winter term 2016/2017 norbert henze, institute of...

Asymptotic stochastics

Winter term 2016/2017

Norbert Henze, Institute of Stochastics

Norbert Henze, KIT 0.1

Contents

1. Basic facts from probability theory

2. A Poisson limit theorem for triangular arrays

3. The method of moments

4. A CLT for stationary m-dependent sequences

5. The multivariate normal distribution

6. Convergence in distribution and CLT in Rd

7. Empirical distribution functions

8. Limit theorems for U-statistics

9. Basic concepts of asymptotic estimation theory

10. Asymptotic properties of maximum likelihood estimators

11. Asymptotic (relative) efficiency of estimators

12. Asymptotic tests in parametric models

13. Probability Measures on Metric Spaces


Contents

14. Weak convergence in metric spaces

15. Convergence in Distribution

16. Relative compactness and tightness

17. Weak convergence and tightness in C

18. Wiener Measure, Donsker’s Theorem

19. Brownian Bridge, Wiener process on [0,∞)

20. The Space D[0, 1]

21. Empirical Processes: Applications to Statistics

22. Gaussian distributions in separable Hilbert spaces

23. The Central Limit Theorem in separable Hilbert spaces

24. Statistical applications: Weighted L2-statistics


Basic facts from probability theory

1 Basic facts from probability theory

Let X,X1, X2, . . . be real-valued r.v.’s on some probability space (Ω,A,P).

1.1 Definition (Almost sure convergence)

Xna.s.−→ X :⇐⇒ P

(ω ∈ Ω : lim

n→∞Xn(ω) = X(ω)

)= 1.

1.2 Theorem (Characterization of almost sure convergence)

Xna.s.−→ X ⇐⇒ lim

n→∞P

(supk≥n|Xk −X| > ε

)= 0 ∀ ε > 0.

1.3 Definition (Convergence in probability)

XnP−→ X :⇐⇒ lim

n→∞P (|Xn −X| > ε) = 0 ∀ ε > 0.

1.4 Theorem (Characterization of convergence in probability)

XnP−→ X ⇐⇒ each subsequence (Xnk

) contains a further subsequence

(Xn′

k) such that Xn′

k

a.s.−→ X as n′k →∞.



1.5 Generalization to random vectors

Let X =: (X(1), . . . , X(d)), Xn = (X(1)n , . . . , X

(d)n ), n ≥ 1, be d-dimensional

random vectors. Then:

Xna.s.−→ X without change,


n→∞P (‖Xn −X‖ > ε) = 0 ∀ ε > 0,

were ‖ · ‖ is any norm on Rd, (why?)

Xna.s.−→ X ⇐⇒ X

(j)n

a.s.−→ X(j) for each j ∈ 1, . . . , d, (why?)

XnP−→ X ⇐⇒ X

(j)n

P−→ X(j) for each j ∈ 1, . . . , d. (why?)

1.6 Definition (Convergence in p-th mean)

Let Lp := Lp(Ω,A,P) := X : Ω→ R : E|X|p <∞, 0 < p <∞.If X,X1, X2, . . . ∈ Lp, one defines

XnLp

−→ X :⇐⇒ E|Xn −X|p → 0.

p = 1: Convergence in the mean

p = 2: Convergence in quadratic mean



1.7 Remark

If Xna.s.−→ X then Xn

P−→ X. (why?)

If XnLp

−→ X then XnP−→ X. (why?)

In general, there are no further implications. (why?)

1.8 Theorem (Strong law of large numbers, SLLN)

Let X1, X2, . . . be independent, identically distributed (i.i.d.) random variables.Then the following are equivalent:

a) There is a random variable X such that1

n

n∑

j=1

Xja.s.−→ X.

b) E|X1| <∞.

If a) or b) holds, then1

n

n∑

j=1

Xja.s.−→ EX1.

Does this result generalize to random vectors?



1.9 Definition (Convergence in distribution)

Let X,X1, X2, . . . be random variables with distribution functions F, F1, F2, . . ..Write C(F ) for the set of continuity points of F .

XnD−→ X :⇐⇒ lim

n→∞Fn(x) = F (x) ∀x ∈ C(F ).

Equivalent notations: FnD−→ F , PXn

D−→ PX , XnD−→ PX .

1.10 Remarks

a) PX is called limit (asymptotic) distribution of Xn (of PXn).

b) XnP−→ X =⇒ Xn

D−→ X.”⇐=“ holds if PX is degenerate.

c) Suppose F is continuous. We then have Polya’s Theorem:

XnD−→ X ⇐⇒ lim

n→∞supx∈R

|Fn(x)− F (x)| = 0.



Let

Cb := h : R→ R : h bounded and continuous,C(0) := h ∈ Cb : h uniformly continuous,C(r) := h ∈ C(0) : h r times differentiable, h(j) ∈ C(0) for j ∈ 1, . . . , r,r ∈ N.

1.11 Theorem (Characterization of convergence in distribution)

Let r ∈ N0 be fixed. Then the following assertions are equivalent:

a) XnD−→ X,

b) limn→∞ Eh(Xn) = Eh(X) for each h ∈ Cb,c) limn→∞ Eh(Xn) = Eh(X) for each h ∈ C(r).

Notice that


n→∞E1(−∞,x](Xn) = E1(−∞,x](X) ∀x ∈ C(F ).



1.12 Definition (Tightness and relative compactness)

Let Q 6= ∅ be a set of probability measures on B1. Q is said to be

a) tight :⇐⇒ for each ε > 0 there is a compact set K ⊂ R such thatQ(K) ≥ 1− ε for each Q ∈ Q.

b) relatively compact :⇐⇒ for each sequence (Qn) in Q there are asubsequence (Qnk

) and a probability measure Q on B1 such that

Qnk

D−→ Q as k →∞.

1.13 Theorem We have: Q tight ⇐⇒ Q relatively compact.

1.14 Corollary

a) XnD−→ X =⇒ PXn : n ∈ N is tight.

b) Let PXn : n ≥ 1 be tight. Suppose there is a probability measure Q

such that Xnk

D−→ Q as k →∞ for each subsequence (Xnk) that

converges in distribution at all. Then XnD−→ Q.



1.15 Definition (Characteristic function)

Let X be a random variable.The function

ϕ := ϕX :

R→ C,

t 7→ ϕ(t) := E[eitX

]=∫Reitx PX(dx)

is called the characteristic function of X.

1.16 Theorem (Some properties of characteristic functions)

a) ϕaX+b(t) = eitb ϕX(at), t ∈ R,

b) If E|X|k <∞, then ϕ is k times continuously differentiable, and

dk

dtkϕ(t) = E

[(iX)keitX

], t ∈ R.

c) If X,Y are independent then ϕX+Y = ϕX · ϕY .

d) If a, b ∈ C(F ) and a < b, then

F (b)− F (a) = limT→∞

1

2π

∫ T

−T

e−ita − e−itb

itϕ(t) dt.

e) We have PX = PY (⇐⇒: XD= Y ) ⇐⇒ ϕX = ϕY .



1.17 Theorem (Continuity theorem of Levy-Cramer)

Let X,X1, X2, . . . be random variables with characteristic functionsϕ,ϕ1, ϕ2, . . .. We then have


n→∞ϕn(t) = ϕ(t) ∀ t ∈ R.



1.18 Definition (Characteristic function of a random vector)Let X = (X1, . . . , Xd)

⊤ be a d-dimensional random (column) vector.

ϕ := ϕX :

Rd → C,

t 7→ ϕ(t) := E

[eit

⊤X]

is called the characteristic function of X.

1.19 Proposition Let −∞ < aj < bj <∞, j = 1, . . . , d, and put

B := [a1, b1]× . . .× [ad, bd].

If P(Xj ∈ aj , bj) = 0 for each j = 1, . . . , d, then

PX(B) = lim

T→∞

1

(2π)d

∫ T

−T

· · ·∫ T

−T

d∏

j=1

e−itjaj − e−itjbj

itjϕX(t) dt.

Proof. Mimic the proof given for the case d = 1 (Exercise!)

1.20 Corollary We have XD= Y ⇐⇒ ϕX = ϕY . (why?)



1.21 Theorem (Herglotz-Radon-Cramer-Wold)

We haveX

D= Y ⇐⇒ t⊤X

D= t⊤Y ∀ t ∈ R

d.

Proof. Only”⇐=“ is non-trivial. Fix t ∈ Rd. Then

ϕX(t) = E

[eit

⊤X]

= E

[ei·1·t

⊤X]

= ϕt⊤X(1)

= ϕt⊤Y (1) = E[ei·1·t

⊤Y]

= E[eit

⊤Y]

= ϕY (t).

Corollary 1.20 =⇒ assertion.



1.22 Theorem (Central Limit Theorem, Lindeberg-Levy)

Let X1, X2, . . . be i.i.d. random variables such that EX21 <∞. Put a := EX1,

σ2 := V(X1), Sn := X1 + . . .+Xn, n ≥ 1. If σ2 > 0, then

Sn − E(Sn)√V(Sn)

=Sn − naσ√n

D−→ N(0, 1) as n→∞.

1.23 Theorem (Central Limit Theorem, Lindeberg-Feller)

For each n ≥ 2, let Xn,1, Xn,2, . . . , Xn,rn be independent random variables.Let 0 < σ2

n,j := V(Xn,j) < ∞, an,j := EXn,j , σ2n := σ2

n,1 + . . . + σ2n,rn ,

Sn := Xn,1 + . . .+Xn,rn . For ε > 0, let

Ln(ε) :=1

σ2n

rn∑

k=1

E[(Xn,k − an,k)

2 1∣∣Xn,k − an,k

∣∣ > εσn

].

If limn→∞ Ln(ε) = 0 for each ε > 0 (so-called Lindeberg condition), then

Sn − ESn√V(Sn)

D−→ N(0, 1) as n→∞.



1.24 Theorem (Central Limit Theorem, Ljapunov)

Suppose that in 1.23 there is some δ > 0 such that

limn→∞

1

σ2+δn

rn∑

k=1

E∣∣Xn,k − an,k

∣∣2+δ= 0.

ThenSn − ESn√

V(Sn)

D−→ N(0, 1) as n→∞.

1.25 Theorem (Continuous mapping theorem, CMT)

If XnD−→ X and h : R→ R is continuous, then h(Xn)

D−→ h(X).

1.26 Theorem (Slutsky’s Lemma)

Suppose that XnD−→ X and Yn

P−→ a, a ∈ R. We then have:

a) Xn + YnD−→ X + a,

b) Xn · YnD−→ a ·X.



1.27 Theorem (Skorokhod)

LetX,X1, X2, . . . be random variables on a probability space (Ω,A, P) such that

XnD−→ X. Then there are a probability space (Ω, A, P) and random variables

Y, Y1, Y2, . . . on Ω with the property

PY = P

X , PYn = P

Xn for each n (=⇒ YnD−→ Y )

andlim

n→∞Yn = Y P-almost surely.

Proof. Take

Ω := (0, 1),

A := B1 ∩ (0, 1),

P := λ1|(0,1),

Yn := F−1n ,

Y := F−1.



1.28 Definition (Conditional expectation)

Let (Ω,A, P) be a probability space, X ∈ L1(Ω,A, P) and G be a sub-σ-fieldof A. A random variable Y is called conditional expectation of X given G, forshort: E[X|G] := Y , if:

a) E|Y | <∞,

b) Y is G-measurable,

c) E(Y 1A) = E(X1A) for each A ∈ G.

1.29 Theorem (Existence and Uniqueness of conditional expectations)

E[X|G] exists, and it is uniquely determined P-almost surely.



1.31 Theorem (Factorization Theorem)

Let (Ω′,A′) be a measurable space. If G = σ(Z) = Z−1(A′) for a (A,A′)-measurable mapping Z : Ω→ Ω′, then

E[X|G] = E[X|σ(Z)] =: E[X|Z] = h(Z)

for some measurable function h : Ω′ → R.


A Poisson limit theorem for triangular arrays

2 A Poisson limit theorem for triangular arrays

For n ≥ 2, let Xn,1, . . . , Xn,n be independent N0-valued random variables.

Aim: Give a necessary and sufficient condition (”nasc“) for

Xn,1 + . . .+Xn,nD−→ Po(λ) as n→∞ for some λ > 0.

2.1 Definition (Null array)

Let ∆ := (Xn,j : 1 ≤ j ≤ n)n≥1 be a triangular array of R-valued random

variables. ∆ is said to be a null array if

limn→∞

max1≤j≤n

P (|Xn,j | > ε) = 0 for each ε > 0. (2.1)

Equivalent wording: ∆ null array ⇐⇒ ∆ is uniformly asymptotically negligible.

In what follows, we use the notation x ∧ y := min(x, y) (∧ =”wedge“)



2.2 Proposition We have

limn→∞

max1≤j≤n

P (|Xn,j | > ε) = 0 ∀ε > 0⇐⇒ limn→∞

max1≤j≤n

E [|Xn,j | ∧ 1] = 0.

Proof:”⇐=“: It suffices to consider 0 < ε ≤ 1 (why?). Then

1|Xn,j | > ε ≤ 1

ε(|Xn,j | ∧ 1) .

Take E[ · ] and the maximum over j =⇒ assertion.

”=⇒“: Fix ε > 0. We have

E [|Xn,j | ∧ 1] = E [(|Xn,j | ∧ 1)1|Xn,j | > ε] + E [(|Xn,j | ∧ 1)1|Xn,j | ≤ ε]︸︷︷︸︸︷︷︸≤ 1 ≤ ε ∧ 1

≤ P (|Xn,j | > ε) + (ε ∧ 1),

q.e.d.



2.3 Definition (Generating function)

Let X be a N0-valued random variable.

gX :

[−1, 1]→ R,

s 7→ gX(s) :=∑∞

k=0 P(X = k)sk

is called the (probability) generating function of X (of PX).

2.4 Example (Poisson distribution)

If X ∼ Po(λ) then

gX(s) =

∞∑

k=0

e−λλk

k!sk = e−λ

∞∑

k=0

(λs)k

k!= e−λeλs = eλ(s−1).

2.5 Remark

a) gX determines PX , since ddsr

gX(s)∣∣∣s=0

= r! · P(X = r),

b) gX(s) = E[sX],

c) X,Y independent =⇒

gX+Y (s) = E

[sX+Y

]= E

[sXsY

]= E

[sX]E

[sY]= gX(s)gY (s).



2.6 Theorem (Continuity theorem for generating functions)

Let X0, X1, X2, . . . be N0-valued random variables with generating functionsg0, g1, g2, . . .. Then the following are equivalent:

a) XnD−→ X0,

b) limn→∞

P(Xn = k) = P(X0 = k) ∀ k ∈ N0,

c) limn→∞

gn(s) = g0(s) ∀ s ∈ [0, 1].

Proof:”a) =⇒ b)“: Let Fn be the distribution function of Xn. Fix k ∈ N0.

Since k + 1/2 is a point of continuity of F0, we have

limn→∞

P(Xn ≤ k) = limn→∞

Fn

(k +

1

2

)= F0

(k +

1

2

)= P(X0 ≤ k).

√

”b) =⇒ a):“

√(why?)

”b) =⇒ c):“ W.l.o.g. let s < 1. Put ∆n,k := |P(Xn = k)− P(X0 = k)|.Let m ∈ N. We have

|gn(s)− g0(s)| ≤∞∑

k=0

∆n,ksk ≤ max

0≤k≤m∆n,k ·

m∑

k=0

sk +∞∑

k=m+1

sk

= max0≤k≤m

∆n,k · 1− sm+1

1− s +sm+1

1− s .



Memo: ∆n,k := |P(Xn = k)− P(X0 = k)|

Memo: |gn(s)− g0(s)| ≤ max0≤k≤m

∆n,k · 1− sm+1

1− s +sm+1

1− s , m ∈ N fixed.

Fix ε > 0. Choose m so large that sm+1/(1− s) ≤ ε. From b), we have

limn→∞

max0≤k≤m

∆n,k = 0

=⇒ lim supn→∞ |gn(s)− g0(s)| ≤ ε, q.e.d.

”c) =⇒ b):“ For each k ≥ 0, (P(Xn = k))n≥1 is a bounded sequence.Bolzano–Weierstraß and Cantor’s diagonal argument =⇒ ∃ subsequence (Xn′ )such that

uk := limn′→∞

P(Xn′ = k) exists for each k ∈ N0.

Part”b) =⇒ c)“ =⇒ lim

n′→∞gn′(s) =

∞∑

k=0

uksk, 0 ≤ s < 1.

Assumption c) =⇒ limn′→∞

gn′(s) =∞∑

k=0

P(X0 = k)sk, 0 ≤ s < 1,

=⇒ uk = P(X0 = k), k ∈ N0 (why?) =⇒ b).√



2.7 Lemma For n ≥ 1, let Xn,1, . . . , Xn,n be N0-valued random variableswith generating functions gn,1, . . . , gn,n. We then have:

a) Xn,j : n ≥ 1, 1 ≤ j ≤ n is a null array

⇐⇒ b) limn→∞

max1≤j≤n

(1− gn,j(s)) = 0 ∀s ∈ [0, 1].

Proof: We have

a) ⇐⇒ limn→∞

max1≤j≤n

P (|Xn,j | > ε) = 0 ∀ ε > 0

⇐⇒ Xn,kn

P−→ 0 for each subsequence (kn) such that 1 ≤ kn ≤ n⇐⇒ Xn,kn

D−→ δ0 ”

⇐⇒ gn,kn(s)→ 1, 0 ≤ s ≤ 1, ”

⇐⇒ 1− gn,kn(s)→ 0, 0 ≤ s ≤ 1, ”

⇐⇒ b).

2.8 Remark If Xn,1, . . . , Xn,n are R-valued, then b) in 2.7 takes the form

limn→∞

max1≤j≤n

∣∣1− ϕn,j(t)∣∣ = 0, t ∈ R.

Here, ϕn,j is the characteristic function of Xn,j .



2.9 Theorem (Poisson Limit Theorem)

Let Xn,j : n ≥ 2, 1 ≤ j ≤ n be a null array of rowwise independent N0-valuedrandom variables and X ∼ Po(λ). We then have:

Xn,1 + . . .+Xn,nD−→ X ⇐⇒ (i)

n∑

j=1

P(Xn,j > 1)→ 0,

(ii)

n∑

j=1

P(Xn,j = 1)→ λ.

Proof:”⇐=“: In view of 2.6, 2.5 c) and 2.4, we have to show

limn→∞

n∏

j=1

gn,j(s) = eλ(s−1), 0 ≤ s ≤ 1, (2.2)

⇐⇒n∑

j=1

log (1− (1− gn,j(s))) → λ(s− 1), 0 ≤ s ≤ 1,

⇐⇒n∑

j=1

(1− gn,j(s)) → λ(1− s), 0 ≤ s ≤ 1.↑ (2.3)

use 2.7 b) and 1− 1/t ≤ log t ≤ t− 1, t > 0


A Poisson limit theorem for triangular arraysn∑

j=1

(1− gn,j(s))

=n∑

j=1

[1−

1∑

k=0

skP(Xn,j = k)

]−

n∑

j=1

∞∑

k=2

skP(Xn,j = k)

=n∑

j=1

[1− P(Xn,j = 0)− sP(Xn,j = 1)]

+n∑

j=1

∞∑

k=2

(s− sk)P(Xn,j = k) −n∑

j=1

sP(Xn,j ≥ 2)

= (1− s)n∑

j=1

P(Xn,j > 0) +∞∑

k=2

(s− sk)n∑

j=1

P(Xn,j = k)

︸︷︷︸︸︷︷︸=: Tn,1(s) =: Tn,2(s)

For k ≥ 2 we have s(1− s) ≤ s(1− sk−1) = s− sk ≤ s =⇒

s(1− s)n∑

j=1

P(Xn,j > 1) ≤ Tn,2(s) ≤ s

n∑

j=1

P(Xn,j > 1)

︸︷︷︸︸︷︷︸→ 0, cf. (i) → 0, cf. (i)



Memo:∑n

j=1(1− gn,j(s)) = Tn,1(s) + Tn,2(s), limn→∞ Tn,2(s) = 0.

Memo: Tn,1(s) = (1− s)∑nj=1 P(Xn,j > 0)

Notice that

n∑

j=1

P(Xn,j > 0) =n∑

j=1

P(Xn,j = 1) +n∑

j=1

P(Xn,j > 1)

︸︷︷︸︸︷︷︸→ λ, cf. (ii) → 0, cf. (i).

It follows thatlim

n→∞Tn,1(s) = (1− s)λ, q.e.d.



”=⇒“: Suppose that Xn,1 + . . .+Xn,n

D−→ Po(λ)

=⇒n∏

j=1

gn,j(s) −→ eλ(s−1), 0 ≤ s ≤ 1,

=⇒n∑

j=1

(1− gn,j(s)) −→ λ(1− s), 0 ≤ s ≤ 1.

Put s = 0. Then n∑

j=1

P(Xn,j > 0) −→ λ.

Memo:n∑

j=1

(1− gn,j(s)) = (1− s)n∑

j=1

P(Xn,j > 0) + Tn,2(s)

Memo: s(1− s)∑nj=1 P(Xn,j > 1) ≤ Tn,2(s) ≤ s

∑nj=1 P(Xn,j > 1)

It follows that∑n

j=1 P(Xn,j > 1) −→ 0 (which is (i)), andn∑

j=1

P(Xn,j = 1) =n∑

j=1

P(Xn,j > 0) −n∑

j=1

P(Xn,j > 1) −→ λ,

which is (ii), q.e.d.


The method of moments

3 The method of moments

A warning! In general, XnD−→ X does not imply EXn → EX.

Notice that X + YnD−→ X, if Yn

P−→ 0 (Slutsky).

If P(Yn = n2) = 1/n, P(Yn = 0) = 1− 1/n, we have YnP−→ 0 and EYn →∞.

3.1 Theorem Suppose XnD−→ X. Then E|X| ≤ lim infn→∞ E|Xn|

Proof: Take (Ω, A, P) and Y, Y1, Y2, . . . as in Skorokhod’s Theorem. Notice

that XD= Y and 0 ≤ |Yn| → |Y | P-a.s.. Fatou’s Lemma =⇒

E|X| = E|Y | =

∫|Y |d P

≤ lim infn→∞

∫|Yn|d P = lim inf

n→∞E|Yn| = lim inf

n→∞E|Xn|, q.e.d.



3.2 Definition (Uniform integrability)

Let X1, X2, . . . be random variables on (Ω,A,P). The sequence (Xn) is said tobe uniformly integrable (UI), if

lima→∞

supn≥1

E [|Xn|1|Xn| ≥ a] = 0.

3.3 Corollary

a) If (Xn)n≥1 is UI then supn≥1 E|Xn| <∞.

b) If supn≥1 E|Xn|1+δ <∞ for some δ > 0, then (Xn) is UI.

c) If supn≥1 |Xn| ≤ C <∞ for some C then (Xn) is UI.

Proof: a) For fixed a > 0, we have

E|Xn| = E [|Xn|1|Xn | < a] + E [|Xn|1|Xn| ≥ a]≤ a + E [|Xn|1|Xn| ≥ a]≤ a + sup

k≥1E [|Xk|1|Xk | ≥ a]

√.

b) Fix a > 0. We have

E [|Xn|1|Xn| ≥ a] ≤ E

[|Xn| ·

( |Xn|a

)δ]

=1

aδ· E|Xn|1+δ √

c) follows from b).



3.4 Theorem If XnD−→ X and (Xn) is UI, then:

a) E|X| <∞,

b) limn→∞ EXn = EX.

Proof: a) follows from Thm. 3.1 and Cor. 3.3 a).

b) Skorokhod’s Theorem =⇒ w.l.o.g. Xna.s.−→ X. Fix a > 0.

|EXn − EX| ≤ E|Xn −X|= E [|Xn−X|1|Xn−X| < a] + E [|Xn−X|1|Xn−X| ≥ a]︸︷︷︸︸︷︷︸

=: ∆n,1 =: ∆n,2

Notice that limn→∞ ∆n,1 = 0 by dominated convergence. Furthermore,

∆n,2 ≤ 2 · E[max (|Xn|, |X|)1

max (|Xn|, |X|) ≥ a

2

]

≤ 2 · E[|Xn|1

|Xn| ≥ a

2

]+ 2 · E

[|X|1

|X| ≥ a

2

]

≤ 2 · supk≥1

E

[|Xk|1

|Xk| ≥ a

2

]+ 2 · E

[|X|1

|X| ≥ a

2

]

︸︷︷︸︸︷︷︸→ 0 as a→∞, since (Xn) UI → 0 as a→∞, since E|X| <∞ √



3.5 Corollary Let r∈N, ε>0. If XnD−→ X and supn≥1 E|Xn|r+ε<∞, then:

a) E|X|r <∞,

b) limn→∞ E(Xrn) = E(Xr).

Proof: We have |Xn|r+ε = |Xrn|1+ε/r. Put δ := ε/r.

Memo: If supn≥1 E|Yn|1+δ <∞ for some δ > 0, then (Yn) is UI.

It follows that (Xrn)n≥1 is UI.

The Continuous Mapping Theorem gives Xrn

D−→ Xr.

The assertion now follows from Theorem 3.4, q.e.d.



3.6 Theorem (Method of moments)

Suppose PX is uniquely determined by the sequence (EXk)k≥1 of moments.

If limn→∞

EXkn = EXk for each k ≥ 1, then Xn

D−→ X.

Proof: For a > 0, Markov’s inequality yields

P(|Xn| > a) ≤ EX2n

a2→ EX2

a2as n→∞.

It follows that PXn : n ≥ 1 is tight. (why?)

Thm. 1.13 =⇒ ∃ subsequence (Xnk) ∃ r.v. Y such that Xnk

D−→ Y as k →∞.

Cor. 3.5 =⇒ limk→∞ EXrnk

= EY r, r ∈ N. (why?)

Assumption =⇒ XD= Y , i.e., Xnk

D−→ X. Cor. 1.14 =⇒ assertion.

Problem: Prove the CLT of Lindeb.-Levy for bounded r.v.’s via Thm. 3.6.



3.7 Theorem (Sufficient condition for”(EXk)k≥1 determines PX“)

Let X be a random variable such that E|X|k <∞, k ≥ 1. Suppose that

∞∑

k=1

EXk

k!tk

has a non-vanishing radius of convergence. Then PX is uniquely determined bythe sequence (EXk)k≥1.

Proof: Let ϕ(t) := EeitX , bk := E|X|k, k ≥ 1. Induction over n =⇒∣∣∣∣∣e

itX

(eihX −

n∑

k=0

(ihX)k

k!

)∣∣∣∣∣ ≤|h|n+1|X|n+1

(n+ 1)!, t, h ∈ R, n ≥ 0.

Since E|X|k <∞ implies ϕ(k)(t) = E[eitX(iX)k

], it follows that

∣∣∣ϕ(t+ h)−n∑

k=0

hk

k!ϕ(k)(t)

∣∣∣ ≤ |h|n+1

(n+ 1)!bn+1.

Put mk := EXk. Assumption =⇒ ∃t0 > 0 with∑∞

k=0 |mk|tk0/k! <∞.



Memo:∣∣∣ϕ(t+ h) −

n∑

k=0

hk

k!ϕ(k)(t)

∣∣∣ ≤ |h|n+1

(n+ 1)!bn+1.

Memo:∞∑

k=0

|mk|tk0k!

<∞, 0 < t0 <∞, mk = EXk.

Since |X|2k−1 ≤ 1 + |X|2k , we have b2k−1 ≤ 1 +m2k. Since m2k = b2k,

b2k−1h2k−1

(2k − 1)!≤ h2k−1

(2k − 1)!+m2kt

2k0

(2k)!· 2kh

2k−1

t2k0

shows that the left-hand side tends to 0 as k →∞ if h ∈ (0, t0). It follows that

ϕ(t+ h) =∞∑

k=0

ϕ(k)(t)

k!hk, t ∈ R, |h| < t0. (3.1)

Let Y be a r.v. with EY k = mk, k ≥ 1, and CF ψ(t) = EeitY . Proceeding asabove, we obtain

ψ(t+ h) =

∞∑

k=0

ψ(k)(t)

k!hk, t ∈ R, |h| < t0. (3.2)

Put t = 0 in (3.1), (3.2). Since ψ(k)(0) = ϕ(k)(0) = ikmk, k ≥ 1, we haveψ(t) = ϕ(t), |t| < t0. Putting t = ±t0/2, t = ±t0, . . . in (3.1), (3.2) gives

ψ = ϕ =⇒ XD= Y , q.e.d.



3.8 Examples

a) If P(|X| ≤M) = 1 for some M <∞, then PX is determined by thesequence (EXk).

b) If X ∼ N(0, 1), then PX is determined by the sequence (EXk).

c) If X has a lognormal distribution, then PX is not determined by thesequence (EXk).


A CLT for stationary m-dependent sequences

4 A CLT for stationary m-dependent sequences

Let (Yj)j≥1 be a sequence of random variables on some probability space(Ω,A,P). Recall: For T ⊂ N, T 6= ∅,

σ (Yt : t ∈ T ) := σ

(⋃

t∈T

Y −1t

(B1)

).

4.1 Definition (m-dependence and stationarity)

a) (Yj)j≥1 is called m-dependent :⇐⇒

for each s ≥ 1 : σ(Y1, . . . , Ys) and σ(Ys+m+j : j ≥ 1) are independent

b) (Yj)j≥1 is said to be stationary :⇐⇒∀j ∈ N ∀k ∈ N0: the distribution of (Yj , . . . , Yj+k) does not depend on j.(”shift invariance of finite-dimensional distributions“)

Notice that

Y1, Y2, . . . are independent ⇐⇒ (Yj)j≥1 is 0-dependent. (!)

Y1, Y2, . . . i.i.d. =⇒ (Yj)j≥1 stationary and m-dependent ∀m ≥ 0



4.2 Examples (Functions of blocks of i.i.d. sequences)

a) Let X1, X2, . . . be i.i.d. random variables. Let ℓ ∈ N and f : Rℓ → R be ameasurable function. Put Yj := f(Xj , Xj+1, . . . , Xj+ℓ−1), j ≥ 1. Then(Yj)j≥1 is stationary. Since, for s ≥ 1, we have

σ(Y1, . . . , Ys) ⊂ σ(X1, . . . , Xs+ℓ−1),

σ(Ys+ℓ−1+j : j ≥ 1) ⊂ σ(Xs+ℓ, Xs+ℓ+1, . . .),

Y1, Y2, . . . are (ℓ− 1)-dependent.

b) (Special case of a)): Let X0, X1, X2, . . . be i.i.d. PutYj := 1Xj−1 > Xj < Xj+1, j ≥ 1 (local minimum at time j).Then (Yj)j≥1 is stationary and 2-dependent.

c) (Special case of a)): Let X0, X1, . . . be i.i.d. ∼ Bin(1, p), 0 < p < 1. Let

Yj := (1−Xj−1)Xj Xj−1 · . . . ·Xj+r−1 (1−Xj+r)

(a lucky streak of exact length r starts at the jth trial)

(Yj)j≥1 is (r + 1)-dependent and stationary.



In what follows, we assume EY 21 <∞. Put

µ := E(Y1) = E(Yj) ∀j ≥ 1,

σ00 := V(Y1) = V(Yj) ∀j ≥ 1,

σ0j := Cov(Y1, Y1+j) = Cov(Yi, Yi+j) ∀i ≥ 1.↑

stationarity!

Notice that σ0j = 0 if j > m (because of m-dependence).

Let Sn := Y1 + . . .+ Yn, n ≥ 1. Then ESn = nµ , and, for n ≥ m,

V(Sn) =

n∑

i=1

n∑

j=1

Cov(Yi, Yj)

= nσ00 + 2(n− 1)σ01 + . . .+ 2(n−m)σ0m.

Notice that

limn→∞

1

nV(Sn) = σ2 := σ00 + 2

m∑

j=1

σ0j .

σ2 = σ00 + 2m∑

j=1

σ0j is called the long-run variance of (Yj)j≥1.



4.3 Lemma Let Zn,k, Xn,k (n, k ∈ N) be random variables and

Tn := Zn,k + Xn,k, n, k ≥ 1 (Tn does not depend on k!)

Suppose that

a) limk→∞

supn∈N

P (|Xn,k | ≥ δ) = 0 ∀ δ > 0,

b) for each k ≥ 1: Zn,kD−→ Zk as n→∞ for some random variable Zk,

c) ZkD−→ Z as k →∞ for some random variable Z.

Then TnD−→ Z as n→∞.

Proof: Let F, F1, F2, . . . be the distribution functions of Z,Z1, Z2, . . ..

Fix ε > 0 and z ∈ C(F ). Since R \ C(F ) is countable, there is some δ > 0 with

P(|Z − z| ≤ δ) < ε and z + δ, z − δ ∈ C(F ) ∩∞⋂

k=1

C(Fk). (4.1)

From a), there is some k0 such that

P(|Xn,k| ≥ δ) < ε ∀n ∀k ≥ k0. (4.2)

From c), there is some k1 ≥ k0 with

|P(Zk ≤ z ± δ)− P(Z ≤ z ± δ)| < ε ∀k ≥ k1. (4.3)



Memo: b) ∀ k ≥ 1 : Zn,kD−→ Zk as n→∞ for some random variable Zk

Memo: P(|Z − z| ≤ δ) < ε and z + δ, z − δ ∈ C(F ) ∩⋂∞k=1 C(Fk) (4.1)

Memo: P(|Xn,k| ≥ δ) < ε ∀n, ∀k ≥ k0 (4.2)

Memo: |P(Zk ≤ z ± δ)− P(Z ≤ z ± δ)| < ε ∀k ≥ k1 (4.3)

For k ≥ k1, we have

P(Tn ≤ z) = P(Zn,k+Xn,k ≤ z)= P(Zn,k+Xn,k ≤ z, |Xn,k | < δ) + P(Zn,k+Xn,k ≤ z, |Xn,k| ≥ δ)≤ P(Zn,k ≤ z + δ) + P(|Xn,k| ≥ δ)≤ P(Zn,k ≤ z + δ) + ε. (by (4.2))

From b), it follows that

lim supn→∞

P(Tn ≤ z) ≤ P(Zk ≤ z+ δ) + ε ≤ P(Z ≤ z+ δ)+ 2ε ≤ P(Z ≤ z)+ 3ε.

b) (4.3) (4.1)

In the same way, using (4.3) with z − δ, we obtain

lim infn→∞

P(Tn ≤ z) ≥ P(Z ≤ z)− 3ε. ε ↓ 0 =⇒ assertion.



4.4 Theorem (CLT for stationary m-dependent sequences)

Let (Yj)j≥1 be a stationary m-dependent sequence satisfying EY 21 < ∞ and

0 < σ2, where σ2 is the long-run variance. For the sequence of partial sumsSn = Y1 + . . .+ Yn, we then have

Sn − ESn√V(Sn)

D−→ N(0, 1) as n→∞.

Proof: W.l.o.g. let µ = EY1 = 0. Idea: Split Sn into suitable blocks and usethe CLT of Lindeberg/Levy and Slutsky’s lemma.

Fix k > m. Then n =: s(k +m) + r, where 0 ≤ r < k +m. Put

Sn =: Sn,1 + Sn,2 +Rn, where

Sn,1 :=

s−1∑

j=0

Vk,j , Sn,2 :=

s−1∑

j=0

Wk,j ,

Vk,j :=k∑

i=1

Yj(k+m)+i, Wk,j :=k+m∑

i=k+1

Yj(k+m)+i,

Rn :=r∑

i=1

Ys(m+k)+i. Notice that Vk,0, . . . , Vk,s−1 are i.i.d.



• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •Vk,0 Wk,0 Vk,1 Wk,1 Vk,2 Wk,2 Vk,s−1 Wk,s−1 Rn

Put

Tn :=Sn√n, Zn,k :=

Sn,1 +Rn√n

, Xn,k :=Sn,2√n

=⇒ Tn =Sn√n

= Zn,k +Xn,k.

We claim that a), b), c) of Lemma 4.3 hold.

a) To show: limk→∞ supn∈N P (|Xn,k | ≥ δ) = 0 ∀ δ > 0.

We have

EXn,k = 0, V(Xn,k) =1

nV(Sn,2) =

s

nV(Sm).

Since n = s(k +m) + rn, 0 ≤ rn < k +m, we have sn≤ 1

k+m.

Tschebyshev =⇒ supn∈N

P(|Xn,k| ≥ δ) ≤ V(Sm)

(k +m)δ2→ 0 as k →∞. √



Memo: b) ∀k ≥ 1 : Zn,kD−→ Zk as n→∞ for some random variable Zk

Memo: Zn,k =Sn,1√n

+Rn√n, n = sn(k +m) + rn, 0 ≤ rn < k +m.

Memo: Sn,1 =∑s−1

j=0 Vk,j , Vk,j =∑k

i=1 Yj(k+m)+i

b) We have: Vk,0, . . . , Vk,s−1 i.i.d., E(Vk,j) = 0, V(Vk,j) = V(Sk).

Sn,1√n

=

√snn· 1√

sn

sn−1∑

j=0

Vk,j

︸︷︷︸︸︷︷︸↓ ↓ D1√

k +mN(0,V(Sk)) (Lindeberg-Levy)

D−→ Zk ∼ N

(0,

V(Sk)

k +m

)

E

[Rn√n

]= 0, V

(Rn√n

)=

V(Sr)

n≤ (k +m)2σ00

n(Cauchy-Schwarz)

=⇒ Rn√n

P−→ 0. Slutsky =⇒ Zn,kD−→ Zk.

√



Memo: c): ZkD−→ Z as k →∞ for some random variable Z.

Memo: Zk ∼ N

(0,

V(Sk)

k +m

)

Memo: limn→∞

1

nV(Sn) = σ2 := σ00 + 2

m∑

j=1

σ0j .

Notice that

V(Sk)

k +m=

k

k +m· V(Sk)

k→ σ2 as k →∞.

︸︷︷︸︸︷︷︸↓ ↓1 σ2

It follows that

ZkD−→ Z ∼ N(0, σ2) as k →∞, q.e.d.



4.5 Examples

a) In Example 4.2 b), i.e. Yj = 1Xj−1 > Xj < Xj+1, we have, providedthat the distribution function F of X1 is continuous:

√n

(Sn

n− 1

3

)D−→ N

(0,

2

45

)(!)

Notice that P(X0 > X1 < X2) = P(X1 = min(X0, X1, X2)) = 1/3.

b) In Example 4.2 c), i.e., Yj = (1−Xj−1)Xj . . . Xj+r−1(1−Xj+r), wehave, writing q = 1− p:

EY1 = q2pr = EY 21 =⇒ V(Y1) = q2pr − q4p2−r = σ00

E(Y1Y1+r+1) = q3p2r,

E(YjYj+k) = 0, if k ∈ 1, . . . , r,σ0k = Cov(Yj, Yj+k)− q4p2r, if k ∈ 1, . . . , r,

σ0,r+1 = Cov(Y1, Y1+r+1) = q3p2r − q4p2r,Cov(Yj , Yℓ) = 0, otherwise.

Thm. 4.4 =⇒ √n

(Sn

n− q2pr

)D−→ N(0, σ2),

where σ2 = . . . = q2pr + 2q3p2r − (2r + 3)q4p2r.


The multivariate normal distribution

5 The multivariate normal distribution

Let X = (X1, . . . , Xd)⊤ be a d-dimensional random (column) vector.

5.1 Definition (Expectation vector, covariance matrix)

a) If E|Xj | <∞, j = 1, . . . , d, then

E(X) := (EX1, . . . ,EXd)⊤

is called the expectation (expectation vector) of X.

b) If EX2j <∞, j = 1, . . . , d, the (d× d)-matrix

Σ(X) :=(Cov(Xj , Xk)

)1≤j,k≤d

is called the covariance matrix of X.

More generally, if Y := (Yjℓ)1≤j≤k,1≤ℓ≤m is Rkm-valued (a random(k ×m)-matrix) and if E|Yjℓ| <∞ ∀j, ℓ then

E(Y ) := (E(Yjℓ))k×m .



With this new notation, we have

Σ(X) = E

[(X − EX) · (X − EX)⊤

]

= E

[XX⊤

]− EX · (EX)⊤.

5.2 Remark (Affine transformations)

If A ∈ Rs×d and b ∈ Rs, then

a) E (AX + b) = AE(X) + b,

b) Σ(AX + b) = AΣ(X)A⊤.



5.3 Theorem (Properties of covariance matrices)

a) Σ(X) is symmetric and positive-semidefinite (Σ ≥ 0),

b) Σ(X) ist singular (non-invertible) :⇐⇒

∃ c ∈ Rd, c 6= 0, ∃γ ∈ R such that P(c⊤X = γ) = 1

(⇐⇒ there is a hyperplane H ⊂ Rd such that P(X ∈ H) = 1.)

Proof: a) Since Cov(U,V ) = Cov(V,U), Σ(X) is symmetric.

Let c := (c1, . . . , cd)⊤ ∈ Rd. We have

0 ≤ V(c⊤X) = Cov

(d∑

j=1

cjXj ,

d∑

k=1

ckXk

)

=

d∑

j=1

d∑

k=1

cjck Cov(Xj , Xk)

= c⊤Σ(X)c.

b) Σ(X) singular ⇐⇒ ∃c 6= 0 : V(c⊤X) = 0

⇐⇒ ∃c 6= 0 ∃γ ∈ R : P(c⊤X) = γ) = 1.



5.4 Definition (d-variate normal distribution)

X = (X1, . . . , Xd)⊤ has a d-variate normal distribution :⇐⇒

∀ c = (c1, . . . , cd)⊤ ∈ R

d : c⊤X =d∑

j=1

cjXj has a normal distribution.

Here, N(a, 0) := δa.

5.5 Corollary Suppose X = (X1, . . . , Xd)⊤ has a d-variate normal distribution.

We then have:

a) Let s ∈ 1, . . . , d and 1 ≤ i1 < . . . < is ≤ d. Then (Xi1 , . . . , Xis )⊤ has

a s-dimensional normal distribution.

b) E(X) and Σ(X) exist (i.e., EX2j <∞, j = 1, . . . , d).

5.6 Remark and definition In the setting of 5.4, we have

E(c⊤X) = c⊤E(X), V(c⊤X) = c⊤Σ(X)c.

Thm. 1.21 =⇒ PX in 5.4 is uniquely determined by a := E(X) and Σ := Σ(X).Manner of speaking: X has a d-variate normal distribution with expectation aand covariance matrix Σ, for short: X ∼ Nd(a,Σ) or P

X = Nd(a,Σ).



5.7 Corollary (Reproduction Theorem for Nd)

If X ∼ Nd(a,Σ), A ∈ Rs×d, b ∈ Rs, then

AX + b ∼ Ns

(Aa+ b, AΣA⊤

).

Proof: h ∈ Rs =⇒ h⊤(AX + b) =(A⊤h

)⊤X + h⊤b has a univariate normal

distribution, q.e.d.

5.8 Lemma If Σ ≥ 0 there is a matrix A with Σ = AA⊤.

Proof: Σ has a complete system of orthonormal eigenvectors withnonnegative eigenvalues, i.e., we have

Σvj = λjvj , v⊤j vk = δj,k (j, k = 1, . . . , d).

Put V := (v1 · · · vd) =⇒ V ⊤ = V −1.

Let Λ := diag(λ1, . . . , λd). Then ΣV = V Λ and Σ = V ΛV ⊤.

Put A := V Λ1/2, where Λ1/2 := diag(√λ1, . . . ,

√λd). Then

Σ = V Λ1/2Λ1/2V ⊤ = AA⊤.



5.9 Theorem (Existence of Nd(a,Σ))

For each a ∈ Rd and each symmetric positive-semidefinite (d × d)-matrix Σthere is a random vector X such that X ∼ Nd(a,Σ).

Proof: Let Y1, . . . , Yd be i.i.d. ∼ N(0, 1). From the addition theorem for thenormal distribution, we have Y := (Y1, . . . , Yd)

⊤ ∼ Nd(0, Id), where Id is theunit matrix of order d. Let A be a (d× d)-matrix such that Σ = AA⊤. By thereproduction Theorem 5.7, we have

X := AY + a ∼ Nd(a,Σ), q.e.d.

5.10 Theorem (Nd(a,Σ) and independence)

Let X := (X1, . . . , Xk)⊤, Y := (Y1, . . . , Yℓ)

⊤. Suppose that

(XY

)has a

(k + ℓ)-dimensional normal distribution. We then have:

X,Y independent ⇐⇒ Cov(Xi, Yj) = 0 ∀i ∈ 1, . . . , k, ∀j ∈ 1, . . . , ℓ.

Proof:”=⇒“ is obvious.



”⇐=“ : Writing 0r×s for the zero matrix of order r× s, we have by assumption

Σ =

(Σ(X) 0k×ℓ

0ℓ×k Σ(Y )

).

From 5.8, there are matrices A,B with Σ(X) = AA⊤, Σ(Y ) = BB⊤. LetZ1, . . . , Zk+ℓ be i.i.d. ∼ N(0, 1). Then

U1

...Uk

V1

...Vℓ

:=

(A 0k×ℓ

0ℓ×k B

)

Z1

...Zk

Zk+1

...Zk+ℓ

+

(EXEY

)∼ Nk+ℓ

((EXEY

),

(AA⊤ 0k×ℓ

0ℓ×k BB⊤

))

Put U := (U1, . . . , Uk)⊤, V := (V1, . . . , Vℓ)

⊤. Then

U = A(Z1 · · ·Zk)⊤ + EX, V = B(Zk+1 · · ·Zk+ℓ)

⊤ + EY.

Notice that U and V are independent, (why?) and that UD= X, V

D= Y . Since(

XY

)D=

(UV

), the assertion follows. (why?)



5.11 Corollary Let X = (X1, . . . , Xd)⊤ ∼ Nd(a,Σ). We then have:

X1, . . . , Xd independent :⇐⇒ Σ is a diagonal matrix.

5.12 Theorem (Addition Theorem)

If X ∼ Nd(a,Σ), Y ∼ Nd(b, T ) and X and Y are independent, then

X + Y ∼ Nd(a+ b,Σ + T ).

Proof: Exercise!

5.13 Theorem (Density of a non-degenerate normal distribution)

The distribution Nd(a,Σ) is called non-degenerate, if det(Σ) > 0, otherwisedegenerate. If det(Σ) > 0 and X ∼ Nd(a,Σ), then X has the Lebesgue density

f(x) =1

(2π)d/2√

det(Σ)exp

(− 1

2(x− a)⊤Σ−1(x− a)

), x ∈ R

d.

Proof: Let Σ = AA⊤, Z := (Z1, . . . , Zd)⊤ ∼ Nd(0, Id). We have

fZ(z) = (2π)−d/2 exp(−z⊤z/2

). Since X

D= AZ + a and

fAZ+a(x) = fZ(A−1(x− a))/|det(A)|, |det(A)| =

√det(Σ), we are done.



5.14 Principal component decomposition

As in 5.8, let Σ = V ΛV ⊤, Λ = diag(λ1, . . . , λd), X ∼ Nd(a,Σ). Assume thatΣ is invertible. We then have

X =

d∑

j=1

(v⊤j X

)vj =

d∑

j=1

(v⊤j (X − a)

)vj +

d∑

j=1

(v⊤j a

)vj

︸︷︷︸= a

=d∑

j=1

√λj Zj vj + a,

where Zj := λ−1/2j v⊤j (X − a), j = 1, . . . , d.

Check that Z1, . . . , Zd are i.i.d. N(0, 1) (!).

If λ1 ≥ . . . ≥ λd, then√λjZjvj is called the jth principal component of X.

a1

a2 •v1

•v2

•√

λ1Z1

√λ2Z

2



5.15 Theorem If X ∼ Nd(a,Σ), det(Σ) > 0, then

(X − a)⊤Σ−1(X − a) ∼ χ2d.

proof: Let Q := (X − a)⊤Σ−1(X − a). We have

XD= AZ + a, where Z ∼ Nd(0, Id), Σ = AA⊤.

Put Z =: (Z1, . . . , Zd)⊤. We then have

QD= (AZ)⊤ Σ−1 (AZ)

= Z⊤A⊤(AA⊤

)−1

AZ

= Z⊤Z

=d∑

j=1

Z2j ∼ χ2

d.


Convergence in distribution and CLT in Rd

6 Convergence in distribution and CLT in Rd

Let X = (X1, . . . , Xd) be a d-dimensional random vector on (Ω,A,P).

6.1 Definition (distribution function of a random vector)

The function F : Rd → [0, 1], defined by

F (x) := P (X1 ≤ x1, . . . , Xd ≤ xd) , x = (x1, . . . , xd) ∈ Rd,

is called the distribution function of X.

In what follows, the notationx(n) ↓ x

for a sequence (x(n)) in Rd and x ∈ Rd means

x(n)j ↓ xj for each j = 1, . . . , d.

Here, x = (x1, . . . , xd) and x(n) =

(x(n)1 , . . . , x

(n)d

).



6.2 Theorem (Properties of F ) We have:

a) If x, y ∈ Rd and x ≤ y, then

0 ≤ ∆yxF :=

∑

(ε1,...,εd)∈0,1d(−1)d−ε1−...−εdF

(yε11 x

1−ε11 , . . . , yεdd x1−εd

d

)

(generalized monotonicity)

b) F is continuous from above, i.e., if x(n) ↓ x then

limn→∞

F (x(n)) = F (x).

c) If x(n)j → −∞ for some j ∈ 1, . . . , d, then F (x(n))→ 0,

If x(n)j →∞ for each j ∈ 1, . . . , d, then F (x(n))→ 1.

Proof: Exercise! (Use that PX is continuous from above and from below, andthat ∆y

x = P(X ∈ (x, y]))



6.3 Remarks

a) PX is uniquely determined by the values

PX(A), A ∈ Hd := (x, y] : x, y ∈ R

d, x ≤ y.

(uniqueness theorem for measures)

b) If a function F : Rd → [0, 1] satisfies 6.2 a)- c), there is a uniqueprobability measure Q on Bd such that

Q((x, y]) = ∆yxF ∀(x, y] ∈ Hd.

(follows from Caratheodory’s extension theorem, see e.g. Billingsley, P.:Probability and Measure, p. 177)

Notation: Let Od, Ad denote the class of open and closed sets in Rd,respectively. For a set B ⊂ Rd, let

B :=⋃O ∈ Od : O ⊂ B, (interior of B)

B :=⋂A ∈ Ad : A ⊃ B, (closure of B)

∂B := B \B. (boundary of B)



Let Cb := f : Rd → R : f bounded and continuous.Let X,X1, X2, . . . be d-dimensional random vectors on (Ω,A, P).Put Q := PX , Qn := PXn , F (x) := P(X ≤ x), Fn(x) := P(Xn ≤ x).

6.4 Theorem (Portmanteau theorem)

The following assertions are equivalent:

a) limn→∞

∫hdQn =

∫hdQ ∀h ∈ Cb,

b) lim supn→∞

Qn(A) ≤ Q(A) ∀A ∈ Ad,

c) lim infn→∞

Qn(O) ≥ Q(O) ∀O ∈ Od,

d) limn→∞

Qn(B) = Q(B) ∀B ∈ Bd such that Q(∂B) = 0,

e) limn→∞

Fn(x) = F (x) ∀x ∈ C(F ) (the set of continuity points of F )

Notice that∫h dQn = Eh(Xn),

∫hdQ = Eh(X), Qn(A) = P(Xn ∈ A) etc.

Statements can be rephrased in terms of Xn and X.



Proof:”a) =⇒ b)“:

Memo: a)∫hdQn →

∫hdQ ∀h∈Cb, b) lim supn→∞Qn(A) ≤ Q(A)∀A∈Ad

Let ‖ · ‖ be the Euclidean norm on Rd. Fix A ∈ Ad. Put

hj(x) := max(0, 1− j ‖x− A‖), j ≥ 1,

where ‖x− A‖ := inf‖x − y‖ : y ∈ A. Then hj ∈ Cb and hj ≥ 1A, j ≥ 1.Moreover, 1 ≥ hj ↓ 1A as j →∞. (why?) From a), we have

limn→∞

∫hj dQn =

∫hj dQ, j ≥ 1.

Furthermore,

Qn(A) =

∫1A dQn ≤

∫hj dQn, n ≥ 1, j ≥ 1

and thus

lim supn→∞

Qn(A) ≤∫hj dQ, j ≥ 1.

Since∫hj dQ ↓

∫1A dQ = Q(A) as j →∞, (why?) b) follows.



Memo: b) lim supn→∞

Qn(A) ≤ Q(A)∀A∈Ad, c) lim infn→∞

Qn(O) ≥ Q(O)∀O∈Od

Memo: d) limn→∞Qn(B) = Q(B)∀B∈Bd such that Q(∂B) = 0

”b) ⇐⇒ c)“: Take complements!

”b) + c) =⇒ d)“:

Let B ∈ Bd. We have

Q(B) ≤ lim infn→∞

Qn(B) (by c))

≤ lim infn→∞

Qn(B)

≤ lim supn→∞

Qn(B)

≤ lim supn→∞

Qn(B)

≤ Q(B) (by d))

= Q(B) +Q(∂B).

If Q(∂B) = 0, then limn→∞Qn(B) = Q(B), q.e.d.



Memo: d)Qn(B)→ Q(B)∀B∈Bd s.th. Q(∂B) = 0

Memo: a)∫hdQn →

∫hdQ ∀h ∈ Cb

”d) =⇒ a)“: Approximate h ∈ Cb by

hm :=m∑

j=1

αj1Bj such that Q(∂Bj) = 0 ∀j.

To this end, let K := supx∈Rd |h(x)| <∞. Fix ε > 0.

Choose α0 < α1 < . . . < αm such that α0 < −K, αm > K and αj −αj−1 ≤ εfor each j = 1, . . . ,m. If

Bj := x ∈ Rd : αj−1 < h(x) ≤ αj = αj−1 < h ≤ αj,

then ‖h− hm‖∞ ≤ ε.Notice that ∂Bj ⊂ h = αj−1 ∪ h = αj. Hence if, in addition,

P(h(X) ∈ α0, . . . , αm) = 0,

then Q(∂Bj) = 0 for each j = 1, . . . ,m.




Memo: a)∫hdQn →

∫hdQ ∀h ∈ Cb

Memo: hm =∑m

j=1 αj1Bj, Q(∂Bj) = 0∀j, ‖h− hm‖∞ ≤ ε

Now,

∣∣∣∫hdQn −

∫hdQ

∣∣∣ ≤∣∣∣∫hdQn −

∫hmdQn

∣∣∣+∣∣∣∫hmdQn −

∫hmdQ

∣∣∣

+∣∣∣∫hmdQ−

∫hdQ

∣∣∣

≤∫|h− hm|dQn +

∣∣∣m∑

j=1

αj (Qn(Bj)−Q(Bj))∣∣∣

+

∫|hm − h|dQ

≤ 2ε +∣∣∣

m∑

j=1

αj (Qn(Bj)−Q(Bj))∣∣∣

︸︷︷︸→ 0 by d)

Hence, lim supn→∞ |∫hdQn −

∫hdQ| ≤ 2ε, q.e.d. (since ε > 0 was arbitrary)




Memo: e)Fn(x)→ F (x) ∀x ∈ C(F )

Memo: c) lim infn→∞Qn(O) ≥ Q(O) ∀O ∈ Od

”d) =⇒ e)“: Let Bx := (−∞, x], x ∈ Rd.

Check that x ∈ C(F )⇐⇒ Q(∂Bx) = 0 (!), q.e.d.

”e) =⇒ c)“: Let D be a countable subset of R such that D = R and

Q((x1, . . . , xd) ∈ R

d : xj = a)= 0 ∀a ∈ D ∀j = 1, . . . , d.

Then Dd ⊂ C(F ). (!) Let

M :=×d

j=1 (aj , bj ] : aj , bj ∈ D, aj < bj for j ∈ 1, . . . , d

From e), we have

Qn

(×d

j=1(aj , bj ])

= ∆baFn → ∆b

aF = Q(×d

j=1(aj , bj ]),

i.e. Qn(B)→ Q(B) for each B ∈M.



Memo: Dd ⊂ C(F )

Memo: M :=×d

j=1 (aj , bj ] : aj , bj ∈ D, aj < bj for j ∈ 1, . . . , d

Memo: Qn(B)→ Q(B) for each B ∈M.

The systemM∪ ∅ is closed with respect to finite intersections.

From the inclusion-exclusion formula, we thus obtain Qn(B) → Q(B), if B isa finite union of sets inM.

Fix O ∈ Od, O 6= ∅. Since the systemM is”sufficiently rich“, there are

B1, B2, . . . ∈M such that

O =∞⋃

j=1

Bj .

For fixed k ∈ N, we have

Q

(k⋃

j=1

Bj

)= lim

n→∞Qn

(k⋃

j=1

Bj

)≤ lim inf

n→∞Qn(O).

Since Q is continuous from below, it follows that

Q(O) ≤ lim infn→∞

Qn(O), q.e.d.



6.5 Definition (Convergence in distribution of random vectors)

Let X,X1, X2, . . . be d-dimensional random vectors on some probability space(Ω,A,P).

XnD−→ X :⇐⇒ lim

n→∞Eh(Xn) = Eh(X) ∀h ∈ Cb

By the Portmanteau theorem, there are the following equivalent statements:

lim supn→∞

P(Xn ∈ A) ≤ P(X ∈ A) ∀A ∈ Ad,

lim infn→∞

P(Xn ∈ O) ≥ P(X ∈ O) ∀O ∈ Od,

limn→∞

P(Xn ∈ B) = P(X ∈ B) ∀B ∈ Bd such that P(X ∈ ∂B) = 0,

limn→∞

Fn(x) = F (x) ∀x ∈ C(F ).

Equivalent notations: XnD−→ X, Xn

D−→ Q := PX , FnD−→ F .



6.6 Theorem (Continuous Mapping Theorem, CMT)

Suppose XnD−→ X. If h : Rd → Rs is measurable and P(X ∈ C(h)) = 1, then

h(Xn)D−→ h(X).

In terms of Qn := PXn , Q := PX , an equivalent statement is:

If QnD−→ Q and Q(C(h)) = 1 then Qh

nD−→ Qh.

Notice that the continuity of h is a sufficient condition.

Proof: Fix A ∈ As. To show: lim supn→∞Qhn(A) ≤ Qh(A).

Notice that

h−1(A) ⊂ Rd \ C(h) ∪ h−1(A) (why?) =⇒

lim supn→∞

Qn

(h−1(A)

)≤ lim sup

n→∞Qn

(h−1(A)

)

≤ Q(h−1(A)

)

≤ Q(R

d \ C(h))

+Q(h−1(A)

)

= 0 +Qh(A), q.e.d.



6.7 Theorem (Slutsky’s Lemma)

Let X,X1, X2, . . . ;Y1, Y2, . . . be d-dimensional random vectors. We then have:

XnD−→ X and Yn

P−→ 0 =⇒ Xn + YnD−→ X.

Proof: Fix A ∈ Ad and ε > 0. The set

Aε := x ∈ Rd : ∃y ∈ A with ‖x− y‖ ≤ ε

is closed, and the triangle inequality yields

Xn + Yn ∈ A ⊂ Xn ∈ Aε ∪ ‖Yn‖ > ε. (!)

It follows that

lim supn→∞

P(Xn + Yn ∈ A) ≤ lim supn→∞

P(Xn ∈ Aε) + 0

≤ P(X ∈ Aε) (by Portmanteau, since XnD−→ X)



6.8 Definition (Tightness and relative compactness)

Let Q 6= ∅ be a set of probability measures on Bd.

a) Q tight :⇐⇒ ∀ ε > 0 ∃K ⊂ Rd, K compact: Q(K) ≥ 1− ε ∀Q ∈ Q.b) Q relatively compact :⇐⇒ ∀(Pn)∈QN ∃ subsequence (Pnk

)

∃probab. measure Q: Pnk

D−→ Q as k →∞.

6.9 Theorem (Prokhorov) Q tight ⇐⇒ Q relatively compact .

Proof:”⇐=“: Suppose that Q is not tight, i.e., ∃ ε > 0 ∃ sequence (Qn) in

Q such that Qn(An) < 1− ε for each n ≥ 1, where An := [−n, n]d.

Assumption =⇒ ∃ subsequence (Qnk) ∃Q with Qnk

D−→ Q.

Put A := [−M,M ]d, where M > 0 is chosen to have Q(A) ≥ 1− ε/2 andQ(∂A) = 0. Then, as k →∞, Qnk

(A)→ Q(A) ≥ 1− ε/2. Since Ank⊃ A for

sufficiently large k, we have

Qnk(A) ≤ Qnk

(Ank) < 1− ε

for those k, a contradiction!



”=⇒“: (for d = 1, for d > 1 see Billingsley, P.: Probability and measure, p.392).

Let (Qn) be an arbitrary sequence in Q. Put Fn(x) := Qn((−∞, x]), x ∈ R.Bolzano-Weierstraß and Cantor’s diagonal procedure =⇒ ∃ subsequence (Fnk

)∃G : Q→ [0, 1] such that

G(q) := limk→∞

Fnk(q)

exists for each q ∈ Q. Put F (x) := infG(q) : q > x, x ∈ R. Then F isnondecreasing and, by definition, for each x ∈ R and ε > 0 there is a q ∈ Q

such that x < q and G(q) < F (x) + ε. If x ≤ y < q, thenF (y) ≤ G(q) < F (x) + ε. Hence F is continuous from the right.

If x ∈ C(F ) choose y < x so that F (x)− ε < F (y). Now choose r, s ∈ Q sothat y < r < x < s and G(s) < F (x) + ε. Since

F (x)− ε < G(r) ≤ G(s) < F (x) + ε

and Fn(r) ≤ Fn(x) ≤ Fn(s), n ≥ 1, it follows that

F (x)− ε < G(r) ≤ lim infk→∞

Fnk(x) ≤ lim sup

k→∞Fnk

(x) ≤ G(s) < F (x) + ε

and thus limk→∞ Fnk(x) = F (x).



Memo: (Qn) sequence in Q, Fn(x) := Qn((−∞, x]) =⇒ ∃F : R→ [0, 1]

Memo: F ր, contin. from the right ,∃(nk) : Fnk(x)→ F (x) ∀x ∈ C(F )

In general, F need not be a distribution function, i.e. 0 < F (−∞) and /orF (∞) < 1 possible. But: Since Q is tight, we have:

∀ε > 0 ∃a, b with a < b and Qn((a, b]) = Fn(b)− Fn(a) ≥ 1− ε, n ≥ 1.

Let a′, b′ ∈ C(F ) with a′ < a, b′ > b. Then

1− ε ≤ Qnk((a, b]) ≤ Qnk

((a′, b′])

= Fnk(b′)− Fnk

(a′) → F (b′)− F (a′).

Hence, F is a distribution function.

Let Q be the distribution associated with F , q.e.d.

6.10 Remarks

a) (Xn)n≥1 tight :⇐⇒ PXn : n ≥ 1 tight.b) Xn

D−→ X =⇒ (Xn)n≥1 tight.c) If (Xn) is tight and there is a probability distribution Q such that

Xnk

D−→ Q for each subsequence (Xnk) that converges in distribution,

then XnD−→ X, where PX = Q.



6.11 Theorem (Continuity Theorem of Levy-Cramer)

LetX,X1, X2, . . . be d-dimensional random vectors with characteristic functionsϕ,ϕ1, ϕ2, . . .. We then have:


n→∞ϕn(t) = ϕ(t) ∀ t ∈ R

d.

Proof:”=⇒“: For fixed t ∈ Rd, put h1(x) := cos(t⊤x), h2(x) := sin(t⊤x),

and use the definition of XnD−→ X.

”⇐=“: Let Xn =: (X

(1)n , . . . , X

(d)n )⊤, X =: (X(1), . . . , X(d))⊤. Write

ej := (0, . . . , 0, 1, 0, . . . , 0)⊤ for the jth unit vector in Rd. Put t := αej , whereα ∈ R. Then

ϕX

(j)n

(α) = E

[exp

(iαX(j)

n

) ]= ϕn(αej) → ϕ(αej)

= E

[exp

(iαX(j)

)]= ϕX(j) (α).

Thm. 1.17 =⇒ X(j)n

D−→ X(j) for each j ∈ 1, . . . , d. It follows that

(X(j)n )n≥1 is tight for each j ∈ 1, . . . , d. Thus, (Xn)n≥1 is tight. (!) Thm.

6.9 =⇒ ∃ subsequence (Xnk) ∃ r.v. Y with Xnk

D−→ Y as k →∞. Part”=⇒“

and 1.21 a) =⇒ XD= Y , i.e., Xnk

D−→ X. 6.10 c) =⇒ assertion, q.e.d.



6.12 Theorem (Cramer-Wold-Device)

Let X,X1, X2, . . . be d-dimensional random vectors. We then have:

XnD−→ X ⇐⇒ c⊤Xn

D−→ c⊤X ∀c ∈ Rd.

Proof:”=⇒“ : Put h(x) := c⊤x and use the Continuous Mapping Theorem.

”⇐=“ : We have

ϕXn(c) = E

[exp

(ic⊤Xn

)]= ϕc⊤Xn

(1)

ϕX(c) = E

[exp

(ic⊤X

)]= ϕc⊤X(1).

By the Continuity Theorem of Levy-Cramer in R, we haveϕc⊤Xn

(1)→ ϕc⊤X(1). Thus, ϕXn(c)→ ϕX(c) for each c ∈ Rd, and theassertion follows from Theorem 6.11.



6.13 Theorem (Multivariate Central Limit Theorem)

Let X1, X2, . . . be i.i.d. d-dimensional random vectors such that E‖X1‖2 <∞.Putting a := EX1, Σ := Σ(X1), we have

1√n

(n∑

j=1

Xj − na)

D−→ Nd(0,Σ).

Proof: Let Zn := n−1/2(∑n

j=1Xj − na), Y ∼ Nd(0,Σ). To show:

c⊤ZnD−→ c⊤Y ∀c ∈ R

d.

Notice that

c⊤Zn =1√n

(n∑

j=1

c⊤Xj − nc⊤a).

We have E(c⊤Zn) = 0, V(c⊤Zn) = V(c⊤X1) = c⊤Σc, c⊤Y ∼ N(0, c⊤Σc)=⇒ w.l.o.g. c⊤Σc > 0. Thm. 1.22, applied to (c⊤Xj)j≥1, yields

c⊤Zn√c⊤Σc

=

∑nj=1 c

⊤Zn − nc⊤a√nc⊤Σc

D−→ N(0, 1).

The CMT yields c⊤ZnD−→√c⊤Σc N(0, 1) = N(0, c⊤Σc), q.e.d.



6.14 Example (Chi-square-test)

Let X1, X2, . . . be i.i.d., P(X1 = ej) := pj , j = 1, . . . , s,0 < pj < 1 ∀j, p1 + . . .+ ps = 1, ej is jth unit vector in Rs. Then

∑nj=1Xj ∼ Mult(n; p1, . . . , ps) (multinomial distribution).

a := E(X1) = (p1, . . . , ps)⊤,

Σ := Σ(X1) = (pjδkj − pjpk)1≤j,k≤s (Σ is singular!)

6.13 =⇒ 1√n

(n∑

j=1

Xj − na)

D−→ Z := (Z1, . . . , Zs)⊤ ∼ Ns(0,Σ).

Let A := (pjδkj − pjpk)1≤j,k≤s−1 =⇒ A−1 = (δjkp−1k + p−1

s )1≤j,k≤s−1. Put∑n

j=1Xj =: (Nn,1, . . . , Nn,s−1, Nn,s)⊤, Vn := (Nn,1, . . . , Nn,s−1)

⊤.

CMT =⇒Wn :=1√n

(Vn−n(p1, . . . , ps−1)

⊤)

D−→ (Z1, . . . , Zs−1)⊤ ∼ Ns−1(0, A).

CMT =⇒W⊤n A

−1WnD−→ (Z1, . . . , Zs−1)A

−1(Z1, . . . , Zs−1)⊤ ∼ χ2

s−1.

We have W⊤n A

−1Wn =s∑

j=1

(Nn,j − npj)2npj

(use Nn,1 + . . .+Nn,s = n).



6.15 Theorem (Delta Method)

Let (Tn) be a sequence of d-dimensional random vectors such that, for someϑ ∈ Rd, √

n (Tn − ϑ) D−→ X ∼ Nd(0,Σ). (6.1)

Suppose that the measurable function g : Rd → Rs is differentiable at ϑ with(s× d)-Jacobian matrix g′(ϑ). We then have

√n (g(Tn)− g(ϑ)) D−→ Ns

(0, g′(ϑ)Σg′(ϑ)⊤

).

Proof: We have (pointwise on the underlying probability space)√n (g(Tn)− g(ϑ)) = g′(ϑ)

√n(Tn − ϑ) + ‖

√n(Tn − ϑ)‖ r(Tn − ϑ),

where r(Tn − ϑ)→ 0 as Tn → ϑ. (6.1) =⇒ Tn − ϑ P−→ 0 (!). Invoking 1.4, it

follows that r(Tn − ϑ) P−→ 0. Furthermore, (6.1) and the CMT yield

‖√n(Tn − ϑ)‖ D−→ ‖X‖. From Slutsky’s lemma, we therefore have

‖√n(Tn − ϑ)‖ · r(Tn − ϑ) P−→ 0.

From (6.1) and the CMT, we obtain g′(ϑ)√n(Tn − ϑ) D−→ g′(ϑ)X. Now, 5.7

implies g′(ϑ)X ∼ Ns(0, g′(ϑΣg′(ϑ)⊤), and the assertion follows from 6.7.



6.16 Stochastic Landau notations

Let X,X1, X2, . . . be d-dimensional random vectors and (an) a sequence ofpositive real numbers. The following notation is frequently encountered:

Xn = OP(1) :⇐⇒ (Xn)n≥1 tight,

Xn = OP(an) :⇐⇒(Xn

an

)

n≥1

tight,

Xn = oP(1) :⇐⇒ XnP−→ 0,

Xn = oP(an) :⇐⇒ Xn

an

P−→ 0,

Xn = X + oP(1) :⇐⇒ Xn −X P−→ 0.



6.17 Theorem (Properties of OP and oP)

Let Xn, Yn(n ≥ 1) be d-dimensional random vectors and (Zn)n≥1 a sequenceof random variables. We then have:

a) Xn = OP(1), Yn = OP(1) =⇒ Xn + Yn = OP(1),

b) Xn = oP(1), Yn = oP(1) =⇒ Xn + Yn = oP(1),

c) Xn = OP(1), Zn = OP(1) =⇒ Xn · Zn = OP(1),

d) Xn = OP(1), Zn = oP(1) =⇒ Xn · Zn = oP(1),

e) Xn = OP(1), h : Rd → Rs continuous =⇒ h(Xn) = OP(1).

Proof: Exercise!

6.18 Corollary If XnD−→ X and Zn = a+ oP(1), then ZnXn

D−→ aX.

Proof: We haveZnXn = (Zn − a)Xn + aXn.

Since XnD−→ X implies Xn = OP(1), the first summand on the right-hand

side is oP(1) by 6.17 d). The second term converges to aX in distribution by

the CMT. Hence ZnXnD−→ aX by Slutsky’s lemma.


Empirical distribution functions

7 Empirical distribution functions

Let X1, X2, . . . be i.i.d. random variables on a probability space (Ω,A, P)having distribution function F (x) := P(X1 ≤ x), x ∈ R.

7.1 Definition (Empirical distribution function)

The function

Fn :

Ω× R→ [0, 1]

(ω, x) 7→ Fωn (x) :=

1

n

n∑

j=1

1Xj(ω) ≤ x

is called the empirical distribution function (EDF) of X1, . . . , Xn.

7.2 Remarks

a) For fixed ω ∈ Ω, Fωn is the distribution function of the discrete probability

measure n−1∑n

j=1 δXj(ω).

b) For fixed x ∈ R, Fn(x) := n−1∑nj=1 1Xj ≤ x is a random variable.

SLLN =⇒ Fn(x)a.s.−→ F (x) as n→∞.



1

x

Fω8 (x)

x6 x2 x7 x5 x1 x3 x8 x4

.5

••

••

••

••

Realization of an EDF corresponding to data xj = Xj(ω), j = 1, . . . , 8



7.3 Theorem (Glivenko-Cantelli, fundamental theorem of statistics)

We havelim

n→∞supx∈R

∣∣Fn(x)− F (x)∣∣ = 0 P-a.s.

Proof: Let

Dn := supx∈R

∣∣Fn(x)− F (x)∣∣(= ‖Fn − F‖∞

),

Dωn := sup

x∈R

∣∣Fωn (x)− F (x)

∣∣, ω ∈ Ω.

Notice that, by right continuity, Dn = supx∈Q

∣∣Fn(x)− F (x)∣∣. Hence, Dn is

measurable (!) and thus a random variable.

To show: ∃Ω0 ∈ A with P(Ω0) = 1 and

limn→∞

Dωn = 0 ∀ω ∈ Ω0.

A bit of notation: For H : R→ R, H ր, put H(x−) := limy↑x,y<xH(x).



From the strong law of large numbers, we have:

∀x ∈ R ∃Ax ∈ A : P(Ax) = 1 and Fωn (x)→ F (x) ∀ω ∈ Ax,

∀x ∈ R ∃Bx ∈ A : P(Bx) = 1 and Fωn (x−)→ F (x−) ∀ω ∈ Bx.

For 0 < p < 1, let F−1(p) := infx ∈ R : F (x) ≥ p(F−1 is the quantile function of F ). We have (!)

F(F−1(p)−

)≤ p ≤ F

(F−1(p)

). (7.1)

For m ≥ 2 and 1 ≤ k ≤ m− 1, let xm,k := F−1(k/m).Putting p = k/m and p = (k − 1)/m in (7.1), we have (!)

F (xm,k−) − F (xm,k−1) ≤ 1

mfor each k = 2, . . . ,m− 1. (7.2)

Moreover,

F (xm,1−) ≤ 1

m, F (xm,m−1) ≥ 1− 1

m. (7.3)

Putting u ∨ v := max(u, v), set

Dωm,n := max

1≤k≤m−1

∣∣Fωn (xm,k)−F (xm,k)

∣∣∨∣∣Fω

n (xm,k−)−F (xm,k−)∣∣. (7.4)



Memo: Dωn = supx∈R

∣∣Fωn (x)− F (x)

∣∣

Memo: Dωm,n = max

1≤k≤m−1

∣∣Fωn (xm,k)− F (xm,k)

∣∣ ∨∣∣Fω

n (xm,k−)− F (xm,k−)∣∣

We claim that

Dωn ≤

1

m+Dω

m,n (m ≥ 2, n ≥ 1, ω ∈ Ω). (7.5)

To this end, fix x ∈ R.

Case 1: ∃k ∈ 2, . . . ,m− 1 such that xm,k−1 ≤ x < xm,k.

Monotonicity arguments yield

Fωn (x) ≤ Fω

n (xm,k−) ≤ F (xm,k−) +Dωm,n

≤ F (xm,k−1) +1

m+Dω

m,n ≤ F (x) +1

m+Dω

m,n.

Analogously, Fωn (x) ≥ F (x)− 1

m−Dω

m,n. Hence

∣∣Fωn (x)− F (x)

∣∣ ≤ 1

m+Dω

m,n. (7.6)

Case 2: x < xm,1 or x ≥ xm,m−1 by complete analogy, using (7.3), q.e.d. (7.5).



Memo: Dωn = supx∈R

∣∣Fωn (x)− F (x)

∣∣

Memo: Dωm,n = max

1≤k≤m−1

∣∣Fω(xm,k)− F (xm,n)∣∣ ∨∣∣Fω(xm,k−)− F (xm,n−)

∣∣

Memo: Dωn ≤

1

m+Dω

m,n (m ≥ 2, n ≥ 1, ω ∈ Ω).

Memo: ∀x ∈ R ∃Ax ∈ A : P(Ax) = 1 and Fωn (x)→ F (x)∀ω ∈ Ax

Memo: ∀x ∈ R ∃Bx ∈ A : P(Bx) = 1 and Fωn (x−)→ F (x−) ∀ω ∈ Bx

Put

Ω0 :=∞⋂

m=2

m−1⋂

k=1

(Axm,k

∩Bxm,k

).

We have Ω0 ∈ A and P(Ω0) = 1. (why?)

ω ∈ Ω0 =⇒ limn→∞

Dωm,n = 0, m ≥ 2,

=⇒ lim supn→∞

Dωn ≤

1

m∀m ≥ 2, q.e.d.



7.4 Remarks

a) In statistical terms, Theorem 7.3, i.e., ‖Fn − F‖∞ a.s.−→ 0, means that

(Fn) is a strongly consistent sequence of estimators of F .

b) Let X1, X2, . . . be i.i.d. d-dimensional random vectors. Let B ∈ Bd,

Qωn(B) :=

1

n

n∑

j=1

1B(Xj(ω)) =1

n

n∑

j=1

δXj(ω)(B), ω ∈ Ω.

From the SLLN, there is a set AB ∈ A with P(AB) = 1 andQω

n(B)→ PX1(B) ∀ω ∈ AB .

Let C ⊂ Bd be a class of Borel sets. Do we have

limn→∞

supB∈C

∣∣Qn(B)− PX1(B)

∣∣ = 0P-almost surely ? (7.7)

Thm. 7.3 =⇒ (7.7) holds if d = 1 and C = (−∞, x] : x ∈ R.(7.7) holds for C = (−∞, x] : x ∈ Rd (

”multivariate Glivenko-Cantelli“).

If X1 has a Lebesgue density, then (7.7) holds forC = B ∈ Bd : B convex.



Memo: Dn = supx∈R

∣∣Fn(x)− F (x)∣∣

c) There is a constant C, 0 < C <∞, not dependent on F , such that

P(Dn > t) ≤ C exp(−2nt2

), t > 0, n ∈ N. (DKW)

(Dvoretsky, Kiefer, Wolfowitz 1956).

If (DKW) holds for X1 ∼ U(0, 1), then it holds for any F ! (Exercise!)

Notice that (DKW) entails

∞∑

n=1

P(Dn > ε) ≤ C∞∑

n=1

exp(−2nε2

)<∞ ∀ ε > 0.

The Borel–Cantelli Lemma then gives

P

(lim supn→∞

Dn > ε)

= 0 ∀ε > 0,

from which Theorem 7.3 follows.



7.5 Theorem Let X1, X2, . . . be i.i.d. random variables with distributionfunction F ,

Bn(x) :=√n(Fn(x)− F (x)

), x ∈ R.

For any k ≥ 1 and any choice of x1, . . . , xk ∈ R, we have

Bn(x1)

...Bn(xk)

D−→ Nk

0...0

, Σ

,

where Σ = (σij)1≤i,j≤k and

σij = F (min(xi, xj)) − F (xi)F (xj), 1 ≤ i, j ≤ k.

Proof: Exercise! (use the multivariate CLT)


Limit theorems for U -statistics

8 Limit theorems for U -statistics

Let X1, X2, . . . be i.i.d. d-dimensional random vectors with distributionfunction F . For k ∈ N, let h : (Rd)k → R be measurable and symmetric.

8.1 Definition (U-statistic)

Un := Un(X1, . . . , Xn) :=1(nk

)∑

1≤i1<...<ik≤n

h (Xi1 , . . . , Xik )

is called U -statistic of order k with kernel h.

8.2 Remark In statistical applications, F is assumed to be unknown.

We assume that the second moment of h exists, i.e.,

EFh2 = EFh

2 (X1, . . . , Xk) <∞

and putϑ := ϑ(F ) := EF (Un) = EFh(X1, . . . , Xk).

Then Un is an unbiased estimator of ϑ.



In the following examples, we have d = 1.

8.3 Examples

a) k = 1, Un =1

n

n∑

j=1

h(Xj), ϑ(F ) = EFh(X1).

b) k = 2, h(x1, x2) =1

2(x1 − x2)

2,

Un =1(n2

)∑

1≤i<j≤n

1

2(Xi −Xj)

2 = · · · = 1

n− 1

n∑

j=1

(Xj −Xn

)2(!)

ϑ(F ) = VF (X1), Un is the sample variance

c) k = 2, h(x1, x2) = 1x1 + x2 > 0,

Un =1(n2

)∑

1≤i<j≤n

1Xi +Xj > 0,

ϑ(F ) = PF (X1 +X2 > 0).

In the sequel, we often omit the index F and write E = EF , V = VF , P = PF

etc.



8.4 Theorem (Variance of a U-statistic)

For c ∈ 1, 2, . . . , k, let

σ2c := Cov (h(X1, . . . , Xc, Xc+1, . . . , Xk), h(X1, . . . , Xc, Xk+1, . . . , X2k−c))

(”c common indices“). We then have

V(Un) =1(nk

)k∑

c=1

(k

c

)(n− kk − c

)σ2c .

Proof: Exercise! (use V(Un) = Cov(Un, Un) and the bilinearity of Cov(·, ·))

For c ∈ 1, . . . , k − 1, let

hc(x1, . . . , xc) := E [h(x1, . . . , xc, Xc+1, . . . , Xk)]

= E[h(X1, . . . , Xk)

∣∣X1 = x1, . . . , Xc = xc

].

Furthermore, put hk := h. Notice that

Ehc = E [hc(X1, . . . , Xc)] = ϑ = Eh.



Memo: σ2c := Cov (h(X1, . . . , Xk), h(X1, . . . , Xc, Xk+1, . . . , X2k−c))

Memo: hc(x1, . . . , xc) = E [h(x1, . . . , xc, Xc+1, . . . , Xk)] ; Ehc = ϑ

Memo: hc(X1, . . . , Xc) = E [h(X1, . . . , Xc, Xc+1, . . . , Xk)|X1, . . . , Xc]

8.5 Lemma We have σ2c = V (hc(X1, . . . , Xc)) .

Proof: We have

σ2c = E [h(X1, . . . , Xk) · h(X1, . . . , Xc, Xk+1, . . . , X2k−c)] − ϑ2

= E

[E

[h(X1, . . . , Xk)h(X1, . . . , Xc, Xk+1, . . . , X2k−c)

∣∣∣X1, . . . , Xc)] ]− ϑ2

︸︷︷︸= h2

c(X1, . . . , Xc)

= E(h2c

)− (Ehc)

2

= V (hc) .√



8.6 Example (cf. Example 8.3 b))

h(x1, x2) =1

2(x1 − x2)

2

µ := EX1, µr := E [(X1 − µ)r] =⇒

h1(x1) = E

[1

2(x1 −X2)

2

]=

1

2E[(X2 − µ+ µ− x1)

2]

=1

2

(µ2 + (µ− x1)

2).

σ21 = V

(1

2

(µ2 + (X1 − µ)2

))=

1

4

(µ4 − µ2

2

),

σ22 = V

(1

2(X1 −X2)

2

)=

1

2

(µ4 + µ2

2

). (!)

Ex. 8.3 b), Thm. 8.4 =⇒

V

(1

n− 1

n∑

j=1

(Xj −Xn

)2)

=2

n(n− 1)

[(2

1

)(n− 2

2− 1

)σ21 +

(2

2

)(n− 2

2− 2

)σ22

]

=1

n

(µ4 − n− 3

n− 1µ22

). (!)



8.7 Definition (Hajek Projection) Let

Un =1(nk

)∑

1≤i1<...<ik≤n

h (Xi1 , . . . , Xik )

be a U -statistic and ϑ = EUn. The random variable

Un :=n∑

j=1

E[Un|Xj ]− (n− 1)ϑ

is called the Hajek projection of Un.

Notice that EUn = ϑ, and that Un is a sum of independent random variables.

8.8 Lemma We have:

a) Un =k

n

n∑

j=1

(h1(Xj)− ϑ) + ϑ,

b) E(Un − Un)2 = σ2

1

k

(n−kk−1

)(nk

) − k2

n

+

1(nk

)k∑

c=2

(k

c

)(n− kk − c

)σ2c ,

c) E(Un − Un)2 = O(n−2) as n→∞.



If A = a1, . . . , ak ⊂ 1, . . . , n, |A| = k, put h(XA) := h(Xa1 , . . . , Xak).

Proof: a) We have

E[Un|Xj ] =1(nk

)∑

A:|A|=k

E [h(XA)|Xj ] .

Now, E[h(XA)|Xj ] = ϑ if j /∈ A (why?) and E[h(XA)|Xj ] = h1(Xj), ifj ∈ A. Counting the respective cases gives

E[Un|Xj ] =1(nk

)[(

n− 1

k − 1

)h1(Xj) +

(n− 1

k

)ϑ

]=k

nh1(Xj) +

n− kn

ϑ.√

b) Since EUn = EUn we may assume w.l.o.g. ϑ = 0. Then

E(Un − Un)2 = V(Un) + V(Un)− 2E(UnUn)

=1(nk

)k∑

c=1

(k

c

)(n− kk − c

)σ2c +

k2

n2nσ2

1

−2 kn

1(nk

)∑

A:|A|=k

n∑

j=1

E [h(XA) · h1(Xj)] .︸︷︷︸

=?



We have

E [h(XA) · h1(Xj)] =

0, if j /∈ A (since ϑ = 0),

σ21 , if j ∈ A. (*)

Proof of (*): By symmetry we have

E [h(XA)h1(Xj)] = E [h(X1, . . . , Xk)h1(X1)]

= E

[E [h(X1, . . . , Xk)h1(X1)|X1]

]

= E[h1(X1)E [h(X1, . . . , Xk)|X1]

]

= E [h1(X1) · h1(X1)]

= V(h1(X1)) = σ21 .√

Thus,

−2 kn

1(nk

)∑

A:|A|=k

n∑

j=1

E [h(XA) · h1(Xj)] = − 2k2

nσ21

︸︷︷︸= kσ2

1

and

E(Un − Un)2 =

1(nk

)k∑

c=1

(k

c

)(n− kk − c

)σ2c +

k2

n2n σ2

1 − 2k2

nσ21 .√



Memo: E(Un − Un)2 = σ2

1

k

(n−kk−1

)(nk

) − k2

n

+

1(nk

)k∑

c=2

(k

c

)(n− kk − c

)σ2c

c) The 2nd summand is of order O(n−2) since c ≥ 2.

The first summand equals

σ21k2

n

(n−kk−1

)(n−1k−1

) − 1

.

Check that the curly bracket is of order O(n−1) as n→∞, q.e.d.



Memo: Un − ϑ =k

n

n∑

j=1

(h1(Xj)− ϑ)

8.9 Theorem (CLT for nondegenerate U-statistics)

Let Un be a U -statistic. If σ21 > 0, Un is said to nondegenerate. We then have

√n(Un − ϑ) D−→ N(0, k2σ2

1).

Proof: We have

√n (Un − ϑ) =

√n(Un − ϑ

)+√n(Un − Un).︸︷︷︸=: Rn

Lemma 8.8 c) =⇒ E(R2n)→ 0 and thus Rn

P−→ 0.

Put Yj := k(h1(Xj)− ϑ). Notice that EYj = 0, V(Yj) = k2σ21 .

The CLT of Lindeberg–Levy gives

√n(Un − ϑ

)=

1√n

n∑

j=1

YjD−→ N(0, k2σ2

1), q.e.d.



8.10 Example (Continuation of Example 8.3 c))

Leth(x1, x2) = 1x1 + x2 > 0.

We have

h1(x1) = E [1x1 +X2 > 0] = P(X2 > −x1) = 1− F (−x1)

and thereforeσ21 = V (1− F (−X1)) = V(F (−X1)).

If F is continuous and the distribution of X1 is symmetric around 0, i.e., if

X1D= −X1, then

F (−X1)D= F (X1)

D= U(0, 1)

and

σ21 = V(U(0, 1)) =

1

12.

The CLT now gives

√n

1(

n2

)∑

1≤i<j≤n

1Xi +Xj > 0 − 1

2

D−→ N

(0,

1

3

).



8.11 Definition (Two-sample U-statistic)

Let X1, X2, . . . ;Y1, Y2, . . . be independent random variables, where X1, X2, . . .are identically distributed with distribution function F , and Y1, Y2, . . . are iden-tically distributed with distribution function G.

Furthermore, let h : Rk × Rℓ → R be a measurable function such thath(x1, . . . , xk, y1, . . . , yℓ) is symmetric in x1, . . . , xk and symmetric in y1, . . . , yℓ.Then

Um,n :=1(

mk

)(nℓ

)∑

1≤i1<...<ik≤m

∑

1≤j1<...<jℓ≤n

h(Xi1 , . . . , Xik , Yj1 , . . . , Yjℓ)

is called a two-sample U -statistic of order (k, ℓ) with kernel h.

In a statistical context, F and G will be unknown.

We have

EF,G(Um,n) = EF,Gh(X1, . . . , Xk, Y1, . . . , Yℓ) =: ϑ(F,G) =: ϑ.

In what follows, we assume EF,Gh2 <∞.



8.12 Example (Mann–Whitney-U-statistic)

Let k = ℓ = 1, h(x, y) = 1x ≤ y,

Um,n =1

mn

m∑

i=1

n∑

j=1

1Xi ≤ Yj,

ϑ(F,G) = EF,G[1X1 ≤ Y1] = PF,G(X1 ≤ Y1).

Notice that ϑ(F, F ) = 1/2 if F is continuous. (why?)

8.13 Theorem Let σ00 := 0 and, for c+ d ≥ 1,

σ2c,d := Cov

(h(XA1 , YB1), h(XA2 , YB2

)),

where A1, A2 ⊂ 1, . . . ,m, |A1∩A2| = c, B1, B2 ⊂ 1, . . . ,m, |B1∩B2| = d.Then

V(Um,n) =1(

mk

)(nℓ

)k∑

c=0

ℓ∑

d=0

(k

c

)(m− kk − c

)(ℓ

d

)(n− ℓℓ− d

)σ2c,d.

Proof: Exercise! (Use the bilinearity of Cov(·, ·) and symmetry).



8.14 Remark For c ≤ k, d ≤ ℓ, let

hc,d(x1, . . . , xc, y1, . . . , yd)

:= E [h(x1, . . . , xc, Xc+1, . . . , Xk, y1, . . . , yd, Yd+1, . . . , Yℓ)]

Thenσ2c,d = V (hc,d(X1, . . . , Xc, Y1, . . . , Yd)) .

Proof: Exercise in conditional expectations, cf. 8.5 .

8.15 Definition (Hajek Projection)

Um,n :=

m∑

i=1

E[Um,n|Xi] +

n∑

j=1

E[Um,n|Yj ]− (m+ n− 1)ϑ

is called the Hajek projection of Um,n.

Notice that Um,n is a sum of independent random variables.



8.16 Theorem We have:

a) Um,n =k

m

m∑

i=1

(h1,0(Xi)− ϑ) +ℓ

n

n∑

j=1

(h0,1(Yj)− ϑ) + ϑ,

b) Putting (a)j := a(a− 1) · . . . · (a− j + 1), we have

E(Um,n − Um,n)2 =

k2

m

(m− k)k−1

(m− 1)k−1

(n− ℓ)ℓ(n)ℓ

− 1

σ21,0

+ℓ2

n

(m− k)k(m)k

(n− ℓ)ℓ−1

(n)ℓ−1− 1

σ20,1

+1(

mk

)(nℓ

)k∑

c=0

ℓ∑

d=0

(k

c

)(m− kk − c

)(ℓ

d

)(n− ℓℓ− d

)σ2c,d

c+d≥2

Proof: Analogously to 8.8 a), b).



8.17 Theorem (CLT for nondegenerate two-sample U-statistics)

If σ21,0 > 0, σ2

0,1 > 0 and m,n→∞ under the condition that

m

m+ n→ τ for some τ ∈ (0, 1) (8.1)

(so-called usual limiting regime in the two-sample case), then

√m+ n (Um,n − ϑ) D−→ N

(0,k2σ2

1,0

τ+ℓ2σ2

0,1

1− τ

).

Proof: We have

√m+ n (Um,n − ϑ) =

√m+ n

(Um,n − ϑ

)+Rm,n

where Rm,n =√m+ n(Um,n − Um,n). Notice that 8.16 b) implies

ER2m,n → 0 and thus Rm,n

P−→ 0.

(8.1) means m = ms, s ≥ 1, n = ns, s ≥ 1 and

lims→∞

ms

ms + ns= τ.



From 8.16 a),

√ms + ns

(Um,n − ϑ

)=

ms+ns∑

i=1

Zs,i,

where

Zs,i =

√ms + ns

k

ms(h1,0(Xi)− ϑ) , if i ∈ 1, . . . ,ms,

√ms + ns

ℓ

ns(h0,1(Yms−i)− ϑ) , if i ∈ ms + 1, . . . ,ms + ns.

(Zs,1, . . . , Zs,ms+ns)s≥1 is a triangular array of rowwise independent randomvariables. We have E(Zs,i) = 0 for each i and

V(Zs,i) =

k2ms + ns

m2s

σ21,0, if i ≤ ms,

ℓ2ms + ns

n2s

σ20,1, if i > ms.

Notice that, as s→∞,

σ2s :=

ms+ns∑

i=1

V(Zs,i) → σ2 :=k2σ2

1,0

τ+ℓ2σ2

0,1

1− τ .



The CLT of Lindeberg–Feller yields

1

σs

ms+ns∑

i=1

Zs,iD−→ N(0, 1).

Since σ2s → σ2, it follows that

√ms + ns

(Um,n − ϑ

)=

ms+ns∑

i=1

Zs,iD−→ N(0, σ2),

where

σ2 =k2σ2

1,0

τ+ℓ2σ2

0,1

1− τ , q.e.d.



8.18 Example (Mann–Whitney-U-statistic, cf. 8.12)

Let h(x, y) = 1x ≤ y, ϑ = P(X1 ≤ Y1). We have

σ21,0 = Cov(1X1 ≤ Y1, 1X1 ≤ Y2)

= P(X1 ≤ Y1, X1 ≤ Y2)− ϑ2,

σ20,1 = Cov(1X1 ≤ Y1, 1X2 ≤ Y1)

= P(X1 ≤ Y1, X2 ≤ Y1)− ϑ2.

If σ21,0 > 0 and σ2

0,1 > 0, then

√m+ n

(1

mn

m∑

i=1

n∑

j=1

1Xi ≤ Yj − ϑ)

D−→ N

(0,σ21,0

τ+

σ20,1

1− τ

).

Um,n is a widely used statistic for the testing problem H0 : F = G, where Fand G are assumed to be continuous.

If H0 holds, then ϑ = 1/2, σ21,0 = σ2

0,1 = 1/3− 1/4 (why?) = 1/12, and

√m+ n

(Um,n − 1

2

)D−→ N

(0,

1

12

(1

τ+

1

1− τ

)).

︸︷︷︸=

1

12τ (1− τ )



In what follows, let

Un =1(nk

)∑

1≤i1<...<ik≤n

h(Xi1 , . . . , Xik )

as in 8.1. For σ2c (cf. 8.4, 8.5) we assume

0 = σ21 < σ2

2 (so-called first order degeneracy).

From 8.4 we have

V(Un) =1(nk

)(k

2

)(n− kk − 2

)σ22 +O

(1

n3

)

=2(k2

)2

n2σ22 +O

(1

n3

). (!)

Thus,

V (n(Un − ϑ)) → 2

(k

2

)2

σ22

and n(Un − ϑ) = OP(1). (why?)

Conjecture: n(Un − ϑ) has a non-degenerate limit distribution as n→∞.



8.19 Example (a “warming up“)

Let

h(x1, x2) :=s∑

ν=1

λνϕν(x1)ϕν(x2),

where λ1, . . . , λs ∈ R \ 0, ϕ1, . . . , ϕs : R→ R measurable. Furthermore,

E[ϕν(X1)] = 0, E[ϕ2ν(X1)] = 1,

E[ϕµ(X1)ϕν(X1)] = δµ,ν (µ, ν ∈ 1, . . . , s).

Hence,

Un =1(n2

)∑

i<j

s∑

ν=1

λνϕν(Xi)ϕν(Xj) (=⇒ EUn = 0)

=

s∑

ν=1

λν1(n2

) 12

∑

i6=j

ϕν(Xi)ϕν(Xj)

=

s∑

ν=1

λν1(n2

) 12

(n∑

j=1

ϕν(Xj)

)2 −

n∑

j=1

ϕ2ν(Xj)



Memo: Un =s∑

ν=1

λν1(n2

) 12

(n∑

j=1

ϕν(Xj)

)2

−n∑

j=1

ϕ2ν(Xj)

=⇒ nUn =s∑

ν=1

λνn

n− 1

(1√n

n∑

j=1

ϕν(Xj)

)2

− 1

n

n∑

j=1

ϕ2ν(Xj)

.

Notice that, by the SLLN,

1

n

n∑

j=1

ϕ2ν(Xj)

a.s.−→ E[ϕ2

ν(X1)]= 1.

Moreover, by the multivariate CLT

1√n

n∑

j=1

ϕ1(Xj)

...ϕs(Xj)

D−→ Ns

0...0

, Is

∼

N1

...Ns

The continuous mapping theorem and Slutsky’s lemma now give

nUnD−→

s∑

ν=1

λν

(N2

ν − 1).



General idea: Approximate kernel h by kernel of order 2 and the latter by akernel as in Example 8.19.

To be precise, put

Un :=∑

1≤i<j≤n

E [Un|Xi, Xj ] −(n

2

)ϑ+ ϑ

(cf. Hajek projection).

8.20 Lemma We have

a) Un − ϑ =

(k2

)(n2

)∑

1≤j<ℓ≤n

(h2(Xj , Xℓ)− ϑ),

b) E(Un − Un)2 = O

(1

n3

)as n→∞.

Proof: Recall h(XA) := h(Xi1 , . . . , Xik ), A = i1, . . . , ik

=⇒ E [Un|Xj , Xℓ] =1(nk

)∑

A:|A|=k

E [h(XA)|Xj , Xℓ] .



Now,

E [h(XA)|Xj , Xℓ] =

ϑ , if j, ℓ ∩A = ∅,h1(Xℓ) , if j /∈ A, ℓ ∈ A,h1(Xj) , if j ∈ A, ℓ /∈ A,

h2(Xj , Xℓ), if j, ℓ ⊂ A.

Since 0 = σ21 = V(h1(X1)) and ϑ = Eh1(X1) we have

E [h(XA)|Xj , Xℓ] =

ϑ , if |j, ℓ ∩ A| ≤ 1,

h2(Xj , Xℓ), otherwise,=⇒ a). (!)

b): W.l.o.g. let ϑ = 0 =⇒ E(Un − Un)2 = V(Un) + V(Un)− 2E

(UnUn

).

Memo: Un =1(nk

)∑

A:|A|=k

h(XA), Un =

(k2

)(n2

)∑

1≤j<ℓ≤n

h2(Xj , Xℓ)

Notice that

E [h2(X1, X2)h2(X1, X3)] = E[E [h2(X1, X2)h2(X1, X3)|X1]

]

= E[(E [h2(X1, X2)|X1])

2]

= E[(h1(X1))

2]

= V(h1(X1)) = σ21 = 0.

Proceed by analogy with Lemma 8.8 c), q.e.d.



Consequence:n(Un − ϑ) = n(Un − ϑ) + n(Un − Un). (8.2)

By Lemma 8.20 b),

E[(n(Un − Un)

)2 ]→ 0 =⇒ n(Un − Un)

P−→ 0.

Hence, n(Un − ϑ) and n(Un − ϑ) have the same limit distribution(if there is such a limit distribution).

Notice that

n(Un − ϑ

)=

(k

2

)2

n− 1

∑

1≤j<ℓ≤n

h2(Xj , Xℓ), (8.3)

where h2(x, y) := h2(x, y)− ϑ and E(h2) = Eh2(X1, X2) = 0.

Let L2 := L2(R,B, dF ) be the separable Hilbert space of (equivalence classesof) square integrable functions with respect to dF (:= PX1).

〈f, g〉 :=

∫f(x)g(x) dF (x) =

∫fg dF,

‖g‖2 := 〈g, g〉 =

∫g2 dF.



In what follows, each unspecified integral is over R.

8.21 Lemma For g ∈ L2, let

(Ag)(x) :=

∫h2(x, y)g(y)dF (y), x ∈ R.

We then have:

a) Ag ∈ L2,

b) A : L2 → L2 is a linear operator,

c) ‖Ag‖ ≤√

Eh22 ‖g‖ (=⇒ A is continuous) ,

d) 〈Af, g〉 = 〈f,Ag〉 (i.e., A is symmetric (self-adjoint)),

e) If ϕ1, ϕ2, . . . is a complete orthonormal set in L2, then

‖A‖2HS :=

∞∑

j=1

‖Aϕj‖2 =

∫ ∫h22(x, y) dF (x)dF (y) = Eh2

2 <∞

(i.e., A is a Hilbert–Schmidt operator and therefore compact)

(see, e.g. J.Weidmann: Linear operators in Hilbert spaces, Thm. 6.10)



Memo: (Ag)(x) :=∫h2(x, y)g(y) dF (y), x ∈ R.

Memo: a) Ag ∈ L2, b) A is a linear operator, c) ‖Ag‖ ≤√

Eh22 ‖g‖

Proof: a) By the Cauchy–Schwarz inequality, we have∫

(Ag)2 dF =

∫(Ag(x))2 dF (x)

=

∫ (∫h2(x, y)g(y)dF (y)

)2

dF (x)

≤∫ (∫

h22(x, y)dF (y)

∫g2(y)dF (y)

)dF (x)

︸︷︷︸= ‖g‖2

= ‖g‖2∫∫

h22(x, y)dF (x)dF (y)

= ‖g‖2 Eh22 <∞. √

b), c) follow from a).




Memo: d) 〈Af, g〉 = 〈f,Ag〉

Proof:

〈Af, g〉 =

∫(Af)(x)g(x)dF (x)

=

∫ (∫h2(x, y)f(y)dF (y)

)g(x) dF (x)

︸︷︷︸= h2(y, x)

=

∫f(y)

(∫h2(y, x)g(x)dF (x)

)dF (y) (Fubini)

︸︷︷︸= (Ag)(y)

= 〈f,Ag〉. √




Memo: e) ‖A‖2HS :=

∞∑

j=1

‖Aϕj‖2 =

∫∫h22(x, y) dF (x)dF (y) = Eh2

2

Proof: Put h2,x(y) := h2(x, y). Then

∞∑

j=1

‖Aϕj‖2 =∞∑

j=1

∫(Aϕj(x))

2 dF (x)

=

∞∑

j=1

∫ (∫h2(x, y)ϕj(y)dF (y)

)2

dF (x)

︸︷︷︸= 〈h2,x, ϕj〉2

=

∫ ∞∑

j=1

〈h2,x, ϕj〉2dF (x) (why?)

=

∫‖h2,x‖2 dF (x) (Parseval’s identity)

=

∫ ∫h22(x, y)dF (y) dF (x)

√



8.22 Theorem (Expansion Thm. for compact self-adjoint linear operators)

There are λ1, λ2, . . . ∈ R with |λ1| ≥ |λ2| ≥ . . . > 0 and limn→∞ λn = 0 andϕ1, ϕ2, . . . ∈ L2 with 〈ϕi, ϕj〉 = δi,j ∀i, j, such that

Ag =∑

n≥1

λn〈g,ϕn〉ϕn, g ∈ L2.

If ψ1, ψ2, . . . is an orthonormal basis of g : Ag = 0, then ψ1, ψ2, . . . ∪ϕ1, ϕ2, . . . is an orthonormal basis of L2.

Proof: See, e.g. J.Weidmann: Linear operators in Hilbert spaces, Thm. 7.2.

Notice that Aϕk = λkϕk, k ≥ 1, i.e., λk is an eigenvalue of A associated withthe normalized eigenfunction ϕk.

From Thm. 8.21 e), we have

∑

k≥1

λ2k =

∑

k≥1

‖Aϕk‖2 < ∞. (8.4)



In what follows, we put K(x, y) := h2(x, y).

For s ≥ 1, let Ks(x, y) :=

s∑

j=1

λjϕj(x)ϕj(y), (cf. Example 8.19).

8.23 Lemma We have

lims→∞

∫∫(K(x, y)−Ks(x, y))

2 dF (x)dF (y) = 0.

Proof: Let Kx(y) := K(x, y). Recall 〈Kx, g〉 = (Ag)(x).

Since∫∫

K2(x, y)dF (x)dF (y) <∞ we have∫K2(x, y)dF (y) <∞ for dF -almost all x

=⇒ Kx ∈ L2 for dF -almost all x.

Let ϕj , ψj as in Thm. 8.22. Then (Fourier expansion of Kx!)∫ (Kx(y)−

s∑

j=1

〈Kx, ψj〉ψj(y)−s∑

j=1

〈Kx, ϕj〉ϕj(y)

)2dF (y)→ 0

for dF -almost all x.



Memo:

∫ (Kx(y)−

s∑

j=1

〈Kx, ψj〉ψj(y)−s∑

j=1

〈Kx, ϕj〉ϕj(y)

)2

dF (y)→ 0 dF -a.s.

Notice that 〈Kx, ψj〉 = Aψj(x) = 0 (ψj : j ≥ 1 is ONB of g : Ag = 0).Since 〈Kx, ϕj〉 = λjϕj(x), this means

ρs(x) :=

∫(K(x, y)−Ks(x, y))

2 dF (y) → 0 for dF -almost all x.

We have |ρs(x)| ≤ 2

∫K2(x, y)dF (y) + 2

∫K2

s (x, y)dF (y). (why?) Since

∫K2

s (x, y)dF (y) =s∑

j,ℓ=1

λjλℓϕj(x)ϕℓ(x)

∫ϕj(y)ϕℓ(y)dF (y) ≤

∞∑

j=1

λ2jϕ

2j (x)

︸︷︷︸= δj,ℓ

we have |ρs(x)| ≤ ρ(x) := 2∫K2(x, y)dF (y) + 2

∑∞j=1 λ

2jϕ

2j (x). Since

∫ρ(x)dF (x) = 2

∫∫K2dF⊗dF+2

∞∑

j=1

λ2j <∞, DOM =⇒

∫ρs(x)dF (x)→ 0.

√



8.24 Lemma We have Eϕj(X1) = 0, j ≥ 1.

Proof: Let h1 := h1 − ϑ. (Recall h1(x1) = Eh(x1, X2, . . . , Xk))

We have (!) h1(x) =∫h2(x, y) dF (y) =

∫K(x, y) dF (y) =⇒

∫ (h1(x)−

s∑

j=1

λjϕj(x)

∫ϕjdF

)2

dF (x)

︸︷︷︸= Eϕj(X1)

=

∫ (∫ [K(x, y)−

s∑

j=1

λjϕj(x)ϕj(y)]· 1 dF (y)

)2dF (x)

︸︷︷︸= Ks(x, y)

≤∫∫

(K −Ks)2dF ⊗ dF. (Cauchy–Schwarz inequality)

︸︷︷︸→ 0 as s→∞ (cf. Lemma 8.23)



Memo:

∫ (h1(x)−

s∑

j=1

λjϕj(x)

∫ϕjdF

)2

dF (x) → 0 as s→∞

Notice that

0 = σ21 = V(h1) = Eh2

1 =

∫h21(x)dF (x) =⇒ h1 = 0 dF -almost surely.

Memo =⇒

∆s :=

∫ ( s∑

j=1

λjϕj(x)

∫ϕjdF

)2 dF (x) → 0.

Now,

∆s =s∑

i,j=1

λiλj

∫ϕidF

∫ϕjdF

∫ϕi(x)ϕj(x)dF (x)

︸︷︷︸= δi,j

=s∑

j=1

λ2j

(∫ϕjdF

)2

.

It follows that∫ϕjdF = 0 = Eϕj(X1), j ≥ 1.



Memo: Ks(x, y) :=s∑

j=1

λjϕj(x)ϕj(y)

8.25 Lemma We have

∫∫(K −Ks)

2 dF ⊗ dF =

∫∫K2dF ⊗ dF −

s∑

j=1

λ2j =

∞∑

j=s+1

λ2j .

Proof: The last equality follows from Thm. 8.21 e). We have∫∫

(K−Ks)2dF⊗dF =

∫∫K2dF⊗dF

−2s∑

j=1

λj

∫ [∫K(x, y)ϕj(y)dF (y)

]ϕj(x)dF (x)

︸︷︷︸= λjϕj(x)

+s∑

j,ℓ=1

λjλℓ

∫ϕj(x)ϕℓ(x)dF (x)

∫ϕj(y)ϕℓ(y)dF (y).

√

︸︷︷︸︸︷︷︸= δj,ℓ = δj,ℓ



Put

Tn :=1

n

∑

j 6=ℓ

K(Xj , Xℓ), (8.5)

Tn,s :=1

n

∑

j 6=ℓ

Ks(Xj , Xℓ). (8.6)

8.26 Lemma We have E (Tn − Tn,s)2 ≤ 2

∞∑

j=s+1

λ2j .

Proof: We have

Tn − Tn,s = (n− 1)1(n2

)∑

j<ℓ

K(Xj , Xℓ)−Ks(Xj , Xℓ)

︸︷︷︸

=: Gs(Xj , Xℓ)︸︷︷︸=: ∆n (U -statistic !)

EGs(X1, X2) = Eh2(X1, X2)−s∑

j=1

λjEϕj(X1)Eϕj(X2) = 0.

By Lemma 8.25, EG2s(X1, X2) =

∑∞j=s+1 λ

2j .



Memo: Gs(Xj , Xℓ) = K(Xj , Xℓ)−Ks(Xj , Xℓ)

Memo: EGs(X1, X2) = 0, EG2s(X1, X2) =

∞∑

j=s+1

λ2j

Memo: Ks(Xj , Xℓ) =s∑

ν=1

λνϕν(Xj)ϕν(Xℓ)

Check that Lemma 8.24 implies

E [Gs(X1, X2)Gs(X1, X3)] = E

[h2(X1, X2)h2(X1, X3)

]

(recall K = h2). Now,

E

[h2(X1, X2)h2(X1, X3)

]=

∫Eh2(x,X2)Eh2(x,X3) dF (x)

=

∫h1(x)

2dF (x) = V(h1(X1)) = σ21 = 0.

We thus have

E(Tn − Tn,s)2 = V(Tn − Tn,s) = (n− 1)2V(∆n)

= (n− 1)21(n2

)EG2s(X1, X2) ≤ 2

∞∑

j=s+1

λ2j , q.e.d.



8.27 Theorem (Limit distribution of singly-degenerate U-statistics)

Let Un be a U -statistic satisfying 0 = σ21 < σ2

2 , and let h2 := h2 − ϑ. Letλ1, λ2, . . . be the nonzero eigenvalues of the integral operator on L2(R,B,dF )

associated with h2, cf. 8.21. We then have

n(Un − ϑ) D−→(k

2

) ∞∑

j=1

λj

(N2

j − 1),

where N1, N2, . . . are i.i.d. standard normal random variables.

Proof: By (8.2), (8.3),

n(Un − ϑ) =

(k

2

)n

n− 1

1

n

∑

j 6=ℓ

h2(Xj , Xℓ) + oP(1).

︸︷︷︸= Tn, cf. (8.5)

Let

Ys :=s∑

j=1

λj

(N2

j − 1).

Check that (Ys) is a Cauchy sequence in L2. Since L2 is complete, there is a

Y ∈ L2 such that YsL2

−→ Y . If Y =:∑∞

j=1 λj(N2j − 1), then Ys

D−→ Y .



We provelim

n→∞EeitTn = EeitY , t ∈ R.

The continuity theorem of Levy–Cramer implies TnD−→ Y , q.e.d.

Let t ∈ R, t 6= 0, Tn,s as in (8.6). For fixed s ∈ N, we have∣∣EeitTn − EeitY

∣∣ ≤∣∣EeitTn − EeitTn,s

∣∣+∣∣EeitTn,s − EeitYs

∣∣

+∣∣EeitYs − EeitY

∣∣=: an,s + bn,s + cs.

Fix ε > 0. We have

an,s ≤ E

∣∣∣eitTn − eitTn,s

∣∣∣ = E

∣∣∣(eit(Tn−Tn,s) − 1

)eitTn,s

∣∣∣

= E

∣∣∣(eit(Tn−Tn,s) − 1

) ∣∣∣ ≤ |t| · E∣∣Tn − Tn,s

∣∣ ( |eitx − 1| ≤ |tx|)

≤ |t| ·(E(Tn − Tn,s)

2)1/2 (Cauchy–Schwarz inequality)

≤ |t| ·(2

∞∑

j=s+1

λ2j

)1/2

(by Lemma 8.26)

≤ ε, if s ≥ s1(ε, t), since∞∑

j=s+1

λ2j → 0 as s→∞.



Memo:∣∣EeitTn − EeitY

∣∣ ≤ an,s + bn,s + cs

Memo: an,s ≤ ε, if s ≥ s1(ε, t) Memo: cs =∣∣EeitYs − EeitY

∣∣

Memo: bn,s =∣∣EeitTn,s − EeitYs

∣∣

Since YsD−→ Y we have cs ≤ ε, if s ≥ s2 = s2(ε, t).

Put s0 := max(s1, s2). It follows that

lim supn→∞

∣∣∣EeiTn − EeiY∣∣∣ ≤ 2ε + lim sup

n→∞

∣∣∣EeiTn,s0 − EeiYs0

∣∣∣︸︷︷︸= 0, since Tn,so

D−→ Ys0 , cf. 8.19

q.e.d.



8.28 Example (Cramer–von Mises statistic)

Let X1, X2, . . . be i.i.d. ∼ U(0, 1),

Fn(t) :=1

n

n∑

j=1

1Xj ≤ t, 0 ≤ t ≤ 1.

Let

ω2n :=

∫ 1

0

(√n(Fn(t)− t)

)2dt.

We have (Exercise!)

ω2n = (n− 1)

1(n2

)∑

1≤i<j≤n

h(Xi, Xj) +1

6+ oP(1),

where

h(x, y) =x2

2+y2

2−max(x, y) +

1

3.

We have Eh(X1, X2) = 0 = ϑ, h1(x) = Eh(x,X2) = 0 (!)=⇒ σ2

1 = V(h1(X1)) = 0.

Eh2(X1, X2) = V(h(X1, X2)) = σ22 = 1

90(!) > 0.



Theorem 8.27 =⇒ look for nonzero eigenvalues of the integral operator

Ag(x) =

∫ 1

0

h(x, y)g(y)dy (8.7)

on L2 = L2([0, 1],B ∩ [0, 1],U(0, 1)).

Notice: g := g0 ≡ 1 =⇒ Ag0(x) =∫ 1

0h(x, y)dy = h1(x) = 0 =⇒ g ≡ const

has eigenvalue 0. Suppose Ag = λg, λ 6= 0 =⇒∫ 1

0

g(x)dx = 〈g, 1〉 = 1

λ〈λg, 1〉 = 1

λ〈Ag,1〉 = 1

λ〈g,A1〉

=1

λ〈g, 0〉

= 0.

In our case, the integral equation (8.7), putting Ag = λg, takes the form

λg(x) =x2

2

∫ 1

0

g(y)dy +1

2

∫ 1

0

y2g(y)dy − x∫ x

0

g(y)dy −∫ 1

x

yg(y)dy +1

3

∫ 1

0

g(y)dy

︸︷︷︸︸︷︷︸= 0 = 0



Memo: λg(x) =1

2

∫ 1

0

y2g(y)dy − x∫ x

0

g(y)dy −∫ 1

x

yg(y)dy Approach:

Differentiate this equation twice =⇒

λg′(x) = −∫ x

0

g(y)dy − xg(x) + xg(x),

λg′′(x) = −g(x).Try g(x) = cos(ax) =⇒ g′′(x) = −a2g(x)

=⇒ − g(x) = 1

a2g′′(x) =⇒ λ =

1

a2.

Since

0 =

∫ 1

0

g(x)dx =1

asin(ax)

∣∣∣1

0=

1

asin a,

we have sin a = 0 and thus a ∈ kπ : k ∈ Z \ 0. Hence,

λk :=1

k2π2, k ≥ 1,

is an eigenvalue corresponding to the normalized eigenfunction

gk(x) =1√2cos(kπx), 0 ≤ x ≤ 1.



Do we have obtained all solutions of the integral equation (8.7)?

8.21 e) =⇒∞∑

k=1

‖Aϕk‖2 = Eh2(X1, X2) =1

90

for any complete orthonormal system ϕ1, ϕ2, . . .. We have

∞∑

k=1

‖Agk‖2 =∞∑

k=1

λ2k =

1

π4

∞∑

k=1

1

k4=

1

π4

π4

90=

1

90.

From Thm. 8.27 we thus obtain

ω2n

D−→∞∑

k=1

1

π2k2(N2

k − 1)+

1

6∼ ω2,

where N1, N2, . . . are i.i.d. ∼ N(0, 1).

The distribution of ω2 is called Cramer–von Mises distribution.

ω2n is a suitable statistic for testing the hypothesis of a uniform distribution on

the unit interval.


Basic concepts of asymptotic estimation theory

9 Basic concepts of asymptotic estimation theory

9.1 Setting

Let X1, X2, . . . be i.i.d. random variables on some probability space (Ω,A, P)taking values in a measurable space (X0,B0). X0 is called the sample space.

Mostly, (X0,B0) = (Rd,Bd). Often, we will have d = 1.

LetM1 := Q : Q probability measure on B0.

Assumption: PX1 ∈ M1 is not completely known.

9.2 Definition (Parametric model)

A parametric model for PX1 is a subset P ⊂ M1 with the following property:There are an integer k, a set Θ ⊂ Rk, Θ 6= ∅, and a bijective mapping Θ ∋ ϑ 7→Qϑ from Θ onto P . We write

P = Qϑ : ϑ ∈ Θ.

We assume PX1 ∈ Qϑ : ϑ ∈ Θ and say ϑ is the true parameter, if PX1 = Qϑ.



9.3 Examples

a) Qϑ = Exp(ϑ), Θ = (0,∞).

b) Qϑ = N(µ, σ2), ϑ = (µ, σ2), Θ = R× R>0.

c) Qϑ = Bin(n, p), ϑ = p, Θ = [0, 1].

9.4 Canonical model

If not stated otherwise, we will adopt the so-called canonical model

Ω := X N0 , A := BN

0 , Pϑ = QNϑ,

i.e., the infinite product (Ω,A,Pϑ) := ⊗∞j=1(X0,B0, Qϑ). Moreover, given

ω = (xj)j≥1 ∈ Ω, we put Xj(ω) := xj . In other words, Xj is the jth coordinateprojection. Then X1, X2, . . . are i.i.d. random variables with distribution Qϑ.

Pϑ is the distribution of X := (Xj)j≥1.

(X ,B) :=(X N

0 ,BN0

)is the sample space of X.

(X ,B, Pϑ : ϑ ∈ Θ) is a suitable statistical space for asymptotic statistics.

If n ≥ 1, A ∈ B0 ⊗ . . .⊗B0 (n factors), then

Pϑ

(A××∞

j=n+1X0

)= Pϑ ((X1, . . . , Xn) ∈ A) .



In what follows, we stress the dependence of expecations, variances etc. on ϑby writing Eϑ, Vϑ etc.

9.5 Definition (Asymptotic properties of estimators)

A sequence (Sn) of estimators of ϑ is a sequence Sn : X → Rk (⊃ Θ) ofmeasurable mappings such that, for each n, Sn(x), x = (xj)j≥1, depends onlyon x1, . . . , xn.

Usual notation (canonical model): Sn = Sn(X1, . . . , Xn).

The sequence (Sn) is called

a) (asymptotically) unbiased (for ϑ) :⇔(lim

n→∞

)EϑSn = ϑ ∀ϑ ∈ Θ,

b) (weakly) consistent (for ϑ) :⇔ limn→∞

Pϑ(‖Sn−ϑ‖ > ε) = 0 ∀ε > 0∀ϑ ∈ Θ,

c) strongly consistent (for ϑ) :⇔ limn→∞

Sn = ϑ Pϑ-a.s. ∀ϑ ∈ Θ,

d)√n-consistent (for ϑ) :⇔ √n(Sn − ϑ) = OPϑ (1) ∀ϑ ∈ Θ.

Notice that a) requires Eϑ‖Sn‖ <∞ ∀ϑ ∈ Θ.

Notice that b) is equivalent to SnPϑ−→ ϑ for each ϑ ∈ Θ.



9.6 Remarks

a) Often there is interest only in γ(ϑ), where γ : Θ→ Rs, 1 ≤ s < k. Thenall definitions remain valid mutatis mutandis (Sn : X → Rs,limn→∞ EϑSn = γ(ϑ) ∀ϑ ∈ Θ etc.),

b) Let Sn =: (Sn1, . . . , Snk). If (Sn) is asymptotically unbiased andVϑ(Snj)→ 0 for each j ∈ 1, . . . , k, then (Sn) is consistent. (check!)

9.7 Example Let X1, X2, . . . be i.i.d. ∼ N(µ, σ2), ϑ := (µ, σ2), γ(ϑ) := σ2.Let

Sn :=1

n− 1

n∑

j=1

(Xj −Xn

)2, Xn :=

1

n

n∑

j=1

Xj .

We have EϑSn = σ2 = γ(ϑ) ∀ϑ ∈ Θ := R× R>0. Furthermore, by Ex. 8.6,

Vϑ(Sn) =1

n

(µ4 − n− 3

n− 1µ22

)→ 0.

Hence SnPϑ−→ γ(ϑ) ∀ϑ ∈ Θ. Since

√n(Sn− σ2)

Dϑ−→ N(0, 2σ4) (use 8.3b), 8.6,8.9), (Sn) is

√n-consistent. 8.3b), 8.6, 8.9 can be used to show asymptotic

normality even in greater generality: If X1, X2, . . . i.i.d. , E(X41 ) <∞ then√

n(Sn − σ2)D−→ N(0, µ4 − σ4), where µ4 = E(X1 − EX1)

4. (check!)



9.8 Definition (Asymptotic confidence region)

Let α ∈ (0, 1). An asymptotic confidence region for ϑ at level 1−α is a sequence(Cn), where Cn : X → P(Rk) and Cn(x), x = (xj)j≥1, is only dependent onx1, . . . , xn, such that

lim infn→∞

Pϑ (Cn(X1, . . . , Xn) ∋ ϑ) ≥ 1− α ∀ϑ ∈ Θ. (9.1)

9.9 Remarks

a) We must have x ∈ X : Cn(x1, . . . , xn) ∋ ϑ ∈ A ∀n ≥ 1, ∀ϑ ∈ Θ.

b) One often has more than (9.1), namely

limn→∞

Pϑ (Cn(X1, . . . , Xn) ∋ ϑ) = 1− α ∀ϑ ∈ Θ.

9.10 Example Let X1, X2, . . . i.i.d. ∼ Po(ϑ), ϑ ∈ Θ := (0,∞).

Sn := Xn → ϑ Pϑ-a.s. for each ϑ ∈ Θ. Since Vϑ(X1) = ϑ, the CLT gives√n(Xn − ϑ)√

ϑ

Dϑ−→ N(0, 1) ∀ϑ ∈ Θ.



Memo:

√n(Xn − ϑ)√

ϑ

Dϑ−→ N(0, 1) ∀ϑ ∈ Θ.

Slutsky’s lemma =⇒√n(Xn − ϑ)√

Xn

Dϑ−→ N(0, 1) ∀ϑ ∈ Θ

=⇒ limn→∞

Pϑ

(∣∣∣∣∣

√n(Xn − ϑ)√

Xn

∣∣∣∣∣ ≤ Φ−1(1− α

2

))= 1− α ∀ϑ ∈ Θ.

Thus,

Cn(X1, . . . , Xn) :=

[Xn −

Φ−1(1− α2)√

n

√Xn , Xn +

Φ−1(1− α2)√

n

√Xn

]

is an asymptotic confidence region for ϑ at level 1− α.


Asymptotic properties of maximum likelihood estimators

10 Asymptotic properties of maximum likelihood estimators

Memo: X1, X2, . . . i.i.d. on (Ω,A,P), Xj : Ω→ X0, (X0,B0) meas. space

Memo: Θ ⊂ Rk, PX1 ∈ Qϑ : ϑ ∈ Θ = P ⊂M1.

Suppose µ is a σ-finite measure on B0.If B0 = Bd, then either

µ = λd (Borel–Lebesgue measure)

or

µ is the counting measure on a countable subset D of Rd, i.e.,µ =

∑t∈D δt.

Suppose that in 9.2∀ϑ ∈ Θ : Qϑ = f(·, ϑ)µ,

i.e.,

Qϑ(B) =

∫

B

f(x, ϑ)µ(dx), B ∈ B0.

In other words, Qϑ has a density f(·, ϑ) with respect to µ.



10.1 Definition (Likelihood function, maximum likelihood estimate)

a) For fixed x ∈ X0, the function Lx : Θ→ R, defined by

Lx(ϑ) = f(x, ϑ),

is called the likelihood function (pertaining to x).

b) Any ϑ(x) ∈ Θ satisfying

f(x, ϑ(x)

)= sup

ϑ∈Θf(x, ϑ) (10.1)

is called a maximum likelihood (ML) estimate of ϑ given x.

c) A measurable mapping ϑ : X0 → Θ satisfying (10.1) for each x ∈ X0 iscalled a maximum likelihood estimator (MLE) of ϑ.

10.2 Remark Suppose Θ ⊂ Rk and assume that ∂∂ϑj

f(x, ϑ) exists

(j = 1, . . . , k; ϑ = (ϑ1, . . . , ϑk)). Then one might try to find ϑ(x) in (10.1) bysolving the log-likelihood equations

∂

∂ϑjlog f(x, ϑ) = 0, j = 1, . . . , k.

But: Be aware of relative maxima and (relative) maxima on the boundary of Θ.



10.3 Example (Binomial case)

Let X0 := 0, 1n, Θ := (0, 1), µ := counting measure on X0.

For x = (x1, . . . , xn) ∈ X0, let

f(x, ϑ) = ϑ∑

nj=1

xj (1− ϑ)n−∑

nj=1

xj (”binomial case“).

∂

∂ϑlog f(x, ϑ) = 0 =⇒ ϑ(x) =

1

n

n∑

j=1

= xn.

Notice that ϑ(x) ∈ Θ⇐⇒ 0 <∑n

j=1 xj < n. (check!)

If∑n

j=1 xj = 0 then f(x, ϑ) = (1− ϑ)n = maxϑ! and ϑ(x) = 0.

If∑n

j=1 xj = n then f(x, ϑ) = ϑn = maxϑ! and ϑ(x) = 1.

If Θ is the closed interval [0, 1], then ϑ : X0 → [0, 1] is the MLE.



For asymptotic considerations, the following modification of 10.1 is convenient:

Assume the setting of 9.1 and 9.2. Let µ be some σ-finite measure on B0. Letf(·, ϑ) be the density of PX1

ϑ = Qϑ with respect to µ.

Then (X1, . . . , Xn) has the density

fn(x1, . . . , xn, ϑ) :=n∏

j=1

f(xj , ϑ)

with respect to the n-fold product measure

µn := µ⊗ µ⊗ · · · ⊗ µ (n factors). (why?)

For x = (xj)j≥1 ∈ X := X N0 , let

fn(x, ϑ) := fn(x1, . . . , xn, ϑ).

In what follows, we assume the canonical model of 9.4, , i.e.,

(Ω,A, Pϑ) := ⊗∞j=1(X0,B0, Qϑ), Xj(ω) := xj , where ω = (xj)j≥1.



10.4 Definition (Asymptotic maximum likelihood estimator)

Let X1, X2, . . . be i.i.d. with µ-density f(·, ϑ), ϑ ∈ Θ ⊂ Rk. For n ∈ N, let

Mn :=⋃

ϑ∈Θ

x ∈ X : fn(x, ϑ) = sup

t∈Θfn(x, t)

.

Suppose there is a set M ′n ⊂Mn with M ′

n ∈ B and Pϑ(M′n)→ 1 ∀ϑ ∈ Θ.

Then any sequence (ϑn) of measurable mappings ϑn : X → Θ satisfying

fn(x, ϑn(x)

)= sup

t∈Θfn(x, t) ∀x ∈M ′

n

is called an asymptotic maximum likelihood estimator.

Aim: Under certain regularity conditions, we have

√n(ϑn − ϑ

) Dϑ−→ Nk

(0, I1(ϑ)

−1)∀ϑ ∈ Θ.

Here, I1(ϑ) is the Fisher information matrix (to be defined).



10.5 Definition (Regularity Conditions)

a) ∀z∈X0 ∀i, j∈1, . . . , k : ∂2

∂ϑi∂ϑjf(z, ϑ) exists and is continuous on Θ,

b) ∀ϑ ∈ Θ ∀i, j ∈ 1, . . . , k we have

0 = E

[∂

∂ϑilog f(X1, ϑ)

] (= Eϑ

[∂

∂ϑif(X1, ϑ)

f(X1, ϑ)

]),

c) ∀ϑ ∈ Θ ∀i, j ∈ 1, . . . , k we have

0 = Eϑ

[1

f(X1, ϑ)· ∂

2 f(X1, ϑ)

∂ϑi∂ϑj

],

d) ∀ϑ ∈ Θ ∃δϑ > 0 such that U(ϑ, δϑ) := y ∈ Rk : ‖y − ϑ‖ < δϑ ⊂ Θ

∃ measurable function M(·, ϑ) ≥ 0 on X0 with EϑM(X1, ϑ) <∞ and

∣∣∣∣∣∂2

∂ϑi∂ϑjlog f(·, ϑ′)

∣∣∣∣∣ ≤ M(·, ϑ) ∀ϑ′ ∈ U(ϑ, δϑ) ∀i, j ∈ 1, . . . , k,

e) For each ϑ ∈ Θ, the so-called Fisher information matrix

I1(ϑ) :=

(Eϑ

[∂

∂ϑilog f(X1, ϑ)

∂

∂ϑjlog f(X1, ϑ)

])

1≤i,j≤kis invertible.



10.5 b),c) mean that we can interchange the order of integration anddifferentiation:

0 = E

[∂

∂ϑilog f(X1, ϑ)

]

=

∫

X0

∂

∂ϑilog f(z, ϑ)f(z, ϑ)µ(dz)

=

∫

X0

∂∂ϑi

f(z, ϑ)

f(z, ϑ)f(z, ϑ)µ(dz)

=

∫

X0

∂

∂ϑif(z, ϑ)µ(dz)

=∂

∂ϑi

∫

X0

f(z, ϑ)µ(dz)

︸︷︷︸= 1



Putd

dϑ:=

(∂

∂ϑ1, . . . ,

∂

∂ϑk

)⊤.

10.6 Definition and Theorem (Score vector)

U1(ϑ) :=d

dϑlog f(X1, ϑ)

is called the score vector of X1. Under the regularity conditions 10.5 we have:

a) Eϑ (U1(ϑ)) = 0 ∀ϑ ∈ Θ,

b) Vϑ (U1(ϑ)) = Eϑ

[U1(ϑ)U1(ϑ)

⊤] = I1(ϑ).

Proof: a) follows from 10.5 b), b) from 10.5 e).

10.7 Remarks

a) I1(ϑ) is the covariance matrix of the score vector.

b) Let Un(ϑ) :=ddϑ

log fn(X1, . . . , Xn, ϑ) be the score vector of(X1, . . . , Xn). We then have Eϑ (Un(ϑ)) = 0 (why?) and

Vϑ (Un(ϑ)) = Eϑ(Un(ϑ)Un(ϑ)⊤) =

n∑

j,ℓ=1

Eϑ

[∂

∂ϑilogf(Xj , ϑ)

∂

∂ϑjlogf(Xℓ, ϑ)

]

=n∑

j,ℓ=1

δj,ℓI1(ϑ) = nI1(ϑ).



10.8 Theorem (Main theorem of maximum likelihood estimation)

Assume the conditions in 10.5. If (ϑn) is a consistent sequence of maximumlikelihood estimators, then

√n(ϑn − ϑ

) Dϑ−→ Nk

(0, I1(ϑ)

−1) ∀ϑ ∈ Θ.

Proof: (Sketch) Fix ϑ ∈ Θ and let M ′n ⊂ X as in 10.4 (Pϑ(M

′n)→ 1).

Let U(ϑ, δϑ) ⊂ Θ as in 10.5 d) and put

Vn := x ∈ X : ϑn(x) ∈ U(ϑ, δϑ).

Since (ϑn) is consistent, we have Pϑ(Vn)→ 1. Put

ϑn(x) := ϑn(x)1M′n∩Vn(x) + ϑ1(M′

n∩Vn)c(x), x ∈ X ,

=⇒ ϑnPϑ−→ ϑ. Since Pϑ(ϑn 6= ϑn)→ 0 (why?), we have√

n(ϑn − ϑn) = oPϑ(1) (!). By Slutsky’s lemma, we have to show

√n(ϑn − ϑ

) Dϑ−→ Nk

(0, I1(ϑ)

−1)∀ϑ ∈ Θ.



Let

Un(t) :=n∑

j=1

d

dϑlog f(Xj , ϑ)

∣∣∣∣ϑ=t

, t ∈ Θ.

On the set M ′n ∩ Vn, ϑn satisfies the log-likelihood equations 0 = Un(ϑn).

Idea: Make a Taylor expansion of Un(t) at t = ϑ. Let

Wn(ϑ) :=d

dϑ⊤Un(ϑ) =

(n∑

ℓ=1

∂2

∂ϑi∂ϑjlog f(Xℓ, ϑ)

)

1≤i,j≤k

.

We claimE(Wn(ϑ)) = −nI1(ϑ).

To prove this claim, notice that

∂

∂ϑj

[∂

∂ϑilog f(Xℓ, ϑ)

]=

∂

∂ϑj

∂∂ϑif(Xℓ, ϑ)

f(Xℓ, ϑ)

=1

f(Xℓ, ϑ)

∂2

∂ϑi∂ϑjf(Xℓ, ϑ)− ∂

∂ϑilogf(Xj , ϑ)

∂

∂ϑjlogf(Xℓ, ϑ).

︸︷︷︸︸︷︷︸Eϑ[ ] = 0, cf. 10.5 c) Eϑ[ ] = −I1(ϑ), cf. 10.5 e)



A Taylor expansion gives

0 = Un(ϑn) = Un(ϑ) +Wn(ϑ)(ϑn − ϑ) +Rn(ϑ, ϑn − ϑ).

Dividing this equation by√n gives

0 =1√nUn(ϑ) +

1

nWn(ϑ) ·

√n(ϑn − ϑ) + 1√

nRn(ϑ, ϑn − ϑ)

︸︷︷︸= oPϑ (1), cf. 10.5 (d)

=⇒ 1

nWn(ϑ) ·

√n(ϑn − ϑ) = − 1√

nUn(ϑ) + oPϑ(1).

Multiv. CLT, CMT and Slutsky imply

− 1√nUn(ϑ) + oPϑ(1)

Dϑ−→ Nk(0, I1(ϑ)).

The CMT gives (−I1(ϑ))−1 1

nWn(ϑ)

√n(ϑn − ϑ) Dϑ−→ Nk(0, I1(ϑ)

−1).

Since 1nWn(ϑ)→ −I1(ϑ) Pϑ-a.s., the assertion follows.



10.9 Corollary (Representation of estimation error)Under the standing assumptions, we have

√n(ϑn − ϑ

)=

1√n

n∑

j=1

ℓ(Xj , ϑ) + oPϑ (1) as n→∞,

where ℓ(Xj , ϑ) = I1(ϑ)−1 d

dϑlog f(Xj , ϑ).

Proof: From the proof of Theorem 10.8, we have√n(ϑn − ϑ

)= I1(ϑ)

−1 1√nUn(ϑ) + oPϑ (1), q.e.d.

Notice that Eϑ(ℓ(X1, ϑ)) = 0, Vϑ(ℓ(X1, ϑ)) = I1(ϑ)−1.

ℓ(·, ϑ) : X0 → R is called influence function.

10.10 Remark Under general conditions, we have ϑn → ϑ Pϑ-almost surely.

Further reading: Witting, H., Muller-Funk, U.: Mathematische Statistik II, B.G.Teubner 1995, p. 168ff.

Ferguson, Th.: A course in large sample Theory. Chapman & Hall 1996,Chapters 16-18.


Asymptotic (relative) efficiency of estimators

11 Asymptotic (relative) efficiency of estimators

11.1 Definition (Loewner semiorder)

Let A,B be symmetric (k × k)-matrices. We write A ≥ 0 if A is positive-semidefinite and put

A ≥ B :⇐⇒ A−B ≥ 0.

11.2 Remark (multivariate Cramer-Rao inequality)

Let X1, X2, . . . be i.i.d. ∼ Qϑ, where ϑ ∈ Θ ⊂ Rk. Furthermore, letQϑ = f(·, ϑ)µ, where µ is some σ-finite measure on B0.Let Tn = Tn(X1, . . . , Xn) be an unbiased estimator of ϑ, i.e., we haveEϑ(Tn) = ϑ for each ϑ ∈ Θ.

Under further regularity conditions, we then have

Vϑ(Tn) = Eϑ

[(Tn − ϑ)(Tn − ϑ)⊤

]≥ 1

nI1(ϑ)

−1.

(multivariate Cramer–Rao-inequality)



Memo: Vϑ(Tn) = Eϑ

[(Tn − ϑ)(Tn − ϑ)⊤

]≥ 1

nI1(ϑ)

−1.

=⇒ Vϑ

(√n(Tn − ϑ)

)≥ I1(ϑ)

−1.

Now, suppose that√n(Tn − ϑ) Dϑ−→ Nk(0,Σ(ϑ)), ϑ ∈ Θ.

Do we have Σ(ϑ) ≥ I1(ϑ)−1, ϑ ∈ Θ ?

11.3 Theorem (Bahadur)

Suppose that the conditions of 10.5 hold. Let (Tn) be a sequence of estimatorsof ϑ such that

√n(Tn − ϑ) Dϑ−→ Nk(0,Σ(ϑ)), ϑ ∈ Θ.

ThenΣ(ϑ) ≥ I1(ϑ)

−1 ∀ϑ ∈ Θ ∩Nc,

where λk(N) = 0.

Proof: See Bahadur, Ann. Mathem. Statist. 38 (1967), 303-324.



11.4 Definition (BAN estimator)

Under the conditions of 10.5, a sequence (Tn) of estimators of ϑ is said to beasymptotically efficient, if

√n(Tn − ϑ) Dϑ−→ Nk(0, I1(ϑ)

−1) ∀ϑ ∈ Θ.

In this case, Tn is called a best asymptotically normal (BAN) estimator.

11.5 The moment estimator

Let X1, X2, . . . be i.i.d. R-valued and EX2k1 <∞. Let

mℓ := EXℓ1, ℓ = 1, . . . , 2k,

mℓ,n :=1

n

n∑

j=1

Xℓj

a.s.−→ mℓ as n→∞. (SLLN)

Suppose thatϑ = g(m1, . . . ,mk) ∈ Θ ⊂ R

k

for some continuously differentiable bijective function g : D ⊂ Rk → Θ. Then

ϑn := g(m1,n, . . . , mk,n) is called the moment estimator of ϑ.



Memo: mℓ = EXℓ1, mℓ,n =

1

n

n∑

j=1

Xℓj

a.s.−→ mℓ

Memo: ϑ = g(m1, . . . ,mk), ϑn := g(m1,n, . . . , mk,n)

Notice that ϑna.s.−→ ϑ, ϑ ∈ Θ (why?) Let

Yj :=

Xj

X2j

...

Xkj

, a := EY1 =

m1

m2

...mk

T := E

[(Y1 − a)(Y1 − a)⊤

]=(E[(Xi

1 −mi)(Xj1 −mj)]

)1≤i,j≤k

= (mi+j −mimj)1≤i,j≤k .

CLT =⇒ √n(Y n − a) = 1√

n

(n∑

j=1

Yj − na)

D−→ Nk(0, T ).

Y n = (m1,n, . . . , mk,n)⊤ , g

(Y n

)= ϑn, g(a) = ϑ.



By the δ-method we have the following result:

11.6 Theorem (Limit distribution of the moment estimator)

Let X1, X2, . . . be i.i.d., EX2k1 < ∞, mℓ = EXℓ

1, mℓ,n = n−1∑n

j=1Xℓj , ℓ =

1, . . . , 2k. Let ϑ = g(m1, . . . , mk) ∈ Θ ⊂ Rk for some continuously differentiablefunction g : D ⊂ Rk → Θ. Then

√n(ϑn − ϑ

) Dϑ−→ Nk (0,Σ(ϑ)) ,

whereΣ(ϑ) = g′(a)T g′(a)⊤,

a =(EX1,EX

21 , . . . ,EX

k1

)⊤, T =

(EXi+j

1 − EXi1 · EXj

1

)1≤i,j≤k

.

In general, (ϑn) is not BAN.



11.7 Scoring (making estimators BAN)

Assume the conditions of 10.5. Let (ϑn) be any sequence of estimators of ϑwith

ϑna.s.−→ ϑ, ϑ ∈ Θ, and

√n(ϑn − ϑ) Dϑ−→ Nk(0,Σ(ϑ)), ϑ ∈ Θ.

Put

Un(ϑ) :=n∑

j=1

d

dϑlog f(Xj , ϑ), Wn(ϑ) :=

d

dϑ⊤Un(ϑ), cf. proof of 10.8,

ϑ(1)n := ϑn −Wn(ϑn)

−1Un(ϑn),

ϑ(2)n := ϑn + I1(ϑn)

−1 1

nUn(ϑn).

Let ϑn denote the maximum likelihood estimator. For j ∈ 1, 2, we then have

√n(ϑ(j)n − ϑn

)Pϑ−→ 0 ∀ϑ ∈ Θ. (11.1)

From Slutsky’s lemma, we obtain

√n(ϑ(j)

n − ϑ) =√n(ϑn − ϑ) +

√n(ϑ(j)

n − ϑn)Dϑ−→ Nk(0, I1(ϑ)

−1).

Hence,(ϑ(1)n

)and

(ϑ(2)n

)are asymptotically efficient (BAN).



Memo: ϑ(1)n = ϑn −Wn(ϑn)

−1Un(ϑn), Un(ϑ) =∑n

j=1ddϑ

log f(Xj , ϑ)

We give a sketch of the proof of (11.1) in the case j = 1. The case j = 2

follows similarly. A Taylor expansion of Un(·) at ϑn yields

Un(ϑn) = Un(ϑn) +Wn(ϑn)(ϑn − ϑn) + . . .︸︷︷︸= 0

Hence, ϑ(1)n − ϑn = ϑn − ϑn −Wn(ϑn)

−1Un(ϑn)︸︷︷︸≈Wn(ϑn)(ϑn − ϑn)

≈[Ik −Wn(ϑn)

−1Wn(ϑn)](ϑn − ϑn) =⇒

√n(ϑ(1)

n −ϑn) ≈[Ik −

(1

nWn(ϑn)

)−11

nWn(ϑn)

] (√n(ϑn−ϑ)−

√n(ϑn−ϑ)

)

︸︷︷︸︸︷︷︸︸︷︷︸︸︷︷︸a.s.−→ −I1(ϑ) a.s.−→ −I1(ϑ) = OPϑ (1) = OPϑ(1)

= oPϑ(1), q.e.d.

Further reading: Ferguson, Th.: A course in large sample theory. Chapman &Hall, 1996, p.133ff.



11.8 Definition (Asymptotic relative Pitman efficiency)

Let X1, X2, . . . be i.i.d. ∼ Qϑ as in Chapter 9, µ : Θ → R. Let Sn =Sn(X1, . . . , Xn) and Tn = Tn(X1, . . . , Xn) be sequences of estimators of µ(ϑ)satisfying

√n(Sn − µ(ϑ)) Dϑ−→ N(0, σ2(ϑ)), (11.2)√n(Tn − µ(ϑ)) Dϑ−→ N(0, τ 2(ϑ)) (11.3)

and 0 < σ2(ϑ), τ 2(ϑ) <∞, ϑ ∈ Θ. Then

AREϑ ((Tn) : (Sn)) :=σ2(ϑ)

τ 2(ϑ)

is called asymptotic relative (Pitman) efficiency of (Tn) with respect to (Sn).

There is the alternative notation

AREF ((Tn) : (Sn)) ,

if F is the distribution function of X1.



Memo: AREϑ ((Tn) : (Sn)) :=σ2(ϑ)

τ 2(ϑ),√n(Sn − µ(ϑ)) Dϑ−→ N(0, σ2(ϑ))

11.9 Interpretation of the ARE

Let (mn)n≥1, mn = mn(ϑ), be a sequence of integers such that mn →∞ and

√n (Tmn − µ(ϑ))

Dϑ−→ N(0, σ2(ϑ)).

A comparison with (11.2) shows that the estimator T with sample size mn isasymptotically equivalent (with respect to asymptotic variance) to theestimator S with sample size n.

√n

mn

√mn (Tmn − µ(ϑ))

Dϑ−→ N(0, σ2(ϑ))︸︷︷︸

Dϑ−→ N(0, τ 2(ϑ)), cf. (11.3)

⇐⇒ limn→∞

√n

mn=σ(ϑ)

τ (ϑ).

Hence,AREϑ ((Tn) : (Sn)) = lim

n→∞

n

mn(ϑ).



11.10 Example (Estimation of the center of a symmetric distribution)

Let X1, X2, . . . be i.i.d., EX21 <∞, ϑ = EX1. Suppose that the distribution of

X1 is symmetric around ϑ, i.e., X1 − ϑ D= −(X1 − ϑ) = ϑ−X1.

Let F be the distribution function of X1. Assume thatf(F−1(1/2)

):= F ′ (F−1(1/2)

)> 0. Then

ϑ = F−1

(1

2

)= EX1. (expectation = median)

Let σ2F := σ2(F ) := VF (X1). As a first estimator, take the sample mean

Sn := Xn :=1

n

n∑

j=1

Xj .

The CLT gives√n(Sn − ϑ) DF−→ N(0, σ2(F )).

For a second estimator, let X(1) ≤ X(2) ≤ . . . ≤ X(n) be the order statistics ofX1, . . . , Xn. Define the empirical median of X1, . . . , Xn as

Tn :=

X

( n+12

), if n is odd,

12

(X( n

2) +X(n

2+1)

), if n is even.



Memo: Tn :=

X

(n+12

), if n is odd,

12

(X(n

2) +X( n

2+1)

), if n is even.

Use

X(r) ≤ t⇐⇒n∑

j=1

1Xj ≤ t ≥ r

and∑n

j=1 1Xj ≤ t ∼ Bin(n, F (t)) to show

√n(Tn − ϑ) DF−→ N

(0,

1

4f2 (F−1(1/2))

)

(Exercise!). It follows that

AREF ((Tn) : (Sn)) = 4f2

(F−1

(1

2

))σ2(F ).

Special case: X1 ∼ ts (Student’s t-distribution with s degrees of freedom), i.e.,

X1 ∼ N0√1s

∑sj=1N

2j

, where N0, . . . , Ns i.i.d. ∼ N(0, 1).



X1 has the density

fs(x) =Γ(s+12

)√πsΓ

(s2

) ·(1 +

x2

s

)−(s+1)/2

, x ∈ R.

Let Fs(t) :=∫ t

−∞ fs(x)dx, t ∈ R, be the distribution function of X1. We have(!)

σ2(Fs) =s

s− 2, if s ≥ 3,

F−1s (1/2) = 0,

EF (X1) = 0, if s ≥ 2.

It follows that

as := AREFs ((Tn) : (Sn)) =4Γ2

(s+12

)

πsΓ2(s2

) · s

s− 2.

s 3 4 5 6 ∞as 1.621 1.125 0.961 0.879 0.637 (= 2/π)


Asymptotic tests in parametric models

12 Asymptotic tests in parametric models

12.1 The setting

Let X1, X2, . . . be i.i.d, X0-valued, having density f(·, ϑ), ϑ ∈ Θ ⊂ Rk, withrespect to some σ-finite measure µ on B0.Assume Θ to be open. Furthermore, assume the conditions in 10.8.

For testing a simple hypothesis H0 : ϑ = ϑ0 against a simple alternativeH1 : ϑ = ϑ1 (ϑ0, ϑ1 ∈ Θ, ϑ0 6= ϑ1), there is an optimal (Neyman–Pearson)test. This uses the test statistic

Tn :=

n∏

j=1

f(Xj , ϑ1)

f(Xj , ϑ0)(likelihood ratio).

Rejection of H0 is for large values of Tn.

Now, Θ0 ⊂ Θ, Θ0 6= ∅, Θ \Θ0 6= ∅.

Hypothesis H0 : ϑ ∈ Θ0,

Alternative H1 : ϑ ∈ Θ \Θ0.



12.2 Definition (Generalized likelihood ratio)

Λn := Λn(X1, . . . , Xn) :=supϑ∈Θ0

∏nj=1 f(Xj , ϑ)

supϑ∈Θ

∏nj=1 f(Xj , ϑ)

(≤ 1)

is called the generalized likelihood ratio (GLR). The GLR-test rejects H0 forsmall values of Λn.

12.3 Theorem (Simple hypothesis)

Let Θ0 = ϑ0, Mn := −2 log Λn. We then have

Mn

Dϑ0−→ χ2k as n→∞.

Proof: Let

Un(ϑ) :=

n∑

j=1

d

dϑlog f(Xj , ϑ),

Wn(ϑ) :=

(n∑

ℓ=1

∂2

∂ϑi∂ϑjlog f(Xℓ, ϑ)

)1≤i,j≤k =

d

dϑ⊤Un(ϑ).



Let ϑn be the maximum likelihood estimator of ϑ. We have

0 = Un(ϑn) = Un(ϑ0) +Wn(ϑ0)(ϑn − ϑ0) + oPϑ0

(√n). (12.1)

Notice that

Λn =supϑ∈Θ0

∏nj=1 f(Xj , ϑ)

supϑ∈Θ

∏nj=1 f(Xj , ϑ)

=n∏

j=1

f(Xj , ϑ0)

f(Xj , ϑn).

It follows that (recall: Mn = −2 log Λn)

Mn = 2

n∑

j=1

[log f(Xj , ϑn)− log f(Xj , ϑ0)

]

= 2

(U⊤

n (ϑ0)(ϑn − ϑ0) +1

2(ϑn − ϑ0)

⊤Wn(ϑ0)(ϑn − ϑ0) + oPϑ0(1)

)

︸︷︷︸= − (ϑn−ϑ0)

⊤Wn(ϑ0) + oPϑ0(√n) (by (12.1))

= 2

(+

1

2

(√n(ϑn − ϑ0)

)⊤ (− 1

nWn(ϑ0)

)√n(ϑn − ϑ0)

)+ oPϑ0

(1)︸︷︷︸︸︷︷︸︸︷︷︸

Dϑ0−→ Z ∼ Nk(0, I1(ϑ0)−1)

a.s.−→ I1(ϑ0)Dϑ0−→ Z

CMT =⇒ Mn

Dϑ0−→ Z⊤I1(ϑ0)Z ∼ χ2k, q.e.d.



12.4 Example (Multinomial distribution)

For s ≥ 2, let

Θ := ϑ := (p1, . . . , ps−1) : pj > 0 ∀ j, p1 + . . .+ ps−1 < 1,

and put ps := 1− p1 − . . .− ps−1. Then Θ is an open subset of Rk, wherek = s− 1.

Let ej be the jth canonical unit vector in Rs, j = 1, . . . , s.

Let X1, X2, . . . be i.i.d. s-dimensional random vectors, where

Pϑ(X1 = ej) = pj , j = 1, . . . , s.

Notice thatn∑

j=1

Xj =: (N1, . . . , Ns) ∼ Mult(n; p1, . . . , ps).

Put

f(t, ϑ) :=

pj , if t = ej (j = 1, . . . , s),

0, otherwise.

Then X1 has density f(·, ϑ) w.r.t. the counting measure µ on e1, . . . , es.



The joint density of (X1, . . . , Xn) isn∏

j=1

f(Xj , ϑ) = pN11 pN2

2 · . . . · pNss .

Let Θ0 := ϑ0 = (q1, . . . , qs−1), qs := 1− q1 − . . .− qs−1.

Hypothesis H0 : ϑ ∈ Θ0 (i.e., pj = qj ∀ j),Alternative H1 : ϑ /∈ Θ0.

The MLE of ϑ is ϑn = (p1, . . . , ps−1), pj =Nj

n(!)

=⇒ Λn =qN11 · . . . · qNs

s

pN11 · . . . · pNs

s

=s∏

i=1

(qipi

)Ni

=⇒ Mn = 2

s∑

i=1

Ni log

(Ni

nqi

)Dϑ0−→ χ2

s−1.

Compare with the χ2-test statistic

Tn :=s∑

i=1

(Ni − nqi)2nqi

We have (Exercise!) Tn −Mn = oPϑ0(1) (=⇒ Tn

Dϑ0−→ χ2s−1).



12.5 Theorem (Composite hypothesis)

In 12.1, let Θ0 = h(U), where U ⊂ Rℓ, U open, 1 ≤ ℓ < k, h : U → Rk,h twice continuously differentiable, injective. Under further regularity conditions(→ proof), we have

Mn = −2 log ΛnDϑ−→ χ2

k−ℓ for each ϑ ∈ Θ0.

Proof: (sketch) Fix ϑ0 = h(u0) ∈ Θ0, where u0 ∈ U .

Let ϑn be a consistent MLE of ϑ0, and let un be a consistent MLE of u0.

Then h(un) is a consistent MLE of ϑ0 within (the submodel given by) Θ0.

We have

Λn =

n∏

j=1

f(Xj , h(un))

f(Xj , ϑn),

Mn = 2

n∑

j=1

log f(Xj , ϑn)− log f(Xj , h(un))

. (12.2)

Let h′(u)k×ℓ be the Jacobian matrix of h at u.



Memo: Mn = 2n∑

j=1


Chain rule =⇒ d

dulog f(Xj , h(u)) = h′(u)⊤

d

dϑlog f(Xj , ϑ)

∣∣∣ϑ=h(u)

. (12.3)

=⇒ d2

dudu⊤ log f(Xj , h(u)) = h′(u)⊤d2

dϑdϑ⊤ log f(Xj , ϑ)∣∣∣ϑ=h(u)

h′(u)

=⇒ Eu

[d2

dudu⊤ log f(Xj , h(u))

]= h′(u)⊤Eu

[d2

dϑdϑ⊤ log f(Xj , ϑ)∣∣∣ϑ=h(u)

]h′(u).

︸︷︷︸︸︷︷︸=: − I1(u) = −I1(h(u))

Let

Un(u) :=n∑

j=1

d

dulog f(Xj , h(u)), Un(h(u)) =

n∑

j=1

d

dϑlog f(Xj , ϑ)

∣∣∣ϑ=h(u)

.

Then (12.3) implies

Un(u) = h′(u)⊤Un(h(u)). (12.4)



From 10.9 (representation of estimation error), we have

√n(ϑn − ϑ0) = I1(ϑ0)

−1 1√nUn(ϑ0) + oPϑ0

(1), (12.5)

√n(un − u0) = I1(u0)

−1 1√nUn(u0) + oPu0

(1). (12.6)

(12.5) =⇒ 1√nUn(ϑ0) = I1(ϑ0)

√n(ϑn − ϑ0) + oPϑ0

(1). (12.7)

(12.4), (12.6) yield

√n(un − u0) = I1(u0)

−1h′(u0)⊤ 1√

nUn(ϑ0) + oPϑ0

(1) (12.8)

= I1(u0)−1h′(u0)

⊤I1(ϑ0)√n(ϑn − ϑ0) + oPϑ0

(1).(12.7) ︸︷︷︸

=: A

Let

Hn(ϑ0) :=

n∑

j=1

d2

dϑdϑ⊤ log f(Xj , ϑ)∣∣∣ϑ=ϑ0

=⇒ 1

nHn(ϑ0)→ −I1(ϑ0) Pϑ0 -a.s.



Taylor expansions at ϑ0 resp. u0 =⇒n∑

j=1

log f(Xj , ϑn) =

n∑

j=1

log f(Xj , ϑ0) + Un(ϑ0)⊤(ϑn − ϑ0) (12.9)

+1

2(ϑn − ϑ0)

⊤Hn(ϑ0)(ϑn − ϑ0) + oPϑ0(1),

n∑

j=1

log f(Xj , h(un)) =n∑

j=1

log f(Xj , h(u0)) + Un(u0)⊤(un−u0)(12.10)︸︷︷︸

= ϑ0

+1

2(un−u0)

⊤Hn(u0)(un−u0) + oPϑ0(1).︸︷︷︸

= h′(u0)⊤Hn(ϑ0)h

′(u0)

Memo: Mn = 2n∑

j=1


Memo:√n(un − u0) = A

√n(ϑn − ϑ0) + oPϑ0

(1)

Put Zn :=√n(ϑn − ϑ0). Plug (12.7), (12.8) into (12.9), (12.10) =⇒



Memo: Zn :=√n(ϑn − ϑ0)

=⇒ Mn = . . . = Z⊤n

(I1(ϑ0)

[Ik − h′(u0)A

])Zn + oPϑ0

(1). (check!)

Now, from the Main Theorem of ML estimation,

Zn

Dϑ0−→ Z ∼ Nk(0,Σ), Σ = I1(ϑ0)−1.

Let B := I1(ϑ0) [Ik − h′(u0)A].

The matrix B has the following properties (check!):

B is symmetric,

BΣ = (BΣ)2,

Rank(BΣ) = k − ℓ.By a general result (Exercise), it follows that

Z⊤BZ ∼ χ2k−ℓ.

By the CMT, Mn

Dϑ0−→ Z⊤BZ, and the assertion follows.



12.6 Corollary Given α ∈ (0, 1), let χ2k−ℓ;1−α be the (1 − α)-quantile of the

χ2k−ℓ-distribution. Then, in the setting of 12.1 and 12.5, the sequence (ϕn) of

testsϕn := 1

Mn ≥ χ2

k−ℓ;1−α

(so-called generalized likelihood ratio test) has asymptotic level α, i.e.,

limn→∞

Eϑ(ϕn) = α ∀ϑ ∈ Θ0.

12.7 Theorem (Consistency of the generalized likelihood ratio test)

The generalized likelihood ratio test is consistent against each fixed alternative,i.e., we have

limn→∞

Eϑ(ϕn) = 1 ∀ϑ ∈ Θ \Θ0.

Proof: We only consider the case ℓ = 0, i.e., Θ0 = ϑ0. Fix ϑ1 6= ϑ0. Wehave

Λn =n∏

j=1

f(Xj , ϑ0)

f(Xj , ϑ1)·

n∏

j=1

f(Xj , ϑ1)

f(Xj , ϑn).



Memo: Λn =n∏

j=1

f(Xj , ϑ0)

f(Xj , ϑ1)·

n∏

j=1

f(Xj , ϑ1)

f(Xj , ϑn)

It follows that

Mn = −2 log Λn

= 2

n∑

j=1

log f(Xj , ϑn)− log f(Xj , ϑ1)

+ 2n · 1

n

n∑

j=1

logf(Xj , ϑ1)

f(Xj , ϑ0)︸︷︷︸︸︷︷︸=: Yn =: Vn

From Theorem 12.3 we have Yn

Dϑ1−→ χ2k as n→∞.

By the SLLN,

Vn → Eϑ1

[log

f(X1, ϑ1)

f(X1, ϑ0)

]Pϑ1 -a.s.



Since log t ≥ 1− 1t(with equality only for t = 1), we have

Eϑ1

[log

f(X1, ϑ1)

f(X1, ϑ0)

]=

∫log

f(z, ϑ1)

f(z, ϑ0)f(z, ϑ1)µ(dz)

>

∫ (1− f(z, ϑ0)

f(z, ϑ1)

)f(z, ϑ1)µ(dz)

=

∫f(z, ϑ1)µ(dz)−

∫f(z, ϑ0)µ(dz) = 0.

Hence, MnP−→ ∞ under Pϑ1 , i.e., limn→∞ Pϑ1(Mn ≥ c) = 1 ∀ c > 0, q.e.d.

12.8 Definition (Kullback-Leibler-Information)

Let Qϑj= f(·, ϑj)µ, j = 0, 1. Assume Qϑ1 ≪ Qϑ0 .

IKL (Qϑ1 , Qϑ0) := Eϑ1

[log

f(X1, ϑ1)

f(X1, ϑ0)

]=

∫

X0

logf(z, ϑ1)

f(z, ϑ0)f(z, ϑ1)µ(dz)

is called the Kullback-Leibler information of Qϑ1 w.r.t. Qϑ0 .



12.9 Contingency tablesLet (X,Y ) be a discrete bivariate random vector, where

pi,j := P(X = xi, Y = yj), 1 ≤ i ≤ r, 1 ≤ j ≤ s.Let

Θ :=

ϑ := (p1,1, . . . , pr,s−1) ∈ R

rs−1 : pi,j > 0 ∀ i, j,∑

(i,j) 6=(r,s)

pi,j < 1

.

Θ is an open subset of Rk, where k = rs− 1.Let (X1, Y1), . . . , (Xn, Yn) be i.i.d., ∼ (X,Y ). Put

Ni,j :=

n∑

ν=1

1Xν = xi, Yν = yj =⇒ (N1,1, . . . , Nr,s) ∼ Mult(n; p1,1, . . . , pr,s).

j

1 · · · · · · · · · s Σ

1 N1,1 · · · · · · · · · N1,s N1+

2 N2,1 · · · · · · · · · N2,s N2+

i...

... · · · · · · · · ·...

......

... · · · · · · · · ·...

...r Nr,1 · · · · · · · · · Nr,s Nr+

Σ N+1 · · · · · · · · · N+s n



H0 : X, Y independent ⇐⇒ pi,j = pi qj ∀ i, j,where

pi = P(X = xi), i = 1, . . . , r,

qj = P(Y = yj), j = 1, . . . , s.

H0 corresponds to

Θ0 :=ϑ ∈ Θ : ∃p1, . . . , pr−1 > 0,

r−1∑

i=1

pi < 1,∃q1, . . . , qs−1 > 0,

s−1∑

j=1

qj < 1, such that pi,j = piqj ∀(i, j) 6= (r, s).

Θ0 is a ℓ-dimensional submanifold of Rk, where ℓ = r − 1 + s− 1.

The density f(·, ·, ϑ) of (X,Y ) with respect to the counting measure onx1, . . . , xr × y1, . . . , ys is

f(x, y, ϑ) := pi,j , if (x, y) = (xi, yj)

=⇒n∏

ν=1

f(Xν , Yν , ϑ) =

r∏

i=1

s∏

j=1

pNi,j

i,j .



We have (in all products, i runs from 1 to r and j from 1 to s)

Λn =suppi,qj

∏i

∏j(piqj)

Ni,j

∏i

∏j

(Ni,j

n

)Ni,j=

suppi

∏i p

Ni+1 supqj

∏j q

N+j

j

∏i

∏j

(Ni,j

n

)Ni,j

=

∏i

(Ni+

n

)Ni+ ∏j

(N+j

n

)N+j

∏i

∏j

(Ni,j

n

)Ni,j.

Generalized likelihood ratio test:

Mn = −2 log Λn

= 2

∑

i,j

Ni,j logNi,j

n−∑

i

Ni+ logNi+

n−∑

j

N+j logN+j

n

= 2n∑

i,j

Ni,j

nlog

Ni,j

n− Ni+

n

N+j

nlog

(Ni+

n

N+j

n

).

↑

(use n =∑

j N+j =∑

iNi+)



In this case, we have k − ℓ = rs− 1− (r − 1 + s− 1) = (r − 1)(s− 1).

Thus, the generalized likelihood ratio test for independence in anr × s-contingency table is:

Reject H0 if Mn ≥ χ2(r−1)(s−1);1−α.

Using

t log t = t− 1 +1

2(t− 1)2 +O((t− 1)3) as t→ 1,

an asymptotically equivalent test statistic is (Exercise!)

Tn :=r∑

i=1

s∑

j=1

(Ni,j − nNi+

n

N+j

n

)2

nNi+

n

N+j

n

.

This is more intuitive than Mn sinceNi+

nestimates pi and

N+j

nestimates qj .



12.10 The parametric bootstrap

Let X1, X2, . . . , Xn, . . . be i.i.d. X0-valued random variables with unknowndistribution PX1 .

We want to test the hypothesis

H0 : PX1 ∈ Pϑ : ϑ ∈ Θ0,

where Θ0 ⊂ Rk.

Let Tn = Tn(X1, . . . , Xn) be a sequence of test statistics of H0 with upperrejection region, i.e., reject H0 if Tn > c.

Here, c is a suitable critical value that depends on the chosen level ofsignificance α ∈ (0, 1) and the sample size n, i.e., c = c(n, α).

But c also depends on the underlying unkown distribution of Tn under H0

if H0 is composite, i.e., if |Θ0| ≥ 2.

We want to havePϑ(Tn > c) = α for each ϑ ∈ Θ0.

This goal can not be achieved!



Basic idea of the parametric bootstrap:

Let (ϑn) be a consistent sequence of estimators of ϑ.

If (a realization of) ϑn is near ϑ, then Pϑn

should be near Pϑ, for short:

ϑn ≈ ϑ =⇒ Pϑn≈ Pϑ.

If X∗1 , . . . , X

∗n are i.i.d. ∼ P

ϑn, then

Pϑ(Tn(X1, . . . , Xn) > c) ≈ Pϑn

(Tn(X∗1 , . . . , X

∗n) > c) .

The right-hand side can be estimated by the parametric bootstrap

(B. Efron 1977).

The bootstrap algorithm:

Given ϑn = ϑn(X1, . . . , Xn) (i.e., conditionally on X1, . . . , Xn):

(B1) Generate X∗1 , . . . , X

∗n i.i.d. ∼ P

ϑn, (→ pseudorandom numbers)

(B2) Compute Tn(X∗1 , . . . , X

∗n)

Carry out (B1), (B2) b times (b is the number of bootstrap replications).



X∗1,1, . . . , X

∗n,1 → Tn(X

∗1,1, . . . , X

∗n,1) =: T ∗

n,1,

X∗1,2, . . . , X

∗n,2 → Tn(X

∗1,2, . . . , X

∗n,2) =: T ∗

n,2,

......

...

X∗1,b, . . . , X

∗n,b → Tn(X

∗1,b, . . . , X

∗n,b) =: T ∗

n,b.

Here, X∗i,j , i = 1, . . . , n; j = 1, . . . , b, are i.i.d. ∼ P

ϑn.

Let

H∗n,b(t) :=

1

b

b∑

j=1

1T ∗n,j ≤ t

be the empirical distribution function of T ∗n,1, . . . , T

∗n,b.

Let T ∗(1) ≤ . . . ≤ T ∗

(b) be the order statistics of T ∗n,1, . . . , T

∗n,b.

Put

c∗n,b(α) := H∗−1n,b (1− α) =

T ∗(b(1−α)), if b(1− α) ∈ N,

T ∗(⌊b(1−α)+1⌋), otherwise.

Reject H0 if Tn > c∗n,b(α).



12.11 Theorem Let Hn,ϑ(t) := Pϑ(Tn ≤ t), ϑ ∈ Θ. Suppose that for each

ϑ ∈ Θ there is a continuous distribution function Hϑ that is strictly increasingon t ∈ R : 0 < Hϑ(t) < 1. Suppose further that for any sequence (ϑn) in Θsatisfying limn→∞ ϑn = ϑ ∈ Θ we have

limn→∞

‖Hn,ϑn −Hϑ‖∞ = 0. (12.11)

Finally, assume that

ϑnPϑ−→ ϑ, ϑ ∈ Θ. (12.12)

We then have for each ϑ ∈ Θ:

a) ‖H∗n,b −Hϑ‖∞ Pϑ−→ 0 as n, b→∞,

b) c∗n,b(α)Pϑ−→ H−1

ϑ (1− α) as n, b→∞,

c) limn,b→∞ Pϑ(Tn > c∗n,b(α)) = α.

Proof: (12.11), (12.12) and the subsequence criterion for stochasticconvergence yield

‖Hn,ϑn

−Hϑ‖∞ Pϑ−→ 0 as n→∞. (12.13)



Memo: To show: a) ‖H∗n,b −Hϑ‖∞ Pϑ−→ 0 as n, b→∞.

Memo: We know: ‖Hn,ϑn

−Hϑ‖∞ Pϑ−→ 0 as n→∞.

Let U1, U2, . . . be i.i.d. with the uniform distribution U(0, 1), independent ofX1, X2, . . .. W.l.o.g., let

T ∗n,j := H−1

n,ϑn(Uj) = inft : H

n,ϑn(t) ≥ Uj.

Notice that, conditionally on ϑn, T∗n,1, T

∗n,2, . . . are i.i.d. with distribution

function Hn,ϑn

. It follows that

‖H∗n,b −Hn,ϑn

‖∞ = supt∈R

∣∣∣∣1

b

b∑

j=1

1Uj ≤ Hn,ϑn(t) −H

n,ϑn(t)

∣∣∣∣︸︷︷︸⇐⇒ T ∗

n,j ≤ t

≤ sup0≤u≤1

∣∣∣∣1

b

b∑

j=1

1Uj ≤ u − u∣∣∣∣ → 0 Pϑ-a.s. as b→∞.

(12.13) (see second memo) and triangle inequality =⇒ a).



Memo: a) ‖H∗n,b −Hϑ‖∞ Pϑ−→ 0 as n, b→∞.

Memo: b) c∗n,b(α)Pϑ−→ H−1

ϑ (1− α) as n, b→∞Memo: c) limn,b→∞ Pϑ(Tn > c∗n,b(α)) = α.

Memo: c∗n,b(α) := H∗−1n,b (1− α)

b) follows from a), together with the continuity and strict monotonicity of Hϑ.

c) follows from b), q.e.d.


Probability Measures on Metric Spaces

13 Probability Measures on Metric Spaces

13.1 Motivation

Let X1, X2, . . . be i.i.d. ∼ U(0, 1). Let

Fn(t) :=1

n

∑nj=11Xj ≤ t, 0 ≤ t ≤ 1 (EDF of X1, . . . , Xn).

limn→∞

supt∈[0,1]

∣∣Fn(t)− F (t)∣∣ = 0 P-a.s. (Glivenko-Cantelli).

Let Bn(t) :=√n(Fn(t)− F (t)

), 0 ≤ t ≤ 1 (uniform empirical process).

0

0.5

−0.51

t

√n(Fn(t)− t)

• • • • •• • •

••

•••• •

• • • •• • •

• • • •

realization of a uniform empirical process (n = 25)



Memo: Let Bn(t) =√n(Fn(t)− F (t)

), 0 ≤ t ≤ 1.

For any k ≥ 1 and any choice of 0 ≤ t1 < t2 < . . . < tk ≤ 1, we have:

(Bn(t1), . . . , Bn(tk))D−→ Nk

(0, (ti ∧ tj − titj)1≤i,j≤k

).

Question: Does Bn(·) converge in distribution as a random function in asuitable space of functions?

Answer: Yes! The”limit object“ is called the Brownian bridge (=: B(·)).

0

0.5

1.0

−0.5

t

3 realizations of a (approximate) Brownian bridge



Memo: Let Bn(t) =√n(Fn(t)− F (t)

), 0 ≤ t ≤ 1.

Bn(·) D−→ B(·) as n→∞ (in a sense to be defined).

Question: Do we have

T (Bn)D−→ T (B)

for ’nice’ real-valued functionals T (defined on a suitable space of functions)?

Examples:

T1(Bn) := ‖Bn‖∞ =√n sup

0≤t≤1

∣∣∣Fn(t)− t∣∣∣,

T2(Bn) :=

∫ 1

0

B2n(t) dt = n

∫ 1

0

(Fn(t)− t

)2dt.

T1(Bn) is called the Kolmogorov-Smirnov statistic, T2(Bn) the Cramer-vonMises statistic for testing the hypothesis of a uniform distribution in [0, 1].

Answer: Yes!



13.2 Notation, basic concepts

Let (S, ρ) be a metric space.

B(x, ε) := Bρ(x, ε) = y ∈ S : ρ(x, y) < ε, x ∈ S, ε > 0,

O ⊂ S open :⇐⇒ ∀x ∈ O ∃ε > 0 : B(x, ε) ⊂ O,

O := O ⊂ S : O open,A ⊂ S closed :⇐⇒ Ac := S \ A open,

A := A ⊂ S : A closed,B := σ(O) (σ-field of Borel sets),

M :=

⋃O ∈ O : O ⊂M (interior of M ⊂ S),

M :=⋂A ∈ A : A ⊃M (closure of M),

∂M := M\M (boundary of M).



x ∈ S, M ⊂ S =⇒ ρ(x,M) := infρ(x, y) : y ∈M,(distance of x to M)

|ρ(x,M)− ρ(z,M)| ≤ ρ(x, z), x, z ∈ S, M ⊂ S, (13.1)

Mε := x ∈ S : ρ(x,M) < ε, ε > 0,

(parallel set of M at distance ε)

xn → x :⇐⇒ ρ(xn, x)→ 0 (convergence in S),

(xn) Cauchy sequence :⇐⇒ limm,n→∞

ρ(xn, xm) = 0,

(S, ρ) complete :⇐⇒ each Cauchy sequence has a limit in S,

M dense (in S) :⇐⇒ M = S,

(S, ρ) separable :⇐⇒ ∃M ⊂ S :M countable and dense.



O0 ⊂ O base of O :⇐⇒ each O ∈ O is union of sets of O0,

O ⊂ O open cover of M ⊂ S :⇐⇒ M ⊂⋃O : O ∈ O,

M ⊂ S compact :⇐⇒ each open cover of M has a finite subcover,

N ⊂ S ε-net for M :⇐⇒ ∀x ∈M ∃y ∈ N : ρ(x, y) < ε,

M ⊂ S totally bounded :⇐⇒ ∀ε > 0 :M has a finite ε-net.

13.3 Theorem (Separability, countable base, countable subcover)


a) (S, ρ) is separable,

b) S (i.e. O) has a countable base,

c) Each open cover of each subset of S has a countable subcover.



13.4 Theorem (Relative compactness)

For a set M ⊂ S, the following assertions are equivalent:

a) M is compact (⇐⇒:M is relatively compact),

b) Each sequence in M has a convergent subsequence (limit possibly /∈M),

c) M is totally bounded and M is complete.

M ⊂ S nowhere dense :⇐⇒ (M) = ∅.

13.5 Theorem (Baire)

Let (S, ρ) be a complete metric space.

If S =

∞⋃

n=1

An, then (An) 6= ∅ for at least one n.

I.e., S cannot be a countable union of nowhere dense sets.



Let

Cb := f : S → R : f bounded and continuous,

Cb0 := f ∈ Cb : f uniformly continuous.

f unif. cont. :⇐⇒ ∀ε > 0 ∃δ > 0 ∀x, y ∈ S : ρ(x, y) < δ =⇒ |f(x)−f(y)| < ε.

LetP := P(B) := P : B → [0, 1] : P probability measure.

13.6 Definition and Theorem (Separating class)

M⊂ B is a separating class for P if:

∀ P,Q ∈ P : If P (A) = Q(A) ∀ A ∈M then P = Q.

The systems O and A of open resp. closed sets are separating classes.

Proof: Uniqueness theorem for measures.



13.7 Theorem (Integrals∫fdP , f ∈ Cb0, determine P )

Let P,Q ∈ P . Then

P = Q⇐⇒∫fdP =

∫fdQ ∀f ∈ Cb0.

Proof of”⇐=“: Let A ∈ A, ε > 0,

fε(x) := max

(0, 1− ρ(x,A)

ε

), x ∈ S.

Then 0 ≤ fε ≤ 1 and

|fε(x)− fε(y)| ≤ ρ(x, y)

ε(use (13.1))

=⇒ f ∈ Cb0. Furthermore, 1A ≤ fε ≤ 1Aε . It follows that

P (A) =

∫1A dP ≤

∫fε dP =

∫fε dQ ≤

∫1Aε dQ = Q(Aε).

ε ↓ 0, A closed =⇒ Aε ↓ A =⇒ P (A) ≤ Q(A).

In the same way, Q(A) ≤ P (A).√



13.8 Example (The space C[0, 1])

LetS := C := C[0, 1] := x : [0, 1]→ R : x continuous.

‖x‖ := sup0≤t≤1

|x(t)|, x ∈ C,

ρ(x, y) := ‖x− y‖ = max0≤t≤1

|x(t)− y(t)|,

ρ(xn, x)→ 0 ⇐⇒ uniform convergence of xn to x.

(C, ρ) is separable, since the set

[0, 1] ∋ t 7→

n∑

k=0

aktk∣∣∣n ∈ N0, a0, . . . , an ∈ Q

of polynomials with coefficients in Q is countable and dense in C

(Weierstraß’ approximation theorem).



(C, ρ) is complete:

Let (xn) be a Cauchy sequence in C.

Then εn := supm≥n ‖xn − xm‖ → 0 as n→∞.

Thus, for fixed t ∈ [0, 1], (xn(t)) is a Cauchy sequence in R.

Since R is complete,x(t) := lim

n→∞xn(t)

exists.

Notice that|xn(t)− xm(t)| ≤ εn if m ≥ n.

m→∞ =⇒|xn(t)− x(t)| ≤ εn

=⇒ limn→∞

‖xn − x‖ = 0.

It follows that x ∈ C. (why?)



Let zn(t) := nt1[0,1/n](t) + (2− nt)1(1/n,2/n](t), 0 ≤ t ≤ 1.

0

0.5

1.0

0 0.5 1.0

zn(t)

t1n

Let ε > 0. The sequence (εzn)n≥1 has no convergent subsequence, sinceρ(εznk

, z)→ 0 implies z ≡ 0, (why?) but ρ(εznk, 0) = ε for each k.

Consequence: B(0, ε) not compact

=⇒ no closed ball B(x, ε) is compact (consider x+ εzn)

=⇒ each compact set is nowhere dense

=⇒ C is not σ-compact (countable union of compact sets)



For k ∈ N and 0 ≤ t1 < . . . < tk ≤ 1, let

πt1,...,tk :

C → Rk,

x 7→ πt1,...,tk(x) := (x(t1), . . . , x(tk)).

The mappings πt1,...,tk , k ∈ N, t1, . . . , tk ∈ [0, 1], are called natural projections.Let

Cf :=π−1t1,...,tk

(H)∣∣k ∈ N, 0 ≤ t1 < . . . < tk ≤ 1, H ∈ Bk

be the system of finite-dimensional sets.

We have Cf ⊂ B. (why?)

Claim: Cf is a field (algebra).

To show: (i) S ∈ Cf , (ii) A ∈ Cf ⇒ Ac ∈ Cf , (iii) A,B ∈ Cf ⇒ A∩B ∈ Cf .

(i): S = π−11 (R) ∈ Cf

√

(ii): Cf ∋ A = π−1t1,...,tk

(H), H ∈ Bk =⇒ Ac = π−1t1,...,tk

(Rk \H) ∈ Cf√

(iii): Cf is a π-system (a bit tricky!).



(iii): For nonempty, finite N ⊂ [0, 1] and n := |N |, let

πN :

C → Rn,

x 7→ πN(x) := (x(u1), . . . , x(un)),

where N = u1, . . . , un, u1 < . . . < un.

If ∅ 6= I ⊂ 1, . . . , n, I = i1, . . . , iℓ, i1 < . . . < iℓ, then

πnI :

Rn → Rℓ,

(v1, . . . , vn) 7→ πnI (v1, . . . , vn) := (vi1 , . . . , viℓ)

=⇒ πnI πN :

C → Rℓ,

x 7→ πnI (x(u1), . . . , x(un)) = (x(ui1), . . . , x(uiℓ)) .

Now, let A := π−1t1,...,tk

(K), B := π−1s1,...,sℓ

(L) ∈ Cf (K ∈ Bk, L ∈ Bℓ).

Put M := t1, . . . , tk, N := s1, . . . , sℓ; let n := |M ∪N |.Let I = i1, . . . , ik ⊂ 1, . . . .n so that M = ui1 , . . . , uik.Let J = j1, . . . , jℓ ⊂ 1, . . . .n so that N = us1 , . . . , usℓ.Then πt1,...,tk = πn

M πM∪N , πs1,...,sℓ = πnN πM∪N .



Memo: A = π−1t1,...,tk

(K), B := π−1s1,...,sℓ

(L) ∈ Cf (K ∈ Bk, L ∈ Bℓ).

Memo: πt1,...,tk = πnI πM∪N , πs1,...,sℓ = πn

J πM∪N .

It follows that

A ∩ B = π−1t1,...,tk

(K) ∩ π−1s1,...,sℓ

(L)

= (πnI πM∪N)−1 (K) ∩ (πn

J πM∪N )−1 (L)

= π−1M∪N

((πn

I )−1(K)

)∩ π−1

M∪N

((πn

J )−1(L)

)

= π−1M∪N

(((πn

I )−1(K)

)∩((πn

J )−1(L)

))

︸︷︷︸︸︷︷︸∈ Bn ∈ Bn

︸︷︷︸∈ Bn

∈ Cf√.



Memo: Cf =π−1t1,...,tk

(H)∣∣k ∈ N, 0 ≤ t1 < . . . < tk ≤ 1, H ∈ Bk

.

Memo: Cf ⊂ B, Cf is a field.

Furthermore,

B(x, ε) =⋂

r∈Q∩[0,1]

y ∈ C : |y(r)− x(r)| ≤ ε

=⋂

r∈Q∩[0,1]

π−1r ([x(r)− ε, x(r) + ε]) ∈ σ (Cf )

=⇒ B(x, ε) =∞⋃

n=1

B(x, ε− 1/n) ∈ σ (Cf ) .

If M := x1, x2, . . . is dense in C, then

σ (B(xj , ε) : xj ∈M, ε ∈ Q>0) = B.

Therefore, B = σ(Cf), and Cf is a separating class.



13.9 Example (S = R∞)

Let S := R∞ := x = (xn)n≥1 : xn ∈ R ∀n ≥ 1,

ρ(x, y) :=

∞∑

k=1

min(1, |xk − yk|)2k

.

(S, ρ) is a separable and complete metric space.

For xn = (xnj )j≥1,

ρ (xn, x)→ 0⇐⇒ xnj → xj ∀j ≥ 1.

For k ∈ N, let

πk :

S → Rk,

x 7→ πk(x) := (x1, . . . , xk).

For x ∈ S, ε > 0, k ∈ N, let

Nk,ε(x) := π−1k

(×k

j=1(xj − ε, xj + ε))

= y ∈ S : |yj − xj | < ε, j = 1, . . . , k ∈ O.



The classN := Nk,ε(x) : x ∈ S, ε > 0, k ∈ N

is a base of O.Let R∞

f := π−1k (H) : k ∈ N,H ∈ Bk.

R∞f is a field satisfying B = σ(R∞

f ) = σ(N ).

R∞f is a determining class.

R∞ is not σ-compact (in contrast to Rd).

We have

A compact ⇐⇒ ∀k ∈ N : xk : x = (xj)j≥1 ∈ A is a bounded set in R.

(Exercise!)


Weak convergence in metric spaces

14 Weak convergence in metric spaces

14.1 Definition (Weak convergence)

Let (S, ρ) be a metric space, P, P1, P2, . . . ∈ P .

PnD−→ P as n→∞ :⇐⇒ lim

n→∞

∫f dPn =

∫f dP ∀ f ∈ Cb.

Wording: Pn converges weakly to P .

Notice that PnD−→ P and Pn

D−→ Q implies P = Q. (why?)

14.2 Example Let δz be the Dirac measure in z ∈ S. Let x0, x1, x2, . . . ∈ S.We then have

δxn

D−→ δx0 ⇐⇒ xn → x0.

Proof:”⇐=“: Let xn → x0. If f ∈ Cb then

∫f dδxn = f(xn)→ f(x0) =

∫f dδx0 .



Memo: δxn

D−→ δx0 ⇐⇒ xn → x0

”=⇒“: Suppose xn 6→ x0. Then ∃ε > 0 and

ρ(xn, x0) > ε for infinitely many n.

Let

fε(x) := max

(0, 1− ρ(x, x0)

ε

), x ∈ S

(cf. proof of Thm. 13.7, putting A = x0).Then

fε ∈ Cb,fε(x0) = 1,

fε(xn) = 0 for infinitely many n.

=⇒∫fε dδxn = fε(xn) 6→ fε(x0) =

∫fε dδx0

=⇒ δxn

D6−→ δx0 .



Memo: PnD−→ P :⇐⇒

∫f dPn →

∫f dP ∀f ∈ Cb

14.3 Theorem (Portmanteau)


a) PnD−→ P ,

b)∫f dPn →

∫f dP ∀f ∈ Cb0,

c) lim supn→∞ Pn(A) ≤ P (A) ∀A ∈ A,

d) lim infn→∞ Pn(O) ≥ P (O) ∀O ∈ O,e) limn→∞ Pn(B) = P (B) ∀B ∈ C(P ) := C ∈ B : P (∂C) = 0.

A set B ∈ B with the property P (∂B) = 0 is called P -continuity set.

Proof: (largely follows the proof in the case S = Rd, cf. 6.4).

b) =⇒ c): Use 1A ≤ fε ≤ 1Aε , where fε was defined in the proof of 13.7.

e) =⇒ a): Let f ∈ Cb. If |f | < L then 0 <(fL+ 1)· 12< 1.

∫· · · linear =⇒ w.l.o.g. 0 < f < 1.



Memo: To show:∫f dPn →

∫f dP .

∫f dPn =

∫ 1

0

t P fn (dt) (transformation formula)

=

∫ 1

0

(∫ t

0

1λ1(du)

)P fn (dt)

=

∫ 1

0

(∫ 1

u

P fn (dt)

)λ1(du) (Tonelli’s theorem)

=

∫ 1

0

Pn(f > u) du. (f > u = x ∈ S : f(x) > u)

Likewise,

∫f dP =

∫ 1

0

P (f > u) du.

f continuous =⇒ ∂f > u ⊂ f = u (!) =⇒ Pn(∂f > u) ≤ Pn(f = u).

P (f = u) = 0 with at most countably many exceptions. (why?)

e) =⇒ Pn(f > u)→ P (f > u) λ1-almost everywhere.

Dominated convergence =⇒ a).



14.4 Theorem (Criterion for weak convergence I)

Let P ∈ P and MP ⊂ B be a π-system (i.e., MP is closed with respect tointersections). If each open set is a countable union of sets ofMP , then:

Pn(A)→ P (A) ∀A ∈MP =⇒ PnD−→ P.

Proof: We show P (O) ≤ lim infn→∞ Pn(O) ∀O ∈ O. 14.3 d) ⇒ assertion.

Let A1, . . . , Ak ∈ MP .MP π-system, inclusion-exclusion formula =⇒

Pn

(k⋃

j=1

Aj

)→ P

(k⋃

j=1

Aj

). (⋆)

Let O ∈ O. Assumption =⇒ O = ∪∞j=1Aj , where A1, A2, . . . ∈ MP .

Fix ε > 0. Choose k such that

P (O)− ε ≤ P

(k⋃

j=1

Aj

).

(⋆) =⇒ P (O)−ε ≤ limn→∞

Pn

(k⋃

j=1

Aj

)≤ lim inf

n→∞Pn(O). ε ↓ 0 =⇒ assertion.



14.5 Theorem (Criterion for weak convergence II)

Let (S, ρ) be separable, P ∈ P , and let MP ⊂ B be a π-system. Supposefurther that

∀x ∈ S ∀ε > 0 ∃A ∈MP : x ∈A ⊂ A ⊂ B(x, ε). (⋆)

If Pn(A)→ P (A) for each A ∈MP , then PnD−→ P .

Proof: Let ∅ 6= O ∈ O. Assumption =⇒

∀x ∈ O ∃Ax ∈ MP : x ∈Ax ⊂ Ax ⊂ O.

=⇒ O =⋃

x∈O

Ax (open cover!)

(S, ρ) separable, Thm. 13.3 =⇒ there is a countable subcover

O =

∞⋃

j=1

Axj

=

∞⋃

j=1

Axj.

14.4 =⇒ assertion.



Memo: C(P ) = B ∈ B : P (∂B) = 0.

Memo: PnD−→ P ⇐⇒ Pn(B) −→ P (B) ∀B ∈ B ∩ C(P ).

14.6 Definition (Convergence determining class)

A systemM⊂ B is called convergence-determining class (CDC) :⇐⇒

∀P, P1, P2, . . . ∈ P : Pn(A)→ P (A) ∀A ∈M∩ C(P ) =⇒ PnD−→ P.

ForM⊂ B, x ∈ S, ε > 0, put

Mx,ε := A ∈M : x ∈A ⊂ A ⊂ B(x, ε),

∂Mx,ε := ∂A : A ∈ Mx,ε.



Memo: M CDC :⇔ ∀P, (Pn) : Pn(A)→P (A)∀A ∈M∩ C(P )⇒ PnD−→P.

Memo: Mx,ε = A ∈M : x ∈A⊂ A ⊂ B(x, ε).

Memo: ∀x ∈ S ∀ε > 0 ∃A ∈ MP : x ∈A ⊂ A ⊂ B(x, ε). (⋆)

14.7 Theorem (Sufficient condition for CDC)

Let (S, ρ) be separable andM⊂ B be a π-system satisfying

(i) ∀x ∈ S ∀ε > 0:Mx,ε 6= ∅,(ii) ∀x ∈ S ∀ε > 0: ∂Mx,ε contains ∅ or uncountably many disjoint sets.

ThenM is a CDC.

Proof: Fix P ∈ P , putMP :=M∩ C(P ). Since

∂(A ∩B) ⊂ ∂A ∪ ∂B, (!)

MP is a π-system. (i) and (ii) =⇒ MP satisfies condition (⋆) in 14.5.

Thus, Pn(A)→ P (A) ∀A ∈MP implies PnD−→ P , q.e.d.



Memo: 14.7: (S, ρ) separable,M π-system;M CDC if (i) ∀x∀ε :Mx,ε 6= ∅(ii) ∀x∀ε : ∂Mx,ε contains ∅ or uncountably many disjoint sets.

Mx,ε = A ∈M : x ∈A⊂ A ⊂ B(x, ε).

14.8 Examples

a) LetM be the class of finite intersections of open balls. Since

∂B(x, r) ⊂ y : ρ(x, y) = r, (!)

M is a CDC by 14.7.

b) S = Rd,M := (−∞, x] : x ∈ Rd is a CDC (cf. 6.4.e)).

c) S = R∞, R∞f = π−1

k (H) : k ∈ N,H ∈ Bk πk(x) := (x1, . . . , xk)

R∞f is a separating class, cf. 13.9. R∞

f is also a CDC (Exercise!).

d) S = C = C[0, 1]. 13.8 =⇒ a separating class is

Cf =π−1t1,...,tk

(H) : k ∈ N, 0 ≤ t1 < . . . < tk ≤ 1, H ∈ Bk.



Claim: Cf is not a CDC. Let zn ∈ C as in Example 13.8.

0

0.5

1.0

0 0.5 1.0

zn(t)

t1n

Let Pn := δzn , P := δ0. We have Pn

D6−→ P since zn 6→ 0 (cf. 14.2).

But: Let k ∈ N, 0 ≤ t1 < . . . < tk ≤ 1 fixed. We have

πt1,...,tk (zn) = (0, 0, . . . , 0) = πt1,...,tk(0)

if2

n<

t1, if t1 > 0,

t2, if 0 = t1 < t2.

I.e., Pn(A)→ P (A) for each A ∈ Cf . (why?) Thus, Cf is not a CDC.



14.9 Theorem (Subsequence criterion)

Let P, P1, P2, . . . ∈ P .

PnD−→ P ⇐⇒ each subsequence (Pnk

) contains a further

subsequence (Pn′

k) such that Pn′

k

D−→ P.

Proof:”⇐=“: Suppose Pn

D6−→ P =⇒ ∃f ∈ Cb ∃ε > 0 such that

∣∣∣∣∫f dPnk

−∫f dP

∣∣∣∣ > ε

for some subsequence (Pnk).

No subsequence of (Pnk) can converge weakly to P , q.e.d.



Let (S, ρ), (S′, ρ′) metric spaces. Let h : S → S′ be (B,B′)-measurable.

If P ∈ P then

P h := P h−1 =: Ph−1, P h(B′) := P (h−1(B′)), B′ ∈ B′,

is a probability measure on B′. Do we have PnD−→ P =⇒ P h

nD−→ P h?


Let C(h) be the set of points of continuity of h. We then have:

If PnD−→ P and P (C(h)) = 1 then P h

nD−→ P h.

Proof: Let A′ ∈ A′. We have (!, cf. proof of 6.6)

h−1(A′) ⊂ S \ C(h) ∪ h−1(A′) =⇒

lim supn→∞

Pn

(h−1(A′)

)≤ lim sup

n→∞Pn

(h−1(A′)

)

≤ P(h−1(A′)

)≤ P (S \ C(h)) + P

(h−1(A′)

)

= 0 + P(h−1(A′)

). 14.3 c) =⇒ assertion.



Memo: R∞f = π−1

k (H) : k ∈ N,H ∈ Bk, πk(x) = (x1, . . . , xk)

14.11 Example

Let S = R∞. Then

PnD−→ P ⇐⇒ ∀ k ∈ N : Pnπ

−1k

D−→ Pπ−1k .

Proof:”=⇒“ follows from the CMT, since πk is continuous.

”⇐=“: If H ∈ Bk then (!) ∂π−1

k (H) = π−1k (∂H).

Suppose A := π−1k (H) ∈ C(P ) =⇒

P(π−1k (∂H)

)= P

(∂π−1

k (H))= P (∂A) = 0

=⇒ H ∈ C(Pπ−1k ). Assumption =⇒

Pn(A)→ P (A) ∀A ∈ R∞f ∩ C(P ).

14.8 c) =⇒ assertion. (R∞f is a CDC!)



14.12 Example Let S = C[0, 1]. Then

PnD−→ P =⇒ ∀k ≥ 1, ∀t1, . . . , tk : Pnπ

−1t1,...,tk

D−→ Pπ−1t1,...,tk

Warning! The converse ⇐= does not hold, cf. Example 14.8 d).


Convergence in Distribution

15 Convergence in distribution

15.1 Random Elements

Let (Ω,A,P) be a probability space, (S, ρ) a metric space, B := σ(O).

A (A,B)-measurable mapping X : Ω→ S is called a random element of S.

Manner of speaking:

S = R: random variable,

S = Rd: random vector,

S = R∞: random sequence,

S = C[0, 1]: random function.

The distribution of X is the probability measure PX = P X−1 = PX−1.

Canonical construction, given a probability measure P on S :

(Ω,A) := (S,B), X := idΩ, P := P.



15.2 Notations for random Functions

Let (Ω,A,P) be a probability space, X : Ω→ C = C[0, 1] a random function.

For fixed ω ∈ Ω, X(ω) is a continuous function on [0, 1].

X(ω)(t) =: Xt(ω) =: X(t, ω), 0 ≤ t ≤ 1.

X(t) := Xt :

Ω→ R,

ω 7→ X(t)(ω) := Xt(ω) := X(t, ω).

X(t) = πt X.

0 ≤ t1 < . . . < tk ≤ 1 =⇒ (X(t1), . . . , X(tk)) = πt1...,tk X.The distributions of (X(t1), . . . , X(tk)), where k ≥ 1, 0 ≤ t1 < . . . < tk ≤ 1,are called the finite-dimensional distributions (

”fidis“) of X.

Notice that X is a random function (i.e., (A,B)-measurable), if, and only if,X(t) is a random variable (i.e., (A,B1)-measurable) for each t ∈ [0, 1](Exercise!).



15.3 Example (The partial Sum Process)

Let Z1, Z2, . . . be i.i.d. R-valued random variables with E(Z21 ) = 1, E(Z1) = 0.

Let S0 := 0, Sk := Z1 + . . .+ Zk, k ≥ 1. For t ∈ [0, 1], let

Xn(t) :=S⌊nt⌋√n

+ (nt− ⌊nt⌋) · Z⌊nt⌋+1√n

.

The random function Xn is called n-th partial sum process of (Zn)n≥1.

0

1

2

−1

−2

0.5 1.0t

Xn(t)

Realizations of X100 (Here, P(Z1 = 1) = P(Z1 = −1) = 1/2)

Notice that Xn(1) =Sn√n

D−→ N(0, 1). (why?)



15.4 Definition (Convergence in Distribution)

Let X,X1, X2, . . . be random elements in S having distributions P = PX ,P1 = PX1 , P2 = PX2 , . . ..

XnD−→ X :⇐⇒ Pn

D−→ P

⇐⇒ Ef(Xn)→ Ef(X) ∀f ∈ Cb

Notice that only distributions matter.

Underlying probability space remains”offstage“.

A set A ∈ B is called an X-continuity set :⇐⇒ P(X ∈ ∂A) = 0.

Notice that A is an X-continuity set if, and only if, A ∈ C(PX) (= C(P )).



15.5 Theorem (Portmanteau)


a) XnD−→ X,

b) Ef(Xn)→ Ef(X) ∀f ∈ Cb0,c) lim supn→∞ P(Xn ∈ A) ≤ P(X ∈ A) ∀A ∈ A,

d) lim infn→∞ P(Xn ∈ O) ≥ P(X ∈ O) ∀O ∈ O,e) limn→∞ P(Xn ∈ B) = P(X ∈ B) for each X-continuity set B.

Remark: There are”hybrid“ notations, like Xn

D−→ P , PnD−→ X.


Let (S′, ρ′) be a further metric space, h : S → S′ measurable. We then have:

If XnD−→ X and P(X ∈ C(h)) = 1, then h(Xn)

D−→ h(X).



15.7 Example (Random elements of C)

Let X,X1, X2, . . . be random elements of C[0, 1]. If XnD−→ X, then

∀k ≥ 1, ∀0 ≤ t1 < . . . < tk ≤ 1 : (Xn(t1), . . . , Xn(tk))D−→ (X(t1), . . . , X(tk)).

Proof: The function h = πt1,...,tk : C → Rk is continuous.

15.8 Definition (Convergence in Probability)

Let X,X1, X2, . . . be random elements of S, a ∈ S.


n→∞P(ρ(Xn, X) ≥ ε) = 0 ∀ε > 0,

XnP−→ a :⇐⇒ lim

n→∞P(ρ(Xn, a) ≥ ε) = 0 ∀ε > 0.

Attention! ρ(Xn, X) must be a random variable, i.e., measurable w.r.t. B ⊗ B.If (S, ρ) is separable, then (Xn, X) is a random element of S × S. (!!)

Since ρ : S × S → R is continuous, this condition holds.



15.9 Theorem (Slutzy’s Lemma)

Let (Xn, Yn), n ≥ 1, be random elements of S ×S, X a random element of S.

If XnD−→ X and ρ(Xn, Yn)

P−→ 0, then YnD−→ X.

Proof: Let A ∈ A, ε > 0, Aε := x : ρ(x,A) ≤ ε. We have

P(Yn ∈ A) = P(Yn ∈ A, ρ(Xn, Yn) ≥ ε) + P(Yn ∈ A,ρ(Xn, Yn) < ε)

≤ P(ρ(Xn, Yn) ≥ ε) + P(Xn ∈ Aε).

XnD−→ X, Aε ∈ A =⇒ lim supn→∞ P(Yn ∈ A) ≤ P(X ∈ Aε).

ε ↓ 0 =⇒ Aε ↓ A =⇒

lim supn→∞

P(Yn ∈ A) ≤ P(X ∈ A). 15.5 c) =⇒ assertion.

15.10 Corollary If XnP−→ X then Xn

D−→ X.

Proof: Put Xn := X, Yn := Xn, n ≥ 1, in 15.9.


Relative compactness and tightness

16 Relative compactness and tightness

Let Q ⊂ P be a nonempty set of probability measures on B.

16.1 Definition (Relative compactness, tightness)

a) Q relatively compact :⇐⇒ ∀(Pn) ∈ QN ∃ subsequence (Pnk) ∃Q ∈ P :

Pnk

D−→ Q as k →∞.

b) Q tight :⇐⇒ ∀ ε > 0 ∃K ⊂ S, K compact: Q(K) ≥ 1− ε ∀Q ∈ Q.

16.2 Remark (Relative compactness is necessary for weak convergence)

If PnD−→ P then Q := Pn : n ∈ N is relatively compact.

Proof: Use the subsequence criterion 14.9.

16.3 Remark If X1, X2, . . . are random elements in S, then Xn : n ∈ N isrelatively compact (tight) if PXn : n ∈ N is relatively compact (tight).



16.4 Definition (Fidi convergence)

Let S = C[0, 1] and P, P1, P2, . . . ∈ P .

PnDfidi−→ P :⇐⇒ Pnπ

−1t1,...,tk

D−→ Pπ−1t1,...,tk

∀k ≥ 1, ∀ 0 ≤ t1 < . . . < tk ≤ 1

(weak convergence of all finite-dimensional distributions).

If X,X1, X2, . . . are random elements of C, then

XnDfidi−→ X :⇐⇒ (Xn(t1), . . . , Xn(tk))

D−→ (X(t1), . . . , X(tk))

∀k ≥ 1, ∀ 0 ≤ t1 < . . . < tk ≤ 1.

Warning: Fidi convergence is a necessary but not sufficient condition for

PnD−→ P or Xn

D−→ X, cf. Example 14.12 and 15.7.



16.5 Theorem (Fidi conv. and relative compactness imply PnD−→ P in C)

Let S = C = C[0, 1] and P, P1, P2, . . . ∈ P . Suppose

PnDfidi−→ P, (16.1)

Pn : n ∈ N relatively compact. (16.2)

Then PnD−→ P .

Proof: (16.2) =⇒ each subsequence (Pni) contains a further subsequence

(Pn′

i) with Pn′

i

D−→ Q for some Q ∈ P . Let k ∈ N, 0 ≤ t1 < . . . < tk ≤ 1.

CMT =⇒ Pn′

iπ−1t1,...,tk

D−→ Qπ−1t1,...,tk

,

(16.1) =⇒ Pn′

iπ−1t1,...,tk

D−→ Pπ−1t1,...,tk

,

=⇒ P (B) = Q(B) ∀B ∈ Cf . Cf separating class =⇒ P = Q.

Subsequence criterion 14.9 =⇒ PnD−→ P , q.e.d.



16.6 Theorem (Existence of probability measures on C[0, 1])

Let S = C[0, 1]. Let P1, P2, . . . ∈ P . Suppose

Pn : n ∈ N is relatively compact, (16.3)

∀k ≥ 1, ∀ 0 ≤ t1 < . . . < tk ≤ 1 ∃ probability measure (16.4)

µt1,...,tk on Bk such that Pnπ−1t1,...,tk

D−→ µt1,...,tk .

Then there is a P ∈ P with Pπ−1t1,...,tk

= µt1,...,tk ∀k ≥ 1, ∀t1, . . . tk.

Proof: (16.3) =⇒ ∃ subsequence (Pni) ∃P ∈ P such that Pni

D−→ P .

Fix k and t1, . . . , tk. The CMT implies

Pniπ−1t1,...,tk

D−→ Pπ−1t1,...,tk

.

From (16.4), we have

Pniπ−1t1,...,tk

D−→ µt1,...,tk

=⇒ Pπ−1t1,...,tk

= µt1,...,tk , q.e.d.



16.7 Theorem (Prokhorov)

a) Q ⊂ P is tight =⇒ Q relatively compact,

b) “⇐=“ holds if (S, ρ) is separable and complete.

Proof of b): Let Q be relatively compact. Fix ε > 0. Let On ∈ O, n ≥ 1, withOn ↑ S. Claim: ∃n ∈ N with P (On) > 1− ε ∀P ∈ Q. Proof (by contradiction):

Suppose ∀n ∃Pn ∈ Q with Pn(On) ≤ 1− ε. Assumption =⇒ ∃ subsequence

(Pnk) ∃Q ∈ P with Pnk

D−→ Q. Portmanteau theorem =⇒

for fixed n : Q(On) ≤ lim infk→∞

Pnk(On) ≤ lim inf

k→∞Pnk

(Onk) ≤ 1− ε.

But Q(On) ↑ 1 (why?), a contradiction (=⇒ claim).

For k ≥ 1, let Bk,j , j ≥ 1, be open balls of radius 1/k that cover S(separability!)

Claim =⇒ ∃nk such that P

(nk⋃

j=1

Bk,j

)> 1− ε

2k∀P ∈ Q.

Let M := ∩k≥1 (∪j≤nkBk,j) =⇒ M totally bounded. K :=M is complete

(since S is complete). 13.4 =⇒ K compact, and P (K) > 1− ε ∀P ∈ Q.



Proof of a) (sketch). Let (Pn) be a sequence in Q. To show: ∃ subsequence

(Pnk) ∃P ∈ P with Pnk

D−→ P as k →∞. Let K1,K2, . . . ⊂ S be compactsets with Kj ⊂ Kj+1, j ≥ 1 , and

Pn(Kj) > 1− 1

j∀j ≥ 1, ∀n ≥ 1.

For each m ≥ 1, Kj has a finite 1/m-net Nj,m. The set

N :=

∞⋃

j=1

∞⋃

m=1

Nj,m

is countable, and we have ∪∞j=1Kj ⊂ N , i.e., ∪∞

j=1Kj is separable. By a

general result, there is a countable system O ⊂ O of open sets such that

∀x ∈ S ∀O ∈ O : x ∈( ∞⋃

j=1

Kj

)∩O ⇒ ∃G ∈ O : x ∈ G ⊂ G ⊂ O.

Let H be the system of all finite unions of sets of the type G ∩Kj , whereG ∈ O and j ≥ 1, enlarged by ∅. The system H is countable, and byCantor’s diagonal procedure, there is a subsequence (Pnk

) such that



α(H) := limk→∞

Pnk(H)

exists for each H ∈ H. The aim is to construct P ∈ P such that

P (O) = supα(H) : H ∈ H, H ⊂ O, O ∈ O. (16.5)

Suppose P exists. Then, for H ∋ H ⊂ O,

α(H) = limk→∞

Pnk(H) ≤ lim inf

k→∞Pnk

(O).

(16.5) implies P (O) ≤ lim infk→∞ Pnk(O).

The Portmanteau theorem then gives Pnk

D−→ P . To construct P , put

β(O) := supH∈H,H⊂O

α(H), γ(M) := infO∈O,O⊃M

β(O), M ⊂ S.

Then γ is an outer measure on the class of all subsets of S. By Caratheodory’slemma, the restriction of γ to the system A∗ of γ-measurable sets is ameasure, denoted by µ. Show that A ⊂ A∗, and that µ is a probabilitymeasure. This measure µ is the desired P .



16.8 Corollary If (S, ρ) is separable, any finite set Q ⊂ P is tight.


Weak convergence and tightness in C

17 Weak convergence and tightness in C

Let S = C = C[0, 1], P, P1, P2 . . . ∈ P . From 16.5 and 16.7, we have:

17.1 Theorem (Fidi convergence and tightness imply PnD−→ P in C)

If PnDfidi−→ P and the sequence Pn : n ∈ N is tight then Pn

D−→ P .

How to prove tightness in C?

17.2 Definition (Modulus of continuity)For x ∈ C[0, 1], the function wx : (0, 1]→ R, defined by

wx(δ) := w(x, δ) := sup|s−t|≤δ

|x(s)− x(t)|, 0 < δ ≤ 1,

is called modulus of continuity of x.

Notice that x is uniformly continuous if, and only if, limδ→0 wx(δ) = 0.



Memo: A ⊂ S relatively compact :⇐⇒ A compact .

17.4 Theorem (Arzela–Ascoli)A set A ⊂ C[0, 1] is relatively compact if, and only if,

supx∈A|x(0)| <∞ (uniform boundedness at 0) (17.1)

andlimδ→0

supx∈A

wx(δ) = 0 (uniform equicontinuity) (17.2)

17.5 Example : Let zn ∈ C as in 13.8.

0

0.5

1.0

0 0.5 1.0

zn(t)

t1n

wzn(δ) = 1, δ ≥ 1/n

supx∈Awx(δ) = 1, δ > 0

A := zn : n ∈ N not relatively compact since (17.2) is violated.



Memo: supx∈A |x(0)| <∞ (17.1), limδ→0 supx∈Awx(δ) = 0 (17.2)

Proof of”(17.1),(17.2) =⇒ A compact“:

Choose k large enough that supx∈Awx(1/k) <∞. Since

|x(t)| ≤ |x(0)|+k∑

j=1

∣∣∣∣x(jt

k

)− x

((j − 1)t

k

) ∣∣∣∣,︸︷︷︸

≤ wx(1/k)we have

α := sup0≤t≤1

supx∈A|x(t)| <∞. (17.3)

We now use (17.2) and (17.3) to show that A is totally bounded.

Since C is complete, it then follows that A is compact (cf. 13.4 c)).

Fix ε > 0. Choose a finite ε-net H in [−α, α] ⊂ R. Choose k large enough thatwx(1/k) < ε for all x ∈ A. Let B be the finite set of (polygonal) functions thatare linear on each interval Ikj := [(j − 1)/k, j/k], 1 ≤ j ≤ k, and take valuesin H at the endpoints. If x ∈ A, then |x(j/k)| ≤ α =⇒ ∃y ∈ B with|x(j/k) − y(j/k)| < ε for j = 0, 1, . . . , k. If t ∈ Ikj then

∣∣y(j/k)− x(t)∣∣ ≤

∣∣y(j/k)− x(j/k)∣∣+∣∣x(j/k)− x(t)

∣∣ < 2ε.



Memo: |x(j/k) − y(j/k)| < ε for j = 0, 1, . . . , k.

Memo: wx(1/k) < ε, x ∈ A, Ikj := [(j − 1)/k, j/k]

Likewise, for t ∈ Ikj ,∣∣∣∣y(j − 1

k

)− x(t)

∣∣∣∣ ≤∣∣∣∣y(j − 1

k

)− x

(j − 1

k

) ∣∣∣∣+∣∣∣∣x(j − 1

k

)− x(t)

∣∣∣∣ < 2ε.

Since y(t) is a convex combination of y((j − 1)/k) and y(j/k), there is aβ = β(t) ∈ [0, 1] with

y(t) = β · y(j − 1

k

)+ (1− β) · y

(j

k

)=⇒

|y(t)− x(t)| =

∣∣∣∣β ·(y

(j − 1

k

)− x(t)

)+ (1− β) ·

(y

(j

k

)− x(t)

)∣∣∣∣

≤ β ·∣∣∣∣y(j − 1

k

)− x(t)

∣∣∣∣ + (1− β) ·∣∣∣∣y(j

k

)− x(t)

∣∣∣∣≤ 2ε.

Thus, B is a finite 2ε-net for A, q.e.d.



A picture”tells it all“:

t1

α

−α

••••••••

•••••••••

H

2ε>

1k

j−1k

jk

Ikj

•

•

•

•

•

•

•

•

•

• y ∈ B

x(t)

• • •

• •

• • • •

•



Memo: supx∈A |x(0)| <∞ (17.1), limδ→0 supx∈Awx(δ) = 0 (17.2)

The proof of”A compact =⇒ (17.1), (17.2)“ is easy:

Since π0 : C → R is continuous and A is compact, π0(A) ⊂ R is compact andthus bounded, which is (17.1).

Let fn(x) := w(x, 1/n).

fn is continuous, and we have fn(x) ↓ 0 as n→∞.

Fix ε > 0. Let On := x : fn(x) < ε ∈ O.Then On ⊂ On+1, n ≥ 1, and S = ∪∞

n=1On.

Since A is compact, we have A ⊂ On for some n, which gives (17.2).



Memo: A compact ⇐⇒ supx∈A |x(0)| <∞, limδ→0 supx∈Awx(δ) = 0

17.6 Theorem (Characterization of tightness in C[0, 1])

Let (Pn)n≥1 be a sequence in P . We then have:

Pn : n ≥ 1 tight

⇐⇒ a) ∀η > 0 ∃a ∃n0 ∀n ≥ n0 : Pn(x : |x(0)| ≥ a) ≤ η,b) ∀ε > 0 ∀η > 0 ∃δ ∈ (0, 1) ∃n0 ∀n ≥ n0 : Pn(x : wx(δ) ≥ ε) ≤ η.

Proof:”=⇒“: Let Pn : n ≥ 1 be tight. Fix η > 0.

Choose a compact set K ⊂ C such that Pn(K) > 1− η for each n ≥ 1.

Thm. 17.4 (Arzela–Ascoli) =⇒ ∃a > 0: K ⊂ x : |x(0)| < a.

=⇒ Pn(x : |x(0)| < a) > 1− η ∀n ≥ 1.

Fix ε > 0. Thm. 17.4 =⇒ ∃δ ∈ (0, 1) such that K ⊂ x : wx(δ) < ε =⇒

Pn(x : wx(δ) < ε) > 1− η ∀n ≥ 1.



Memo: A compact ⇐⇒ supx∈A |x(0)| <∞, limδ→0 supx∈Awx(δ) = 0

Pn : n ≥ 1 tight

⇐⇒ a)∀η > 0 ∃a ∃n0 ∀n ≥ n0 : Pn(x : |x(0)| ≥ a) ≤ η,b) ∀ε > 0 ∀η > 0 ∃δ ∈ (0, 1) ∃n0 ∀n ≥ n0 : Pn(x : wx(δ) ≥ ε) ≤ η.

Proof:”⇐=“: 16.8 and part

”=⇒“ imply w.l.o.g. n0 = 1.

Fix ε > 0. To show: There is a compact set K with Pn(K) ≥ 1− ε ∀n.In a), let η := ε/2. Choose a > 0 such that, putting B := x : |x(0)| ≤ a,

Pn(B) ≥ 1− ε

2, n ≥ 1.

For each k ∈ N, let ε := 1/k and η := ε/2k+1 in b). Choose δk > 0 such that,writing Bk := x : wx(δk) < 1/k,

Pn(Bk) ≥ 1− ε

2k+1, k ≥ 1, n ≥ 1.

Put K := B ∩ ∩∞k=1Bk. Then Pn(K) ≥ 1− ε for each n ≥ 1.

The set A := B ∩∩∞k=1Bk satisfies the conditions of 17.4. Thus, K is compact.



17.8 Theorem Let X,X1, X2, . . . be random functions on (Ω,A,P).If, for each k ≥ 1 and 0 ≤ t1 ≤ . . . ≤ tk ≤ 1,

(Xn(t1), Xn(t2), . . . , Xn(tk))D−→ (X(t1), X(t2), . . . , X(tk)) (17.7)

andlimδ→0

lim supn→∞

P (w(Xn, δ) ≥ ε) = 0 ∀ε > 0, (17.8)

then XnD−→ X.

Proof: Let P := PX , Pn := PXn , n ≥ 1. (17.7) ⇐⇒ PnDfidi−→ P .

In view of Thm. 17.1, we have to show the tightness of Pn : n ≥ 1.

Since Xn(0)D−→ X(0), Pn π−1

0 : n ≥ 1 is tight =⇒ condition a) of Thm.17.6 holds. Now,

(17.8) ⇐⇒ limδ→0

lim supn→∞

Pn(x : w(x, δ) ≥ ε) = 0 ∀ε > 0

=⇒ condition b) of Thm. 17.6.


Wiener Measure, Donsker’s Theorem

18 Wiener Measure, Donsker’s Theorem

In what follows, let X := idC , X(x) := x, x ∈ C, Xt := πt X : C → R

(canonical construction).

If P is a probability measure on B, then, for u ∈ R,

P (Xt ≤ u) = P (x ∈ C : Xt(x) ≤ u).

18.1 Definition (Wiener Measure)

A probability measure W on B is called Wiener measure :⇐⇒

a) W (Xt ≤ u) = 1√2πt

∫ u

−∞exp

(−z

2

2t

)dz, 0 < t ≤ 1, u ∈ R,

b) W (X0 = 0) = 1,

c) ∀k ≥ 2, ∀ 0 ≤ t0 ≤ t1 ≤ . . . ≤ tk ≤ 1:

Xt1 −Xt0 , Xt2 −Xt1 , . . . , Xtk −Xtk−1 are independent under W .

I.e., Xt ∼ N(0, t), 0 ≤ t ≤ 1, and (Xt : 0 ≤ t ≤ 1) has independent increments.



18.2 Corollary Under the Wiener measure W , we have:

a) Xt −Xs ∼ N(0, t− s), 0 ≤ s ≤ t ≤ 1,

b) Xt −Xs ∼ Xt−s, s ≤ t (”(Xt)0≤t≤1 has stationary increments“),

c) Cov(Xs, Xt) = min(s, t) =: s ∧ t, 0 ≤ s, t ≤ 1,

d) For each k ≥ 1, for each 0 ≤ t1 ≤ t2 ≤ . . . ≤ tk ≤ 1:

(Xt1 , Xt2 , . . . , Xtk )⊤ ∼ Nk(0,Σ),

where 0 = (0, 0, . . . , 0)⊤ ∈ Rd and Σ = (ti ∧ tj)1≤i,j≤k.

Proof: a) Let 0 ≤ s < t ≤ 1. We have

Xt = Xs + (Xt −Xs),

where, according to 18.1 c), Xs (= Xs −X0) and Xt −Xs are independent

=⇒ E

(eiuXt

)= E

(eiuXs

)· E(eiu(Xt−Xs)

), u ∈ R.

18.1a) =⇒ E

(eiuXt

)= exp

(− tu

2

2

), E

(eiuXs

)= exp

(−su

2

2

)

=⇒ E

(eiu(Xt−Xs)

)= exp

(− (t− s)u2

2

)=⇒ Xt−Xs ∼ N(0, t−s) =⇒ b).



Memo: c) Cov(Xs, Xt) = min(s, t), 0 ≤ s, t ≤ 1

Proof of c): Let 0 ≤ s < t ≤ 1. We have XsXt = X2s +Xs(Xt −Xs)

=⇒ Cov(Xs, Xt) = E(XsXt) = E(X2

s

)+ E (Xs(Xt −Xs))

= E(X2

s

)+ 0 = s = min(s, t).

Proof of d): Notice that

Xt1

Xt2

Xt3

...

...Xtk

=

1 0 0 0 · · · 01 1 0 0 · · · 01 1 1 0 · · · 0...

...... 1 · · · 0

......

......

. . . 01 1 1 1 · · · 1

·

Xt1

Xt2 −Xt1

Xt3 −Xt2

...

...Xtk −Xtk−1

︸︷︷︸︸︷︷︸=: A ∼ Nk(0, D)

where D := diag(t1, t2 − t1, t3 − t2, . . . , tk − tk−1).

We have (!) ADA⊤ = Σ = (ti ∧ tj)1≤i,j≤k, q.e.d.



18.3 Construction of W

Let Z1, Z2, . . . be i.i.d. random variables on some probability space (Ω,A,P)such that E(Z1) = 0, 0 < σ2 := V(Z1) <∞. Put S0 := 0,Sn := Z1 + . . .+ Zn, n ≥ 1. Let, for ω ∈ Ω,

Xn(t)(ω) :=1

σ√nS⌊nt⌋(ω) + (nt− ⌊nt⌋) 1

σ√nZ⌊nt⌋+1(ω), (18.1)

n ≥ 1, 0 ≤ t ≤ 1. Notice that

Xn

(j

n

)=

1

σ√nSj , j ∈ 0, 1, . . . , n.

1n

2n

jn 1

•S1σ√n

•S2σ√n

•Sj

σ√n

Xn is the n-th partial sum process associated with (Zj)j≥1, cf. Ex. 16.3.



Let

Rn(t) :=nt− ⌊nt⌋σ√n· Z⌊nt⌋+1.

Then

Xn(t) =1

σ√nS⌊nt⌋ +Rn(t).

Notice that Rn(t)P−→ 0 as n→∞. Thus, for t > 0 (and n ≥ 1/t),

Xn(t) =

√⌊nt⌋√n· S⌊nt⌋

σ√⌊nt⌋

+ Rn(t)

︸︷︷︸︸︷︷︸︸︷︷︸→√t

D−→ N(0, 1)P−→ 0

CLT of Lindeberg-Levy, CMT and Slutsky =⇒ Xn(t)D−→ N(0, t) ∼ Xt.

Notice that Xn(0) = 0D−→ δ0 = N(0, 0) ∼ X0.

Thus, Xn(t)D−→ Xt for each t ∈ [0, 1].



Likewise, consider 0 ≤ s < t ≤ 1. Notice that S⌊ns⌋ and S⌊nt⌋ − S⌊ns⌋ areindependent. (why?) We have

(Xn(s)

Xn(t)−Xn(s)

)=

1

σ√n

(S⌊ns⌋

S⌊nt⌋ − S⌊ns⌋

)+

(Rn(s)

Rn(t)−Rn(s)

)

D−→ N2

((00

),

(s 00 t− s

))

CMT =⇒(Xn(s)Xn(t)

)=

(1 01 1

)·(

Xn(s)Xn(t)−Xn(s)

)

D−→ N2

((00

),

(1 01 1

) (s 00 t− s

) (1 10 1

))

= N2

((00

),

(s ss t

))

∼(Xs

Xt

).



Likewise, for any k ≥ 3 and any t1, . . . , tk with 0 ≤ t1 < . . . < tk ≤ 1:

Xn(t1)

...Xn(tk)

D−→ Nk

0...0

, (ti ∧ tj)1≤i,j≤k

∼

X(t1)

...X(tk)

(18.2)

Let Pn := PXn . Suppose W exists

=⇒ Pn π−1t1,...,tk

D−→ W π−1t1,...,tk

∀k ∀t1, . . . , tk, i.e., PnDfidi−→ W.

Suppose Pn : n ∈ N is tight (⇐⇒: Xn : n ∈ N is tight).Prokhorov’s Thm. =⇒ ∃ subsequence Pnj

: j ≥ 1 ∃ probability measure

(=:W ) on B such that Pnj

D−→W . From the CMT, we then have

Pnj

Dfidi−→ W . Now, from (18.2),

Pnj π−1

t1,...,tk

D−→ Nk

(0, (ti ∧ tj)1≤i,j≤k

)=⇒

W has the desired fidis (which determine PW ). Subsequence criterion

=⇒ PnD−→W . It remains to prove tightness of Xn : n ∈ N.



18.4 Lemma (Tightness of the PSP for stationary sequences)

Let Xn be the partial sum process of (18.1), where (Zj)j≥1 is a stationarysequence satisfying E(Z1) = 0, 0 < σ2 := V(Z1) <∞. If

limλ→∞

lim supn→∞

λ2P

(maxk≤n|Sk| ≥ λσ

√n

)= 0,

then Xn : n ∈ N is tight.

Proof: Since Xn(0) = 0, n ≥ 1, condition a) of Thm. 17.6 holds. From(17.8), it remains to prove

limδ→0

lim supn→∞

P (w(Xn, δ) ≥ ε) = 0 ∀ε > 0. (18.3)

If 0 = t0 < t1 < . . . < tk = 1 and min1<j<k(tj − tj−1) ≥ δ, (17.6) yields

P(w(Xn, δ) ≥ 3ε) ≤k∑

j=1

P

(sup

tj−1≤s≤tj

|Xn(s)−Xn(tj−1)| ≥ ε).

Choose

tj :=mj

n, where mj ∈ N0 and 0 = m0 < m1 < . . . < mk = n.



Memo: P(w(Xn, δ) ≥ 3ε) ≤ ∑kj=1 P

(suptj−1≤s≤tj

|Xn(s)−Xn(tj−1)| ≥ ε).

If,for tj :=

mj

n,

mj

n− mj−1

n≥ δ, 1 < j < k, (18.4)

then (because of the polygonal character of Xn and stationarity!)

P (w(Xn, δ) ≥ 3ε) ≤k∑

j=1

P

(max

mj−1≤ℓ≤mj

∣∣∣∣Sℓ − Smj−1

σ√n

∣∣∣∣ ≥ ε)

=k∑

j=1

P

(max

ℓ≤mj−mj−1

|Sℓ| ≥ εσ√n

). (18.5)

Now, put m := ⌈nδ⌉ := minr ∈ N : r ≥ nδ, and mj := jm, 0 ≤ j < k,mk := n. Then the inequalities in (18.4) hold (!). Since we also need

mk−1 = (k − 1)m < n = mk ≤ km,

take k := ⌈n/m⌉. Then mk −mk−1 ≤ m. (18.5) and stationarity =⇒

P (w(Xn, δ) ≥ 3ε) ≤ k · P(maxℓ≤m|Sℓ| ≥ εσ

√n

).



Memo: P (w(Xn, δ) ≥ 3ε) ≤ k · P (maxℓ≤m |Sℓ| ≥ εσ√n) ,

where

m = m(n, δ) =⌈nδ⌉, k = k(n, δ) =

⌈n

m

⌉.

Notice that k −→n→∞

1

δ<

2

δ,

n

m−→

n→∞1

δ>

1

2δ=⇒

P (w(Xn, δ) ≥ 3ε) ≤⌈n

m

⌉· P(maxℓ≤m|Sℓ| ≥ ε√

2δ· σ√m ·

√2δ

√n

m

)

︸︷︷︸→√2

≤ 2

δ· P(maxℓ≤m|Sℓ| ≥ ε√

2δ· σ√m

)

for sufficiently large n.

Put λ := ε√2δ. Then, for sufficiently large n,

P (w(Xn, δ) ≥ 3ε) ≤ 4λ2

ε2· P(maxℓ≤m|Sℓ| ≥ λσ

√m

).



Memo: P (w(Xn, δ) ≥ 3ε) ≤ 4λ2

ε2· P(maxℓ≤m|Sℓ| ≥ λσ

√m

), n ≥ n0(δ).

Memo: Assumption: limλ→∞

lim supn→∞

λ2P

(maxk≤n|Sk| ≥ λσ

√n

)= 0.

Memo: To show: limδ→0

lim supn→∞

P (w(Xn, δ) ≥ ε) = 0 ∀ε > 0.

Assumption =⇒

∀ε > 0 ∀η > 0 ∃λ0 > 0 ∀λ ≥ λ0 :4λ2

ε2lim supm→∞

P

(maxℓ≤m|Sℓ| ≥ λσ

√m

)< η.

From this, the assertion follows (m goes to infinity along with n!).

How to bound P

(maxℓ≤m|Sℓ| ≥ β

)from above?



18.5 Lemma (Etemadi’s inequality)

Let Z1, . . . , Zn be independent random variables on (Ω,A, P), S0 := 0,

Sk :=∑k

j=1 Zj , 1 ≤ k ≤ n. Then

P

(maxk≤n|Sk| ≥ 3α

)≤ 3max

k≤nP (|Sk| ≥ α) , α > 0.

Proof: Put

A :=

maxk≤n|Sk| ≥ 3α

and, for k ∈ 1, . . . , n,

Bk := |Sk| ≥ 3α, |Sj | < 3α for j = 0, . . . , k − 1.

Then

A =n∑

k=1

Bk

and A = A ∩ |Sn| ≥ α+ A ∩ |Sn| < α.



Memo: A = maxk≤n |Sk| ≥ 3α

Memo: Bk = |Sk| ≥ 3α, |Sj | < 3α for j = 0, . . . , k − 1

Memo: A = B1 + . . .+Bk = A ∩ |Sn| ≥ α+ A ∩ |Sn| < α.

P(A) ≤ P(|Sn| ≥ α) +n∑

k=1

P(Bk ∩ |Sn| < α) (last memo)

≤ P(|Sn| ≥ α) +n∑

k=1

P(Bk ∩ |Sn − Sk| > 2α) (triangle inequal.)

= P(|Sn| ≥ α) +n∑

k=1

P(Bk) · P(|Sn − Sk| > 2α) (independence)

≤ P(|Sn| ≥ α) + maxk≤n

P(|Sn − Sk| > 2α)

(n∑

k=1

P(Bk) ≤ 1

)

≤ P(|Sn| ≥ α) + maxk≤n

(P(|Sn| ≥ α) + P(|Sk| ≥ α)) (triangle inequal.)

≤ 3maxk≤n

P(|Sk| ≥ α), q.e.d.



Proof of the existence of Wiener measure W :

Consider special partial sum process

Xn(t) :=1

σ√nS⌊nt⌋ + (nt− ⌊nt⌋) 1

σ√nZ⌊nt⌋+1,

where Z1, Z2, . . . are i.i.d. ∼ N(0, σ2)!! In this case, we have

N :=Sk

σ√k∼ N(0, 1), k ≥ 1, =⇒

P(|Sk| ≥ λσ√n) = P

(|N | ≥ λ ·

√n

k

)

≤ P(|N | ≥ λ) (if k ≤ n)

≤ E(N4) 1

λ4=

3

λ4.

It follows that limλ→∞

lim supn→∞

λ2 max1≤k≤n

P(|Sk| ≥ λσ√n) = 0.

In view of Lemma 18.4 and Etemadi’s inequality 18.5, this was to be shown.



18.6 Wiener Process on [0, 1]

In what follows, we also use the notation W for a random function on aprobability space (Ω,A,P) having distribution W . Then W : Ω→ C.

For fixed ω ∈ Ω, W (ω) ∈ C is called a path of W .

We put W (ω)(t) =:Wt(ω) =:W (ω, t) and suppress the dependence on ω bywriting

W (t) :=Wt (random variable on Ω).

Then (W (t),0 ≤ t ≤ 1) is a stochastic process (family of random variables)with the following properties:

a) P(W (0) = 0) = 1,

b) W (t) ∼ N(0, t), 0 ≤ t ≤ 1;

c) W has independent increments, i.e., ∀k ≥ 2,∀0 ≤ t0 < t1 < . . . < tk ≤ 1:W (t1)−W (t0), . . . ,W (tk)−W (tk−1) are independent.

d) (W (t), 0 ≤ t ≤ 1) is a Gaussian process, i.e., ∀k ≥ 1, ∀t1, . . . , tk,(W (t1), . . .W (tk))

⊤ has a k-variate normal distribution satisfyingEW (t) = 0 and Cov(W (s),W (t)) = min(s, t), 0 ≤ s, t ≤ 1.

(W (t), 0 ≤ t ≤ 1) is called the Wiener-Process or Brownian motion on [0, 1].



The Wiener Process (Brownian motion) (W (t),0 ≤ t ≤ 1) is a fundamentalstochastic process.

It has continuous paths (W is a C[0, 1]-valued random element), but one canprove:

With probability one, the paths of W are nowhere differentiable,

With probability one, the paths of W are nowhere locally increasing ordecreasing,

With probability one, the paths of W have unbounded variation on everyinterval [s, t] with s < t.

→ Course”Brownian Motion“.



18.7 Theorem (Donsker’s Theorem (1951))

Let Z1, Z2, . . . be i.i.d. random variables, E(Z1) = 0, 0 < σ2 := V(Z1) < ∞.Let S0 := 0, Sn :=

∑nj=1 Zj , n ≥ 1, and put

Xn(t) :=1

σ√nS⌊nt⌋ + (nt− ⌊nt⌋) · Z⌊nt⌋+1√

n.

We then have XnD−→ W .

Proof: We have to show (cf. 18.4, 18.5)

limλ→∞

lim supn→∞

λ2 max1≤k≤n

P(|Sk| > λσ

√n)= 0. (TP)

LetMn(λ) := max

1≤k≤nP(|Sk| > λσ

√n).

Notice that

P(|Sk| > λσ√n) ≤ kσ2

λ2σ2n=

k

λ2n. (Tschebyschew’s inequality)



Memo: P(|Sk| > λσ√n) ≤ k

λ2n, Mn(λ) := max

1≤k≤nP(|Sk| > λσ

√n).

Put Yk :=Sk

σ√k. Note that Yk

D−→ N ∼ N(0, 1) as k →∞.

P(|Sk| > λσ√n) = P

(|Yk| > λ

√n/k

)≤ P(|Yk| > λ) −→

k→∞P(|N | > λ).

↑k ≤ n

Markov’s inequality =⇒ P(|N | > λ) ≤ E(N4)

λ4=

3

λ4.

Given λ > 0, let k(λ) ∈ N such that

P(|Yk| > λ) ≤ 6

λ4∀ k > k(λ).

=⇒ Mn(λ) ≤ max

(k(λ)

λ2n,6

λ4

)

=⇒ lim supn→∞

λ2Mn(λ) ≤ 6

λ2=⇒ lim

λ→∞lim supn→∞

λ2Mn(λ) = 0, i.e., (TP).



0

1

2

−1

−2

0.5 1.0t

Realizations of PSP X1000, P(Z1 = ±1) = 1/2

0

1

2

−1

−2

0.5 1.0t

Realizations of PSP X1000, Z1 − 1 ∼ Exp(1)



18.8 Corollary From Donsker’s Theorem, we have

Sn

σ√n

= Xn(1)D−→ W (1) ∼ N(0, 1) (CLT of Lindeberg-Levy)

18.9 Invariance Principle, functional Central Limit Theorem

a) Let Xn be a partial sum process as in Thm. 18.7. The limit process W in18.7 does not depend on the specific distribution of Z1 (we only needE(Z1) = 0, 0 < V(Z1) <∞). This fact is called the invariance principle.

b) Let h : C → Rk be measurable and W (C(h)) = 1. From XnD−→W and

the CMT, we have

h(Xn)D−→ h(W )

(so-called functional central limit theorem).

From the invariance principle, the limit distribution of h(Xn) does not dependon the special distribution of Z1.

Important consequence: If you can find the limit distribution of h(Xn) for asimple PSP (e.g., the simple symmetric random walk case P(Z1 = ±1) = 1/2),you know the distribution of h(W ).



18.10 Theorem (The distribution of max0≤t≤1W (t))

For the Wiener process W , we have

max0≤t≤1

W (t) ∼ |N |, where N ∼ N(0, 1),

i.e.,

P

(max0≤t≤1

W (t) ≤ u)

= 2Φ(u) − 1, u ≥ 0,

where Φ is the distribution function of the standard normal distribution.

Proof: Let Xn be the partial sum process associated with the i.i.d.-sequence(Zj)j≥1, where P(Z1 = 1) = P(Z1 = −1) = 1/2 (simple symmetric randomwalk). Exercise =⇒

max0≤t≤1

Xn(t) =1√n· maxk=0,...,n

SkD−→ |N |.

Since XnD−→W and the function h : C → R, h(x) := max0≤t≤1 h(t), is

continuous, (check!) the CMT yields

max0≤t≤1

Xn(t) = h(Xn)D−→ h(W ) = max

0≤t≤1W (t).

√



18.11 Corollary Let Z1, Z2, . . . be independent identically distributed randomvariables with E(Z1) = 0, 0 < σ2 := V(Z1) <∞. Then

1

σ√n

maxk=0,...,n

SkD−→ |N |, where N ∼ N(0, 1).

Consider the functionals

h+(x) := λ1 (t ∈ [0, 1] : x(t) > 0) , x ∈ C,

h0(x) := supt ∈ [0, 1] : x(t) = 0, x ∈ C.h+(x) is the time that x

”spends above the t-axis“,

h0(x) is the time of the last zero of x.

0

1

2

−11

x(t)

h0(x)

h+(x)t



For the PSP Xn based on the symmetric simple random walk, we have (seeHenze,N. (2013): Irrfahrten und verwandte Zufalle, Spr. Spektrum, p.21, p.46):

limn→∞

P

(h0(Xn)

n≤ u)

= limn→∞

P

(h+(Xn)

n≤ u)

=2

πarcsin

√u, 0 ≤ u ≤ 1.

From this, we have the famous Arc Sine Law:

18.12 Theorem (Arc Sine Law for the Wiener process)

We have

P(h0(W ) ≤ u) = P(h+(W ) ≤ u) =2

πarcsin

√u, 0 ≤ u ≤ 1.

0 1u

1/(π√u(1− u))

0 1u

2πarcsin

√u

Density (left) and distribution function (right) of the Arc Sine distribution



18.13 Theorem (Fourier representation of W )

Let N1, N2, . . . be i.i.d. standard normal random variables. Put

W (t) :=∞∑

j=1

√2 sin

((j − 1

2

)t)

(j − 1

2

)π

·Nj , 0 ≤ t ≤ 1.

The series converges in L2 := L2([0, 1],B ∩ [0, 1], λ1|[0,1]), and we have

WDfidi= W (equality of finite-dimensional distributions).

The proof uses Mercer’s Theorem:

18.14 Theorem (Mercer)

Let K : [0, 1]2 → R, K 6≡ 0, be a continuous, symmetric function satisfying

∫ 1

0

∫ 1

0g(s)K(s, t)g(t) ds dt ≥ 0 ∀ g ∈ L2. (K positive-semidefinite)

ThenK(s, t) =

∑∞j=1λjϕj(s)ϕj(t), 0 ≤ s, t ≤ 1, (18.6)

where λ1, λ2, . . . are the positive eigenvalues and ϕ1, ϕ2, . . . the correspondingnormalized eigenfunctions of the integral operator associated with the kernel K.The series in (18.6) converges both uniformly and absolutely.



18.15 Theorem (Mercer’s theorem, applied to K(s, t) = s ∧ t)We have

s ∧ t =∞∑

j=1

λj ϕj(s)ϕj(t), 0 ≤ s, t ≤ 1,

where

λj =1

π2(j − 1

2

)2 , ϕj(t) =√2 sin

((j − 1

2

)πt

), j ≥ 1.

Proof: Exercise! (Differentiate λf(s) =∫ 1

0s ∧ t f(t) dt twice.)

Let

W (t) :=

∞∑

j=1

√λj ϕj(t)Nj (L2-limit).

Proof of Theorem 18.13. Fix k ≥ 1 and 0 ≤ t1 < . . . < tk ≤ 1.Claim:

(W (t1), . . . , W (tk)

)D= (W (t1), . . . ,W (tk))

D= Nk

(0, (ti ∧ tj)1≤i,j≤k

).



Memo: Claim:(W (t1), . . . , W (tk)

)D= Nk

(0, (ti ∧ tj)1≤i,j≤k

).

Fix c1, . . . , ck ∈ R. To show:k∑

ℓ=1

cℓW (tℓ) ∼ N(0,∑k

ℓ,m=1cℓcm tℓ ∧ tm).

Let

Wn(t) :=n∑

j=1

√λj ϕj(t)Nj , n ≥ 1.

Notice that

k∑

ℓ=1

cℓWn(tℓ) =

k∑

ℓ=1

cℓ

(n∑

j=1

√λjϕj(tℓ)Nj

)=

n∑

j=1

√λj

(k∑

ℓ=1

cℓϕj(tℓ)

)Nj

∼ N(0,∑n

j=1λj

∑kℓ,m=1cℓcm ϕj(tℓ)ϕj(tm)

)

= N(0,∑k

ℓ,m=1cℓcm∑n

j=1λjϕj(tℓ)ϕj(tm))

︸︷︷︸→ tℓ ∧ tm.

Since∑k

ℓ=1 cℓWn(tℓ)L2

−→∑kℓ=1 cℓW (tℓ), the assertion follows. (why?)


Brownian Bridge, Wiener process on [0,∞)

19 Brownian Bridge, Wiener process on [0,∞)

19.1 Definition (Brownian Bridge)

A C[0, 1]-valued random element B is called Brownian Bridge :⇐⇒a) P(B(0) = 0) = 1 = P(B(1) = 0),

b) For each k ≥ 1, for each t1, . . . , tk with 0 ≤ t1 < . . . < tk ≤ 1:

B(t1)

...B(tk)

∼ Nk

0...0

, (min(ti, tj)− titj)1≤i,j≤k

.

19.2 Theorem B exists.

Proof: Consider the mapping h : C → C, C ∋ x 7→ h(x), defined by

h(x)(t) := x(t)− t · x(1), 0 ≤ t ≤ 1.

Note that h is continuous, (why?) and that h(x)(1) = 0.

Moreover, x(0) = 0 =⇒ h(x)(0) = 0.



0

1

−1

1

tx(1)x(t)− tx(1)

x(t)

t

Let (W (t),0 ≤ t ≤ 1), be a Wiener process. Put

B(t) := W (t)− tW (1) = h W (t), 0 ≤ t ≤ 1.

Then P(B(0) = 0) = 1 = P(B(1) = 0), i.e., 19.1 a) holds. Notice that

B(t1)B(t2)

...B(tk)

=

1 0 · · · 0 −t0 1 0 0 −t... 0

. . . 0...

0 0 · · · 1 −t

W (t1)...

W (tk)W (1)

∼ Nk



Memo: B(t) =W (t)− tW (1).

We haveEB(t) = EW (t)− tEW (1) = 0,

and, for 0 ≤ s, t ≤ 1,

Cov(B(s),B(t)) = E [(W (s)− sW (1))(W (t)− tW (1))]

= E [W (s)W (t)] − sE [W (1)W (t)] − tE [W (s)W (1)]

+stE[W (1)2

]

= s ∧ t− st− ts+ st

= s ∧ t− st.

I.e., 19.1 b) holds, q.e.d.



0

0.5

1.0

−0.5

t

3 realizations of an (approximate) Brownian bridge



19.3 The Wiener process on [0,∞)

Let S := C[0,∞) := x : R≥0 → R | x continuous.For x, y ∈ C[0,∞), put

ρ(x, y) :=∞∑

j=1

1

2j· max0≤t≤j |x(t)− y(t)|1 + max0≤t≤j |x(t)− y(t)|

.

Then (C[0,∞), ρ) is a complete and separable metric space. (Exercise!)

ρ(xn, x)→ 0 ⇐⇒ maxt∈K|xn(t)− x(t)| → 0 for each compact set K ⊂ R≥0.

For j ∈ N, let

rj :

C[0,∞) → C[0, j],

x 7→ rj(x) := x|[0,j], (restriction of x to [0, j])

Let P, P1, P2, . . . be probability measures on B. Then

PnD−→ P ⇐⇒ Pnr

−1j

D−→ Pr−1j ∀ j ≥ 1.

Ref.: Whitt, W.: Ann. Mathem. Statist. 41, 1970, 939–944.



Consider the mapping

h :

C[0, 1]→ C[0,∞),

x 7→ h(x), h(x)(t) := (1 + t) · x(

t

1 + t

), 0 ≤ t <∞.

The function h is continuous. (why?)

Let (B(t))0≤t≤1 be a Brownian bridge. Put

V (t) := h(B)(t)

= (1 + t) ·B(

t

1 + t

), t ≥ 0.

Then V is a random element of C[0,∞).

Notice that EV (t) = 0, t ≥ 0, and that, for 0 ≤ s ≤ t,

Cov(V (s), V (t)) = (1 + s)(1 + t)Cov

(B

(s

1 + s

), B

(t

1 + t

))

= (1 + s)(1 + t)

(min

(s

1 + s,

t

1 + t

)− s

1 + s

t

1 + t

)

= s(1 + t)− st = s

= min(s, t).



Memo: V (t) = (1 + t) ·B(

t1+t

), EV (t) = 0, Cov(V (s), V (t)) = s ∧ t.

Notice that

a) P(V (0) = 0) = 1,

b) For each k ≥ 1, for each t1, . . . , tk with 0 ≤ t1 < . . . < tk <∞:

V (t1)

...V (tk)

∼ Nk

0...0

, (ti ∧ tj)1≤i,j≤k

.

From this property, it follows that V has independent increments.

Any random element V satisfying a) and b) is called Wiener process orBrownian motion on C[0,∞).



19.4 Theorem Let W be a Wiener process on [0, 1]. For ε > 0, let

Pε(A) := P (W ∈ A|0 ≤W (1) ≤ ε), A ∈ B.

We then have PεD−→ B as ε ↓ 0, where B is a Brownian bridge.

Proof: Let W be defined on (Ω,A, P), and let B be defined as

B(t) := W (t)− tW (1), 0 ≤ t ≤ 1.

According to the Portmanteau Theorem, we have to show

lim supε↓0

P(W ∈ A|0 ≤W (1) ≤ ε) ≤ P(B ∈ A) ∀A ∈ A.

Notice that, for each k ≥ 1 and each choice of t1, . . . , tk,(W (1), B(t1), . . . , B(tk)) has a (k + 1)-variate normal distribution. Moreover,for each j ∈ 1, . . . , k,

E[W (1)B(tj)] = E[W (1)(W (tj)− tjW (1)

]= tj − tj = 0 =⇒

W (1) and (B(t1), . . . B(tk)) are independent ∀k ≥ 1,∀t1, . . . , tk =⇒

P(W (1) ∈ A,B ∈M) = P(W (1) ∈ A) · P(B ∈M) ∀M ∈ Cf ∀A ∈ B1.



Memo: Cf :=π−1t1,...,tk

(H)∣∣k ∈ N, 0 ≤ t1 < . . . < tk ≤ 1,H ∈ Bk

Memo: P(W (1) ∈ A,B ∈M) = P(W (1) ∈ A) · P(B ∈M) ∀M ∈ Cf ∀A ∈ B1

Fix A ∈ B1. Put

DA := M ∈ B : P(W (1) ∈ A,B ∈M) = P(W (1) ∈ A) · P(B ∈M).

We have:

a) Cf ⊂ DA.

b) DA is a Dynkin system, i.e., we have:

C ∈ DA,

D,E ∈ DA and D ⊂ E =⇒ E \D ∈ DA,

E1, E2, . . . ∈ DA pairwise disjoint =⇒∑∞

n=1 En ∈ DA.

It follows that δ(Cf ) ⊂ DA, where δ(Cf ) is the smallest Dynkin system over Ccontaining Cf .

Cf π-system =⇒ B = σ(Cf ) = δ(Cf ) ⊂ DA.



19.5 Remark Loosely speaking, Thm. 19.4 reads

PB = P

W |W (1)=0.

(”Brownian bridge is tied down Brownian motion“)

0

0.5

1.0

−0.5

t

The Brownian bridge is tied down Brownian motion



19.6 Some relations between processes (Exercise!)

a) Let W be a Wiener process (WP) on [0,∞) and r > 0. Then

W ∗(t) := ±√rW(t

r

), t ≥ 0, is a WP on [0,∞).

b) Let W be a WP on [0,∞) and r > 0. Then

W (t) :=W (t+ r)−W (r), t ≥ 0, is a WP on [0,∞).

c) Let W be a WP [0,∞). Then (use W (s)/sa.s.−→ 0 as s→∞)

W (t) := tW

(1

t

), t ≥ 0, is a WP on [0,∞).

d) Let W be a WP on [0,∞). Then (use W (s)/sa.s.−→ 0 as s→∞)

B(t) := (1− t)W(

t

1− t

), 0 ≤ t ≤ 1, is a Brownian bridge.

e) Let B be a Brownian bridge and Z ∼ N(0, 1), independent of B. Then

W (t) := B(t) + tZ, 0 ≤ t ≤ 1, is a WP on [0, 1].



19.7 Theorem (Reproduction Theorem for B)

Let B1, B2 be independent Brownian bridges. If a1, a2 ∈ R and a21 + a22 = 1,then

B := a1B1 + a2B2

is a Brownian bridge.

Proof: Notice that P(B(0) = 0) = 1 = P(B(1) = 0).

For each k ≥ 1, for each t1, . . . , tk ∈ [0, 1]:

(B(t1), . . . , B(tk))⊤ ∼ Nk (addition theorem for Nk).

We have EB(t) = 0, 0 ≤ t ≤ 1.

Let K(s, t) := s ∧ t− st (covariance function of a Brownian bridge).

E [B(s)B(t)] = E [(a1B1(s) + a2B2(s))(a1B1(t) + a2B2(t))]

= a21K(s, t) + a1a2E [B1(s)B2(t)] + a2a1E [B2(s)B1(t)] + a22K(s, t)︸︷︷︸︸︷︷︸= 0 = 0

= (a21 + a22)K(s, t)

= K(s, t), q.e.d. Generalization to n indep. Brownian bridges?


The Space D[0, 1]

20 The space D[0, 1]

Motivation: Let U1, U2, . . . be i.i.d. random variables, where U1 ∼ U[0, 1]. Let

Fn(t) :=1

n

n∑

j=1

1Uj ≤ t, 0 ≤ t ≤ 1,

be the empirical distribution function (EDF) of U1, . . . , Un, cf. Chapter 7. Wealready know that, for each k ≥ 1, for each 0 ≤ t1 < . . . < tk ≤ 1:

√n(Fn(t1)− t1

)

...√n(Fn(tk)− tk

)

D−→ Nk

0...0

, (K(ti, tj))1≤i,j≤k

,

whereK(s, t) = min(s, t)− st, 0 ≤ s, t ≤ 1,

is the covariance function of a Brownian bridge B.

Does

(√n(Fn(t)− t

)0≤t≤1

)(as a discontinuous random function) converge

in distribution to B?


The Space D[0, 1]

For x : [0, 1]→ R, let

x(t+) := lims↓t

x(s), t ∈ [0, 1) (right-hand limit)

x(t−) := lims↑t

x(s), t ∈ (0, 1] (left-hand limit).

20.1 Definition (The cadlag space D[0, 1])

Let

D[0, 1] := x : [0, 1]→ R|x(t+) = x(t)∀ t ∈ [0, 1), x(t−) exists ∀t ∈ (0, 1]

be the space of real functions on [0, 1] that are right-continuous and have left-hand limits. D := D[0, 1] is called cadlag space.

(french: continue a droite, limites a gauche).

For x ∈ D and T ⊂ [0, 1], let

wx(T ) := w(x, T ) := sups,t∈T

|x(s)− x(t)|.

Notice thatwx(δ) = sup

|u−v|≤δ

|x(u)− x(v)| = sup0≤t≤1−δ

wx([t, t+ δ]).


The Space D[0, 1]

20.2 Lemma For each x ∈ D and each ε > 0, there exist points t0, t1, . . . , tkso that

0 = t0 < t1 < . . . < tk = 1

andwx([ti−1, ti)) < ε, i = 1, 2, . . . , k. (20.1)

Proof: Let

t := supt ∈ [0, 1] : [0, t) can be decomposed into finitely many

intervals satisfying (20.1).

Since x(0) = x(0+) we have t > 0. Since x(t−) exists, [0, t) can itself be sodecomposed. t < 1 is impossible because of x(t) = x(t+) in that case.

20.3 Corollary

a) ∀ε > 0:∣∣t ∈ [0, 1] : |x(t)− x(t−)| ≥ ε

∣∣ <∞,

b) ‖x‖∞ := sup0≤t≤1 |x(t)| <∞,

c) x is measurable (uniform limit of simple functions constant over intervals).

x ∈ D can have at most countably many discontinuities. (why?)


The Space D[0, 1]

x(t)

tt1 t2 t3 t4 t5 tk

•

•

•

•

•

•

•

2ε

Notice that |x(t)| ≤ max0≤j≤k |x(tj)|+ ε.


The Space D[0, 1]

For 0 < δ < 1, let

w′x(δ) := inf

max1≤i≤k

wx[ti−1, ti)∣∣∣k ∈ N, 0 = t0 < t1 < . . . < tk = 1,

min1≤i≤k

(ti − ti−1) > δ

w′x(·) is called the cadlag modulus.

Notice that the infimum is taken over all so-called δ-sparse partitions of [0, 1].

Lemma 20.2 is equivalent to saying that for each x ∈ D, w′x(δ)→ 0 as δ → 0.

We have

w′x(δ) ≤ wx(2δ), if δ < 1

2.

Proof. δ < 1/2 =⇒ ∃ δ-sparse partition with ti − ti−1 ≤ 2δ ∀i.Let

j(x) := sup0<t≤1

|x(t)− x(t−)|

be the maximum absolute jump of x. We have (Exercise!):

wx(δ) ≤ 2w′x(δ) + j(x).


The Space D[0, 1]

Since, for 0 ≤ u ≤ 1, xu := 1[u,1] ∈ D and

u 6= v =⇒ ρ(xu, xv) = max0≤t≤1

|xu(t)− xv(t)| = 1,

the metric space (D, ρ) is not separable.

Idea: xu and xv should have a small distance if u ≈ v =⇒ allow for

”deformations of the time scale“. Let

Λ := λ : [0, 1]→ [0, 1] : λ continuous, strictly increasing, bijective.

Λ is a group with respect to composition”“, and λ(0) = 0, λ(1) = 1.

Put I(t) := t, 0 ≤ t ≤ 1, and

dS(x, y) := infλ∈Λ

max (‖x λ− y‖∞, ‖λ− I‖∞) , x, y ∈ D.

≤ ρ(x, y) = ‖x− y‖∞ (put λ := I)

Notice that

dS(xn, x)→ 0 ⇐⇒ ∃λn ∈ Λ : max (‖xn λn − x‖∞, ‖λn − I‖∞) → 0,

‖xn − x‖∞ → 0 =⇒ dS(xn, x)→ 0. (take λn = I)


The Space D[0, 1]

Memo: dS(xn, x)→ 0⇐⇒ ∃λn ∈ Λ : max (‖xn λn − x‖∞, ‖λn − I‖∞)→ 0.

⇐⇒ ∃λn ∈ Λ : max (‖xn − x λn‖∞, ‖λn − I‖∞) → 0

We have

|xn(t)− x(t)| ≤ |xn(t)− x (λnt) |+ |x (λnt)− x(t)|≤ ‖xn − x λn‖∞ + wx (‖λn − I‖∞) .

As a consequence, we have:

If dS(xn, x)→ 0, then

xn(t)→ x(t) for each point of continuity t of x,

xn(t)→ x(t) with at most countably many exceptional values t,

‖xn − x‖∞ → 0, if x is continuous.

20.4 Definition and Theorem

dS is a metric on D (so-called Skorokhod metric). (Exercise!)

The metric space (D, dS) is separable but not complete.


The Space D[0, 1]

Memo: dS(x, y) = infλ∈Λ max (‖x λ− y‖∞, ‖λ − I‖∞)

20.5 Example ((D, dS) is not complete)

Let an := 1/2n, xn := 1[0,an), n ≥ 1.

0

1

0 1 t•

an

xn(t)

an+1

λn(t)

‖λn − I‖∞ = an+1

xn+1 = 1[0,an+1)

xn+1 λn = 1[0,an+1) λn

= 1[0,an) = xn

=⇒ ‖xn+1 λn − xn‖∞ = 0

=⇒ dS(xn, xn+1) ≤ an+1 = 2−(n+1)

=⇒ (xn) Cauchy sequence (!)

Notice that xn(t)→ 0 for each t > 0. Let x ≡ 0 (∈ D).

We have dS(xn, x) = 1 ∀n. Thus, (xn) has no limit in D.


The Space D[0, 1]

Memo: dS(x, y) = infλ∈Λ max (‖x λ− y‖∞, ‖λ − I‖∞)

For λ ∈ Λ, put

‖λ‖ := sups<t

∣∣∣∣ logλ(t)− λ(s)

t− s

∣∣∣∣ (≤ ∞)

dS(x, y) := infλ∈Λ

max (‖λ‖, ‖x λ− y‖∞) .

20.6 Theorem

a) dS and dS are equivalent metrics on D (generate the same topology).

b) The space (D, dS) is separable and complete.

Notice that C = C[0, 1] ⊂ D = D[0, 1].

The Skorokhod topology relativized to C coincides with the uniform topologyon C.

The Borel σ-field in C is the trace B(D) ∩ C, where B(D) is the Borel σ-fieldin D.


The Space D[0, 1]

How to characterize relative compactness in D?

w′x(δ)=inf

max1≤i≤k

wx[ti−1, ti)∣∣∣k≥1, 0= t0<t1<. . .<tk=1, min

1≤i≤k(ti−ti−1)>δ

20.7 Theorem (“Arzela–Ascoli in D[0, 1]“)A set A ⊂ D[0, 1] is relatively compact if, and only if,

supx∈A‖x‖∞ <∞, (20.2)

limδ→0

supx∈A

w′x(δ) = 0. (20.3)

0 1t

xn(t)

•

•n A single t with supx∈A |x(t)| <∞ does not suffice!

Consider A := xn : n ≥ 1, where xn = n1[0.5,1)

We have (20.3) and supn≥1 |xn(0.25)| <∞,

but A is not relatively compact.


The Space D[0, 1]

Memo: dS(xn, x)→ 0⇐⇒ ∃λn ∈ Λ : max (‖xn λn − x‖∞, ‖λn − I‖∞) → 0.

Memo: λ ∈ Λ =⇒ λ(0) = 0, λ(1) = 1.

Let D be the Borel σ-field over D.

For t1, . . . , tk ∈ [0, 1], let πt1,...,tk : D → Rk, πt1,...,tk(x) = (x(t1), . . . , x(tk)).

20.8 Theorem

a) The projections π0 and π1 are continuous.

b) If 0 < t < 1 then: πt continuous at x ⇐⇒ t ∈ C(x).c) πt1,...,tk is (D,Bk)-measurable ∀k ≥ 1, ∀t1, . . . , tk ∈ [0, 1].

d) If T ⊂ [0, 1], 1 ∈ T and T = [0, 1], then D = σ(πt : t ∈ T).

For a probability measure P on D, put

TP := t ∈ [0, 1] : πt is continuous P -almost surely.

We have 0, 1 ⊂ TP , and [0, 1] \ TP is countable.


The Space D[0, 1]

20.9 Theorem Let P, P1, P2, . . . be probability measures on D.If Pn : n ≥ 1 is tight and

Pn π−1t1,...,tk

D−→ P π−1t1,...,tk

∀k ≥ 1, ∀t1, . . . , tk ∈ TP ,

then PnD−→ P .

Let X,X1, X2, . . . be D-valued random elements. Put TX := TPX .

20.10 Theorem Suppose that

a) (Xn(t1), . . . , Xn(tk))D−→ (X(t1), . . . , X(tk)) ∀k ≥ 1, ∀t1, . . . , tk ∈ TX ,

b) X(1) −X(1− δ) D−→ δ0 as δ ↓ 0. (⇐⇒ X(1) −X(1− δ) P−→ 0)

Suppose further that, for some continuous increasing function H : [0, 1] → R

and constants α > 0, β ≥ 0:

c) P (|Xn(t)−Xn(r)| ∧ |Xn(t)−Xn(s)| ≥ γ) ≤ 1

γ4β(H(t)−H(r))2α

∀γ > 0, ∀n ≥ 1, ∀ r, s, t ∈ [0, 1] such that r ≤ s ≤ t.

Then XnD−→ X.


The Space D[0, 1]

Memo: c) P (|Xn(t)−Xn(r)| ∧ |Xn(t)−Xn(s)| ≥ γ) ≤ 1

γ4β(H(t)−H(r))2α

20.11 Remark A sufficient condition for c) is

c’) E

(∣∣Xn(s)−Xn(r)∣∣2β ·

∣∣Xn(t)−Xn(s)∣∣2β)≤ (H(t)−H(r))2α.

Proof:

1|U | ∧ |V | ≥ γ ≤ |U |2β |V |2βγ4β

.

Let

ι :

C → D,

x 7→ ι(x) := x

be the canonical injection of C in D.

20.12 Definition (Wiener measure on D)

LetW be Wiener measure on B. Then the image W ι−1 ofW under ι is calledWiener measure on D. We shall write W :=W ι−1.


The Space D[0, 1]

20.13 Theorem (Donsker)

Let Z1, Z2, . . . be i.i.d. random variables, E(Z1) = 0, 0 < σ2 := V(Z1) < ∞.Let S0 := 0, Sn :=

∑nj=1 Zj , n ≥ 1, and put

Xn(t) :=1

σ√nS⌊nt⌋, 0 ≤ t ≤ 1.

We then have XnD−→ W in D[0, 1].

Proof: We use Thm. 20.10. To show:

a): (Xn(t1), . . . , Xn(tk))D−→ (W (t1), . . . ,W (tk)) ∀k ≥ 1, ∀t1, . . . , tk ∈ TW .

This follows from the multiv. CLT. Notice that TW = [0, 1] since W (C) = 1.

b): W (1)−W (1−δ) D−→ δ0. This holds since W (1)−W (1−δ) ∼Wδ ∼ N(0, δ).

c’): E(∣∣Xn(s)−Xn(r)

∣∣2β ·∣∣Xn(t)−Xn(s)

∣∣2β)≤ (H(t)−H(r))2α.

Notice that Xn has independent increments =⇒

E(|Xn(s)−Xn(r)|2 · |Xn(t)−Xn(s)|2

)=⌊ns⌋ − ⌊nr⌋

n· ⌊nt⌋ − ⌊ns⌋

n.


Empirical Processes: Applications to Statistics

21 Empirical Processes: Applications to Statistics

Let X1, X2, . . . be i.i.d. rv’s on (Ω,A,P), P(0 ≤ X1 ≤ 1) = 1.

Let F (t) := P(X1 ≤ t), Fn(t) := n−1∑nj=1 1Xj ≤ t.

For fixed ω ∈ Ω, let Yn :

Ω→ D[0, 1],

ω 7→ Y ωn ,

where

Y ωn (t) :=

√n

(1

n

n∑

j=1

1Xj(ω) ≤ t − F (t)

), 0 ≤ t ≤ 1.

LetYn(t) :=

√n(Fn(t)− F (t)

), 0 ≤ t ≤ 1.

Then Yn := (Yn(t), 0 ≤ t ≤ 1) is a random element of D = D[0, 1].

21.1 Definition (Empirical process)

Yn is called the empirical process based on X1, . . . , Xn.

If X1 ∼ U(0, 1), then Yn is called uniform empirical process.



0

0.5

1.0

−0.5

1t

√n(Fn(t)− t)

• • ••

•• • •

••

•••

• •••

••

• • •• • •

•

realization of a uniform empirical process (n = 25)

21.2 Definition (Gaussian random element, Gaussian process)

A random element Y of D is called Gaussian (Gaussian process) if, for anyk ≥ 1 and any t1, . . . , tk ∈ [0, 1], the random vector (Y (t1), . . . , Y (tk)) has ak-variate normal distribution.



21.3 Theorem (Weak convergence of the empirical process)

For the empirical process Yn(·) =√n(Fn(·)− F (·)

), we have

YnD−→ Y in D,

where Y is a Gaussian random element of D satisfying EY (t) = 0, 0 ≤ t ≤ 1,and

Cov(Y (s), Y (t)) = F (s) ∧ F (t)− F (s)F (t), 0 ≤ s, t ≤ 1.

Proof: a) Let X1 ∼ U(0, 1), i.e., F (t) = t. In this case, Y = B = B ι−1 is a

Brownian bridge on D. Fidi convergence YnDfidi−→ B has been shown in an

exercise (multivariate CLT). Thus, condition a) of Thm. 20.10 holds. Now,

B(1)−B(1− δ) ∼ N(0, δ(1− δ)) D−→ δ0 as δ ↓ 0.

Hence, condition b) of Thm. 20.10 holds. We show

E[((Yn(s)− Yn(r))

2 · (Yn(t)− Yn(s))2] ≤ 6(t− r)2, 0 ≤ r ≤ s ≤ t ≤ 1.

Thus, putting α = β = 1, H(t) =√6 t, condition c’) of 20.11 holds, and the

assertion is true if X1 ∼ U(0, 1).



Memo: To show: E[((Yn(s)−Yn(r))

2(Yn(t)−Yn(s))2]≤6(t−r)2, 0≤r≤s≤ t≤ 1.

︸︷︷︸=: ∆n(r, s, t)

Proof. Let 0 ≤ u ≤ 1. We have

Yn(u) =√n

(1

n

n∑

i=1

1Xi ≤ u − u)

=1√n

n∑

i=1

(1Xi ≤ u − u) =⇒

Yn(s)− Yn(r) =1√n

n∑

i=1

(1r < Xi ≤ s − (s− r))︸︷︷︸

=: αi

Yn(t)− Yn(s) =1√n

n∑

k=1

(1s < Xk ≤ t − (t− s))︸︷︷︸

=: βk

∆n(r, s, t) =1

n2

n∑

i,j=1

n∑

k,ℓ=1

E[αiαjβkβℓ]

=1

n2

(nE[α2

1β21 ] + n(n−1)E[α2

1]E[β22 ] + 2n(n−1)E[α1β1]E[α2β2]

)



It follows that

∆n(r, s, t) ≤ E[α21β

21 ] + E[α2

1]E[β22 ] + 2E[α1β1]E[α2β2].

Memo: αi = 1r < Xi ≤ s − (s− r), βj = 1s < Xj ≤ t − (t− s).

We have

E[α21β

21 ] = E

[(1r < X1 ≤ s − (s− r))2 (1s < X1 ≤ t − (t− s))2

]

= (s− r) (1− (s− r))2 (t− s)2 + (t− s)(s− r)2 (1− (t− s))2

+(1− (t− r))(s− r)2(t− s)2

≤ 3(s− r)(t− s) ≤ 3(t− r)2.

Likewise,

E[α21]E[β

22 ] = (s− r)(1− (s− r))(t− s)(1− (t− s))≤ (s− r)(t− s) ≤ (t− r)2,

E[α1β1]E[α2β2] = . . . =− (s− r)(t− s)

2

≤ (t− r)2

=⇒ ∆n(r, s, t) ≤ 6(t− r)2, q.e.d.



b) For the general case, put Xj := F−1(Uj), where U1, U2, . . . are i.i.d.∼ U(0, 1). Let

Gn(t) :=1

n

n∑

j=1

1Uj ≤ t, 0 ≤ t ≤ 1,

Zn(t) :=√n(Gn(t)− t

), 0 ≤ t ≤ 1.

a) =⇒ ZnD−→ B, where B is a Brownian bridge.

Notice that Fn(t) = Gn(F (t)) and thus Yn(t) = Zn(F (t)).

Consider the mapping

ψ :

D → D,

x 7→ ψx, ψx(t) := x(F (t)).

Recall: dS(xn, x)→ 0 and x ∈ C =⇒ ‖xn − x‖∞ → 0

=⇒ ‖ψxn − ψx‖∞ → 0

=⇒ dS(ψxn, ψx)→ 0.

a) =⇒ ZnD−→ B. CMT =⇒ Yn = ψ(Zn)

D−→ ψ(B) =: Y . The process Y hasthe desired properties (Exercise!).



21.4 Goodness of fit tests

Let X1, X2, . . . be i.i.d. random variables with unknown distribution functionF , where F is assumed to be continuous.

Let F0 be a known continuous distribution function.

Suppose we want to test the hypothesis

H0 : F = F0

against the alternative H1 : F 6= F0. Let

Fn(x) :=1

n

n∑

j=1

1Xj ≤ x, x ∈ R,

be the empirical distribution function of X1, . . . , Xn.

A reasonable test statistic is the Kolmogorov test statistic

Kn := supx∈R

∣∣∣Fn(x)− F0(x)∣∣∣.

Let X(1) < . . . < X(n) denote the order statistics of X1, . . . , Xn.

Notice that P(Xi 6= Xj ∀ i 6= j) = 1 since F is continuous.



Memo: H0 : F = F0, Kn := supx∈R

∣∣∣Fn(x)− F0(x)∣∣∣

We have (notice that Fn(X(j)) = j/n)

Kn = maxj=1,...,n

(max

(∣∣∣F0

(X(j)

)− j

n

∣∣∣,∣∣∣F0

(X(j)

)− j − 1

n

∣∣∣))

.

Put Uj := F0(Xj), 1 ≤ j ≤ n.Then, under H0, we have U1, . . . , Un i.i.d. ∼ U(0, 1) and thus

(F0(X(1)), . . . , F0(X(n))

)∼(U(1), . . . , U(n)

),

where U(1), . . . , U(n) are the order statistics of U1, . . . , Un.

Consequence: Under H0, the distribution of Kn does not depend on F0

=⇒ w.l.o.g. Xj ∼ U(0, 1).

Let Bn denote the uniform empirical process based on U1, . . . , Un. Then

√nKn ∼ ‖Bn‖∞ under H0.



We have BnD−→ B as n→∞, where B is a Brownian bridge.

Furthermore, the mapping h : D → R, defined by

h(x) := ‖x‖∞,

is almost everywhere continuous with respect to B.

The CMT yields √nKn

D−→ ‖B‖∞ under H0.

21.5 Definition (Kolmogorov distribution)

The distribution of K := ‖B‖∞ is called Kolmogorov distribution. We have

P(K ≤ x) = 1− 2

∞∑

j=1

(−1)j−1 exp(−2j2x2

), 0 < x <∞.



In Chapter 8, Example 8.28, we proved (using the theory of U-statistics)

∫ 1

0

B2n(t) dt

D−→ ω2 :=1

6+

∞∑

k=1

N2k − 1

k2π2,

where N1, N2, . . . are i.i.d. ∼ N(0, 1).

Since the mapping h : D → R, defined by

h(x) :=

∫ 1

0

x2(t) dt,

is continuous almost everywhere with respect to B, it follows that

∫ 1

0

B2(t) dt ∼ 1

6+

∞∑

k=1

N2k − 1

k2π2.

The distribution of ω2 is called Cramer-von Mises distribution.



21.6 The nonparametric two-sample problem

Let X1, X2, . . . ;Y1, Y2, . . . be independent random variables, were X1, X2, . . .are i.i.d. with df F and Y1, Y2, . . . are i.i.d. with df G. F and G are assumed tobe continuous but otherwise unknown.

The problem is to test the hypothesis

H0 : F = G

against the general alternative H1 : F 6= G.

A reasonable test statistic is the Kolmogorov–Smirnov test statistic

Km,n := supx∈R

∣∣∣∣Fm(x)− Gn(x)

∣∣∣∣,

where

Fm(x) :=1

m

m∑

i=1

1Xi ≤ x, Gn(x) :=1

n

n∑

j=1

1Yj ≤ x

are the empirical df’s of X1, . . . , Xm and Y1, . . . , Yn, respectively.

Under H0, the distribution of Km,n does not depend on F . (!)



21.7 Theorem Under H0, we have√

mn

m+ nKm,n

D−→ ‖B‖∞ as m,n→∞,

where B is a Brownian bridge.

Proof: W.l.o.g. let Xi, Yj ∼ U(0, 1). Then, putting

am,n =

√n

m+ n, cm,n = −

√m

m+ n

(=⇒ a2m,n + c2m,n = 1

)

we have under H0

√mn

m+ nKm,n ∼ sup

0≤t≤1

∣∣∣am,n

√m(Fm(t)− t) + cm,n

√n(Gn(t)− t)

∣∣∣

= ‖am,nAm + cm,nCn‖∞,

where Am and Cn are independent uniform empirical processes.

By the independence of Am and Cn, we have (Am, Cn)D−→ (A,C), where A

and C are independent Brownian bridges.



Memo:

√mn

m+ nKm,n ∼ ‖am,nAm + cm,nCn‖∞, (Am, Cn)

D−→ (A,C)

Memo: a2m,n + c2m,n = 1

CMT =⇒ aAm + cCnD−→ aA+ cC, a, c ∈ R.

If a2 + c2 = 1, then, by the reproduction Theorem 19.7, aA+ cC ∼ B, whereB is a Brownian bridge.

If am,n → a and cm,n → c, then

‖am,nAm+cm,nCn − (aAm+cCn) ‖∞ = ‖(am,n−a)Am + (cm,n−c)Cn‖∞≤ |am,n−a|·‖Am‖∞ + |cm,n−c|·‖Cn‖∞= oP(1).

The assertion now follows from the subsequence criterion, q.e.d.



21.8 Remark In the case m = n, the limit distribution of√

mn

m+ nKm,n,

i.e., the distribution of ‖B‖∞, can be obtained by elementary methods (simplesymmetric random walk and the reflection principle, see e.g., Henze, N.:Irrfahrten und andere Zufalle, Springer Spektrum 2013, p.152 ff.).

In the same way, one can derive the following result.

21.9 Theorem (The distribution of sup0≤t≤1

B(t))

We have

P

(sup

0≤t≤1B(t) ≤ x

)= 1− exp

(−2x2

), x ≥ 0.

Proof. Let X1, . . . , Xn, . . . ;Y1, . . . , Yn, . . . be i.i.d. ∼ U(0, 1),

Fn(t) :=1

n

n∑

j=1

1Xj ≤ t, Gn(t) :=1

n

n∑

j=1

1Yj ≤ t.



LetUn(t) :=

√n(Fn(t)− t

), Vn(t) :=

√n(Gn(t)− t

).

By Donsker’s Theorem, UnD−→ B1, Vn

D−→ B2, where B1 and B2 areindependent Brownian bridges. Put

Bn(t) :=1√2(Un(t)− Vn(t)) =

√n

2

(Fn(t)− Gn(t)

).

By independence of B1, B2 and the CMT,

BnD−→ B :=

1√2(B1 −B2) .

Theorem 19.7 =⇒ B is a Brownian bridge.

Let Z(1) < . . . < Z(2n) denote the order statistics of X1, . . . , Xn, Y1, . . . , Yn.

Notice that sup0≤t≤1(Fn(t)− Gn(t)) only depends on whether, for eachj ∈ 1, . . . , 2n, Z(j) belongs to the X- or the Y -sample

=⇒ w.l.o.g. jumps of the EDF’s at equidistant points.



10

t

Fn(t)− Gn(t)

1/n

−1/nX3 X1 Y2 Y3 Y1 X2

•

•

•

•

•

•

W.l.o.g. abscissa values 0, 1, 2, . . . , 2n.

Choose n of the points 0, 1, . . . , 2n− 1 as times for unit 1”up-steps“.

The other points are the times for unit 1 “down-steps.“

All(2nn

)ways of choosing

”up-step times“ are equiprobable.



Model: W 2n := (a1, . . . , a2n) ∈ −1, 12n : a1 + . . .+ a2n = 0.

Let P be the uniform distribution on W 2n.

Let

P(V1 = a1, . . . , V2n = a2n) =1(2nn

) , if (a1, . . . , a2n) ∈W 2n

and P(V1 = a1, . . . , V2n = a2n) = 0, otherwise.

Vj models the direction of the step (+1 or −1) at time j − 1.

Let S0 := 0, Sk := V1 + . . .+ Vk, if 1 ≤ k ≤ 2n.

LetM

2n := maxk=0,...,2n

Sk.

Then

sup0≤t≤1

Bn(t) =

√n

2sup

0≤t≤1

(Fn(t)− Gn(t)

)

∼ 1√2n·M

2n.



Claim: For each x > 0 we have

limn→∞

P

(M

2n√2n≤ x

)= 1− exp

(−2x2) (q.e.d.)

Proof. We first show

P (M2n ≥ k) =

(2n

n+k

)(2nn

) , k = 0, 1, . . . , n.

There is a bijection between paths from W 2n having M

2n ≥ k and paths from(0, 0) to (2n, n+ 2k)!

2n

••

••

••

••

••

••

••

••

••

••

k

2k

••

••

••

••

••

•



Memo: P (M2n ≥ k) =

(2n

n+k

)(2nn

) , k = 0, 1, . . . , n.

Fix x > 0. Let kn := ⌈x√2n⌉. We have

P

(M

2n√2n≥ x

)= P (M

2n ≥ kn) =

(2n

n+kn

)(2nn

) =

kn−1∏

j=0

(1− kn

n− j + kn

).

Use 1− 1/t ≤ log t ≤ t− 1 to show

log P

(M

2n√2n≥ x

)≤ − kn

kn−1∑

j=0

1

n− j + kn≤ − k2n

n+ kn,

log P

(M

2n√2n≥ x

)≥ −kn

kn−1∑

j=0

1

n− j ≥ − k2nn− kn + 1

.

Since

limn→∞

k2nn+ kn

= 2x2 = limn→∞

k2nn− kn + 1

,

the assertion follows.


Gaussian distributions in separable Hilbert spaces

22 Gaussian distributions in separable Hilbert spaces

22.1 Hilbert spaces: Basic facts

Let H be a separable real Hilbert space with scalar (inner) product 〈x, y〉,x, y ∈ H , and norm ‖x‖ :=

√〈x, x〉.

If e1, e2, . . . is a complete orthonormal system of H , then, for x, y ∈ H :

x =∞∑

k=1

〈x, ek〉ek(

:⇐⇒ limn→∞

∥∥∥∥x−n∑

k=1

〈x, ek〉ek∥∥∥∥2

= 0

)

‖x‖2 =

∞∑

k=1

〈x, ek〉2 (Parseval’s equality),

〈x, y〉2 ≤ ‖x‖2 · ‖y‖2 (Cauchy-Schwarz-inequality),

〈x, y〉 =∞∑

k=1

〈x, ek〉〈y, ek〉 (generalized Parseval’s equality).

The metric ρ(x, y) = ‖x− y‖ renders (H, ρ) a complete separable metric space.

As before, let O denote the system of open subsets of H and B := σ(O) theσ-field of Borel sets.



22.2 Examples

a) H := ℓ2 :=x = (xk)k≥1 ∈ R

N :∑∞

k=1x2k <∞

, 〈x, y〉 :=∑∞

k=1 xkyk.

b) Let (Ω,A, µ) be a σ-finite measure space, where A = σ(M) for acountable systemM⊂ P(Ω).Let H := L2(Ω,A, µ) be the set of (equivalence classes of) measurablefunctions f : Ω→ R satisfying

∫

Ω

f2 dµ <∞.

Here,

〈f, g〉 =∫

Ω

f g dµ.

Notice that a) is a special case of b). (why?)

Each infinite-dimensional separable Hilbert space is isomorphic to ℓ2, since

H ∋ x←→ (〈x, ek〉)k≥1 ∈ ℓ2.



22.3 Definition (Properties of operators)

An operator is a function T : H → H . T is called

linear, if T (ax+ by) = aTx+ b Ty, x, y ∈ H, a, b ∈ R,

bounded, if ‖Tx‖ ≤ K · ‖x‖, x ∈ H , for some K ∈ [0,∞),

compact, if T (M) is relatively compact whenever M ⊂ H is bounded,

symmetric, if 〈Tx, y〉 = 〈x, Ty〉 x, y ∈ H ,

positive, if 〈Tx, x〉 ≥ 0, x ∈ H .

A compact linear operator is called of trace class, if∞∑

k=1

|〈ek, T ek〉| <∞, (22.1)

where e1, e2, . . . is a complete orthonormal system (COS) of H .

If T is of trace class, then

tr(T ) :=∞∑

k=1

〈ek, T ek〉

is called the trace of T .

Condition (22.1) and tr(T ) do not depend on the special choice of a COS.



Memo: A linear mapping ℓ : H → R is called a linear functional.

Memo: ℓ is bounded if ‖ℓ‖ := sup|ℓ(x)| : x ∈ H, ‖x‖ = 1 <∞.

22.4 Theorem (Riesz’s representation theorem)

If ℓ is a bounded linear functional, there is a unique z ∈ H with

ℓ(x) = 〈z, x〉, x ∈ H.

Moreover, ‖ℓ‖ = ‖z‖.



22.5 Finite-dimensional sets

Fix a complete orthonormal set e1, e2, . . . of H . For k ∈ N, let

πk : H → Rk, x 7→ πk(x) := πkx := (〈x, e1〉, . . . , 〈x, ek〉) .

M :=∞⋃

k=1

π−1k (Bk) is called the system of finite-dimensional sets.

M is a π-system (!), and we have σ(M) = B.

Proof. Since πk is continuous (!), we have σ(M) ⊂ B.For x ∈ H, ε > 0, let B(x, ε) := y ∈ H : ‖x− y‖ < ε. For k ≥ 1, we have

Ck(x, ε) := y ∈ H :∑k

n=1〈x− y, en〉2 ≤ ε2= y ∈ H : ‖πkx− πky‖22 ≤ ε2 (Euclidean norm in R

k)

= π−1k

(z ∈ R

k : ‖z − πkx‖2 ≤ ε)∈ M.

Parseval =⇒ ∩∞k=1Ck(x, ε) = y ∈ H : ‖x− y‖ ≤ ε ∈ σ(M).

It follows that B(x, ε) = ∪∞m=1

y ∈ H : ‖x− y‖ ≤ ε− 1/m

∈ σ(M), q.e.d.



Memo: M :=∞⋃

k=1

π−1k (Bk), M π-system, σ(M) = B.

22.6 Corollary

a) Let P be a probability measure on B. Then P is uniquely determined bythe distributions P π−1

k on Bk, k ≥ 1, the so-called finite-dimensionaldistributions of P .

b) Let (Ω,A,P) be a probability space. Suppose X : Ω→ H is a randomelement of H , i.e., a (A,B)-measurable mapping.

Then the distribution PX = P X−1 of X is uniquely determined by thedistributions of the k-dimensional random vectors

(〈X, e1〉, . . . , 〈X, ek〉) , k ≥ 1.

Here, e1, e2, . . . is any complete orthonormal set of H .



22.7 Definition (Expectation of a H-valued random element)

Let X be a H-valued random element (on some probability space (Ω,A,P))satisfying E|〈X,x〉| <∞, x ∈ H . Suppose there is m ∈ H with

〈m,x〉 = E〈X,x〉 ∀ x ∈ H.

Then m is called the expectation of X, and we write EX = m.

We thus have〈EX,x〉 = E〈X,x〉, x ∈ H.

X is called centered, if EX = 0 (the zero vector in H).

Convince yourself that (!)

EX is uniquely determined (if it exists),

if H = Rd, EX is the expectation of the d-dimensional random vector X,as given in Definition 5.1 a).



Memo: 〈EX,x〉 = E〈X,x〉, x ∈ H.

22.8 Theorem If E‖X‖ <∞ then EX exists.

Proof. Fix x ∈ H . We have

E|〈X,x〉| ≤ E (‖X‖ · ‖x‖)) = ‖x‖ · E‖X‖ <∞.

Hence, E〈X,x〉 exists for each x ∈ H . Put ℓ(x) := E〈X,x〉, x ∈ H .

Then ℓ : H → R is a well-defined linear functional on H .

Moreover, |ℓ(x)| ≤ E‖X‖ · ‖x‖, x ∈ H , shows that ℓ is bounded.

Riesz’s representation theorem =⇒ there is a unique m ∈ H with

ℓ(x) = 〈m,x〉, x ∈ H, q.e.d.

22.9 Remark We have ‖EX‖ ≤ E‖X‖.

Can you prove this fact?



22.10 Theorem (Linearity of expectations)

Let (Ω,A, P) be a probability space, and let L1 be the set of all H-valuedrandom elements X : Ω→ H satisfying E‖X‖ <∞.

Then L1 is a vector space (over R), and we have

E[aX + bY

]= aEX + bEY, a, b ∈ R, X, Y ∈ L1.

Proof. We first show that, if X and Y are random elements of H anda, b ∈ R, then aX + bY is a random element of H , i.e., (A,B)-measurable (!).Recall

M := ∪∞k=1π

−1k (Bk), πkx := (〈x, e1〉, . . . , 〈x, ek〉), σ(M) = B.

Fix M ∈ M =⇒ ∃ k ∃Bk ∈ Bk :M = π−1k (Bk). Now,

(aX + bY )−1 (M) = (aX + bY )−1(π−1k (Bk)

)= (πk (aX + bY ))−1 (Bk),

πk (aX + bY ) = a (〈X, e1〉, . . . , 〈X, ek〉) + b (〈Y, e1〉, . . . , 〈Y, ek〉).︸︷︷︸is (A,Bk)-measurable

Thus, (aX + bY )−1 (M) ⊂ A. Since σ(M) = B, the assertion follows.

The rest of the proof is an exercise!



22.11 Theorem

Let X be a H-valued random element. If E‖X‖2 < ∞, there is a uniquesymmetric positive linear operator T : H → H of trace-class that satisfies

〈Tx, y〉 = E[〈X,x〉〈X,y〉

], x, y ∈ H.

Moreover, we have tr(T ) = E‖X‖2.

Proof. Fix x, y ∈ H . Cauchy-Schwarz =⇒

E|〈X,x〉〈X, y〉| ≤ ‖x‖ · ‖y‖ · E‖X‖2 <∞.

Thus,ℓ(x, y) := E

[〈X,x〉〈X, y〉

], x, y ∈ H,

defines a bilinear, symmetric and bounded functional ℓ : H ×H → R.

Fix x ∈ H , and put ℓx(y) := ℓ(x, y), y ∈ H . Then ℓx is a bounded linearfunctional. Riesz =⇒ there is a unique element Tx := T (x) ∈ H with

E[〈X,x〉〈X,y〉

]= ℓ(x, y) = ℓx(y) = 〈Tx, y〉, y ∈ H.

T : H → H is symmetric (√), positive (

√) and linear (Exercise!).



Memo: 〈Tx, y〉 = E[〈X,x〉〈X,y〉

],

Memo: T : H → H linear, symmetric and positive.

T is of trace-class, since, for a fixed complete orthonormal set e1, e2, . . .,

E‖X‖2 = E

[ ∞∑

k=1

〈X, ek〉2]

(why?)

=∞∑

k=1

E

[〈X, ek〉2

](why?)

=∞∑

k=1

〈Tek, ek〉

< ∞.

Notice that E‖X‖2 <∞ implies E‖X‖ <∞ (why?).

In particular, EX exists.



22.12 Theorem and Definition (Covariance operator)

Suppose X is a H-valued random element and E‖X‖2 < ∞. Then there is aunique positive symmetric linear operator Σ : H → H satisfying

〈Σx, y〉 = E[〈X − EX,x〉〈X − EX, y〉

], x, y ∈ H. (22.2)

The operator Σ is called the covariance operator of (the distribution of) X.

Proof. Theorem 22.11 =⇒ ∃! operator T : H → H , T linear, symmetric,positive, of trace-class, satisfying

〈Tx, y〉 = E[〈X,x〉〈X,y〉

], x, y ∈ H.

Put Σx := Σ(x) := Tx− 〈EX,x〉EX, x ∈ H . Then Σ is linear, symmetric, ,positive, and (22.2) holds. (Exercise!).

22.13 Remark The covariance operator is of trace-class. (Exercise!)



Memo: 〈Σx, y〉 = E[〈X − EX,x〉〈X − EX, y〉

], x, y ∈ H.

22.14 Corollary If H = Rd and thus X is a d-dimensional random vector, thecovariance operator Σ of X is equal to the covariance matrix of X, viewed as alinear operator acting on column vectors.

Proof. Let X = (X1, . . . , Xd)⊤, EX = (EX1, . . . ,EXd)

⊤.

Putting x = (x1, . . . , xd)⊤, y = (y1, . . . , yd)

⊤, Σ = (σij)1≤i,j≤d, the left-handside of the memo becomes

〈Σx, y〉 =d∑

i=1

d∑

j=1

σij xi yj .

Since

〈X − EX,x〉 · 〈X − EX, y〉 =d∑

i=1

d∑

j=1

(Xi − EXi) xi (Xj − EXj) yj ,

the right-hand side is

d∑

i=1

d∑

j=1

Cov(Xi, Xj) xi yj , q.e.d.



22.15 Theorem (Covariance operators and independence)

Let X and Y be independent random elements of H with E‖X‖2 < ∞ andE‖Y ‖2 <∞. Writing Σ(Z) for the covariance operator of a random element Z,we then have:

Σ(X + Y ) = Σ(X) + Σ(Y ).

Proof. Fix x, y ∈ H , and put X = X − EX, Y = Y − EY . We have to show

〈Σ(X + Y )x, y〉 =⟨

Σ(X) + Σ(Y )x, y⟩.

Now, since E(X + Y ) = EX + EY and

E〈Y , x〉 = E〈Y − EY, x〉 = E〈Y, x〉 − 〈EY, x〉 = 0,

we have

〈Σ(X + Y )x, y〉 = E[〈X + Y − E(X + Y ), x〉 · 〈X + Y − E(X + Y ), y〉

]

= E[〈X + Y , x〉 · 〈X + Y , y〉

]

= E[〈X, x〉〈X, y〉

]+ E

[〈Y , x〉〈X, y〉

]+ E

[〈X, x〉〈Y , y〉

]

+E[〈Y , x〉〈Y , y〉]= 〈Σ(X)x, y〉+ 〈Σ(Y )x, y〉 =

⟨Σ(X) + Σ(Y )

x, y⟩.√



22.16 Definition (Characteristic functional)

Suppose X is a H-valued random element. Then the function

ϕX : H → C, x 7→ ϕX (x) := E[ei〈X,x〉

]=

∫

H

ei〈y,x〉 PX(dy)

is called the characteristic functional of (the distribution of) X.

22.17 Theorem (Properties of ϕX)

The characteristic functional has the following properties:

a) ϕX (0) = 1 (0 is the zero vector in H),

b) ϕX is continuous, (why?)

c) ϕX is positive-semidefinite, i.e.,

n∑

k,ℓ=1

αkαℓϕX(xℓ − xk) ≥ 0 ∀n ≥ 1, ∀x1, . . . , xn ∈ H, ∀α1, . . . , αn ∈ C,

d) If X and Y are independent, then ϕX+Y = ϕX · ϕY , (why?)

e) ϕX = ϕY ⇐⇒ XD= Y .



Proof of c) Notice that

0 ≤ E

∣∣∣∣n∑

k=1

αkei〈X,xk〉

∣∣∣∣2

= E

[ n∑

k,ℓ=1

αkαℓei〈X,xk−xℓ〉

]=

n∑

k,ℓ=1

αkαℓϕX(xk−xℓ).

Proof of e) Let e1, e2, . . . be some COS of H . Put

Xk := (〈X, e1〉, . . . , 〈X, ek〉)⊤, Yk := (〈Y, e1〉, . . . , 〈Y, ek〉)⊤.

Put x = a1e1 + . . .+ akek, where a1, . . . , ak ∈ R. Then

E

[exp

(i

k∑

j=1

aj〈X, ej〉)]

= E

[exp

(i

⟨X,

k∑

j=1

ajej

⟩)]

= ϕX

(k∑

j=1

ajej

)= ϕY

(k∑

j=1

ajej

)

= E

[exp

(i

⟨Y,

k∑

j=1

ajej

⟩)]= E

[exp

(i

k∑

j=1

aj〈Y, ej〉)]

,

i.e., ϕXk(a) = ϕYk

(a) ∀ a = (a1, . . . , ak) ∈ Rk =⇒ Xk

D= Yk. 22.6 =⇒ assertion.



22.18 Proposition (Characteristic function of Nd(m,Σ))

Let m ∈ Rd, Σ ∈ Rd×d symmetric, positive-semidefinite. We then have

X ∼ Nd(m,Σ)⇐⇒ ϕX(t) = E

[eit

⊤X]= exp

(it⊤m− 1

2t⊤Σt

), t ∈ R

d.

Proof.”⇐=“ follows from the uniqueness theorem of characteristic functions.

”=⇒“: 5.8 =⇒ ∃A : Σ = AA⊤ and X

D= AY +m, Y ∼ Nd(0, Id).

Fix t ∈ Rd, and put z := A⊤t. Notice that ‖z‖2 = t⊤AA⊤t = t⊤Σt. Then

ϕX(t) = E

[eit

⊤(AY +m)]

= eit⊤m · E

[ei(A

⊤t)⊤Y]

= eit⊤m · E

[eiz

⊤Y]

= eit⊤m · E

[exp

(i

d∑

k=1

zkYk

)]

= eit⊤m · E

[ d∏

k=1

exp (izkYk)

]= eit

⊤md∏

k=1

E

[exp (izkYk)

]

= eit⊤m

d∏

k=1

exp

(−z

2k

2

)= eit

⊤m · e−‖z‖2/2, q.e.d.



L+tr(H) := T :H→H |T linear, bounded, symm., positive, of trace-class.

22.19 Definition (Gaussian (normal) distribution in H)

A H-valued random element X has a Gaussian (normal) distribution :⇐⇒

∃m ∈ H ∃Σ ∈ L+tr(H) : ϕX(h) = ei〈m,h〉 exp

(−〈Σh, h〉

2

), h ∈ H.

In this case, we write X ∼ N(m,Σ).

If x ∈ H : Σx = 0 = 0, the distribution N(m,Σ) is called non-degenerate.

If m = 0, the distribution N(m,Σ) is called centered.

22.20 Theorem (Existence of Gaussian distributions)

For each m ∈ H and Σ ∈ L+tr(H), there is a Gaussian distribution N(m,Σ).

Proof. Σ symmetric and compact =⇒ ∃ COS e1, e2, . . . of H andλ1, λ2, . . . ≥ 0 with Σek = λkek, k ≥ 1. Notice that

Tr(Σ) =∑∞

k=1〈Σek, ek〉 =∑∞

k=1λk <∞.



Put mk = 〈m, ek〉, k ≥ 1. Let

P ∗ :=

∞⊗

k=1

N1(mk, λk)

be the infinite product measure on the product-Borel-σ-field B∞ of R∞.

Notice that ℓ2 is a Borel subset of R∞ = x = (xj)j≥1 : xj ∈ R ∀j ≥ 1. (!)Claim 1: P ∗ is concentrated on ℓ2, i.e.,

P ∗ (x ∈ R∞ : ‖x‖2ℓ2 <∞

)= 1.

Proof of claim 1: We have

∫

R∞

‖x‖2ℓ2 P ∗(dx) =

∫

R∞

∞∑

k=1

x2k P

∗(dx) =∞∑

k=1

∫

R

x2k P

∗(dx) (why?)

=

∞∑

k=1

∫

R

x2k N1(mk, λk)(dxk) =

∞∑

k=1

(λk +m2k)

= Tr(Σ) + ‖m‖2 <∞, q.e.d. (why?)



Memo: Probability space (R∞,B∞, P ∗), P ∗ = ⊗∞j=1N1(mj , λj), P

∗(ℓ2) = 1.

Let P be the restriction of P ∗ to B(ℓ2). Consider the mapping

γ :

ℓ2 → H,

x = (xj)j≥1 7→ γ(x) :=∑∞

j=1 xjej .

Notice that, for x = (xj)j≥1, y = (yj)j≥1 ∈ ℓ2,

〈γ(x), γ(y)〉 =⟨ ∞∑

j=1

xjej ,∞∑

k=1

ykek⟩

=∞∑

j=1

∞∑

k=1

xjyk〈ej , ek〉

=∞∑

j=1

xjyj = 〈x, y〉ℓ2 .

It follows that γ is an isometry between ℓ2 and H . Let

P := γ(P ) = P γ−1

be the image (probability) measure of P under the (measurable) mapping γ.



Memo: γ(x) =∑∞

j=1 xjej , x = (xj)j≥1 ∈ ℓ2; γ−1(h) = (〈h, ej〉)j≥1, h ∈ H.

Claim 2: P (= γ(P )) = N(m,Σ). Proof of Claim 2: We show

∫

H

ei〈y,h〉 P (dy) = exp

(i〈m,h〉 − 1

2〈Σh, h〉

), h ∈ H (q.e.d.)

Fix h ∈ H . We have∫

H

ei〈y,h〉 P (dy) =

∫

ℓ2ei〈γ(x),h〉 P (dx) (transf. of integrals)

=

∫

R∞

ei〈γ(x),h〉 P ∗(dx) (P = P ∗|ℓ2 , P

∗(ℓ2) = 1)

=

∫

R∞

exp(i〈x, γ−1(h)〉ℓ2

)P ∗(dx) (γ isometry)

=

∫

R∞

exp

(i

∞∑

k=1

xk〈h, ek〉)P ∗(dx) (memo)

= limn→∞

∫

R∞

exp

(i

n∑

k=1

xk〈h, ek〉)P ∗(dx). (why?)



Memo:∫Reits N1(a, σ

2)(dt) = exp(ias− σ2s2

2

), Σek = λkek

∫

H

ei〈y,h〉 P (dy) = limn→∞

∫

R∞

exp

(i

n∑

k=1

xk〈h, ek〉)P ∗(dx)

= limn→∞

∫

R∞

n∏

k=1

eixj〈h,ej〉 P ∗(dx)

= limn→∞

n∏

k=1

∫

R

eixk〈h,ek〉N1(mk, λk)(dxk)

= limn→∞

n∏

k=1

exp

(imk〈h, ek〉 − 1

2λk〈h, ek〉2

)

= limn→∞

exp

(i

n∑

k=1

〈m,ek〉〈h, ek〉 − 1

2

n∑

k=1

λk〈h, ek〉2).

Now,n∑

k=1

〈m,ek〉〈h, ek〉 →∞∑

k=1

〈m, ek〉〈ek, h〉 = 〈m,h〉 (general. Parseval),n∑

k=1

λk〈h, ek〉2 =n∑

k=1

〈Σh, ek〉〈h, ek〉 →∞∑

k=1

〈Σh, ek〉〈ek, h〉 = 〈Σh, h〉.



Memo: X ∼ N(m,Σ)⇐⇒ E[ei〈X,h〉

]= exp

(i〈m,h〉 − 〈Σh, h〉

2

), h ∈ H.

22.21 Theorem (Properties of Gaussian distributions)

Suppose X ∼ N(m,Σ). We then have:

a) 〈X, h〉 ∼ N1(〈m,h〉, 〈Σh, h〉) ∀h ∈ H ,

b) ∀ k ≥ 1, ∀h1, . . . , hk ∈ H : (〈X,h1〉, . . . , 〈X,hk〉)⊤ has a k-variatenormal distribution,

c) E‖X‖2 <∞,

d) EX = m,

e) Σ(X) = Σ.

Proof. a) In the memo, replace h with th, where t ∈ R. Then

ϕ〈X,h〉(t) = E

[eit〈X,h〉

]= E

[ei〈X,th〉

]= exp

(i〈m,h〉t− 〈Σh, h〉t

2

2

), q.e.d.

b) Let a1, . . . , ak ∈ R. In a), put h = a1h1 + . . .+ akhk. Then∑k

j=1 aj〈X,hj〉has a univariate normal distribution, q.e.d.



c) E‖X‖2 <∞.

c) Since 〈X, ek〉 ∼ N1(〈m,ek〉, λk), we have

E‖X‖2 = E

[∑∞k=1〈X, ek〉2

]=∑∞

k=1 E[〈X, ek〉2

]=∑∞

k=1

(λk + 〈m,ek〉2

)

=∑∞

k=1λk +∑∞

k=1〈m,ek〉2 = tr(Σ) + ‖m‖2 <∞.

d) EX = m.

d) From a) we have E〈X,h〉 = 〈m,h〉 ∀h ∈ H , q.e.d.

e) Σ(X) = Σ.

e) To show: 〈Σx, y〉 = E[〈X − EX,x〉〈X − EX, y〉

], x, y ∈ H .

Notice that 〈X,x+ y〉 ∼ N1(〈m,x+ y〉, 〈Σ(x+ y), x+ y〉) =⇒

V (〈X,x〉+ 〈X, y〉) = 〈Σ(x+ y), x+ y〉 = 〈Σx, x〉+ 〈Σy, y〉+ 2〈Σx, y〉= V(〈X,x〉) + V(〈X, y〉) + 2Cov (〈X,x〉, 〈X, y〉)

=⇒ 〈Σx, y〉 = Cov (〈X,x〉, 〈X, y〉) = Cov (〈X − EX,x〉, 〈X − EX, y〉)= E [〈X − EX,x〉 · 〈X − EX, y〉] , q.e.d.



22.22 Theorem (Characterization of Gaussian distributions)


a) X is a Gaussian random element of H ,

b) 〈X, h〉 has a univariate normal distribution for each h ∈ H .

Notice that b) implies (〈X,h1〉, . . . , 〈X,hk〉)⊤ ∼ Nk ∀k ≥ 1,∀ h1, . . . , hk ∈ H .

(why?) Corollary 22.6 b) =⇒ PX uniquely determined by property b).

Proof.”=⇒“ follows from Theorem 22.21 a).

“⇐=“: We have E‖X‖2 <∞ (without proof, not trivial!). Thm. 22.8 andThm. 22.12 =⇒ m := EX and Σ := Σ(X) exist.

b) =⇒ ∀h ∈ H ∃mh ∈ R, σ2h ≥ 0 : 〈X,h〉 ∼ N1(mh, σ

2h) =⇒

ϕX(h) = E[ei〈X,h〉] = ϕ〈X,h〉(1) = exp

(imh − σ2

h

2

).

Now, mh = E〈X,h〉 = 〈EX,h〉 = 〈m,h〉. Furthermore,

σ2h = V (〈X,h〉) = E [〈X − EX,h〉 · 〈X − EX,h〉] = 〈Σh, h〉, q.e.d.



22.23 Theorem (The distribution of ‖X −m‖2)Suppose X ∼ N(m,Σ) is a Gaussian random element of H . Then

‖X −m‖2 D=∑∞

j=1λjN2j ,

where λ1, λ2, . . . are the eigenvalues of the covariance operator Σ, andN1, N2, . . . are i.i.d. standard normal random variables.

Proof. W.l.o.g. let m = 0 and λj > 0 for each j.

Thm. 22.20 =⇒ ∃ COS e1, e2, . . . of H with Σej = λjej , j ≥ 1.

Let Nj := 〈X, ej〉, j ≥ 1. Fix k ≥ 1. Thm. 22.22 =⇒ (N1, . . . , Nk) ∼ Nk.

Notice that E(Nj) = 0, and that

E(NiNj) = E [〈X, ei〉〈X, ej〉] = 〈Σei, ej〉 = λi〈ei, ej〉.

Hence, N1, N2, . . . are independent random variables (!), and Nj ∼ N(0, λj).

Put Nj := Nj/√λj , j ≥ 1. Then Nj ∼ N(0, 1), and

‖X‖2 =∑∞

j=1〈X, ej〉2 =∑∞

j=1N2j =

∑∞j=1λj N

2j , q.e.d.



22.24 Gaussian processes and Gaussian random elements in L2(Rd,Bd, µ)

LetH := L2 := L2(Rd,Bd, µ), µ σ-finite measure on Bd.

〈g, h〉 :=∫

Rd

g(t)h(t)µ(dt), g, h ∈ L2.

All random elements will be defined on a common probab. space (Ω,A,P).

A Gaussian process Z on Rd is a family Z = (Z(t))t∈Rd of random variables

Z(·, t) : Ω→ R such that, for each k ≥ 1 and each choice of t1, . . . , tk ∈ Rd,the random vector (Z(t1), . . . , Z(tk)) has some k-variate normal distribution.

The distribution of Z is characterized by

the mean function m(t) := EZ(t), t ∈ Rd,

and the

the covariance function C(s, t) := Cov(Z(s), Z(t)), s, t ∈ Rd.

We assume that Z : Ω× Rd → R is (A⊗ Bd,B)-measurable.



Memo: m(t) = EZ(t), t ∈ Rd; C(s, t) = Cov(Z(s), Z(t)), s, t ∈ Rd.

Now, assume ∫

Rd

m2(t)µ(dt) <∞,∫

Rd

C(t, t)µ(dt) <∞.

Then

∞ >

∫

Rd

E[Z2(t)

]µ(dt) = E

[∫

Rd

Z2(t)µ(dt)

](why?).

It follows that Rd ∋ t 7→ Z(·, t) ∈ L2 P-a.s.

Thus (perhaps after redefining paths on a null set), Ω ∋ ω 7→ Zω may beregarded as a random element in H = L2(Rd,Bd, µ), where

Zω(t) := Z(ω, t), t ∈ Rd.

Is Z = (Z(t))t∈Rd a Gaussian random element?

According to Thm. 22.22, we have to show:

〈Z, h〉 =

∫

Rd

Z(t)h(t)µ(dt) ∼ N1 ∀ h ∈ H.

This follows, e.g., by adapting the reasoning given on p.12 ofI. A. Ibragimov/Y. A. Rozanov: Gaussian random processes, Springer 1978.



Memo: m(t) = EZ(t), t ∈ Rd; C(s, t) = Cov(Z(s), Z(t)), s, t ∈ Rd.

Memo: 〈Z, h〉 =∫Rd Z(t)h(t)µ(dt)

Notice that

E〈Z, h〉 = E

[ ∫

Rd

Z(t)h(t)µ(dt)

]=

∫

Rd

E[Z(t)]h(t)µ(dt)︸︷︷︸= m(t)

= 〈m,h〉 ∀h ∈ H=⇒ EZ = m. (equality µ-a.e. as functions in L2).

Let Σ be the covariance operator of Z.

Claim: Σ is determined by the covariance function C(s, t).

W.l.o.g., let m = EZ = 0. For g, h ∈ H , we have

〈Σg, h〉 = E[〈Z, g〉〈Z, h〉

]= E

[∫RdZ(s)g(s)µ(ds)

∫RdZ(t)h(t)µ(dt)

]

= E

[∫Rd

∫RdZ(s)Z(t)g(s)h(t)µ(ds)µ(dt)

]

=∫

Rd

∫RdE[Z(s)Z(t)]g(s)h(t)µ(ds)µ(dt)

].︸︷︷︸

= C(s, t), q.e.d.


The Central Limit Theorem in separable Hilbert spaces

23 The central limit theorem in separable Hilbert spaces

Let H be a separable infinite-dimensional Hilbert space with scalar product〈x, y〉, x, y ∈ H , and norm ‖x‖ =

√〈x, x〉, x ∈ H . Furthermore, let

ek : k ≥ 1 a fixed complete orthornormal system of H . For ℓ ∈ N, let

Πℓ :

H → H,

x 7→ Πℓ(x) :=∑ℓ

k=1〈x, ek〉ek

(orthogonal projection onto the linear subspace spanned by e1, . . . , eℓ).All H-valued random elements will be defined on (Ω,A,P).

23.1 Theorem (Convergence in distribution of Xn to X)

Suppose X,X1, X2, . . . are H-valued random elements. If

Πℓ(Xn)D−→ Πℓ(X) as n→∞ for each fixed ℓ ≥ 1, (23.1)

limℓ→∞

lim supn→∞

P (‖Xn − Πℓ(Xn)‖ ≥ δ) = 0 for each δ > 0, (23.2)

then XnD−→ X.



23.2 Theorem (Lindeberg-Levy-type CLT in Hilbert spaces)

Let (Zj)j≥1 be a sequence of i.i.d. H-valued random elements, where H is aseparable Hilbert space. Assume E‖Z1‖2 <∞, and putm := EZ1, C := Σ(Z1).Then there is a centered Gaussian element X ∼ N(0, C) of H , and we have

1√n

n∑

j=1

(Zj −m)D−→ X as n→∞.

Proof. W.l.o.g. let m = 0. Put Xn := n−1/2(Z1 + . . .+ Zn). Since thecovariance operator Σ(·) satisfies Σ(aY ) = a2Σ(Y ), a ∈ R (!) Thm. 22.15gives Σ(Xn) = C. From 22.11 and 22.12, we then have

〈Cx, y〉 = E [〈Xn, x〉〈Xn, y〉] , n ≥ 1, x, y ∈ H.Since C ∈ L+

tr(H) (see 22.12, 22.13), there is a Gaussian random elementX ∼ N(0, C) of H .

Let ek : k ≥ 1 be the COS of H satisfying Cek = λkek, k ≥ 1.

Recall ∞∑

k=1

〈Cek, ek〉 =∞∑

k=1

λk < ∞.



Memo: 〈Cx, y〉 = E [〈Xn, x〉〈Xn, y〉] ,∑∞

k=1〈Cek, ek〉 < ∞.

We first show (cf. Thm. 23.1)

limℓ→∞

lim supn→∞

P (‖Xn − Πℓ(Xn)‖ ≥ δ) = 0 for each δ > 0. (23.2)

(Recall Πℓ(x) =∑ℓ

k=1〈x, ek〉ek ). Fix δ > 0. We have

P (‖Xn − Πℓ(Xn)‖ ≥ δ) ≤ 1

δ2· E[‖Xn − Πℓ(Xn)‖2

](why?)

=1

δ2· E[∑∞

k=ℓ+1〈Xn, ek〉2]

(why?)

=1

δ2·∑∞

k=ℓ+1E [〈Xn, ek〉〈Xn, ek〉] .︸︷︷︸

= 〈Cek, ek〉Hence

lim supn→∞

P (‖Xn − Πℓ(Xn)‖ ≥ δ) ≤ 1

δ2·

∞∑

k=ℓ+1

〈Cek, ek〉,

and (23.2) follows.



It remains to show

Πℓ(Xn)D−→ Πℓ(X) as n→∞ for each fixed ℓ ≥ 1. (23.1)

Memo: Πℓ(Xn) :=∑ℓ

k=1〈Xn, ek〉ek, Xn = 1√n

∑nj=1 Zj .

Notice that

Yn :=

〈Xn, e1〉〈Xn, e2〉

...〈Xn, eℓ〉

=

1√n

n∑

j=1

〈Zj , e1〉〈Zj , e2〉

...〈Zj , eℓ〉

=:

1√n

n∑

j=1

Vj .

V1, V2, . . . are i.i.d. ℓ-dimensional random vectors, with EV1 = 0.

Since E[〈Z1, ei〉〈Z1, ej〉] = 〈Cei, ej〉, the covariance matrix of V1 is

Σ := (〈Cei, ej〉)1≤i,j≤ℓ. Multivariate CLT =⇒ YnD−→ Nℓ(0,Σ).

Since Y := (〈X, e1〉, 〈X, e2〉, . . . , 〈X, eℓ〉)⊤ ∼ Nℓ(0,Σ), (!) we have YnD−→ Y .

Consider the mapping ψ : Rℓ → H , ψ(x) :=∑ℓ

k=1 xkek, x = (x1, . . . , xℓ).

CMT =⇒ Πℓ(Xn) = ψ(Yn)D−→ ψ(Y ) = Πℓ(X), q.e.d.


Statistical applications: Weighted L2-statistics

24 Statistical applications: Weighted L2-statistics

Let X1, X2, . . . be i.i.d. d-variate random vectors on (Ω,A, P).Let µ be a finite measure on M ∩ Bd, where M ∈ Bd.

For n ≥ 1, let zn : (Rd)n ×M → R be a measurable function.

Put Zn(t) := zn(X1, . . . , Xn, t), t ∈M .

24.1 Definition (One-sample weighted L2-statistic)

The random variable

Tn =

∫

M

Z2n(t)µ(dt),

is called (one-sample) weighted L2-statistic based on zn and µ.

Often: µ(dt) = w(t) dt, where w :M → R≥0 is measurable.

Motivation: Test some hypothesis H0 about unknown distribution PX1 of X1.



24.2 Example Testing for normality (Epps and Pulley 1983)

H0 : PX1 ∈ N(a, σ2) : a ∈ R, σ2 > 0. Fix β > 0.

Yn,j :=Xj −Xn

Sn, j = 1, . . . , n, S2

n = 1n

∑nj=1

(Xj −Xn

)2,

Ψn(t) :=1

n

n∑

j=1

exp (itYn,j) ≈ exp

(− t

2

2

)under H0.

Tn,β := n

∫ ∞

−∞

∣∣∣∣Ψn(t)− exp

(− t

2

2

) ∣∣∣∣2

1

β√2π

exp

(− t2

2β2

)dt

︸︷︷︸= wβ(t)

=

∫ ∞

−∞Z2

n(t)wβ(t) dt,

where

Zn(t) =1√n

n∑

j=1

[cos(tYn,j) + sin(tYn,j)− exp

(− t

2

2

)].

Rejection of H0 is for large values of Tn,β .



24.3 Example Testing for exponentiality (Baringhaus and Henze (1991))

H0 : PX1 ∈ Exp(λ) : λ > 0.Motivation: PX1 is determined by the Laplace transform

L(t) = E(e−tX

), t ≥ 0.

The Laplace transform L(t) = (1 + t)−1 of Exp(1) statisfies differentialequation

L(t) + (1 + t)L′(t) ≡ 0.

Put Yn,j = Xj/Xn, j = 1, . . . , n. Under H0, we should have

Ln(t) :=1n

∑nj=1e

−tYn,j ≈ 11+t

=⇒ Ln(t) + (1 + t)L′n(t) ≈ 0.

Zn(t) =1√n

∑nj=1e

−tYn,j (1 + (1 + t)(−Yn,j)) , µ(dt) = e−βtdt.

Tn,β =

∫ ∞

0

Z2n(t) e

−βt dt.

Rejection is for large values of Tn,β.



24.4 Example Testing for reflected symmetry (Henze, Klar, Meintanis, 2003)

Let X1, X2, . . . be d-variate random vectors.

H0 : X1 − a D= a−X1 for some a ∈ R

d.

Notice that X1D= −X1 ⇐⇒ E

[sin(t⊤X1)

]= 0 ∀ t ∈ R

d.

Put

Yn,j := S−1/2n (Xj −Xn), Sn := n−1∑n

j=1(Xj −Xn)(Xj −Xn)⊤,

Zn(t) :=1√n

n∑

j=1

sin(t⊤Yn,j

)≈ 0 under H0.

Put µ(dt) = exp(−β‖t‖2) dt, β > 0.

Tn,β :=

∫

Rd

Z2n(t) exp(−β‖t‖2) dt

=πd/2

2βd/2n

n∑

j,k=1

[exp

(− 1

4β‖Yn,j−Yn,k‖2

)− exp

(− 1

4β‖Yn,j+Yn,k‖2

)]



Further examples:

Testing for

bivariate exponential distribution (Alba-Fernandez, Jimenez-Gamero, 2015)

bivariate Poisson (Novoa-Munoz, Jimenez-Gamero, 2014)

Gamma distribution (Ebner et al., 2012)

bivariate extreme value copulas (Genest et al., 2011)

skew normal distribution (Meintanis, 2010)

Inverse Gaussian distribution (Fragiadakis et al., 2009)

Marshall-Olkin distribution (Meintanis, 2007)

Laplace distribution (Meintanis, 2004)

Cauchy distribution (Gurtler, Henze 2000)

Poisson distribution (Rueda et al., 1991)

etc.



Memo: Tn =

∫

M

Z2n(t)µ(dt).

We assume that Zn is a random element of H := L2(M,M ∩ Bd, µ). Then

Tn = ‖Zn‖2.Suppose one can prove

ZnD−→ Z under H0 in H,

where Z ∼ N(0, C). Then, by the CMT,

TnD−→ ‖Z‖2 under H0.

Thm. 22.23 =⇒ ‖Z‖2 ∼∑∞j=1 λjN

2j , where λ1, λ2, . . . are the positive

eigenvalues of C, and N1, N2, . . . are i.i.d. standard normal random variables.

In each of the examples,

Zn(t) =1√n

n∑

j=1

f(Xj , t, ϑn

),

where ϑn = ϑn(X1, . . . , Xn) is a suitable estimator of ϑ, and

Eϑ [f(X1, t, ϑ)] = 0 ∀ϑ ∈ Θ ∀ t ∈M.



Memo: Tn =

∫

M

Z2n(t)µ(dt), H = L2(M,Bd ∩M,µ),

Memo: Zn(t) =1√n

n∑

j=1

f(Xj , t, ϑn

).

Suppose H0 does not hold. Then, typically, there is a z ∈ H , z 6= 0, so that

1

n

n∑

j=1

f(Xj , ·, ϑn)P−→ z(·) in H,

i.e., ∥∥∥ 1n

n∑

j=1

f(Xj , ·, ϑn)− z(·)∥∥∥ P−→ 0.

It follows that

Tn

n=∥∥∥ 1n

n∑

j=1

f(Xj , ·, ϑn)∥∥∥2

P−→ ∆ := ‖z‖2 =

∫

M

z2(t)µ(dt) > 0.

Hence, TnP−→∞ =⇒ consistency against each such alternative.



Memo: Tn = ‖Zn‖2, Zn(t) =1√n

n∑

j=1

f(Xj , t, ϑn

),

Zn√n

P−→ z 6= 0.

Memo:Tn

nP−→ ∆ = ‖z‖2 > 0.

Put Zn :=Zn√n. Notice that

√n

(Tn

n−∆

)=√n(‖Zn‖2 − ‖z‖2

)

=√n⟨Zn − z, Zn + z

⟩

=√n⟨Zn − z, 2z + Zn − z

⟩

= 2⟨√n(Zn − z), z

⟩+

1√n‖√n(Zn − z)‖2.

︸︷︷︸︸︷︷︸=: Vn(z) = Vn(z)



Memo:√n

(Tn

n−∆

)= 2〈Vn, z〉+ 1√

n‖Vn‖2.

24.5 Theorem If VnD−→ V for some centered Gaussian random element V of

H , then √n

(Tn

n−∆

)D−→ 2〈V, z〉 ∼ N(0, σ2),

where, putting K(s, t) = E[V (s)V (t)],

σ2 = 4E[〈V, z〉2

]=∫M

∫MK(s, t) z(s)z(t)µ(ds)µ(dt).

24.6 Corollary If σ2n is a consistent estimator of σ2 > 0, then

√n

σn

(Tn

n−∆

)D−→ N(0, 1).

Further reading: Baringhaus, L., Henze, N., Ebner, B.: The Limit Distributionof weighted L2-Goodness-of-Fit Statistics under fixed Alternatives, withApplications (2016). Ann. Inst. Statist. Math. doi:10.1007/s10463-016-0567-8


winter term 2016/2017 norbert henze, institute of...

Documents