lecture 7: convergence in probability and consistencyrsgill01/667/lecture 7.pdf · lecture 7:...

Lecture 7: Convergence in Probability andConsistency

MATH 667-01Statistical Inference

University of Louisville

September 24, 2019Last corrected: 9/26/2019

1 / 21 Lecture 7: Convergence in Probability and Consistency

Introduction

We start by describing convergence in probability andproperties of this type of convergence1.

We next use Chebyshev’s inequality2 to prove the Weak Lawof Large Numbers3.

We then define consistent sequences of estimators4.

Finally, we sketch a proof of a result on consistency ofmaximum likelihood estimators5 under appropriate regularityassumptions.

1CB: Section 5.5, HMC: Section 5.12CB: Section 3.6, HMC: Section 1.103CB: Section 5.5, HMC: Section 5.14CB: Section 10.1, HMC: Section 5.15CB: Section 10.1, HMC: Section 6.1


Convergence in Probability

Definition L7.1:6 A sequence of random variables X1, X2, . . .converges in probability to a random variable X if,for every ε > 0, lim

n→∞P (|Xn −X| < ε) = 1.

If so, we write XnP→ X.

Equivalently, we can write the limit in the above definition aslimn→∞

P (|Xn −X| ≥ ε) = 0.

Theorem L7.1:7

(a) If XnP→ X and Yn

P→ Y , then Xn + YnP→ X + Y and

XnYnP→ XY .

(b) If XnP→ X and a ∈ R, then aXn

P→ aX.

(c) If XnP→ a where g is a function that is continuous at a ∈ R,

then g(Xn)P→ g(a).

6CB: Definition 5.5.1 on p.232, HMC: Definition 5.1.1 on p.3227HMC: Section 5.1, CB: (c) Theorem 5.5.4 on p.233


Simulated Example

Simulation: Suppose X1, . . . , Xn are iid Bernoulli(p) randomvariables. Here we consider the behavior of p̂n = 1

n

∑ni=1Xi

as n becomes large.

Suppose the true value of the parameter is p = 0.5. For eachn in {5, 10, 15, 20, 25, 100, 200, 500, 1000, 5000, 10000}, wesimulate 100 data sets and plot p̂n for each data set.

As seen on slide 6, the variance decreases as n increases and

p̂P→ 0.5.

The green curve shows the average of the 100 generatedvalues of p̂n for each n.


R code for simulation

> set.seed(126573)

> p=.5

> n=c(5,10,15,20,25,100,200,500,1000,5000,10000)

> repetitions=100

> p.hat=matrix(0,repetitions,length(n))

> for (i in 1:length(n)){

+ for (r in 1:repetitions){

+ p.hat[r,i]=rbinom(1,size=n[i],prob=p)/n[i]

+ }

+ }

> plot(rep(n,repetitions),c(t(p.hat)),pch=19,cex=.7,

+ log="x",xlab="n",ylab="p.hat")

> abline(h=p,col="red")

> points(n,apply(p.hat,2,mean),type="l",col="green")


Simulated Example


Chebyshev’s Inequality

Theorem L7.2:8 Let u be a nonnegative function and X be arandom variable such that E[u(X)] exists. Then, for everyc > 0,

P (u(X) ≥ c) ≤ E[u(X)]

c.

Proof of Theorem L7.2: For the continuous case, we have

E[u(X)] =

∫u(x)f(x) dx ≥

∫Au(x)f(x) dx ≥

∫Acf(x) dx = cP (A)

where A = {x : u(x) ≥ c}. 2

Letting u(x) = (x− µ)2 and c = k2σ2, we obtain Chebyshev’sinequality.Theorem L7.3:9 Let X be a random variable with finitevariance σ2. Then, for every k > 0,

P (|X − µ| ≥ kσ) ≤ 1

k2.

8CB: Theorem 3.6.1 on p.122, HMC: Theorem 1.10.2 on p.799HMC: Theorem 1.10.3 on p.79


Weak Law of Large Numbers

Theorem L7.4:10 Let X1, X2, . . . be iid random variables withE[Xi] = µ and Var[Xi] = σ2 <∞. Then

X̄n =1

n

n∑i=1

XiP→ µ.

Proof of Theorem L7.4: For every ε > 0, Chebyshev’sinequality implies that

P(∣∣X̄n − µ

∣∣ ≥ ε) ≤ Var[X̄n]

ε2=

σ2

nε2.

Therefore, as n→∞, it follows that

P (|X̄n − µ| < ε) = 1− P (|X̄n − µ| ≥ ε) ≥ 1− σ2

nε2→ 1.

10CB: Theorem 5.5.2 on p.232, HMC: Theorem 5.1.1 on p.3228 / 21 Lecture 7: Convergence in Probability and Consistency

Convergence in Probability

Example L7.1: Suppose X1, . . . , Xn are iid Bernoulli(p)random variables where p ∈ (0, 1). Show that

1

p̂n=

n∑ni=1Xi

P→ 1

p.

Answer to Example L7.1: Theorem L7.4 shows that p̂nconverges in probability to p.Since h(p) = 1/p is continuous on (0, 1), Theorem L7.1(c)then implies that 1/p̂n converges in probability to 1/p.


Consistency

Now, we consider the behavior of a sequence of estimators ofa parameter θ in a parameter space Θ as n→∞.

Let X1, X2, . . . be iid random variables with pdf/pmf f(x;θ).

Then Tn be a statistic based on X1, . . . , Xn for n = 1, 2, . . .

Definition L7.2:11 The statistic Tn = Tn(X1, . . . , Xn),n = 1, 2, . . . , is (weakly) consistent estimator of θ if

TnP→ θ.

11CB: Definition 10.1.1 on p.468, HMC: Definition 5.1.2 on p.32410 / 21 Lecture 7: Convergence in Probability and Consistency

Consistency

Theorem L7.5:12 If limn→∞

Varθ[Tn] = 0 and limn→∞

Biasθ[Tn] = 0

for every θ ∈ Θ, then {Tn} is a consistent sequence ofestimators of θ.

Proof of Theorem L7.5: By Chebyshev’s inequality,

Pθ(|Tn − θ| ≥ ε) ≤Eθ[(Tn − θ)2

]ε2

and

Eθ[(Tn − θ)2

]= Varθ[Tn] + (Biasθ[Tn])2 → 0 + 0 = 0

as n→∞.

12CB: Theorem 10.1.3 on p.46911 / 21 Lecture 7: Convergence in Probability and Consistency

Regularity Assumptions

Let X1, . . . , Xn be a random sample with common pdf f(x; θ) and cdf F (x; θ) forθ ∈ Θ and let W (x) = W (x1, . . . , xn) be a function. Here are some regularityassumptions13 that will be used for several upcoming theorems.

(R0) The cdfs are distinct.

(R1) The pdfs have common support for all θ.

(R2) The true parameter value θ0 is an interior point of Θ.

(R3) The pdf f(x; θ) is twice differentiable as a function of θ.

(R4) The integral∫f(x; θ) dx can be differentiated twice under the integral sign as a

function of θ.

(R5) The integral∫w(x)f(x; θ) dx can be differentiated under the integral sign as a

function of θ.

(R6) The pdf f(x; θ) is three times differentiable as a function of θ, and for allθ ∈ Θ, there exist c ∈ R and a function M(x) such that∣∣∣∣ ∂3∂θ3 ln f(x; θ)

∣∣∣∣ ≤M(x),

with Eθ0 |M(X)| <∞, for all θ0− c < θ < θ0 + c and all x in the support of X.

13slightly different conditions in HMC: p.356+362+368 and CB: p.51612 / 21 Lecture 7: Convergence in Probability and Consistency

Jensen’s Inequality

Theorem L7.6:14 If X is a random variable such thatE|X| <∞ and ϕ is a convex function such thatE|ϕ(X)| <∞, then

ϕ(E[X]) ≤ E[ϕ(X)].

Proof of Theorem L7.6: Let g(t) = a+ bt be the line thatpasses through the point (c, ϕ(c)) where c = E[X]. Since ϕ isconvex, ϕ(t) ≥ g(t) for all t. So, we have

E[ϕ(X)] ≥ E[a+ bX] = a+ bE[X] = g(E[X]) = ϕ(E[X]).

14HKC: Theorem 1.10.5 on p.8113 / 21 Lecture 7: Convergence in Probability and Consistency

Likelihood Property

Theorem L7.7:15 Suppose θ0 is the true parameter and

Eθ0

[f(Xi; θ)

f(Xi; θ0)

]exists. If (R0) and (R1) hold, then

limn→∞

P [L(θ0;X1, . . . , Xn) > L(θ;X1, . . . , Xn)] = 1

for all θ 6= θ0.

Proof of Theorem L7.7:

L(θ0;X) > L(θ;X) is equivalent to1

n

n∑i=1

lnf(Xi; θ)

f(Xi; θ0)< 0.

Using Theorem L7.4 applied to{

ln f(Xi;θ)f(Xi;θ0)

}∞i=1

, we have

1

n

n∑i=1

lnf(Xi; θ)

f(Xi; θ0)

P→ Eθ0

[ln

f(X1; θ)

f(X1; θ0)

].

15HKC: Theorem 6.1.1 on p.35614 / 21 Lecture 7: Convergence in Probability and Consistency

Likelihood Property

Proof of Theorem L7.7 continued: By Theorem L7.6, we seethat

Eθ0

[ln

f(X1; θ)

f(X1; θ0)

]≤ lnEθ0

[f(X1; θ)

f(X1; θ0)

]= ln

∫f(x; θ)

f(x; θ0)f(x; θ0) dx

= ln 1 = 0.

So, for any ε > 0, we have

P (L(θ0;X) > L(θ;X)) ≤ P

(1

n

n∑i=1

lnf(Xi; θ)

f(Xi; θ0)< ε

)→ 1

as n→∞.


Consistency of MLEs

Theorem L7.8:16 Assume X1, . . . , Xn satisfy (R0)-(R3) where

θ0 is the true parameter. Then the equation∂

∂θL(θ) = 0 has

a solution θ̂n such that θ̂nP→ θ0.

Sketch of the proof of Theorem 7.8:Since θ0 is an interior point of Θ, it follows that(θ0 − a, θ0 + a) ⊂ Θ for some a > 0.

Let Sn,a = {x : L(θ0;x) > L(θ0 − a;x)}∩{x : L(θ0;x) > L(θ0 + a;x)}.By Theorem L7.7, P (Sn,a)→ 1 as n→∞.

On Sn,a, L(θ) has a local maximum θ̂n such that

θ̂n ∈ (θ0 − ε, θ0 + ε) and L′(θ̂n) = 0 for any ε ≤ a.

16CB: Theorem 10.1.6 on p.470, HKC: Theorem 6.1.3 on p.35916 / 21 Lecture 7: Convergence in Probability and Consistency

Consistency of MLEs

Sketch of the proof of Theorem 7.8 continued:

So, we have

Sn,ε ⊂{x : |θ̂n(x)− θ0| < ε

}∩{x : L′(θ̂n(x)) = 0)

}.

Thus, it follows that

1 = limn→∞

P (Sn)

≤ lim infn→∞

P({

x : |θ̂n(x)− θ0| < ε}∩{x : L′(θ̂n(x)) = 0

})≤ 1.

So, for every ε > 0, we have

limn→∞

P(|θ̂n(x)− θ0| < ε

)= 1.


Invariance of the MLE

If L(θ;x) is the likelihood function for θ based on x, thendefine the induced likelihood function for τ(θ) as

L∗(η;x) = sup{θ:τ(θ)=η}

L(θ;x)

and the value η̂ which minimizes L∗ is the MLE of η = τ(θ).

Theorem L7.9:17 If θ̂ is the MLE of θ0, then for any functionτ with domain Θ, the MLE of τ(θ0) is τ(θ̂).

Proof of Theorem L7.9:

L∗(η̂;x) = supηL∗(η;x) = sup

ηsup

{θ:τ(θ)=η}L(θ;x)

= supθL(θ;x) = L(θ̂;x)

= sup{θ:τ(θ)=τ(θ̂)}

L(θ;x) = L∗(τ(θ̂);x)

17Theorem 7.2.10 on p.330, HKC: Theorem 6.1.2 on p.35818 / 21 Lecture 7: Convergence in Probability and Consistency

Consistency of MLEs

Example L7.2: Suppose X1, . . . , Xn is a random sample froma distribution with pdf

f(x; θ) =1

θI(0,θ)(x)

where θ > 0.(a) Compute the MLE and show that it is a consistentestimator of θ.(b) Find a consistent estimator of

√θ.

Answer to Example L7.2(a): The likelihood function

L(θ;x) =

n∏i=1

1

θI(0,θ)(xi) =

1

θnI(0,θ)(maxxi).

is equal to 0 on (0,maxxi] and a positive, decreasing functionof θ on (maxxi,∞). So, the MLE of θ is θ̂n = max

1≤i≤nxi.


Consistency of MLEs

Answer to Example L7.2(a) continued: Note that regularitycondition (R1) is not satisfied.

In Example L6.4, it was shown that θ̂n = Y has pdf

f(y) =nyn−1

θnI(0,θ)(y).

and we can easily compute

E[θ̂n] =n

n+ 1E[(

n+ 1

n)Y ] =

n

n+ 1θ

and

Var[θ̂n] = (n

n+ 1)2Var[(

n+ 1

n)Y ] =

n2

(n+ 1)2

θ2

n(n+ 2).


Consistency of MLEs

Answer to Example L7.2(a) continued: Consequently, we have

E[θ̂n − θ] =n

n+ 1θ − θ = − 1

n+ 1θ ⇒ Bias[θ̂n]→ 0

and

Var[θ̂n] =nθ2

(n+ 1)2(n+ 2)→ 0

as n→∞.

Theorem L7.5 implies that θ̂n is a consistent estimator of θ.

Answer to Example L7.2(b): In part (a), we showed θ̂nP→ θ.

Since g(t) =√t is continuous when θ > 0, Theorem L7.1(c)

implies√θ̂n

P→√θ. So

√θ̂n =

√max

1≤i≤nxi is a consistent

estimator of√θ.


lecture 7: convergence in probability and consistencyrsgill01/667/lecture 7.pdf · lecture 7:...

Documents