lecture 7: convergence in probability and consistencyrsgill01/667/lecture 7.pdf · lecture 7:...
TRANSCRIPT
Lecture 7: Convergence in Probability andConsistency
MATH 667-01Statistical Inference
University of Louisville
September 24, 2019Last corrected: 9/26/2019
1 / 21 Lecture 7: Convergence in Probability and Consistency
Introduction
We start by describing convergence in probability andproperties of this type of convergence1.
We next use Chebyshev’s inequality2 to prove the Weak Lawof Large Numbers3.
We then define consistent sequences of estimators4.
Finally, we sketch a proof of a result on consistency ofmaximum likelihood estimators5 under appropriate regularityassumptions.
1CB: Section 5.5, HMC: Section 5.12CB: Section 3.6, HMC: Section 1.103CB: Section 5.5, HMC: Section 5.14CB: Section 10.1, HMC: Section 5.15CB: Section 10.1, HMC: Section 6.1
2 / 21 Lecture 7: Convergence in Probability and Consistency
Convergence in Probability
Definition L7.1:6 A sequence of random variables X1, X2, . . .converges in probability to a random variable X if,for every ε > 0, lim
n→∞P (|Xn −X| < ε) = 1.
If so, we write XnP→ X.
Equivalently, we can write the limit in the above definition aslimn→∞
P (|Xn −X| ≥ ε) = 0.
Theorem L7.1:7
(a) If XnP→ X and Yn
P→ Y , then Xn + YnP→ X + Y and
XnYnP→ XY .
(b) If XnP→ X and a ∈ R, then aXn
P→ aX.
(c) If XnP→ a where g is a function that is continuous at a ∈ R,
then g(Xn)P→ g(a).
6CB: Definition 5.5.1 on p.232, HMC: Definition 5.1.1 on p.3227HMC: Section 5.1, CB: (c) Theorem 5.5.4 on p.233
3 / 21 Lecture 7: Convergence in Probability and Consistency
Simulated Example
Simulation: Suppose X1, . . . , Xn are iid Bernoulli(p) randomvariables. Here we consider the behavior of p̂n = 1
n
∑ni=1Xi
as n becomes large.
Suppose the true value of the parameter is p = 0.5. For eachn in {5, 10, 15, 20, 25, 100, 200, 500, 1000, 5000, 10000}, wesimulate 100 data sets and plot p̂n for each data set.
As seen on slide 6, the variance decreases as n increases and
p̂P→ 0.5.
The green curve shows the average of the 100 generatedvalues of p̂n for each n.
4 / 21 Lecture 7: Convergence in Probability and Consistency
R code for simulation
> set.seed(126573)
> p=.5
> n=c(5,10,15,20,25,100,200,500,1000,5000,10000)
> repetitions=100
> p.hat=matrix(0,repetitions,length(n))
> for (i in 1:length(n)){
+ for (r in 1:repetitions){
+ p.hat[r,i]=rbinom(1,size=n[i],prob=p)/n[i]
+ }
+ }
> plot(rep(n,repetitions),c(t(p.hat)),pch=19,cex=.7,
+ log="x",xlab="n",ylab="p.hat")
> abline(h=p,col="red")
> points(n,apply(p.hat,2,mean),type="l",col="green")
5 / 21 Lecture 7: Convergence in Probability and Consistency
Simulated Example
6 / 21 Lecture 7: Convergence in Probability and Consistency
Chebyshev’s Inequality
Theorem L7.2:8 Let u be a nonnegative function and X be arandom variable such that E[u(X)] exists. Then, for everyc > 0,
P (u(X) ≥ c) ≤ E[u(X)]
c.
Proof of Theorem L7.2: For the continuous case, we have
E[u(X)] =
∫u(x)f(x) dx ≥
∫Au(x)f(x) dx ≥
∫Acf(x) dx = cP (A)
where A = {x : u(x) ≥ c}. 2
Letting u(x) = (x− µ)2 and c = k2σ2, we obtain Chebyshev’sinequality.Theorem L7.3:9 Let X be a random variable with finitevariance σ2. Then, for every k > 0,
P (|X − µ| ≥ kσ) ≤ 1
k2.
8CB: Theorem 3.6.1 on p.122, HMC: Theorem 1.10.2 on p.799HMC: Theorem 1.10.3 on p.79
7 / 21 Lecture 7: Convergence in Probability and Consistency
Weak Law of Large Numbers
Theorem L7.4:10 Let X1, X2, . . . be iid random variables withE[Xi] = µ and Var[Xi] = σ2 <∞. Then
X̄n =1
n
n∑i=1
XiP→ µ.
Proof of Theorem L7.4: For every ε > 0, Chebyshev’sinequality implies that
P(∣∣X̄n − µ
∣∣ ≥ ε) ≤ Var[X̄n]
ε2=
σ2
nε2.
Therefore, as n→∞, it follows that
P (|X̄n − µ| < ε) = 1− P (|X̄n − µ| ≥ ε) ≥ 1− σ2
nε2→ 1.
10CB: Theorem 5.5.2 on p.232, HMC: Theorem 5.1.1 on p.3228 / 21 Lecture 7: Convergence in Probability and Consistency
Convergence in Probability
Example L7.1: Suppose X1, . . . , Xn are iid Bernoulli(p)random variables where p ∈ (0, 1). Show that
1
p̂n=
n∑ni=1Xi
P→ 1
p.
Answer to Example L7.1: Theorem L7.4 shows that p̂nconverges in probability to p.Since h(p) = 1/p is continuous on (0, 1), Theorem L7.1(c)then implies that 1/p̂n converges in probability to 1/p.
9 / 21 Lecture 7: Convergence in Probability and Consistency
Consistency
Now, we consider the behavior of a sequence of estimators ofa parameter θ in a parameter space Θ as n→∞.
Let X1, X2, . . . be iid random variables with pdf/pmf f(x;θ).
Then Tn be a statistic based on X1, . . . , Xn for n = 1, 2, . . .
Definition L7.2:11 The statistic Tn = Tn(X1, . . . , Xn),n = 1, 2, . . . , is (weakly) consistent estimator of θ if
TnP→ θ.
11CB: Definition 10.1.1 on p.468, HMC: Definition 5.1.2 on p.32410 / 21 Lecture 7: Convergence in Probability and Consistency
Consistency
Theorem L7.5:12 If limn→∞
Varθ[Tn] = 0 and limn→∞
Biasθ[Tn] = 0
for every θ ∈ Θ, then {Tn} is a consistent sequence ofestimators of θ.
Proof of Theorem L7.5: By Chebyshev’s inequality,
Pθ(|Tn − θ| ≥ ε) ≤Eθ[(Tn − θ)2
]ε2
and
Eθ[(Tn − θ)2
]= Varθ[Tn] + (Biasθ[Tn])2 → 0 + 0 = 0
as n→∞.
12CB: Theorem 10.1.3 on p.46911 / 21 Lecture 7: Convergence in Probability and Consistency
Regularity Assumptions
Let X1, . . . , Xn be a random sample with common pdf f(x; θ) and cdf F (x; θ) forθ ∈ Θ and let W (x) = W (x1, . . . , xn) be a function. Here are some regularityassumptions13 that will be used for several upcoming theorems.
(R0) The cdfs are distinct.
(R1) The pdfs have common support for all θ.
(R2) The true parameter value θ0 is an interior point of Θ.
(R3) The pdf f(x; θ) is twice differentiable as a function of θ.
(R4) The integral∫f(x; θ) dx can be differentiated twice under the integral sign as a
function of θ.
(R5) The integral∫w(x)f(x; θ) dx can be differentiated under the integral sign as a
function of θ.
(R6) The pdf f(x; θ) is three times differentiable as a function of θ, and for allθ ∈ Θ, there exist c ∈ R and a function M(x) such that∣∣∣∣ ∂3∂θ3 ln f(x; θ)
∣∣∣∣ ≤M(x),
with Eθ0 |M(X)| <∞, for all θ0− c < θ < θ0 + c and all x in the support of X.
13slightly different conditions in HMC: p.356+362+368 and CB: p.51612 / 21 Lecture 7: Convergence in Probability and Consistency
Jensen’s Inequality
Theorem L7.6:14 If X is a random variable such thatE|X| <∞ and ϕ is a convex function such thatE|ϕ(X)| <∞, then
ϕ(E[X]) ≤ E[ϕ(X)].
Proof of Theorem L7.6: Let g(t) = a+ bt be the line thatpasses through the point (c, ϕ(c)) where c = E[X]. Since ϕ isconvex, ϕ(t) ≥ g(t) for all t. So, we have
E[ϕ(X)] ≥ E[a+ bX] = a+ bE[X] = g(E[X]) = ϕ(E[X]).
14HKC: Theorem 1.10.5 on p.8113 / 21 Lecture 7: Convergence in Probability and Consistency
Likelihood Property
Theorem L7.7:15 Suppose θ0 is the true parameter and
Eθ0
[f(Xi; θ)
f(Xi; θ0)
]exists. If (R0) and (R1) hold, then
limn→∞
P [L(θ0;X1, . . . , Xn) > L(θ;X1, . . . , Xn)] = 1
for all θ 6= θ0.
Proof of Theorem L7.7:
L(θ0;X) > L(θ;X) is equivalent to1
n
n∑i=1
lnf(Xi; θ)
f(Xi; θ0)< 0.
Using Theorem L7.4 applied to{
ln f(Xi;θ)f(Xi;θ0)
}∞i=1
, we have
1
n
n∑i=1
lnf(Xi; θ)
f(Xi; θ0)
P→ Eθ0
[ln
f(X1; θ)
f(X1; θ0)
].
15HKC: Theorem 6.1.1 on p.35614 / 21 Lecture 7: Convergence in Probability and Consistency
Likelihood Property
Proof of Theorem L7.7 continued: By Theorem L7.6, we seethat
Eθ0
[ln
f(X1; θ)
f(X1; θ0)
]≤ lnEθ0
[f(X1; θ)
f(X1; θ0)
]= ln
∫f(x; θ)
f(x; θ0)f(x; θ0) dx
= ln 1 = 0.
So, for any ε > 0, we have
P (L(θ0;X) > L(θ;X)) ≤ P
(1
n
n∑i=1
lnf(Xi; θ)
f(Xi; θ0)< ε
)→ 1
as n→∞.
15 / 21 Lecture 7: Convergence in Probability and Consistency
Consistency of MLEs
Theorem L7.8:16 Assume X1, . . . , Xn satisfy (R0)-(R3) where
θ0 is the true parameter. Then the equation∂
∂θL(θ) = 0 has
a solution θ̂n such that θ̂nP→ θ0.
Sketch of the proof of Theorem 7.8:Since θ0 is an interior point of Θ, it follows that(θ0 − a, θ0 + a) ⊂ Θ for some a > 0.
Let Sn,a = {x : L(θ0;x) > L(θ0 − a;x)}∩{x : L(θ0;x) > L(θ0 + a;x)}.By Theorem L7.7, P (Sn,a)→ 1 as n→∞.
On Sn,a, L(θ) has a local maximum θ̂n such that
θ̂n ∈ (θ0 − ε, θ0 + ε) and L′(θ̂n) = 0 for any ε ≤ a.
16CB: Theorem 10.1.6 on p.470, HKC: Theorem 6.1.3 on p.35916 / 21 Lecture 7: Convergence in Probability and Consistency
Consistency of MLEs
Sketch of the proof of Theorem 7.8 continued:
So, we have
Sn,ε ⊂{x : |θ̂n(x)− θ0| < ε
}∩{x : L′(θ̂n(x)) = 0)
}.
Thus, it follows that
1 = limn→∞
P (Sn)
≤ lim infn→∞
P({
x : |θ̂n(x)− θ0| < ε}∩{x : L′(θ̂n(x)) = 0
})≤ 1.
So, for every ε > 0, we have
limn→∞
P(|θ̂n(x)− θ0| < ε
)= 1.
17 / 21 Lecture 7: Convergence in Probability and Consistency
Invariance of the MLE
If L(θ;x) is the likelihood function for θ based on x, thendefine the induced likelihood function for τ(θ) as
L∗(η;x) = sup{θ:τ(θ)=η}
L(θ;x)
and the value η̂ which minimizes L∗ is the MLE of η = τ(θ).
Theorem L7.9:17 If θ̂ is the MLE of θ0, then for any functionτ with domain Θ, the MLE of τ(θ0) is τ(θ̂).
Proof of Theorem L7.9:
L∗(η̂;x) = supηL∗(η;x) = sup
ηsup
{θ:τ(θ)=η}L(θ;x)
= supθL(θ;x) = L(θ̂;x)
= sup{θ:τ(θ)=τ(θ̂)}
L(θ;x) = L∗(τ(θ̂);x)
17Theorem 7.2.10 on p.330, HKC: Theorem 6.1.2 on p.35818 / 21 Lecture 7: Convergence in Probability and Consistency
Consistency of MLEs
Example L7.2: Suppose X1, . . . , Xn is a random sample froma distribution with pdf
f(x; θ) =1
θI(0,θ)(x)
where θ > 0.(a) Compute the MLE and show that it is a consistentestimator of θ.(b) Find a consistent estimator of
√θ.
Answer to Example L7.2(a): The likelihood function
L(θ;x) =
n∏i=1
1
θI(0,θ)(xi) =
1
θnI(0,θ)(maxxi).
is equal to 0 on (0,maxxi] and a positive, decreasing functionof θ on (maxxi,∞). So, the MLE of θ is θ̂n = max
1≤i≤nxi.
19 / 21 Lecture 7: Convergence in Probability and Consistency
Consistency of MLEs
Answer to Example L7.2(a) continued: Note that regularitycondition (R1) is not satisfied.
In Example L6.4, it was shown that θ̂n = Y has pdf
f(y) =nyn−1
θnI(0,θ)(y).
and we can easily compute
E[θ̂n] =n
n+ 1E[(
n+ 1
n)Y ] =
n
n+ 1θ
and
Var[θ̂n] = (n
n+ 1)2Var[(
n+ 1
n)Y ] =
n2
(n+ 1)2
θ2
n(n+ 2).
20 / 21 Lecture 7: Convergence in Probability and Consistency
Consistency of MLEs
Answer to Example L7.2(a) continued: Consequently, we have
E[θ̂n − θ] =n
n+ 1θ − θ = − 1
n+ 1θ ⇒ Bias[θ̂n]→ 0
and
Var[θ̂n] =nθ2
(n+ 1)2(n+ 2)→ 0
as n→∞.
Theorem L7.5 implies that θ̂n is a consistent estimator of θ.
Answer to Example L7.2(b): In part (a), we showed θ̂nP→ θ.
Since g(t) =√t is continuous when θ > 0, Theorem L7.1(c)
implies√θ̂n
P→√θ. So
√θ̂n =
√max
1≤i≤nxi is a consistent
estimator of√θ.
21 / 21 Lecture 7: Convergence in Probability and Consistency