week 9 the central limit theorem and estimation...

OutlineConsistency and Law of Large Numbers

The Central Limit TheoremLab 5: The distribution of averages

Some Estimation Concepts

Week 9The Central Limit Theoremand Estimation Concepts

Week 9 The Central Limit Theorem and Estimation Concepts




Week 9 Objectives

1 The Law of Large Numbers and the concept of consistency ofaverages are introduced. The condition of existence of thepopulation mean is elaborated.

The Cauchy distribution is used to illustrate the effects ofviolating this condition.

2 The averages of n = 2 and of n = 3 uniform random variablesare used to demonstrate that the exact distribution of averagesdepends on the sample size.

3 Simulations illustrate the approximate normality of averages.

4 The Central Limit Theorem, and the DeMoivre-Laplace Theotemare given.

5 Some estimation concepts are presented.





1 Consistency and Law of Large Numbers

2 The Central Limit Theorem

Convolutions

The Main Result

The DeMoivre-Laplace Theorem

3 Lab 5: The distribution of averages

4 Some Estimation Concepts

Model-Free and Model-Based Estimators

Comparing Estimators

The Method of Moments





We have seen that X , p and S2, approximate theirpopulation counterparts and that the approximationimproves as the sample size increases. This is calledconsistency.In general, an estimator θ of a population parameter θ iscalled consistent if

θ → θ as n increases to∞.

Consistency is such a basic property that all estimatorsused in statistics are consistent.

Thus, the sample covariance, sample correlation, and theestimators of the intercept and slope of the simple linearregression model are all consistent estimators.





Consistency is supported by a very famous probability result:

Theorem (The Law of Large Numbers)Let X1, . . . ,Xn be iid and let g be a function such that−∞ < E [g(X1)] <∞. Then,

θ ≡ 1n

n∑i=1

g(Xi)→ θ ≡ E [g(X1)] as n increases to∞.

Consistency of the aforementioned estimators follows fromthe consistency of averages stated in the LLN.

For example, the consistency of S2 follows from the factthat σ2 = E(X 2)− µ2 and the consistency of X and1n

∑ni=1 X 2

i as estimators of µ and E(X 2), respectively.





The Assumption of finite µ and σ2

The LLN requires that the population has finite mean.Later we will be using also the assumption of finitevariance.Is it possible for the mean of a r.v. to not exist or to beinfinite? Or is it possible a r.v. to have finite µ but infiniteσ2?

The most notorious (for its abnormality) r.v. is the Cauchywhose pdf is

f (x) =1π

11 + x2 , −∞ < x <∞.

The median of this r.v. is zero, but its mean does not exist.Week 9 The Central Limit Theorem and Estimation Concepts




It is not hard to construct other PDFs with∞ mean or variance:

ExampleLet X1 and X2 have PDFs

f1(x) = x−2, 1 ≤ x <∞, f2(x) = 2x−3, 1 ≤ x <∞,

respectively. Show that E(X1) =∞, E(X2) = 2, Var(X2) =∞.

Solution: By straightforward integration we get

E(X1) =

∫ ∞

1xf1(x) dx =∞, E(X 2

1 ) =

∫ ∞

1x2f1(x) dx =∞,

E(X2) =

∫ ∞

1xf2(x) dx = 2, E(X 2

2 ) =

∫ ∞

1x2f2(x) dx =∞.

Thus, Var(X2) = E(X 22 )− [E(X2)]2 =∞.





Distributions with infinite mean, or infinite variance, aredescribed as heavy tailed.The LLN may not hold for heavy tailed distributions: X maynot converge (as in the case of Cauchy) or diverge. (Alsothe CLT, which we will see shortly, does not hold.)Samples coming from heavy tailed distributions are muchmore likely to contain outliers.If large outliers exist in a data set, it might be a good idea tofocus on estimating another quantity, such as the median,which is well defined also for heavy tailed distributions.





ConvolutionsThe Main ResultThe DeMoivre-Laplace Theorem



Convolutions

The Main Result












• The LLN says nothing about the quality of the approximationfor a given sample size. For that we need the distribution of X .

The distribution of the sum of two independent randomvariables is called the convolution of the two distributions.In some cases it is easy to find the convolution of two RVs.

You and a friend each flip a fair coin a number of times. LetX be the number of heads you get in the n1 flips, and Y bethe number of heads your friend gets in n2 flips.(a) Are X and Y independent?(b) What is the distribution of the total number of headsX + Y ?

Solution: (a) Yes. (b) ∼ X + Y Bin(n1 + n2,0.5) (why?).

• In general, if X ∼ Bin(n1,p) and Y ∼ Bin(n2,p) areindependent, then X + Y ∼ Bin(n1 + n2,p).






The convolution is also known in the Poisson and Normalcases:

If X ∼ Poisson(λ1) and Y ∼ Poisson(λ2) are independentthen X + Y ∼ Poisson(λ1 + λ2). (See Example 5.3-1, p.216 for the derivation.)If X1 ∼ N(µ1, σ

21) and X2 ∼ N(µ2, σ

22) are independent then

X1 + X2 ∼ N(µ1 + µ2, σ21 + σ2

2).

• Such nice results are the exception. In general, thedistribution of a sum of independent random variables is noteasy to determine.

If X ∼ Bin(n1,p1) and Y ∼ Bin(n2,p2), with p1 6= p2, thenthe distribution of X + Y is not binomial.






The convolution of two independent U(0,1) randomvariables is shown inhttp://personal.psu.edu/acq/401/fig/convol_2Unif.pdfThis is derived in Example 5.3-3, p. 217. A numericalderivation will be done in the next lab.In general, the formula for the convolution of two randomvariables is given in relation (5.3.3), p. 219. It is not easy tocompute!The convolution of several random variables is even harderto compute.


http://personal.psu.edu/acq/401/fig/convol_2Unif.pdf

http://personal.psu.edu/acq/401/fig/convol_2Unif.pdf





Distribution of Sums of Normal Random Variables

Proposition

Let X1,X2, . . . ,Xn be independent with Xi ∼ N(µi , σ2i ), and set

Y = a1X1 + · · ·+ anXn. Then

Y ∼ N(µY , σ2Y ), where

µY = a1µ1 + · · ·+ anµn, and σ2Y = a2

1σ21 + · · ·+ a2

nσ2n.

Corollary

Let X1,X2, . . . ,Xn be iid N(µ, σ2). Then

X ∼ N(µ,σ2

n).






Why care about such results?

• This result is relevant for answering an important statisticalquestion, namely how to control the estimation error via thesample size:

We want to be 95% certain that X does not differ from thepopulation mean by more than 0.3 units. How large shouldthe sample size n be? It is known that the populationdistribution is normal with σ2 = 9.

Remark: This example assumes normality with knownvariance. In real life the variance is also unknown. More detailson this will be given in Chapter 8.






Solution: Using the previous corollary we have

P(|X − µ| < 0.3) = P

(∣∣∣∣∣X − µσ/√

n

∣∣∣∣∣ < 0.3σ/√

n

)= P

(|Z | < 0.3

σ/√

n

).

We want to find n so that P(|X − µ| < 0.3) = 0.95. Thus,

0.3σ/√

n= z0.025, or n =

(1.960.3

σ

)2

= 384.16.

Thus, using n = 385 will satisfy the precision objective.








Convolutions

The Main Result












The complexity of finding the exact distribution of a sum (oraverage) of k (> 2) RVs increases rapidly with k .This is why the following is the most important theorem inprobability and statistics.

Theorem (The Central Limit Theorem)

Let X1, . . . ,Xn be iid with mean µ and variance σ2 <∞. Then,for large enough n (n ≥ 30 for the purposes of this course), Xhas approximately a normal distribution with mean µ andvariance σ2/n, i.e.

X ·∼ N(µ,σ2

n

).






The Central Limit Theorem: Comments

The CLT does not require the Xi to be continuous RVs.T = X1 + · · ·+ Xn has approximately a normal distributionwith mean nµ and variance nσ2, i.e.

T = X1 + · · ·+ Xn·∼ N

(nµ,nσ2

).

The CLT can be stated for more general linearcombinations of independent r.v.s, and also for certaindependent RVs. The present statement suffices for therange of applications of this book.The CLT explains the central role of the normal distributionin probability and statistics.






ExampleTwo towers are constructed, each by stacking 30 segments ofconcrete. The height (in inches) of a randomly selectedsegment is uniformly distributed in (35.5,36.5). A roadway canbe laid across the two towers provided the heights of the twotowers are within 4 inches. Find the probability that the roadwaycan be laid.Solution: The heights of the two towers are

T1 = X1,1 + · · ·+ X1,30, and T2 = X2,1 + · · ·+ X2,30,

where Xi,j is the height of the j th segment used in the i th tower.By the CLT,

T1·∼ N(1080,2.5), and T2

·∼ N(1080,2.5).






Example (Continued)Since the two heights T1, T2 are independent r.v.s, theirdifference, D, is distributed as

D = T1 − T2·∼ N(0,5).

Therefore, the probability that the roadway can be laid is

P(−4 < D < 4)·

= P(−4

2.236< Z <

42.236

)= Φ(1.79)− Φ(−1.79) = .9633− .0367

= .9266.






ExampleThe waiting time for a bus, in minutes, is uniformly distributed in[0,10]. In five months a person catches the bus 120 times. Findthe 95th percentile, t0.05, of the person’s total waiting time T .Solution. Write T = X1 + . . .+ X120, where Xi=waiting time forcatching the bus the i th time. Since n = 120 ≥ 30, and

E(Xi) = 5, Var(Xi) = 102/12 = 100/12, we have

T ·∼ N(

120× 5, 12010012

)= N(600,1000).

Therefore,

t0.05 ' 600 + z.05√

1000 = 600 + 1.645√

1000 = 652.02.








Convolutions

The Main Result












The DeMoivre-Laplace theorem was proved first byDeMoivre in 1733 for p = 0.5 and extended to general p byLaplace in 1812.Due to the prevalence of the Bernoulli distribution in theexperimental sciences, it continues to be stated separatelyeven though it is now recognized as a special case of theCLT.

Theorem (DeMoivre-Laplace)If T ∼ Bin(n,p) then, for n satisfying np ≥ 5 and n(1− p) ≥ 5,

T ·∼ N (np,np(1− p)) .






The continuity correction

For an integer valued RV X , and Y the approximatingnormal RV, the principle of continuity correction specifies

P(X ≤ k) ' P(Y ≤ k + 0.5).

The DeMoivre-Laplace theorem and continuity correctionyield

P(T ≤ k) ' Φ

(k + 0.5− np√

np(1− p)

).

For individual probabilities use

P(X = k) ' P(k − 0.5 < Y < k + 0.5).






0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

Bin(10,0.5) PMF and N(5,2.5) PDF

Figure: Approximation of P(X ≤ 5) with and without continuitycorrection. Week 9 The Central Limit Theorem and Estimation Concepts





ExampleA college basketball teams plays 30 regular season games, 16which are against class A teams and 14 are against class Bteams. The probability that the team will win a game is 0.4 if theteam plays against a class A team and 0.6 if the team playsagainst a class B team. Assuming that the results of differentgames are independent, approximate the probability that

1 the team will win at leat 18 games;2 the number of wins against class B teams is smaller than

that against class A teams.






Solution: If X1 and X2 denote the number of wins against classA teams and class B teams,

X1 ∼ Bin(16,0.4), and X2 ∼ Bin(14,0.6).

Since 16× 0.4 and 14× 0.4 are both ≥ 5,

X1·∼ N(16× 0.4,16× 0.4× 0.6) = N(6.4,3.84),

X2·∼ N(14× 0.6,14× 0.6× 0.4) = N(8.4,3.36).

Consequently, since X1 and X2 are independent,

X1 + X2·∼ N(6.4 + 8.4,3.84 + 3.36) = N(14.8,7.20),

X2 − X1·∼ N(8.4− 6.4,3.84 + 3.36) = N(2,7.20).






Solution continued: Hence, using also continuity correction, theneeded approximation for part 1 is

P(X1 + X2 ≥ 18) = 1− P(X1 + X2 ≤ 17) ' 1− Φ

(17.5− 14.8√

7.2

)= 1− Φ(1.006) = 1− 0.843 = 0.157,

and the needed approximation for part 2 is

P(X2 − X1 < 0) ' Φ

(−0.5− 2√

7.2

)= Φ(−0.932) = 0.176.





Consistency/inconsistency results

• Consistency of the sample proportion

n=100; x=rbinom(1,n,0.3); x/n # Repeat with n=1000 andn=10000

• Consistency of the sample mean for normal samples

n=100; mean(rnorm(n)) # Repeat with n=1000 and n=10000

• Inconsistency of the sample mean when E(X) does not exist

n=100; mean(rcauchy(n)) # Repeat with n=1000 and n=10000





Plotting probabilities

• Probabilities can be plotted with plot and barplot commands.Empirical probabilities can be plotted with hist and barplot.

Plot the pmf of the Bin(10, 0.5):plot(0:10,dbinom(0:10, 10, 0.5), pch=4,col=2)lines(0:10,dbinom(0:10, 10, 0.5), col=2)Alternatively: p=dbinom(0:10, 10, 0.5); barplot(p)Try also: barplot(p,xlim=c(0,12),ylim=c(0,0.25), col=2)Plot empirical probabilities resulting fromx=rbinom(1000,10,0.5):hist(x,seq(-0.5,10.5,1), col=5)Alternatively: ph=table(x)/1000; barplot(ph)





Averages of uniform random variables

The pdf of the sum of two independent uniform randomvariables can be obtained from the formula (5.3.3) in the book,i.e., fX1+X2(y) =

∫∞−∞ fX2(y − x)fX1(x)dx and plotted by

g=function(x){x*(x<1) + (2-x)*(1<= x)*(x<2)}

curve(g,0,2)

Formula (5.3.3) can be applied repeatedly to get the pdf of thesum of 3 or more uniform r.v.s, but it becomes very difficultquickly.

So, we employ simulations to get an idea what the pdf of thesum (or average) of several uniform r.v.s looks like:





m=matrix(runif(5000000), nrow=80)

hist(m[1,], seq(0, 1, 0.1)) # should look like the pdf of theuniform

hist((m[1,]+m[2,])/2, seq(0, 1, 0.01)) # should look like theplotted pdf

hist(colMeans(m), seq(min(colMeans(m))-0.01,max(colMeans(m))+0.01, 0.01)) # shows you what the densityof X for n = 80 looks like.

mean(m[1,]); sd(m[1,]) # should be approximately 1.6487 and√6.389 = 2.5276

mean(colMeans(m)); sd(colMeans(m)) # should beapproximately 1.6487 and

√6.389/80 = 0.2826





Averages of log-normal random variables

curve(dlnorm, 0,10) # This density has µ = exp(0.5) = 1.6487and σ2 = exp(2)− 1 = 6.389

m=matrix(rlnorm(5000000), nrow=80)

hist(m[1,], seq(0, max(m[1,])+0.3, 0.3)) # should look like theplotted pdf

hist(colMeans(m), seq(min(colMeans(m))-0.1,max(colMeans(m))+0.1, 0.1)) # shows you what the density ofX for n = 80 looks like.



√6.389/80 = 0.2826





Averages of negative binomial random variables

p=dnbinom(0:15, 5, 0.6 ); barplot(p) # hasµ = 5(1/0.6)− 5 = 3.3333 and σ2 = 5(0.4/0.62) = 5.5556

m=matrix(rnbinom(4000000, 5, 0.6), nrow=80)

hist(m[1,], seq(-0.5, floor(max(m[1,]))+0.5,1)) # should look likethe plotted pmf.

hist(colMeans(m), seq(floor(min(colMeans(m)))-0.5,floor(max(colMeans(m)))+1.5, 0.1)) # shows you what the pmfof X for n = 80 looks like.



√5.5556/80 = 0.2635





Model-Free and Model-Based EstimatorsComparing EstimatorsThe Method of Moments



Convolutions

The Main Result












What is a model?

In probability and statistics the word “model” refers to anassumption we make about the data at hand.

For univariate discrete data we may assume the binomial,Poisson or another discrete probability model.For univariate continuous data we may assume the normal,exponential, or another continuous probability model.For bivariate data we may assume the bivariate normalmodel, the linear regression model, or another model.






When a model for the distribution of the data is assumed, itis of interest to estimate the model parameters.

In the normal model we want to estimate µ and σ2.In the uniform model, the two endpoints are of interest.In the SLR model we want β0, β1 and σ2

ε .

Estimating the model parameters is called fitting a modelto data.Estimating the model parameters is quite different fromestimating the population mean, variance, percentiles etcwe saw in Chapter 1.

For example, how does one estimate the endpoints of auniform distribution? (Chapter 6 presents two ways of doingthis, and one of them will be discussed.)






Estimators of the model parameters are meaningful only ifthe assumed model is correct.

Estimators of the endpoints of the uniform distribution wouldbe meaningless if the data are not uniformly distributed.Estimators the slope and intercept of the SLR model loosetheir interpretation if the true regression function is a higherorder polynomial.

The estimators of Chapter 1 do not have this problem.For example, assuming the population mean, µ, exists, Xestimates µ regardless of the population distribution.

• The estimators of Chapter 1 will be called nonparametric ormodel-free.






Fitting models to data leads to alternative estimators of thepopulation mean, variance, proportions and percentiles.

If the normal model is assumed:a) The density is estimated by the N(X ,S2) density.b) The percentiles are also estimated as X + Szα.c) In addition to sample proportions, P(X ≤ x) may also beestimated by Φ((x − X )/S).If the uniform model is assumed, and a, b are theestimators of the two endpoints:a) The mean and median are also estimated as (a + b)/2.b) The variance is also estimated as (b − a)2/12.If the Poisson model is assumed, X is an alternativeestimator of the variance.

• Such alternative estimators are called model-based.








Convolutions

The Main Result












• The existence of alternative estimators for the samepopulation quantity raises the question of which is “better”.

Estimators are compared in terms of their mean andvariance. [Piece of jargon: The standard deviation of anestimator is called its standard error.]If the mean of an estimator θ equals its true value, i.e.,

Eθ(θ)

= θ,

it is called unbiased.The difference expected value − true value is the bias:

bias(θ) = Eθ(θ)− θ






X and p are unbiased estimators:

Eµ(X ) = µ, Ep(p) = p.

S2 is also unbiased: Eσ2

(S2) = σ2. See p. 232 for a proof.

Later we will also see that β0 and β1 are unbiased.

However,S is only consistent (not unbiased) estimator of σ.1n

n∑i=1

X 2i − X

2is only consistent (not unbiased) estimator of

σ2.






The Mean Square Error Selection Criterion

Variance and bias combine to define the Mean SquareError, MSEθ(θ) = Eθ(θ − θ)2:

MSEθ(θ) = Var(θ) + [bias(θ)]2

The MSE selection criterion says that estimator θ1 shouldbe preferred over θ2 if

MSEθ(θ1) ≤ MSEθ(θ2)

If θ1 and θ2 are unbiased, we have the variance selectioncriterion: θ1 should be preferred over θ2 if

Var(θ1) ≤ Var(θ2).






On the basis of the MSE (or variance) selection criterion, wecan make the following statement:

• Under correct model assumptions, model-based estimatorsare preferable to the model-free ones. For example:

if the normal assumption is correct, X + Szα is preferableto the (1− α)100% sample percentile.If the normal assumption is correct, Φ((x − X )/S) ispreferable to the sample proportion.If the normal assumption is correct, X is preferable to X .If the Poisson assumption is correct, X is preferable (asestimator of the variance!) to S2








Convolutions

The Main Result












For r ≥ 1, the r -th moment of a r.v. X is defined as E(X r ).

Let X1, . . . ,Xn be iid from the same population as X . Then,

By the LLN,1n

n∑i=1

X ri is a consistent estimator of E(X r ).

It is also unbiased.The method of moments is based on expressing the modelparameters in terms of the moments.

If X ∼ Unif(0, θ) then θ = 2µ.If X ∼ Unif(α, β) then

α = µ−√

3σ2, β = µ+√

3σ2,

and, of course, σ2 = E(X 2)− µ2.






Additional examples include:If X ∼ Exp(λ), λ = 1/µ; if X ∼ Poisson(λ), λ = µ.

Ignoring the difference between 1n∑n

i=1 X 2i − X

2and S2,

the above lead to the following method of momentsestimators:

If X1, . . . ,Xn is a sample from N(µ, σ2), then

µ = X , σ2 = S2.

If X1, . . . ,Xn is a sample from Unif(0, θ), θ = 2X .If X1, . . . ,Xn is a sample from Unif(α, β),

α = X −√

3S2, β = X +√

3S2,

If X1, . . . ,Xn is a sample from Exp(λ), λ = 1/X .






• In expressing the model parameters in terms of moments, themethod of moments uses the lowest possible moments.

If Xi , i = 1, . . . ,n, are iid Poisson(λ), use λ = X instead ofλ = S2.

• For bivariate data, (Xi ,Yi), i = 1, . . . ,n, the product momentE(XY ), is estimated by 1

n∑n

i=1 XiYi .

The method of moments estimator of Cov(X ,Y ) is

1n

n∑i=1

XiYi − XY ,

which is slightly different from the sample version ofCov(X ,Y ).


week 9 the central limit theorem and estimation...

Documents