week 10 point estimation and confidence intervalspersonal.psu.edu/acq/401/course.info/week10.pdfweek...

OutlineLab 6: Unbiased and Biased Estimators; MSE

The Method of Least SquaresInterval Estimation

Prediction Intervals

Week 10Point Estimation

and Confidence Intervals

Week 10 Point Estimation and Confidence Intervals




Week 10 Objectives

1 Simulations are used to demonstrate: a) the property ofunbiasedness, b) the relation between bias and sample size,and c) the criterion of mean square error for the comparison ofestimators.

2 The method of least squares is introduced and the least squaresestimators for the simple linear regression model are derived.

3 The notion of interval estimation is explained, and confidenceintervals for a mean, a proportion and the regression parametersare presented.

4 Sample size selection for precision in estimation is discussed.

5 Finally, the notion of prediction is introduced and predictionintervals are presented.





1 Lab 6: Unbiased and Biased Estimators; MSE

2 The Method of Least Squares

3 Interval Estimation

Basic Ideas of Interval Estimation

CIs for a Mean and a Proportion

CIs for the Regression Parameters

The issue of Precision

4 Prediction Intervals





Note on the apply function

apply is an incredibly useful function for simulations, i.e. formaking the same calculation/procedure repeatedly overmany samples stored as the columns or rows of a matrix.The apply function takes three arguments:

1 the matrix you wish to apply the procedure to,2 either 1 or 2 depending on whether you are working on the

rows or the columns, and3 the procedure you wish to apply.

For example: ”m=matrix(1:10,ncol=2); m”. Try1 ”apply(m,2,sum); apply(m,1,sum)”2 ”x=apply(m,2,sum); sum(x)”, or simply

”sum(apply(m,2,sum))”

NOTE: A related function, the tapply, was used in Lab 2.Week 10 Point Estimation and Confidence Intervals




”Proof” of E(S2) = σ2 and E(S) 6= σ

We will simulations to verify that E(S2) = σ2 and E(S) 6= σ.The proof uses the “consistency” of the averages:

1 Generate a large number, say B, of samples of size n ≥ 2from any distribution (with σ2 <∞), and calculate S2 fromeach sample. The sample variances S2

1 , . . . ,S2B can be

thought of as iid random variables.2 By the LLN, the average of S2

1 , . . . ,S2B should be

approximately E(S2). Thus, if the average is close to σ2 wehave numerical verification that S2 is unbiased.

3 Similarly, E (S) 6= σ can be verified by checking that thecorresponding average of the S values is not close to thepopulation standard deviation.





Verification using Normal Samples

Choose the N(0,1) distribution, and sample size n = 2.Generate B = 10,000 samples of size 2 from the N(0,1)distribution, by generating 20,000 N(0,1) random numbersand arranging them in the 10,000 columns of the matrix “m”:

m = matrix(rnorm(20000),ncol=10000)Use ”mean(apply(m,2,var))” for the average of the 10,000sample variances. This should be close to 1, verifying thatE(S2) = σ2, i.e., that S2 is unbiased estimator for σ2.Use ”mean(apply(m,2,sd))” for the average of the 10,000sample standard deviations. This should not be so close to1, verifying that S is a biased estimator of σ.





The bias of biased estimators decreases as nincreases

Usem = matrix(rnorm(200000),ncol=10000)mean(apply(m,2,sd))to approximate E(S) when the sample size is 20.It is clear that the bias of S for n = 20 is smaller than thatfor n = 2.





“Proof” that for normal samples X is better than X

•We will use simulations and the MSE criterion to compare Xand X as estimators of the normal mean.

1 Generate B = 10,000 samples of size n = 20 from N(0,1), andfor each sample compute the sample mean and median.

2 Use the 10,000 sample means to approximateMSE(X ) = Var(X ) + Bias(X )2. Similarly approximate MSE(X ).

3 Which of X , X is better?

4 The R commands for implementing items 1-2 are:m=matrix(rnorm(200000), ncol=10000 )means=apply(m,2,mean); medians=apply(m,2,median)var(means)+mean(means)**2; var(medians)+mean(medians)**2





• The method of least squares (LS) finds the line that achievesthe least sum of square vertical distances from the data points.

} Vertical Distance from Line 1

Line 1Line 2

Figure: Illustration of Vertical Distance





The Least Squares Estimators

Since the vertical distance of the point (Xi ,Yi) from a linea + bx is Yi − (a + bXi), the method of LS minimizes

n∑i=1

(Yi − a− bXi)2 with respect to a,b.

This minimization problem has a simple closed-form solution:

β1 =n∑

XiYi − (∑

Xi)(∑

Yi)

n∑

X 2i − (

∑Xi)2

, α1 = Y − β1X .

These are the same as the empirical estimators, but the methodof LS applies also to more complicated regression functions.





Some Regression Jargon

Estimated regression line: µY |X (x) = α1 + β1x .

Fitted or predicted values: Yi = µY |X (Xi), i = 1, . . . ,n.

Residuals: εi = Yi − Yi , i = 1, . . . ,n.

Figure: Illustration of Fitted Values and ResidualsWeek 10 Point Estimation and Confidence Intervals




Estimation of the Intrinsic Error Variance

Sum of Squared Errors, or SSE :∑n

i=1 ε2i .

A computational formula for SSE is

SSE =n∑

i=1

Y 2i − α1

n∑i=1

Yi − β1

n∑i=1

XiYi .

Mean Sum of Squared Errors or MSE : SSE/(n − 2).

n − 2 is called the Residuals’ Degrees of Freedom.

The MSE is an unbiased estimator of the error variance:

σ2ε = S2

ε =SSEn − 2





Prediction via the estimated regression line

The estimated regression line µY |X (x) = α1 + β1x is alsoused for predicting the response (i.e. the Y ) at the value xof X .

For example, the Y at X = 3.8 is predicted byµY |X (3.8) = α1 + 3.8β1.

CAUTION: This prediction should only be applied if thevalue x lies within the range of the X -values in the data set.





Examplen = 10 data points on X=stress applied and Y =time to failureyield

∑Xi = 200,

∑X 2

i = 5412.5,∑

Yi = 484,∑Y 2

i = 24,732,∑

XiYi = 8407.5. Find the best fitting line andthe estimated error variance.Solution: The formulas for β1, α1 and S2 yield

β1 =10(8407.5)− (200)(484)

10(5412.5)− (200)2 = −.900885,

α1 =1

10(484)− (−.900885)

20010

= 66.4177,

S2ε =

18

[∑Y 2

i − α1∑

Yi − β1∑

XiYi

]=

18

[24,732− 32,146.17 + 7,574.19] = 20.0025.Week 10 Point Estimation and Confidence Intervals




ExampleIn the context of the previous example, predict Y at X = 38 andat X = 58.Solution: First we need to check if 38 and 58 lie within the rangeof the X -values in the sample. In absence of the actual values,we check if 38 and 58 lie within three standard deviations fromthe sample mean. (Why is this reasonable?) Here X = 20 and

SX =

[19

(10∑

i=1

X 2i − 10X

2)]1/2

= 12.52.

Since X + 3SX = 57.6, we can not predict at 58. The predictedY at 38 is µY |X (38) = 66.4177− 0.900885 · 38 = 32.184.





Basic Ideas of Interval EstimationCIs for a Mean and a ProportionCIs for the Regression ParametersThe issue of Precision














Bounding the Error of Estimation

By the CLT, if n is large, most estimators, θ, are (at leastapproximately) normally distributed, with mean equal to thetrue value, θ, of the parameter they estimate, i.e.,

θ·∼ N

(θ, σ2

θ

).

For θ = µ, θ = X and σ2θ

= σ2/n.For θ = p, θ = p and σ2

θ= p(1− p)/n.

It follows the estimation error can be bounded withprobability as high as desired. For example,∣∣∣θ − θ∣∣∣ ≤ 1.96σθ holds 95% of the time.






From Error Bounds to Confidence Intervals

The probabilistic error bound, can be re-written as

θ − 1.96σθ ≤ θ ≤ θ + 1.96σθ,

i.e., an interval of plausible values for θ, with degree ofplausibility approximately 95%.

Such intervals are called Z confidence intervals, or Z CI.

In general, the 100(1− α)% Z CI is of the form

θ − zα/2σθ ≤ θ ≤ θ + zα/2σθ, or θ ± zα/2σθ.

Z CI for the mean require known variance, and either theassumption of normality or n ≥ 30. Z CI will be primarilyused for proportions.






The T Distribution and T Intervals

When sampling from normal populations, an estimator θ ofsome parameter θ often satisfies, for all sample sizes n,

θ − θσθ∼ Tν , where σθ is the estimated standard error,

and Tν stands for T distribution with ν degrees of freedom.For θ = X , ν = n − 1. For the SLR parameters, ν = n − 2.

A T distribution is symmetric and its pdf tends to that of thestandard normal as ν tends to infinity.

The 100(1− α/2)th percentile of the T distribution with νdegrees of freedom will be denoted by tν,α/2.






� � � � � � � � � � � � ��

� � � � � � � � � � � � ��

ν

,νt

p.d.f of the t-distr. with

area=

αα

degrees of freedom

Figure: PDF and Percentile of a T Distribution.

As the DF ν gets large, tν,α/2 approaches zα/2.For example, for ν = 9,19,60 and 120, tν,0.05 is:

1.833, 1.729, 1.671, 1.658,

respectively, while z0.05 = 1.645.






Plots of N(0,1) and T densities in R

http://personal.psu.edu/acq/401/fig/ComparTdensit.pdf








Relation θ−θσθ∼ tν , leads to the following 1− α bound on

the error of estimation of θ∣∣∣θ − θ∣∣∣ ≤ tν,α/2σθ,

This error bound leads to the (1− α)100% T CI for θ:(θ − tν,α/2σθ, θ + tν,α/2σθ

). (4.1)

T intervals will be used for the mean, as well as for theregression parameters in the linear regression model.






T CIs for the Mean

• The (1− α)100% CI for the mean is(X − tn−1,α/2

S√n, X + tn−1,α/2

S√n

).

• The above T CI is valid with any sample size if the samplecomes form a normal population.

• If the sample does not come from a normal population the TCI is approximately correct provided n ≥ 30.

• If n < 30 the plausibility of the assumption of normality can bechecked with a Q-Q plot. See Example 7.3-1, p. 258.






ExampleA sample of n = 56 cotton pieces gave average percentelongation of X = 8.17 with S = 1.42. Construct a 95% CI forpopulation mean percent elongation.

Solution. Because n = 56, the T CI can be used without thenormality assumption. Here α = 0.05, so we need the 97.5thpercentile of the T distribution with ν = 56− 1 = 55 degrees offreedom. Since Table A.4 does not list percentiles for ν = 55we interpolate the 97.5th percentiles corresponding to ν = 40and ν = 80 to get t55,0.025 = 2.01, approximately. (The Rcommand qt(0.975, 55) returns 2.004.) Using the approximatepercentile value, the 95% T CI for µ is

8.17± 2.011.42√

56= (7.79,8.55).






Z CIs for Proportions

• The (approximate) (1− α)100% Z CI for a populationproportion p is

p ± zα/2

√p(1− p)

n.

• The approximation is considered adequate if np ≥ 8 andn(1− p) ≥ 8, i.e., at least 8 successes and at least 8 failures.






ExampleIn a Gallup Survey on the drinking habits of adult Americans,985 out of 1516 interviewed said that they drink beer, wine, orhard liquor on a regular basis. Construct a 95% Z CI for theproportion, p, of all Americans who drink.

Solution: Here α = 0.05, and z0.025 = 1.96. Thus

9851516

± 1.96

√0.65× 0.35

1516= 0.65± 0.024

QUESTION: An interpretation of the above CI is that theprobability is 0.95 that the true proportion of adults who drinklies in the interval you obtained. True or False?






T CIs for the Slope of a Regression Line

Proposition (The estimated s.e. of β1)

Let (X1,Y1), . . . , (Xn,Yn), be iid satisfying E(Yi |Xi = x) = α1+β1x, and Var(Yi |Xi = x) = σ2

ε , same for all x. Then,

Sβ1=

√√√√√ S2ε∑

X 2i −

1n

(∑

Xi)2, where

S2ε =

1n − 2

[n∑

i=1

Y 2i − α1

n∑i=1

Yi − β1

n∑i=1

XiYi

].






• The (1− α)100% T CI for β1 is(β1 − tn−2,α/2Sβ1

, β1 + tn−2,α/2Sβ1

)• If the intrinsic error variable is normally distributed, the above

T CI is valid for all sample sizes.

• If the intrinsic error variable is not normally distributed, theabove T CI is approximately correct provided n ≥ 30.






Example (Y=propagation of stress wave, X=tensile strength)

In this study, n = 14,∑

i Xi = 890,∑

i X 2i = 67,182,∑

i Yi = 37.6,∑

i Y 2i = 103.54 and

∑i XiYi = 2234.30. Let Y1

denote an observation made at X1 = 30, and Y2 denote anobservation at X2 = 35. Construct a 95% CI for E(Y1 − Y2).

Solution. Note that E(Y1 − Y2) = −5β1 (why?). We will firstconstruct a 95% CI for β1. With the available information wecompute: β1 = −0.0147209, α1 = 3.6209072, andS2ε = 0.02187. Thus,

Sβ1

=

√√√√√ S2ε∑

X 2i −

1n

(∑

Xi)2=

√0.02187

67,182− 1148902

= 0.001414,






Example (Continued)so that, the 95% CI for β1 is

β1 ± t0.025,12Sβ1

= −0.0147209± 2.179× 0.001414

= −0.0147209± 0.00308

= (−0.0178,−0.01164).

The 95% CI for −5β1 follows by multiplying the endpoints of theabove CI times −5:

(0.0582,0.089).






T CIs for the Regression Line

Proposition (The estimated s.e. of µY |X=x = α1 + β1x)

Let (X1,Y1), . . . , (Xn,Yn), be iid satisfying E(Yi |Xi = x) = α1+β1x , and Var(Yi |Xi = x) = σ2

ε , same for all x . Then,

SµY |X=x = Sε

√1n

+n(x − X )2

n∑

X 2i − (

∑Xi)2

, where

S2ε =

1n − 2

[n∑

i=1

Y 2i − α1

n∑i=1

Yi − β1

n∑i=1

XiYi

].






• The (1− α)100% T CI for µY |X=x is(µY |X=x − tn−2,α/2SµY |X=x , µY |X=x − tn−2,α/2SµY |X=x

)• If the intrinsic error variable is normally distributed, the above

T CI is valid for all sample sizes.

• If the intrinsic error variable is not normally distributed, theabove T CI is approximately correct provided n ≥ 30.






• The n(x − X )2 in the expression of SµY |X=x , means that the CIfor µY |X=x get wider as x get farther away from X .

Figure: Confidence Intervals for µY |X=x get wider away from X

• Estimation of µY |X=x for x < X(1) or x > X(n) is NOTrecommended.






Example

n = 11 data points yield∑

Xi = 292.90,∑

Yi = 69.03,∑X 2

i = 8141.75,∑

XiYi = 1890.2,∑

Y 2i = 442.1903. Construct a

95% CI for µY |X=26.627.

Solution: From the available information we obtainµY |X=x = 2.22494 + 0.152119x , and Sε = 0.3444. Thus,

SµY |X=x = 0.3444

√111

+11(x − X )2

11(8141.75)− (292.9)2 .

Here X = 26.627 so SµY |X=26.627 = 0.1038. Thus, since t0.025,9 = 2.262,

µY |X=26.627 ± 2.262× 0.1038 = 6.275± 0.235.

Note that for µY |X=25, SµY |X=25 = 0.1082 so the corresponding 95% CI,which is 6.028± 0.245, is wider.






GeneralitiesPrecision in estimation is quantified by the length of theCI.The lengths of CIs for the mean and proportion are

2tn−1,α/2S√n

and 2zα/2

√p(1− p)

n,

respectively.Thus precision increases as n increases. It also increasesas α increases since this causes tn−1,α/2 and zα/2 todecrease. For example,

z.05 = 1.645 < z.025 = 1.96 < z.005 = 2.575

Typically, we want to improve precision without adjusting α.






Sample Size Determination for Estimating µ

To construct a (1− α)100% CI having a prescribed lengthof L, the sample size n is found by solving the equation

2zα/2σ√n

= L.

The solution is: n =

(2zα/2

σ

L

)2

.

If the solution is not an integer (as is typically the case), thenumber is rounded up. Rounding up guarantees that theprescribed precision objective will be more than met.






ExampleThe time to response (in milliseconds) to an editing commandwith a new operating system is normally distributed with anunknown mean µ and σ = 25. We want a 95% CI for µ of lengthL = 10 milliseconds. What sample size n should be used?

Solution. For 95% CI, α/2 = .025 and z.025 = 1.96. Thus

n =

(2 · (1.96)

2510

)2

= 96.04,

which is rounded up to n = 97.

• However, the above requires knowledge of σ which, typically,is not available.






The Realistic Case: σ unknown

Sample size determination must rely a preliminaryapproximation, Sprl , of σ. Two common methods are:

1 If the range of population values is known, use

Sprl =range

3.5, or Sprl =

range4

.

This approximation is inspired by the standard deviation ofa U(a,b) random variable, which is σ = (b − a)/3.464.

2 Use the standard deviation, Sprl , of a preliminary sample.This is somewhat cumbersome because it requires sometrial-and-error iterations.






Sampe Size Determination for Estimating p

Equating the length of the (1− α)100% CI for p to L andsolving for n gives the solution is:

n =4z2

α/2p(1− p)

L2 , Round up.

Two commonly used methods for obtaining a preliminaryapproximation, pprl are:

1 Obtain pprl either from a small pilot sample or from expertopinion, and use it in the above formula.

2 Replace p(1− p) in the formula by 0.25. This gives

n = z2α/2/L

2, Round up.






Example

A preliminary sample gave pprl = 0.91. How large should n beto estimate the probability of interest to within 0.01 with 95%confidence?

Solution. “To within 0.01” is another way of saying that the 95%bound on the error of estimation should be 0.01, or the desiredCI should have a width of 0.02. Since we have preliminaryinformation, we use the first formula:

n =4(1.96)2(0.91)(0.09)

(.02)2 = 3146.27.

This is rounded up to 3147.






ExampleA new method of pre-coating fittings used in oil, brake andother fluid systems in heavy-duty trucks is being studied. Howlarge n is needed to estimate the proportion of fittings that leakto within .02 with 90% confidence? (No prior info available).

Solution. Here we have no preliminary information about p.Thus, we apply the second formula and we obtain

n = z2α/2/L

2 = (1.645)2/(.04)2 = 1691.26.

This is rounded up to 1692.





Prediction refers to estimating an observation. The bestpredictor is the mean of the underlying population.While the estimated mean is used for prediction,prediction intervals (PIs) are different from CIs.

The fat content of the hot dog you are about to eat can bepredicted by the sample mean of a sample of fat contents.But the PI is different from the CI for µ.The age of a tree can be predicted from its diameter byestimating the regression line. But the PI is different fromthe CI µY |X=x .

In the first example, there was no explanatory variable.The second example involves a regression context.

We begin with the case of no explanatory variable.





Prediction Based on a Univariate Sample

To emphasize the difference between PIs and CIs,suppose that the amount of fat in a randomly selected hotdog is N(20,9). Thus there are no unknown parameters tobe estimated, and no need to construct a CI for µ.Still the amount of fat, X , in the hot dog which one is aboutto eat is unknown, simply because it is a random variable.The best point predictor of X is 20 and a 95% PI is20± (1.96)3.In general, a (1− α)100% PI is an interval that containsthe r.v. X with probability 1− α.

For X ∼ N(µ, σ2), such an interval is: µ± zα/2σ.





Typically, µ, σ are unknown and are estimated from asample X1, . . . ,Xn by X , S, respectively.Then, the best point predictor of a future observation, is X .Assuming normality, the (1− α)100% PI for the next X is:(

X − tα/2,n−1S

√1 +

1n, X + tα/2,n−1S

√1 +

1n

).

In this formula, the variability of X is accounted for by the1n

, andthat of S is accounted for by the use of the t-percentiles.





ExampleThe fat content measurements from a sample of size n = 10 hot dogs,gave sample mean and sample standard deviation of X = 21.9, andS = 4.134. Give a 95% PI for the fat content, X , of the next hot dogto be sampled.

Solution: Assuming that the fat content of a randomly selected hotdog has the normal distribution, the best point predictor of X isX = 21.9 and the 95% PI is

X ± t.025,9 S

√1 +

1n

= (12.09,31.71).





PIs for the Normal Simple Linear Regression Model

Let (X1,Y1), . . . , (Xn,Yn) be n observations that follow thenormal simple linear regression model, i.e.Yi |Xi = xi ∼ N(α1 + β1xi , σ

2).The point predictor for a future observation Y made atX = x is µY |X=x = α1 + β1x .The 100(1− α)% PI is

µY |X=x ± tα/2,n−2S

√1 +

1n

+n(x − X )2

n∑

X 2i − (

∑Xi)2

.





Example

Consider again the study where n = 11,∑

Xi = 292.90,∑Yi = 69.03,

∑X 2

i = 8141.75,∑

XiYi = 1890.200,∑Y 2

i = 442.1903, µY |X = 2.22494 + .152119X , andS = 0.3444. Construct a 95% PI for a future observation, made atX = 25.Solution. The point predictor is µY |X=25 = 6.028, and the 95% PI atX = 25 is 6.028± 0.8165, as obtained from the formula

µY |X=25 ± t.025,9(0.344)

√1 +

111

+11(1.627)2

11∑

X 2i − (

∑Xi)2

.

The 95% CI for µY |X=25 was found to be 6.028± 0.245. Thisdemonstrates that PIs are wider than CIs.


week 10 point estimation and confidence intervalspersonal.psu.edu/acq/401/course.info/week10.pdfweek...

Documents