chapter 11: linear regression and correlationsfan/subpages/csuteach/st3502/homework/...chapter 11:...

Chapter 11: Linear Regression and Correlation

11.2 Estimating Model Parameters

11.1 A scatterplot of the data is given here:

•

•

•

•

•

•

x

y

5 10 15 20 25

510

1520

2530

3540

45

11.2 a. y = 1.8 + 2.0(3) = 7.8

b. The equation is plotted here:

160

x

y

0 1 2 3 4 5

01

23

45

67

89

1011

12

11.3 The calculations are give here:

i xi yi (xi − 3)2 (xi − 3)(yi − 5.6)1 1 2 4 7.22 2 4 1 1.63 3 6 0 04 4 7 1 1.45 5 9 4 6.8

Total 15 28 10 17

x = 15/5 = 3 y = 28/5 = 5.6

Sxx =∑i(xi − 3)2 = 10

Sxy =∑i(xi − 3)(yi − 5.6) = 17

β1 = Sxy/Sxx = 17/10 = 1.7

β0 = y − β1x = 5.6− (1.7)(3) = 0.5

y = 0.5 + 1.7x

11.4 a. The calculations are give here:

161

i xi yi (xi − 5)2 (xi − 5)(yi − 6.8)1 1 1 16 23.22 3 4 4 5.63 5 8 0 04 7 9 4 4.45 9 12 16 20.8

Total 25 34 40 54

x = 25/5 = 5 y = 34/5 = 6.8

Sxx =∑i(xi − 5)2 = 40

Sxy =∑i(xi − 3)(yi − 5.6) = 54

β1 = Sxy/Sxx = 54/40 = 1.35

β0 = y − β1x = 6.8− (1.35)(5) = 0.05

y = 0.05 + 1.35x

b. y = 0.05 + 1.35(6) = 8.15

11.5 The calculations are give here:

i xi yi (xi − 14)2 (xi − 14)(yi − 25.33)1 5 10 81 137.972 10 19 16 25.323 12 21 4 8.664 15 28 1 2.675 18 34 16 34.686 24 40 100 146.70

Total 84 152 218 356

x = 84/6 = 14 y = 152/6 = 25.33

Sxx =∑i(xi − 14)2 = 218

Sxy =∑i(xi − 14)(yi − 25.33) = 356

β1 = Sxy/Sxx = 356/218 = 1.633

β0 = y − β1x = 25.33− (1.633)(14) = 2.468

y = 2.468 + 1.633x

11.6 a. y = 4.698 + 1.97x

b. yes, the plotted line is relatively close to all 10 data points

c. y = 4.698 + (1.97)(35) = 73.65

11.7 a. A scatterplot of the data is given here:

162

•

•

•

•

••

Firmness of Sweet Potatoes as a Function of Pectin Concentration

Pectin Concentration

Firm

ness

Rea

ding

0.0 0.5 1.0 1.5 2.0 2.5 3.0

4550

5560

6570

7580

85

b. The calculations are give here:

i xi yi (xi − 1.5)2 (xi − 1.5)(yi − 64.43)1 0 8.5 2.25 20.8952 0 8.4 2.25 26.4453 3 7.9 0 04 3 8.1 0 05 6 7.8 2.25 23.5056 6 7.6 2.25 22.1557 9 7.3 0 08 9 7.0 0 09 12 6.8 2.25 23.50510 12 6.7 2.25 22.155

Total 60 386.6 9 93

x = 60/10 = 6 y = 386.6/6 = 64.43

Sxx =∑i(xi − 1.5)2 = 9

Sxy =∑i(xi − 1.5)(yi − 64.43) = 93

β1 = Sxy/Sxx = 93/9 = 10.33

β0 = y − β1x = 64.43− (10.33)(1.5) = 48.935

y = 48.935 + 10.33x

163

11.8 y = 48.935 + 10.33(1.0) = 59.265


••

•

•

•

•

•

•

••

Effect of Exposure Time on Fish Quality

Time After Capture

Fish

Qua

lity

0 1 2 3 4 5 6 7 8 9 10 11 12

5.0

5.5

6.0

6.5

7.0

7.5

8.0

8.5

9.0

9.5

10.0

b. Minitab output is given here:

Regression Analysis: QUALITY versus STORAGE

The regression equation is

QUALITY = 8.46 - 0.142 STORAGE

Predictor Coef SE Coef T P

Constant 8.46000 0.06610 127.99 0.000

STORAGE -0.141667 0.008995 -15.75 0.000

S = 0.1207 R-Sq = 96.9% R-Sq(adj) = 96.5%

Analysis of Variance

Source DF SS MS F P

Regression 1 3.6125 3.6125 248.07 0.000

Residual Error 8 0.1165 0.0146

Total 9 3.7290

The least squares estimates are β0 = 8.46 and β1 = −0.142

164

c. The estimated slope value β1 = −0.142 indicates that for each 1 hour increase in thetime between the capture of the fish and their placement in storage there is approxi-mately a 0.142 decrease in the average quality of the fish.

11.10 y = 8.46− 0.142x ⇒ y = 8.46− 0.142(10) = 7.04

It would not be a good idea to make predictions at x = 18 because the experimental datacontained values between 0 and 12 hours. Thus, 18 hours is well beyond the experimentaldata.

11.11 a. The line on the plot gives the predicted values. We read the predicted y-value off thevertical axis. When x=1, the predicted y-value is approximately 16. When x=7, thepredicted value is approximately 24.5. Therefore, a rough guess for the slope is

(difference in y-values)/(difference in x-values) = (24.5-16)/(7-1) = 1.4

b. Rough guess of slope is (24-15)/(.9-0) = 10

c. The plot of y vs log(x) has points nearly following a straight line. In the plot of y vsx, the middle values of x have y-values above the line, whereas the smaller and largervalues of x have y-values below the line. In the plot of y vs log(x), the points are moreevenly distributed above and below the line for all values of x.

11.12 a. β0 = 14.2917 and β1 = 1.4750 ⇒ y = 14.3 + 1.48x. Note that the guess from thegraph for the slope was reasonably close to the least squares estimate.

b. s = 1.346.

11.13 a. β0 = 14.8755 and β1 = 10.522⇒ y = 14.9 + 10.5log(x). Note that the guess from thegraph for the slope was reasonably close to the least squares estimate.

b. s = 1.131

11.14 The residual standard deviation from the regression using log(x) was smaller. This resultis consistent with the observations from the distribution of the data values about the fittedlines. The plot of y vs log(x) appeared to have points closer to the line than the plot of yvs x.

11.15 The Time Needed is increasing as the Number of Items increases but the rate of increase isless for larger values of Number of Items than it is for smaller values of Number of Items.A square root or logarithmic transformation is suggested from the plot.

11.16 a. For the transformed data, the plotted points appear to be reasonably linear.

b. The least squares line is y = 3.10 + 2.76√x

(Using rounded values 3.097869 ≈ 3.10 and 2.7633138 ≈ 2.76).

That is, Estimated Time Needed = 3.10 + 2.76√

Number of Items.

11.17 The Root Mean Square Error (RMSE) is 2.923232. This is the square root of the ”average”sum of squares of the y-values about the least squares line. This is an assessment of how”close” the data values are to the least squares line in the y-direction.

11.18 The RMSE for the transformed data can be compared to the RMSE for the original databecause only the x-values where transformed. Because the y-values have not been trans-formed and RMSE measures distance from the regression line and the plotted points in they-direction, the units for RMSE are the same for both regression lines.

165


•

•

•

•

•

• •

•

• •

• •

Number of Banks as a Function of Number of Businesses

Number of Independent Businesses

Num

ber o

f Ban

ks

100 200 300 400 500 600 700

12

34

56

78

910

From a scatterplot of data, a straight-line relationship between y and x seems reason-able.

b. The intercept is 1.766846 (1.77) and the slope is 0.0111049 (0.0111).

c. An interpretation of the estimated slope is as follows. For an increase of 1 business ina zip code area, there is an increase of 0.0111 in the average number of banks in thezip code.

d. From the output, the sample residual standard deviation is given by “Root MSE”,0.5583.

11.20 From the scatterplot, there does not appear to be an increase or decrease in the variabilityof y as x increases. It appears to be somewhat constant over the range of the x values.Increasing variability would appear as a funnel shape (or back to back funnels) in thescatterplot.


166

Money Supply (Trillions of Dollars)

M2

M3

2.20 2.25 2.30 2.35 2.40 2.45 2.50 2.55 2.60

2.75

2.80

2.85

2.90

2.95

3.00

3.05

3.10

3.15

3.20

3.25

3.30

From the plot of the data, there is a definite pattern in the lifetime of the bits asthe speed is increased. The lifetimes initially increase, then decrease, as the speed isincreased. The relationship between lifetime and speed is not a straight-line.

b. The lifetime of one of the bits at a speed 100 is quite a bit smaller than the other threelifetimes at this speed. However, because a speed of 100 is the average of all speeds inthe speed, this lifetime has very low leverage and hence can not have high influence.

11.22 a. The estimated intercept is 6.03 and the estimated slope is -0.017.

b. The slope has a negative sign which indicates a decreasing relation between lifetimeand speed. The scatterplot indicates that this is misleading. The relation is not astraight-line. In fact, as speed increases, the lifetimes initially increase, then decrease.

c. The residual standard deviation is given as Standard Error, 0.6324. This value is thesquare root of the MS(Residual), that is, 0.6324 =

√0.400. This value is the square

root of the average squared deviation of the data values about the fitted line in thevertical direction.

11.23 a. The prediction equation is y = 6.03− 0.017x. Thus, we just replace x with the speedsof 60, 80, 100, 120, and 140 to obtain the values:

x = 60⇒ y = 6.03− (0.017)(60) = 5.01

x = 80⇒ y = 6.03− (0.017)(80) = 4.67

167

x = 100⇒ y = 6.03− (0.017)(100) = 4.33

x = 120⇒ y = 6.03− (0.017)(120) = 3.99

x = 140⇒ y = 6.03− (0.017)(140) = 3.65

b. The predicted values are larger than all the observed lifetimes at both the lowestspeed, 60 and the highest speed, 140. Most of the lifetime values are greater than thepredicted values for speeds 80 to 120. Thus we can conclude that the straight-linemodel for this data is not appropriate. If the fitted line is reasonable, there shouldnot be any systematic pattern in the deviations of the observed data values from thepredictions. An alternative model should be fit to these data values.

11.24 a. The LOWESS smooth is basically a straight line, with a slight bend at the largervalues of Income. Thus, it would appear that the relation is basically linear.

b. The points with high leverage are those that are far from the average value of the x-values along the x-axis. In this case there are two points with very high income values.They are the last two values in the data listing. The plot shows that the point withIncome=65.0 and Price=110.0 is a considerable distance from the LOWESS line.This point therefore has high influence. The other high-leverage point Income=70.0and Price=185.0 falls reasonably close to the LOWESS line. This point would nothave such a large influence on the fitted line.

11.25 a. Slope = 1.80264 and Intercept = 47.15048 yielding y = 47.15 + 1.80x for the leastsquares line.

b. The estimated slope is approximately 1.80. This implies that as yearly income in-creases $1000, the average sale price of the home increases $1800. The estimatedIntercept would be the average sale price of homes bought by someone with 0 yearlyincome. Because the data set did not contain any data values with yearly income closeto 0, the Intercept does not have meaningful interpretation.

c. The residual standard deviation is “Root MSE” = 14.445.

11.26 The estimated slope changed considerably from 1.80 to 2.46. This resulted from excludingthe data value having high-influence. Such a data point twists the fitted line towards itselfand hence can greatly distort the value of the estimated slope.

168

11.3 Inferences about Regression Parameters

11.27 a. From the computer output on Page 554, the standard error of the estimated slopeSE(β1) = σβ1

= 0.216418. A 90% C.I. for β1 is given by

β1 ± t.05,98SE(β1)⇒ 2.7633± (1.645)(0.216418)⇒ (2.55, 2.98)

b. As the number of items increases, there is no change in the average Time needed toassemble for shipment.

c. Ha : β1 > 0.

d. The p-value for testing Ha : β1 > 0 is less than 0.0001. Thus, there is significantevidence that the slope is greater than 0. In order to conduct the test of hypothesis,the data from a series of populations having normal distributions with a commonvariance and with mean responses following the equation:

µy|x = βo + β1√x

11.28 p-value < 0.0001

11.29 Minitab output is given here:

Regression Analysis: FIRMNESS versus CONCENTRATION


FIRMNESS = 48.9 + 10.3 CONCENTRATION


Constant 48.933 1.541 31.76 0.000

CONCENTR 10.3333 0.7957 12.99 0.000

S = 2.387 R-Sq = 97.7% R-Sq(adj) = 97.1%


Source DF SS MS F P

Regression 1 961.00 961.00 168.65 0.000


Total 5 983.79

a. y = 48.9 + 10.3x

b. S2 = 5.70

c. SE(β1) = 0.7957

11.30 Ho : β1 = 0 versus Ha : β1 6= 0

Test Statistic: t = β1

se/Sxx= 10.33

2.387/9 = 12.99

p−value = 2P (t4 > 12.99) < 0.0001⇒ Reject Ho and conclude there is significant evidencethat β1 is not 0.

169

11.31 The original data and the log base 10 of recovery are given here:

Data Display

Cloud Time Recovery LogRecovery

1 0 70.6 1.849

2 5 52.0 1.716

3 10 33.4 1.524

4 15 22.0 1.342

5 20 18.3 1.262

6 25 15.1 1.179

7 30 13.0 1.114

8 35 10.0 1.000

9 40 9.1 0.959

10 45 8.3 0.919

11 50 7.9 0.898

12 55 7.7 0.886

13 60 7.7 0.886

a. Scatterplot of the data is given here:

•

•

•

•

•

••

• • • • • •

Biological Recovery as a Function of Exposure Time

Time (minutes)

Biol

ogica

l Rec

over

y (%

)

0 5 10 15 20 25 30 35 40 45 50 55 60

1020

3040

5060

70

b. Scatterplot of the data using log10(y) is given here:

170

•

•

•

•

•

•

•

••

•• • •

Logarithm of Biological Recovery as a Function of Exposure Time

Time (minutes)

Log

10 o

f Bio

logi

cal R

ecov

ery

(%)

0 5 10 15 20 25 30 35 40 45 50 55 60

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

11.32 Minitab Output is given here:

Regression Analysis: LogRecovery versus Time


LogRecovery = 1.67 - 0.0159 Time


Constant 1.67243 0.05837 28.65 0.000

Time -0.015914 0.001651 -9.64 0.000

S = 0.1114 R-Sq = 89.4% R-Sq(adj) = 88.5%


Source DF SS MS F P

Regression 1 1.1523 1.1523 92.93 0.000


Total 12 1.2887

a. y = 1.67− 0.0159x

b. sε = 0.1114

c. SE(βo) = 0.05837 SE(β1) = 0.001651

171

11.33 Ho : β1 = 0 versus Ha : β1 6= 0

Test Statistic: |t| = 9.64

p − value = 2P (t4 > 9.64) < 0.0001 < 0.05 ⇒ Reject Ho and conclude there is significantevidence that β1 is not 0.

11.34 1.67243± (2.201)(0.05837)⇒ (1.54, 1.80)

We are 95% confident that the average value of the logarithm of biological recovery per-centage at time zero is between 1.54 and 1.80.

11.35 a. Yes, the data values fall approximately along a straight-line.

b. y = 12.51 + 35.83x

11.36 a. 1.069

b. 6.957

c. Ho : β1 ≤ 0 versus Ha : β1 > 0

Test Statistic: t = 5.15

p − value = P (t10 > 5.15) = 0.0002 ⇒ Reject Ho and conclude there is significantevidence that there is a positive linear relationship.

11.37 a. No because the interpretation for βo is the average weight gain for chickens who didnot ingest lysine in their diet. However, lysine was placed in all the feed. Thus, allchickens consumed some lysine.

b. The model y = β1x+ ε forces the estimated regression line to pass through the origin,the point (0,0). The model y = βo + β1x+ ε does not necessarily force the fitted lineto pass through the origin. This provides greater flexibility in fitting models wherethere was no data collected near x=0.

11.38 a. β1 = 106.52

b. Scatterplot of the data and the two fitted lines are given here:

172

Comparison of Two Models on Lysine Data

Lysine Ingested

Weigh

t Gain

0.10 0.12 0.14 0.16 0.18 0.20 0.22

1015

2025

•

•

•

•

•

•

•

•

•

•

•

•

The model with the intercept term provided a better fit to the data.

11.39 a. Scatterplot of the data is given here:

•

•

•

••

•

•

•••

•

•

••

•

•

•

•

••

••••

•

•

•

•

•

•

Total Direct Cost versus Run Size

Run Size

Total

Cost

0 1 2 3 4 5 6 7 8 9 11 13 15 17 19

100

200

300

400

500

600

700

800

900

1000

1100

1200

An examination of the scatterplot reveals that a straight-line equation between total

173

cost and run size may be appropriate. There is a single extreme point in the data setbut no evidence of a violation of the constant variance requirement.

b. y = 99.777 + 5.1918x

The residual standard deviation is s =√

148.999 = 12.2065

c. A 95% C.I. for the slope is given by β1±t0.025,28SE(β1)⇒ 5.1918±(2.048)(0.586455)⇒(5.072, 5.312)

11.40 a. T = 88.53

b. From the output p− value = 0.0000, which can be interpreted as p− value < 0.0001.This is the p-value for a two-sided test of hypotheses. In the context of this problem,the appropriate alternative hypothesis would be that the slope is greater than 0 sincethe total cost of a job should increase as the size of the job increases. Thus, forHa : β1 > 0, p-value = P (t > 88.53) < 0.0001.

11.41 a. F = 7837.26 with p− value = 0.0000.

b. The F-test and two-sided t-test yield the same conclusion in this situation. In fact,for this type of hypotheses, F = t2.

11.4 Predicting New y Values Using Regression

11.42 y = 1.67243− (0.015914)(30) = 1.195

95% C.I. on µx=30 : y ± t.025,11sε

√1n + (30−x)2

Sxx⇒

1.195± (2.201)(0.1114)√

113 + (30−30)2

4550 ⇒ 1.195± 0.068⇒ (1.127, 1.263)

11.43 95% prediction interval for log biological recovery percentage at x=30 is given by

y ± t.025,11sε

√1 + 1

n + (30−x)2

Sxx⇒

1.195± (2.201)(0.1114)√

1 + 113 + (30−30)2

4550 ⇒ 1.195± 0.0.254⇒ (0.941, 1.449)

The prediction interval is somewhat wider than the confidence interval on the mean.

11.44 a. y = −1.733333 + 1.316667x

b. The p-value for testing Ho : β1 ≤ 0 versus Ha : β1 > 0 is

p − value = P (t10 ≥ 6.342) < 0.0001 ⇒ Reject Ho and conclude there is significantevidence that the slope β1 is greater than 0.

11.45 a. 95% Confidence Intervals for E(y) at selected values for x:

x = 4⇒ (2.6679, 4.3987)

x = 5⇒ (4.2835, 5.4165)

x = 6⇒ (5.6001, 6.7332)

x = 7⇒ (6.6179, 8.3487)

174

b. 95% Prediction Intervals for y at selected values for x :

x = 4⇒ (1.5437, 5.5229)

x = 5⇒ (2.9710, 6.7290)

x = 6⇒ (4.2877, 8.0456)

x = 7⇒ (5.4937, 9.4729)

c. The confidence intervals in part (a) are interpreted as “We are 95% confident that theaverage weight loss over many samples of the compound when exposed for 4 hours willbe between 2.67 and 4.40 pounds.” Similar statements for other hours of exposure.

The prediction intervals in part (b) are interpreted as “We are 95% confident thatthe weight loss of a single sample of the compound when exposed for 4 hours will bebetween 1.54 and 5.52 pounds.” Similar statements for other hours of exposure.

11.46 a. y = 99.77704+51.9179x⇒ When x=2.0, E(y) = 99.77704+(51.9179)(2.0) = 203.613as is shown in the output.

b. The 95% C.I. is given in the output as (198.902, 208.323)

11.47 No, because x = 2.0 is close to the mean of all x-values used in determining the leastsquares line.

11.48 a. The 95% P.I. is given in the output as (178.169, 229.057)

b. Yes, because $250 does not fall within the 95% prediction interval. In fact, it isconsiderably higher that the upper value of $229.057.

11.49 a. y = 23.817935 + (48.131793)(6) = 312.61

If the extrapolation penalty is ignored, the 95% prediction interval is

312.61± (2.000)(107.3671)√

1 + 1n ⇒ 312.61± 216.459⇒ (96.15, 529.07)

However, note that making a prediction at x = 6 is not advisable because the rangeof values for x in the data set was 0 to 4. Thus, x=6 is well beyond the observed data.

b. The extrapolation penalty is given by (xn+1−x)2

Sxx. Since there are 62 data values, the

value of Sxx should be considerably larger than (xn+1 − x)2 for any value of xn+1

within the range of the observed data. However, x = 6 is well outside the range of theobserved data and hence predictions at this value should not be considered becausethe model may not fit well for values of x outside of the observed data.

11.50 The interval would likely be too narrow if in fact it was valid to calculate the interval.However, predictions should not be made for x-values well beyond the observed data.

11.51 a. y = 14.2917 + (1.475)(20) = 43.792 as is shown in the “FIT” entry of the output forthe model y = 14.3 + 1.48x

b. log10(20) = 1.301⇒ y = 14.8755+(10.522)(1.301) = 28.565 as is shown in the “FIT”entry of the output for the model y = 14.9 + 10.5logx

c. Neither of the predictions is reasonable since x = 20 is well beyond the range of theobserved x-values. Note the warning about “extreme X values”.

11.52 Although the intervals are not valid, they are provided on the output:

95% P.I. for the model y = 14.3 + 1.48x : (36.854, 50.729)

95% P.I. for the model y = 14.9 + 10.5logx : (25.378, 31.752)

175

11.5 Examining Lack of Fit in Linear Regression


••

•

•

••

•

•

•

•

Height of Detergent versus Amount of Detergent

Amount

Heigh

t

6 7 8 9 10

2530

3540

4550

b. The SAS output is given here:

Model: MODEL1

Dependent Variable: Y


Sum of Mean

Source DF Squares Square F Value Prob>F

Model 1 330.48450 330.48450 169.213 0.0001

Error 8 15.62450 1.95306

C Total 9 346.10900

Root MSE 1.39752 R-square 0.9549

Dep Mean 35.89000 Adj R-sq 0.9492

C.V. 3.89390

Parameter Estimates

Parameter Standard T for H0:

Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 3.370000 2.53872138 1.327 0.2210

X 1 4.065000 0.31249500 13.008 0.0001

y = 3.37 + 4.065x

176

c. The residual plot is given here:

Plot of RESID*PRED. Symbol used is ’*’.

||

3 +|| *||||

2 +|||| *|

R |E 1 +S |I |D |U | *A | *L |S 0 +-----------------------------------------------------

| *||| * *|| *

-1 +| *|||| *|

-2 +|---+-----------+-----------+-----------+-----------+--27.760 31.825 35.890 39.955 44.020

PREDICTED VALUE

The residual plot indicates that higher order terms in x may be needed in the model.

11.54 a. A test of lack of fit will be conducted:

SSPexp =∑ij(yij− yi.)2 = (28.1−27.85)2 +(27.6−27.85)2 +(32.3−32.75)2 +(33.2−

32.75)2 + (34.8− 34.90)2 + (35.0− 34.90)2 + (38.2− 38.80)2 + (39.4− 38.80)2 + (43.5−45.15)2 + (46.8− 45.15)2 = 6.715

From the output of exercise 11.53, SS(Residuals) = 15.6245. Thus,

SSLack = SS(Residuals)− SSexp = 15.6245− 6.715 = 8.9095

dfLack = n− 2−∑i(ni − 1) = 10− 2− 5(2− 1) = 3

dfexp =∑i(ni − 1) = 5(2− 1) = 5

F = 8.9095/36.715/5 = 2.21 < 5.41 = F.05,3,5

There is not sufficient evidence that the linear model is inadequate.

b. Sxx =∑i(xi − x) = 20.

177

95% Prediction interval for y at x:

y ± t.025,8sε

√1 + 1

n + (x−x)2

Sxx= 3.37 + 4.065x± (2.306)(1.398)

√1 + 1

10 + (x−8)2

20

For x=6,7,8,9,10 we have

x y 95% P.I.6 27.760 (24.086, 31.434)7 31.825 (28.369, 35.281)8 35.890 (32.510, 39.270)9 39.955 (36.499, 43.411)10 44.020 (40.346, 47.694)

11.6 The Inverse Regression Problem (Calibration)

11.55 y = 1.47 + 0.797x

Ho : β1 ≤ 0 versus Ha : β1 > 0

t = 7.53⇒ p− value = Pr(t8 ≥ 7.53) < 0.0005⇒There is sufficient evidence in the data that the slope is positive.

11.56 a. x = (13− 1.47)/0.797 = 14.47

b. sε = 1.017, Sxx = 92.4, x = 15.6, t.025,8 = 2.306⇒ c2 = (2.306)2(1.017)2

(.797)2(92.4) = 0.0937

d = (2.306)(1.017).797

√10+1

10 (1− .0937) + (14.47−15.6)2

92.4 = 2.958

xL = 15.6 + 1(1−.0937) (14.47− 15.6− 2.958) = 11.09

xU = 15.6 + 1(1−.0937) (14.47− 15.6 + 2.958) = 17.42⇒

(11.09, 17.42) is a 95% prediction interval on the actual volume when the estimatedvolume is 13.

11.57 a. y = 0.11277 + 0.11847x ≈ 0.113 + 0.118x

Dependent variable is Transformed CUMVOL and Independent variable is Log(Dose).

b. x = (y − 0.11277)/0.11847

sε = 0.04597, Sxx = 4.9321, x = 2.40, t.005,8 = 2.819⇒c2 = (2.819)2(.04597)2

(.11847)2(4.9321) = 0.2426

d = (2.819)(.04597).11847

√24+1

24 (1− .242467) + (x−2.40)2

4.9321

xL = 2.40 + 1(1−.2426) (x− 2.40− d)

xU = 2.40 + 1(1−.2426) (x− 2.40 + d)⇒

178

y TRANS(y) ˆLOG(x) x d ˆLOG(x)LˆLOG(x)U xL xU

10 .322 1.764 5.84 .242289 1.24038 1.88018 3.46 6.5514 .383 2.285 9.83 .198634 1.98616 2.51068 7.29 12.3119 .451 2.855 17.38 .221359 2.70876 3.29328 15.01 26.93

11.58 The four values of CUMVOL are 10, 20, 30, 12 yielding y = 0.42969 and sy = .05849,where y is the arcsine of the square root of CUMVOL.

For 50%: x = 2.367 and 95% confidence limits are (0.79, 4.45)

For 75%: x = 5.861 and 95% confidence limits are (2.20, 12.88)

11.7 Correlation

11.59 The output yields R-square=0.9452. The estimated slope of the regression line is 0.0111049which is positive, indicating an increasing relation between Branches and Business. Thus,the correlation is the positive square root of 0.9452, i.e., r = 0.9722.

11.60 a. Test Ho : ρyx ≤ 0 versus Ha : ρyx > 0

Test Statistic: t = .9722√

12−2√1−(.9722)2

= 13.13

p− value = Pr(t10 ≥ 13.13) < 0.0005

Conclusion: There is significant evidence that the correlation is positive.

b. The t-test for testing whether the slope is 0 is exactly the same (except for roundingerror) as the t-test for whether the correlation is 0. From the output we have t =13.138, which is approximately the same as we computed in part a. In fact, theywould be identical if we would have used a greater number of decimal places in ourcalculations in part a.

11.61 a. r2yx = 99.64% (R-squared on output). This large value for r2

yx is reflected on theoutput by having the sum of squares for Model considerably larger than the sum ofsquares for Error. That is, virtually all the variation in TotalCost is accounted for bythe variation in Runsize.

b. Since β1 is positive, there must be a general positive relation between the values for yand x. Thus, ryx should be positive.

c. In general, if the relation between the values of y and x are fairly consistent across therange of values for x, then a wider range of values for x yields a larger value for ryx.Examining the plot in Exercise 11.39 confirms the consistency of the relation. Thus,we would expect ryx to be smaller for the restricted data set.

11.62 a. The correlation is given as 0.956. This value indicates a strong positive trend betweenintensity of the advertising and the awareness percentage. That is, as intensity of theadvertising increases we would generally observe an increase in awareness percentage.

179

b. Scatterplot of the data is given here:

• • •

•

•

•

•• •

•

Intensity of Advertising versus Awareness Percentage

Intensity of Advertising

Aware

ness

Perce

ntage

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0

1020

3040

5060

70

The relation between intensity and awareness does not appear to follow a straightline. It is generally an increasing relation with a threshold value of approximatelyx=5 before awareness increases beyond a value of 10%. A leveling off of the sharpincrease in awareness occurs at an intensity of x=7.

11.63 a. There appears to be a general increase in salary as the level of experience increases.However, there is considerable variability in salary for persons having similar levels ofexperience.

b. Note that Case 11 has a person with 14 years of experience and a salary of 37.9.Salaries of less than 40 are generally associated with persons having less than 5 yearsexperience.

11.64 For the points with the smaller symbols, there is a general upward trend in SALARYas EXPER increases in value. The point denoted with a large box, SALARY=37.9 andEXPER 14 years, clearly does not fall within the general pattern of the other points.This point is an outlier; it has high leverage because the experience for this graduate isconsiderably larger than the average, and the point has high influence because the startingsalary value is much smaller than all the other points having experience value close to 14years.

11.65 a. y = 40.507 + 1.470x, where y is the starting salary and x is the years of experience.The slope value of 1.470 thousand dollars per years of experience can be interpretedin the following manner. A population of graduates with one extra year of experiencewould have an estimated average starting salary $1,470 higher than a population withone less year of experience. The intercept is the estimated average salary of graduates

180

with no prior experience. Since the data included people with no prior experience, theestimated average salary for this group would in fact be the estimated intercept valueof $40,507. (In situations where the data set does not contain points near x=0, theestimated intercept is not meaningful.)

b. The residual standard deviation is sε = 5.402. This is an indication of the amount ofvariation in starting salaries which is not accounted for by the linear model relatingstarting salary to years of experience. If there are other factors affecting the varia-tion in starting salary and/or if the relationship between starting salary and years ofexperience is nonlinear then sε would tend to be large.

c. From the output t = 6.916 with a p-value of 0.000. This would imply overwhelmingevidence that there is a relation between starting salary and experience.

d. R2 = 0.494 which indicates that 49.4% of the variation in starting salaries is accountedfor by its linear relation with experience.

11.66 a. The high-influence point was below the line and at the right edge of the data. Suchpoints twist the line down and hence decrease the value of the slope. Omitting thepoint will cause the line to rise and thus the slope would increase, as was observed,1.470 to 1.863.

b. In general, outliers tend to increase the standard deviation. Therefore, removing anoutlier should decrease the standard deviation, as was observed, 5.40 to 4.07. This isapproximately a 25% decrease, a substantial change.

c. High-influence points can either increase or decrease the correlation depending onstructure in the remaining data values. In this data set, because the high-influencepoint resided well outside the pattern of the remaining data values, it should decreasethe correlation. Thus, removing the high-influence point should result in an increasein the correlation, as was observed, 0.703 to 0.842.

181

Supplementary Exercises


•

•

••

•

•

•

Plot of Data for Exercise 11.67

x

y

10 12 14 16 18 20 22 24

2025

3035

4045

5055

60

b. The necessary calculations are given here:

x y (x− x) (y − y) (x− x)(y − y) (x− x)2

10 25 -5.8571 -14.2857 83.6735 34.306112 30 -3.8571 -9.2857 35.8163 14.877614 36 -1.8571 -3.2857 6.1020 3.449015 37 -0.8571 -2.2857 1.9592 0.734718 42 2.1429 2.7143 5.8163 4.591819 50 3.1429 10.7143 33.6735 9.877623 55 7.1429 15.7143 112.2449 51.0204111 275 0 0 279.2857 118.8571

x = 1117 = 15.8571 y = 275

7 = 39.2857 β1 = 279.2857118.8571 = 2.3498⇒

β0 = 39.2857− (2.3498)(15.8571) = 2.0247⇒ y = 2.2047 + 2.3498x

c. When xn+1 = 21, y = 2.2047 + (2.3498)(21) = 51.55

11.68 a. The necessary calculations are given here:

182

x y y (y − y) (y − y)2

10 25 25.5228 -0.5228 0.273412 30 30.2224 -0.2224 0.049414 36 34.9219 1.0781 1.162415 37 37.2716 -0.2716 0.073818 42 44.3209 -2.3209 5.386619 50 46.6707 3.3293 11.084423 55 56.0697 -1.0697 1.1443111 275 275 0 19.1743

b. sε =√

19.17437−2 = 1.9583

All the residuals fall within 0± 2sε = (−3.9166, 3.9166)

11.69 a. y = 5.890659 + 0.0148652x

b. Ho : β1 ≤ 0 versus Ha : β1 > 0, from the output t = 5.812.

The p-value on the output is for a two-sided test. Thus, p−value = Pr(t5 ≥ 5.812) =0.00106 which indicates that there is significant evidence that the slope is greater than0.

11.70 a. y = 1.007445 + 2.307015ln(x)

b. Ho : β1 ≤ 0 versus Ha : β1 > 0, from the output t = 7.014.

The p-value on the output is for a two-sided test. Thus, p−value = Pr(t5 ≥ 7.014) =0.00045 which indicates that there is significant evidence that the slope is greater than0.

11.71 The regression line using ln(x) appears to provide the better fit. The scatterplots indicatethat the line using ln(x) more closely matches the data points. Also, the residual standarddeviation using ln(x) is smaller than the standard deviation using x (2.0135 vs 2.3801).

11.72 a. From the output, the 95% C.I. for β1 in the model y = β0 +β1x+ ε is (0.0083, 0.021).

b. From the output, the 95% C.I. for β1 in the model y = β0 +β1ln(x) + ε is (1.46, 3.15).

11.73 The preferred model is y = β0 + β1ln(x) + ε with estimates y = 1.007445 + 2.307015ln(x).With x=75, y = 1.007445 + 2.307015ln(75) = 10.97. A 95% prediction interval for y whenx=75 is

10.97± (2.571)(2.0135)√

1 + 17 + (4.31749−3.6502)2

37.4731 ⇒ 10.97± 5.56⇒ (5.41, 16.53). This is a

fairly wide interval and thus the regression line is not providing a very accurate prediction.

11.74 a. The prediction equation is y = 140.074 + 0.61896x.

b. The coefficient of determination is R2 = 0.9420 which implies that 94.20% of thevariation in fuel usage is accounted for by its linear relationship with flight miles.Because the estimated slope is positive, the correlation coefficient is the positive squareroot of R2, i.e., r =

√0.9402 = 0.97.

c. The only point in testing Ho : β1 ≤ 0 versus Ha : β1 > 0 would be in the situationwhere the flights were of essentially the same length and there is an attempt to de-termine if there are other important factors that may affect fuel usage. Otherwise, itwould be obvious that longer flights would be associated with greater fuel usage.

183

11.75 a. With x = 1000, µy|x=1000 = 140.074 + (0.61896)(1000) = 759.03. From the output, a95% C.I. on the mean fuel usage of 1000 mile flights is (733.68, 784.38).

b. With x = 1000, y = 140.074 + (0.61896)(1000) = 759.03. From the output, a 95% P.I.on the fuel usage for a particular 1000 mile flight is (678.33, 839.73). A value of yequal to 628 does not fall in the 95% P.I. It would be regarded as unusually low andshould be investigated to see why the flight performed so well on this particular flight.

11.76 The value of β1 is the increase in average fuel usage for an increase of one air mileage. Thevalue of β0 might be interpreted as the amount of fuel needed to take off and land sincethe air miles are essentially 0 during these maneuvers.

11.77 a. t = 2.14, with a p-value of 0.0396. With α = 0.05, there is significant evidence of alinear relationship between rooms occupied and restaurant/lounge sales.

b. From the scatterplot, it would appear that the point in the upper left corner of the plotis pulling the regression line towards the point. Removing this point, the regressionline would undertake a counter clockwise twist which would result in an increase inits slope.

11.78 a. The slope has increased from 3.176 to 9.437.

b. The intercept has changed from 557.724 to -50.185. Since there were no data valuesnear x=0, it is not appropriate to interpret the value of the intercept.

c. Correcting the value of the outlier resulted in a reduction in sε, 182.253 to 129.81.

d. Correcting the value of the outlier resulted in an increase in R2, 11.87% to 55.29%.

11.79 a. y = 9.7709 + 0.0501x and sε = 2.2021

b. 90% C.I. on β1 : β1 ± t.05,46SE(β1)⇒ 0.0501± (1.679)(0.0009)⇒ (0.0486, 0.0516)

11.80 An examination of the data in the scatterplot indicates that two of the points may possiblybe outliers since they are somewhat below the general pattern in the data. This mayindicate that the data is nonnormal. Also, there appears to an increase in the variability ofthe Rate values as the Mileage increases. This would indicate that the condition of constantvariance may be violated.

11.81 There is an error in the Excel output given in the book. Output from Minitab is givenhere:

184

Regression Analysis: Rate: versus Mileage:


Rate: = 9.79 + 0.0503 Mileage:


Constant 9.7932 0.4391 22.30 0.000

Mileage: 0.0503229 0.0008230 61.14 0.000

S = 2.042 R-Sq = 98.8% R-Sq(adj) = 98.8%


Source DF SS MS F P

Regression 1 15590 15590 3738.63 0.000

Residual Error 46 192 4

Total 47 15782

Unusual Observations

Obs Mileage: Rate: Fit SE Fit Residual St Resid

14 120 11.100 15.832 0.372 -4.732 -2.36R

27 330 18.000 26.400 0.300 -8.400 -4.16R

46 1050 58.700 62.632 0.614 -3.932 -2.02R

47 1200 75.800 70.181 0.725 5.619 2.94RX

48 1650 89.000 92.826 1.074 -3.826 -2.20RX

R denotes an observation with a large standardized residual

X denotes an observation whose X value gives it large influence.

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 26.903 0.298 ( 26.303, 27.503) ( 22.749, 31.057)

Values of Predictors for New Observations

New Obs Mileage:

1 340

y = 9.7932 + (0.0503229)(340) = 26.903

The 95% P.I. when x=340 is (22.749, 31.057)

11.82 a. The group felt that towns with small populations should be associated with largeamounts per person, whereas towns with larger populations would have small amountsper person. Thus, the group claimed that as townsize increased the amount spent perperson decreased. If the data supported their claim, then the slope would be negative.

b. The estimated slope was β1 = 0.0005324 which is positive. Thus, there claim is notsupported by the data.

11.83 There is a single point in the upper right hand corner of the plot that is a considerabledistance from the rest of the data points. A point located in this region will cause the fittedline to be drawn away from the line which would be fitted to the data values with this pointexcluded. Thus the regression line given in the plot seems to be somewhat misfitted. Aregression line fitted to the data set excluding this point would most likely yield a line witha negative slope.

185

11.84 a. The point is a very high-influence outlier which has distorted the slope considerably.

b. The regression line with the one point eliminated has a negative slope, β1 = −0.0015766.This confirms the opinion of the group, which had argued that the smallest townswould have the highest per capita expenditures with decreasing expenditures as thesize of the towns increased.

11.85 The slope with the unusual town included was β1 = 0.0005324. The output with theunusual town excluded is shown to be β1 = −0.0015766. The slope has changed sign andincreased in magnitude.

11.86 The output from SAS is given here:

The SAS System


Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 867.00000 867.00000 99.80 <.0001

Error 16 139.00000 8.68750

Corrected Total 17 1006.00000

Root MSE 2.94746 R-Square 0.8618

Dependent Mean 21.66667 Adj R-Sq 0.8532

Coeff Var 13.60365

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 41.40306 2.09422 19.77 <.0001

x 1 28.23639 2.82649 9.99 <.0001

186

Plot of RESID*PRED. Symbol used is ’*’.

|

|

4.833 + *

|

4.333 + *

|

3.833 +

|

3.333 + *

|

2.833 +*

|

2.333 + *

|

1.833 +* *

R |

E 1.333 +

S |

I 0.833 +

D |

U 0.333 + *

A |------------------------------------------------------------

L -0.167 +* *

S |

-0.667 +

|

-1.167 +* *

|

-1.667 +

|

-2.167 +*

|

-2.667 + *

|

-3.167 +* *

|

-3.667 + *

|

-4.167 + *

|

-+----------------------------+----------------------------+-

13.167 21.667 30.167

PREDICTED VALUE

a. y = 41.40306 + 28.23639 log10(x)

b. There is no apparent pattern in the residuals that would indicate the need to addhigher order terms in the model. There are no outliers and the variance appearsconstant.

c. From the output, t = 9.99⇒ p− value = Pr(t16 ≥ 9.99) < 0.0001

11.87 The Minitab output is given here:

187

Regression Analysis: Yield versus Nitrogen


Yield = 11.8 + 6.50 Nitrogen


Constant 11.778 1.887 6.24 0.000

Nitrogen 6.5000 0.8736 7.44 0.000

S = 2.140 R-Sq = 88.8% R-Sq(adj) = 87.2%


Source DF SS MS F P

Regression 1 253.50 253.50 55.36 0.000


Lack of Fit 1 2.72 2.72 0.56 0.484

Pure Error 6 29.33 4.89

Total 8 285.56

y = 11.778 + 6.5x

From the output, MSLack = 2.72, MSexp = 4.89⇒ F = 2.724.89 = 0.56 with df= 1, 6. This

yields p− value = 0.484 which implies there is no indication of lack of fit in the model.

11.88 a. From the scatterplot, it would appear that a straight line model relating Homogenateto Pellet would provide an adequate fit.

b. No, the plot does not indicate any outliers and the variance appears to be constant.

c. We could use the calibration techniques and predict Pellet response using the observeHomogenate reading.

d. Use the calibration prediction interval.


188

•

•

•

•

•

•

• ••

•

•

•

•

•

•

•

•

•

•

•

•• ••

•

•

•

•

•

•

•

•

••

• ••

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•


SIZE

PRICE

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

050

100

150

200

250

300

350

400

b. One of the houses in the study is shown as having a size of 2, but a very low price.It appears to be an outlier, with a value well below any reasonable line that may befitted to the data. Because the point has an x-value near x, the point does not havehigh leverage.

c. The minitab output is given here:

Regression Analysis: PRICE versus SIZE


PRICE = 51.1 + 59.2 SIZE


Constant 51.08 13.97 3.66 0.001

SIZE 59.152 6.666 8.87 0.000

S = 29.14 R-Sq = 58.9% R-Sq(adj) = 58.1%

PRESS = 50352.0 R-Sq(pred) = 55.66%


Source DF SS MS F P

Regression 1 66862 66862 78.73 0.000


Total 56 113569


Obs SIZE PRICE Fit SE Fit Residual St Resid

4 4.70 352.00 329.09 18.32 22.91 1.01 X

27 2.00 19.10 169.38 3.86 -150.28 -5.20R

47 2.90 154.00 222.62 7.06 -68.62 -2.43R

189



Durbin-Watson statistic = 2.10

The least squares line is y = 51.08 + 59.152x

d. Minitab output for the data set without the outlier is given here:

Regression Analysis: PRICE versus SIZE


PRICE = 54.0 + 59.0 SIZE


Constant 53.99 10.06 5.37 0.000

SIZE 59.040 4.794 12.31 0.000

S = 20.96 R-Sq = 73.7% R-Sq(adj) = 73.3%

PRESS = 26440.0 R-Sq(pred) = 70.73%


Source DF SS MS F P

Regression 1 66607 66607 151.64 0.000


Total 55 90327


Obs SIZE PRICE Fit SE Fit Residual St Resid

4 4.70 352.00 331.48 13.18 20.52 1.26 X

13 2.90 181.00 225.20 5.09 -44.20 -2.17R

26 2.70 262.00 213.40 4.32 48.60 2.37R

46 2.90 154.00 225.20 5.09 -71.20 -3.50R




Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 349.19 14.59 ( 319.94, 378.43) ( 297.99, 400.38) XX

X denotes a row with X values away from the center

XX denotes a row with very extreme X values

Values of Predictors for New Observations

New Obs SIZE

1 5.00

The least squares line is y = 53.99 + 59.04x. Note that the slope has changed verylittle (59.152 versus 59.04) since the outlier was of low leverage.

e. The residual standard deviations are sε = 29.14 with outlier and sε = 20.96 withoutoutlier. There has been a considerable reduction in the residual standard deviation

190

because its the squared distance from the fitted line and hence is very sensitive tooutliers.

11.90 a. The estimated intercept is βo = 53.99. This is the estimated mean price of houses ofsize 0. This could be interpreted as the estimated price of land upon which there is nobuilding. However, there were no data values with x near 0. Therefore, the estimatedintercept should not be directly interpreted but just taken as a portion of an overallmodel.

b. A slope of 0 would indicate that the estimated mean price of houses does not increaseas the size of the house increases. That is, large houses have the same price as smallhouses. This is not very realistic. From the Minitab output, t = 12.31 with df=54,⇒ p − value = Pr(t54 ≥ 12.31) < 0.0005. Thus, there is highly significant evidenceevidence that the slope is not 0.

c. Using the estimates from the Minitab output, a 95% C.I. for β1 is

59.040± (2.005)(4.794)⇒ (49.428, 68.652).

11.91 a. From the Minitab output, a 95% P.I. for the selling price when the size is 5:

(297.99, 400.38). The prediction may be somewhat suspect since a house of size 5(5000 square feet) is beyond all the data values in the study.


•

•

•

•

•

•

• ••

•

•

•

•

•

•

•

•

•

•

•

•• ••

•

•

•

•

•

•

•

••

• ••

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•


SIZE

PRICE

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

050

100

150

200

250

300

350

400

The variance in selling price appears to increase in variability as the size of the housesincreases. For small houses, the selling prices are concentrated near the fitted line.For large houses, there is a wide spread in the selling price for houses of nearly thesame size. Therefore, the assumption of constant variance may not be valid.

191

c. The P.I. may not be valid since the statistical procedures depend on the conditionthat the variance remain constant across the range of x-values.

11.92 Mintab output is given here:

Regression Analysis: SALES versus DENSITY


SALES = 142 - 12.9 DENSITY


Constant 141.525 9.109 15.54 0.000

DENSITY -12.893 1.946 -6.63 0.000

S = 21.74 R-Sq = 59.4% R-Sq(adj) = 58.0%

PRESS = 17131.8 R-Sq(pred) = 50.95%


Source DF SS MS F P

Regression 1 20747 20747 43.90 0.000


Total 31 34927


Obs DENSITY SALES Fit SE Fit Residual St Resid

18 1.30 183.00 124.77 6.90 58.23 2.82R

19 1.30 171.00 124.77 6.90 46.23 2.24R

27 9.40 45.00 20.34 10.74 24.66 1.31 X




a. The correlation is obtained by taking the square root of R2 and giving it the samesign as the estimated slope. r = −

√.594 = −0.77. The negative sign implies that

large values of DENSITY are associated with small values of SALES and small valuesof DENSITY are associated with large values of SALES.

b. y = 141.525− 12.893x The estimated intercept βo = 141.525 is the predicted sales forhomes in a ZIP code area with 0 density (that is, lawn-care sales in an area with nohomes). Thus, the intercept does not have a direct interpretation in this study. Theestimated slope β1 = −12.893 indicates that average number of lawn care sales perthousand decreases 12.893 for each extra home per acre.

c. The residual standard deviation is sε = 21.74. A large value of sε would indicate apoorly fitting model and hence inaccurate estimates and predictions. This would yieldP.I.’s having large widths.

11.93 a. t = −6.63 ⇒ p − value = Pr(|t30| ≥ 6.63) < 0.0001. Thus, there is very strongevidence that density is related to sales.

b. 95% C.I. for β1 : −12.893± (2.042)(1.946)⇒ (−16.867,−8.919).

192

11.94 Scatterplot of the data is given here:

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

••

•

•

•

Plot of Sales vs Density

DENSITY

SALE

S

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

010

2030

4050

6070

8090

110

130

150

170

190

There appears to be a curvature in the plotted points which would indicate that a straight-line models is not appropriate to model Sales as a function of Density.


Regression Analysis: SALES versus 1/DENSITY


SALES = 27.7 + 199 1/DENSITY


Constant 27.704 3.945 7.02 0.000

1/DENSIT 199.22 11.74 16.97 0.000

S = 10.48 R-Sq = 90.6% R-Sq(adj) = 90.3%

a. The density variable is number of homes per acre. Therefore, 1/density would be thenumber of acres per home, that is, the average lot size of a home. If 1/density equals0.5, then the average lot size in this ZIP code area would be a 0.5 acre lot.


193

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

••

•

•

•

Plot of Sales vs Lotsize

LOTSIZE

SALE

S

0.0 0.10 0.25 0.40 0.55 0.70 0.85 1.00

010

2030

4050

6070

8090

110

130

150

170

190

The plotted points appear to fit a straight-line model reasonably well.

c. The correlation is obtained by taking the square root of R2 and giving it the samesign as the estimated slope. r =

√.906 = 0.952. The magnitude of r has increased

(0.771 versus 0.952). The relation between Sales and Lot Size is more linear than therelation between Sales and Density. This results in an increase in the magnitude ofthe correlation coefficient.


Regression Analysis: DURABIL versus CONCENTR


DURABIL = 47.0 + 0.308 CONCENTR


Constant 47.020 4.728 9.94 0.000

CONCENTR 0.3075 0.1114 2.76 0.008

S = 12.21 R-Sq = 11.6% R-Sq(adj) = 10.1%

PRESS = 9368.62 R-Sq(pred) = 4.20%


Source DF SS MS F P

Regression 1 1134.7 1134.7 7.61 0.008


Total 59 9779.2


194

Obs CONCENTR DURABIL Fit SE Fit Residual St Resid

2 20.0 25.20 53.17 2.73 -27.97 -2.35R

4 20.0 20.30 53.17 2.73 -32.87 -2.76R

60 60.0 18.90 65.47 2.73 -46.57 -3.91R



a. y = 47.020 + 0.3075x. The estimated slope β1 = 0.3075 can be interpreted as follows:there is a 0.3075 increase in average durability for a 1 unit increase in concentration.

b. The coefficient of determination, R2 = 11.6%. That is, 11.6% of the variation in dura-bility is explained by its linear relationship with concentration. Thus, a straight-linemodel relating Durability to Concentration would not yield very accurate predictions.

11.97 For testing Ho : β1 = 0 versus Ha : β1 6= 0, t = 2.76 with p − value = 0.008. The p-valueis less than α = 0.01 thus we reject Ho and conclude there is significant evidence that theslope is different from 0.

11.98 Scatterplot of the data is given here:

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

••

•

•

•

••

•

•

••

••

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

••

•

•

•

Plot of Durability vs Concentration

CONCENTRATION

DURA

BILITY

20 24 28 32 36 40 44 48 52 56 60 64

2025

3035

4045

5055

6065

7075

80

a. From the scatterplot, there is a definite curvature in the relation between Durabilityand Concentration. A straight-line model would not appear to be appropriate.

b. The coefficient of determination, R2, measures the strength of the linear (straight-line) relation only. A straight-line model does not adequately describe the relationbetween Durability and Concentrtation. This is indicated by the small percentage ofthe variation, 11.6%, in the values of Durability explained by the model containing

195

just a linear relation with Concentration. A more complex relation exists between thevariables Durability and Concentration.

11.99 Some of the important points to mention are:

• A correlation of 80.1% indicates that the linear relation between the Estimated Costand the Actual Cost is quite strong. This does not mean that the Estimated Costitself is accurate. For instance, the Estimated Cost could always be off by a factor of3 (i.e., consistently Actual Cost = 3 x Estimated Cost). This would result in a perfectcorrelation between Estimated Cost and Actual Cost but the Estimated Cost wouldalways be considerably different from the Actual Cost.

• The p-value is low, which corresponds to the high correlation. The low p-value affirmsthe strong linear relation between Estimated Cost and Actual Cost.

• The estimated slope is 1.25. This implies that the Actual Cost is generally higher thanthe Estimated Cost. Thus, the estimating method may not be a reasonably method.

• There are two observations that might need examination. The point with EstimatedCost=186200 and Actual Cost=152134 is very much below the general pattern in thedata. The second point would be Estimated Cost=178300 and Actual Cost=215200is somewhat above the general pattern in the data.

196

chapter 11: linear regression and correlationsfan/subpages/csuteach/st3502/homework/...chapter 11:...

Documents