lecture 23 multiple regression (sections 19.3-19.4)

22
Lecture 23 • Multiple Regression (Sections 19.3-19.4)

Post on 21-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Lecture 23

• Multiple Regression (Sections 19.3-19.4)

Multiple Regression Model

• Multiple regression model:

y = 0 + 1x1+ 2x2 + …+ kxk + • Required conditions

The regression function is a linear function of the independent variables x1,…,xk (multiple regression line does not systematically overestimate/underestimate y for any combination of x1,…,xk ).

The error is normally distributed. The standard deviation is constant ( for all values of x’s. The errors are independent.

kkk xxxxyE 1101 ),,|(

• Data were collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model:

Margin = Rooms NearestOfficeCollege + 5Income + 6Disttwn

Estimating the Coefficients and Assessing the Model, Example

Margin Number Nearest Office Space Enrollment Income Distance55.5 3203 4.2 549 8 37 2.733.8 2810 2.8 496 17.5 35 14.449 2890 2.4 254 20 35 2.6

31.9 3422 3.3 434 15.5 38 12.157.4 2687 0.9 678 15.5 42 6.949 3759 2.9 635 19 33 10.8

Xm19-01

Model Assessment

• The model is assessed using three tools:– The standard error of estimate – The coefficient of determination– The F-test of the analysis of variance

1knSSE

s

2i

2

)yy(SSE

1R

• We pose the question:Is there at least one independent variable linearly related to the dependent variable (Are any of the X’s useful in predicting Y)?

• To answer the question we test the hypothesis

H0: 1 = 2 = … = k=0

H1: At least one i is not equal to zero.

• If at least one i is not equal to zero, the model has some validity.

Testing the Validity of the Model

• The hypotheses are tested by an ANOVA procedure.

Testing the Validity of the La Quinta Inns Regression Model

Analysis of Variance Source DF Sum of

Squares Mean

Square F Ratio

Model 6 3123.8320 520.639 17.1358 Error 93 2825.6259 30.383 Prob > F C. Total

99 5949.4579 <.0001

[Variation in y] = SSR + SSE. If SSR is large relative to SSE, much of the variation in y is explained by the regression model; the model is useful and thus, the null hypothesis should be rejected. Thus, we reject for large F.

Rejection region

F>F,k,n-k-1

Testing the Validity of the La Quinta Inns Regression Model

1knSSE

kSSR

F

F,k,n-k-1 = F0.05,6,100-6-1=2.17F = 17.14 > 2.17

Also, the p-value (Significance F) = 0.0000Reject the null hypothesis.

Testing the Validity of the La Quinta Inns Regression Model

ANOVAdf SS MS F Significance F

Regression 6 3123.8 520.6 17.14 0.0000Residual 93 2825.6 30.4Total 99 5949.5

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the i is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the i is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid

Relationships among and F ,, 2Rs

SSE R2 F Asses. of model

0

Small

Large

s

2)( yyi

• b0 = 38.14. This is the intercept, the value of y when

all the variables take the value zero. Since the data

range of all the independent variables do not cover

the value zero, do not interpret the intercept.

• b1 = – 0.0076. In this model, for each additional

room within 3 mile of the La Quinta inn, the

operating margin decreases on average by .0076%

(assuming the other variables are held constant).

Interpreting the Coefficients

• b2 = 1.65. In this model, for each additional mile that the

nearest competitor is to a La Quinta inn, the operating margin increases on average by 1.65% when the other variables are held constant.

• b3 = 0.020. For each additional 1000 sq-ft of office space, the operating margin will increase on average by .02% when the other variables are held constant.

• b4 = 0.21. For each additional thousand students the operating margin increases on average by .21% when the other variables are held constant.

Interpreting the Coefficients

• b5 = 0.41. For additional $1000 increase in median household income, the operating margin increases on average by .41%, when the other variables remain constant.

• b6 = -0.23. For each additional mile to the

downtown center, the operating margin decreases on

average by .23% when the other variables are held

constant.

Interpreting the Coefficients

• The hypothesis for each i is

• JMP printout

H0: i 0H1: i 0 d.f. = n - k -1

Test statistic

ib

iis

bt

Testing the Coefficients

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|

Intercept 38.138575 6.992948 5.45 <.0001 Number -0.007618 0.001255 -6.07 <.0001 Nearest 1.6462371 0.632837 2.60 0.0108 Office Space 0.0197655 0.00341 5.80 <.0001 Enrollment 0.2117829 0.133428 1.59 0.1159 Income 0.4131221 0.139552 2.96 0.0039 Distance -0.225258 0.178709 -1.26 0.2107

Confidence Intervals for Coefficients

• Note that test of is a test of whether xi helps to predict y given x1,…,xi-1,xi+1,…xk. Results of test might change as we change other independent variables in the model.

• A confidence interval for is

• In La Quinta data, a 95% confidence interval for (the coefficient on number of rooms) is

i)()2/,'#( isni bsetb

1

)0102.,0050.(987.1*0013.0076.

0:0 iH

• The model can be used for making predictions by– Producing prediction interval estimate for the particular

value of y, for a given values of xi.

– Producing a confidence interval estimate for the expected value of y, for given values of xi.

• The model can be used to learn about relationships between the independent variables xi, and the dependent variable y, by interpreting the coefficients i

Using the Linear Regression Equation

• Predict the average operating margin of an inn at a site with the following characteristics:– 3815 rooms within 3 miles,

– Closet competitor .9 miles away,

– 476,000 sq-ft of office space,

– 24,500 college students,

– $35,000 median household income,

– 11.2 miles distance to downtown center.

MARGIN = 38.14 - 0.0076(3815) +1.65(.9) + 0.020(476) +0.21(24.5) + 0.41(35) - 0.23(11.2) = 37.1%

Xm19-01

La Quinta Inns, Predictions

Prediction Intervals and Confidence Intervals for Mean

• Prediction interval for y given x1,…,xk:

• Confidence interval for mean of y given x1,…,xk:

• For inn with characteristics on previous slide:

Confidence interval for mean = (32.970,41.213) Prediction interval = (25.395,48.788)

)ˆ(ˆ )'(# ysety predsn

)ˆ(ˆ )'(# ysety indsn

091.37ˆ y

• The conditions required for the model assessment to apply must be checked.

– Is the error variable normally distributed?

– Is the regression function correctly specified as a linear function of x1,…,xk Plot the residuals versus x’s and

– Is the error variance constant?

– Are the errors independent?

– Can we identify outlier?– Is multicollinearity a problem?

19.4 Regression Diagnostics - II

Draw a histogram of the residuals

Plot the residuals versus y

Plot the residuals versus the time periods

y

Multicollinearity

• Condition in which independent variables are highly correlated.

• Multicollinearity causes two kinds of difficulties:– The t statistics appear to be too small.– The coefficients cannot be interpreted as “slopes”.

• Diagnostics:– High correlation between independent variables– Counterintuitive signs on regression coefficients– Low values for t-statistics despite a significant overall

fit, as measured by the F statistics

Diagnostics: Multicollinearity

• Example 19.2: Predicting house price (Xm19-02) – A real estate agent believes that a house selling price can be

predicted using the house size, number of bedrooms, and lot size.

– A random sample of 100 houses was drawn and data recorded.

– Analyze the relationship among the four variables

Price Bedrooms H Size Lot Size124100 3 1290 3900218300 4 2080 6600117800 3 1250 3750

. . . .

. . . .

• The proposed model isPRICE = 0 + 1BEDROOMS + 2H-SIZE +3LOTSIZE +

The model is valid, but no variable is significantly relatedto the selling price ?!

Diagnostics: Multicollinearity

Summary of Fit RSquare 0.559998 RSquare Adj 0.546248 Root Mean Square Error 25022.71 Mean of Response 154066 Observations (or Sum Wgts) 100 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio

Model 3 7.65017e10 2.5501e10 40.7269 Error 96 6.0109e+10 626135896 Prob > F

C. Total 99 1.36611e11 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|

Intercept 37717.595 14176.74 2.66 0.0091 Bedrooms 2306.0808 6994.192 0.33 0.7423 House Size 74.296806 52.97858 1.40 0.1640 Lot Size -4.363783 17.024 -0.26 0.7982

• Multicollinearity is found to be a problem.Price Bedrooms H Size Lot Size

Price 1Bedrooms 0.6454 1H Size 0.7478 0.8465 1Lot Size 0.7409 0.8374 0.9936 1

Diagnostics: Multicollinearity

• Multicollinearity causes two kinds of difficulties:– The t statistics appear to be too small.– The coefficients cannot be interpreted as “slopes”.