part ii multiple linear regression - statistics · pdf filepart ii multiple linear regression...

of 63/63
Part II Multiple Linear Regression 86

Post on 31-Jan-2018

219 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • Part II

    Multiple Linear Regression

    86

  • Chapter 7

    Multiple Regression

    A multiple linear regression model is a linear model that describeshow a y-variable relates to two or more xvariables (or transformations ofx-variables).

    For example, suppose that a researcher is studying factors that might af-fect systolic blood pressures for women aged 45 to 65 years old. The responsevariable is systolic blood pressure (Y ). Suppose that two predictor variablesof interest are age (X1) and body mass index (X2). The general structure ofa multiple linear regression model for this situation would be

    Y = 0 + 1X1 + 2X2 + .

    The equation 0 + 1X1 + 2X2 describes the mean value of bloodpressure for specific values of age and BMI.

    The error term () describes the characteristics of the differences be-tween individual values of blood pressure and their expected values ofblood pressure.

    One note concerning terminology. A linear model is one that is linear inthe beta coefficients, meaning that each beta coefficient simply multiplies anx-variable or a transformation of an x-variable. For instance y = 0 + 1x+2x

    2 + is called a multiple linear regression model even though it describesa quadratic, curved, relationship between y and a single x-variable.

    87

  • 88 CHAPTER 7. MULTIPLE REGRESSION

    7.1 About the Model

    Notation for the Population Model

    A population model for a multiple regression model that relates a y-variable to p 1 predictor variables is written as

    yi = 0 + 1xi,1 + 2xi,2 + . . .+ p1xi,p1 + i. (7.1)

    We assume that the i have a normal distribution with mean 0 andconstant variance 2. These are the same assumptions that we used insimple regression with one x-variable.

    The subscript i refers to the ith individual or unit in the population. Inthe notation for the xvariables, the subscript following i simply denoteswhich x-variable it is.

    Estimates of the Model Parameters

    The estimates of the coefficients are the values that minimize thesum of squared errors for the sample. The exact formula for this willbe given in the next chapter when we introduce matrix notation.

    The letter b is used to represent a sample estimate of a coefficient.Thus b0 is the sample estimate of 0, b1 is the sample estimate of 1,and so on.

    MSE = SSEnp estimates

    2, the variance of the errors. In the formula,n = sample size, p = number of coefficients in the model and SSE =sum of squared errors. Notice that for simple linear regression p = 2.Thus, we get the formula for MSE that we introduced in that contextof one predictor.

    In the case of two predictors, the estimated regression equation yieldsa plane (as opposed to a line in the simple linear regression setting).For more than two predictors, the estimated regression equation yieldsa hyperplane.

    STAT 501 D. S. Young

  • CHAPTER 7. MULTIPLE REGRESSION 89

    Predicted Values and Residuals

    A predicted value is calculated as yi = b0 + b1xi,1 + b2xi,2 + . . . +bp1xi,p1, where the b values come from statistical software and thex-values are specified by us.

    A residual (error) term is calculated as ei = yi yi, the differencebetween an actual and a predicted value of y.

    A plot of residuals versus predicted values ideally should resem-ble a horizontal random band. Departures from this form indicatesdifficulties with the model and/or data.

    Other residual analyses can be done exactly as we did in simple re-gression. For instance, we might wish to examine a normal probabilityplot (NPP) of the residuals. Additional plots to consider are plots ofresiduals versus each x-variable separately. This might help us identifysources of curvature or nonconstant variance.

    Interaction Terms

    An interaction term is when there is a coupling or combined effect of2 or more independent variables.

    Suppose we have a response variable (Y ) and two predictors (X1 andX2). Then, the regression model with an interaction term is written as

    Y = 0 + 1X1 + 2X2 + 3X1 X2 + .

    Suppose you also have a third predictor (X3). Then, the regressionmodel with all interaction terms is written as

    Y = 0 + 1X1 + 2X2 + 3X3 + 4X1 X2 + 5X1 X3+ 6X2 X3 + 7X1 X2 X3 + .

    In a model with more predictors, you can imagine how much the modelgrows by adding interactions. Just make sure that you have enoughobservations to cover the degrees of freedom used in estimating thecorresponding regression coefficients!

    D. S. Young STAT 501

  • 90 CHAPTER 7. MULTIPLE REGRESSION

    For each observation, their value of the interaction is found by multi-plying the recorded values of the predictor variables in the interaction.

    In models with interaction terms, the significance of the interactionterm should always be assessed first before proceeding with significancetesting of the main variables.

    If one of the main variables is removed from the model, then the modelshould not include any interaction terms involving that variable.

    7.2 Significance Testing of Each Variable

    Within a multiple regression model, we may want to know whether a par-ticular x-variable is making a useful contribution to the model. That is,given the presence of the other x-variables in the model, does a particularx-variable help us predict or explain the y-variable? For instance, supposethat we have three x-variables in the model. The general structure of themodel could be

    Y = 0 + 1X1 + 2X2 + 3X3 + . (7.2)

    As an example, to determine whether variableX1 is a useful predictor variablein this model, we could test

    H0 : 1 = 0

    HA : 1 6= 0.

    If the null hypothesis above were the case, then a change in the value ofX1 would not change Y , so Y and X1 are not related. Also, we would still beleft with variables X2 and X3 being present in the model. When we cannotreject the null hypothesis above, we should say that we do not need variableX1 in the model given that variables X2 and X3 will remain in the model.In general, the interpretation of a slope in multiple regression can be tricky.Correlations among the predictors can change the slope values dramaticallyfrom what they would be in separate simple regressions.

    To carry out the test, statistical software will report p-values for all co-efficients in the model. Each p-value will be based on a t-statistic calculatedas

    t = (sample coefficient - hypothesized value) / standard error of coefficient.

    STAT 501 D. S. Young

  • CHAPTER 7. MULTIPLE REGRESSION 91

    For our example above, the t-statistic is:

    t =b1 0s.e.(b1)

    =b1

    s.e.(b1).

    Note that the hypothesized value is usually just 0, so this portion of theformula is often omitted.

    7.3 Examples

    Example 1: Heat Flux Data SetThe data are from n = 29 homes used to test solar thermal energy. Thevariables of interest for our model are y = total heat flux, and x1, x2, andx3, which are the focal points for the east, north, and south directions, re-spectively. There are two other measurements in this data set: anothermeasurement of the focal points and the time of day. We will not utilizethese predictors at this time. Table 7.1 gives the data used for this analysis.

    The regression model of interest is

    yi = 0 + 1xi,1 + 2xi,2 + 3xi,3 + i.

    Figure 7.1(a) gives a histogram of the residuals. While the shape is not com-pletely bell-shaped, it again is not suggestive of any severe departures fromnormality. Figure 7.1(b) gives a plot of the residuals versus the fitted val-ues. Again, the values appear to be randomly scattered about 0, suggestingconstant variance.

    The following provides the t-tests for the individual regression coefficients:

    ##########

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 389.1659 66.0937 5.888 3.83e-06 ***

    east 2.1247 1.2145 1.750 0.0925 .

    north -24.1324 1.8685 -12.915 1.46e-12 ***

    south 5.3185 0.9629 5.523 9.69e-06 ***

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Residual standard error: 8.598 on 25 degrees of freedom

    D. S. Young STAT 501

  • 92 CHAPTER 7. MULTIPLE REGRESSION

    Histogram of Residuals

    Residuals

    Den

    sity

    10 0 10 20

    0.00

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    (a)

    200 220 240 260

    15

    10

    5

    05

    1015

    20

    Residuals vs. Fitted Values

    Fitted Values

    Res

    idua

    ls

    (b)

    Figure 7.1: (a) Histogram of the residuals for the heat flux data set. (b) Plotof the residuals.

    Multiple R-Squared: 0.8741, Adjusted R-squared: 0.859

    F-statistic: 57.87 on 3 and 25 DF, p-value: 2.167e-11

    ##########

    At the = 0.05 significance level, both north and south appear to be statis-tically significant predictors of heat flux. However, east is not (with a p-valueof 0.0925). While we could claim this is a marginally significant predictor,we will rerun the analysis by dropping the east predictor.

    The following provides the t-tests for the individual regression coefficientsfor the newly suggested model:

    ##########

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 483.6703 39.5671 12.224 2.78e-12 ***

    north -24.2150 1.9405 -12.479 1.75e-12 ***

    south 4.7963 0.9511 5.043 3.00e-05 ***

    ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Residual standard error: 8.932 on 26 degrees of freedom

    STAT 501 D. S. Young

  • CHAPTER 7. MULTIPLE REGRESSION 93

    Multiple R-Squared: 0.8587, Adjusted R-squared: 0.8478

    F-statistic: 79.01 on 2 and 26 DF, p-value: 8.938e-12

    ##########

    The residual plots still appear okay (they are not included here) and