part ii multiple linear regression - statistics · pdf filepart ii multiple linear regression...

Part II

Multiple Linear Regression

86

Chapter 7

Multiple Regression

A multiple linear regression model is a linear model that describeshow a y-variable relates to two or more xvariables (or transformations ofx-variables).

For example, suppose that a researcher is studying factors that might af-fect systolic blood pressures for women aged 45 to 65 years old. The responsevariable is systolic blood pressure (Y ). Suppose that two predictor variablesof interest are age (X1) and body mass index (X2). The general structure ofa multiple linear regression model for this situation would be

Y = β0 + β1X1 + β2X2 + ε.

• The equation β0 + β1X1 + β2X2 describes the mean value of bloodpressure for specific values of age and BMI.

• The error term (ε) describes the characteristics of the differences be-tween individual values of blood pressure and their expected values ofblood pressure.

One note concerning terminology. A linear model is one that is linear inthe beta coefficients, meaning that each beta coefficient simply multiplies anx-variable or a transformation of an x-variable. For instance y = β0 + β1x+β2x

2 + ε is called a multiple linear regression model even though it describesa quadratic, curved, relationship between y and a single x-variable.

87

88 CHAPTER 7. MULTIPLE REGRESSION

7.1 About the Model

Notation for the Population Model

• A population model for a multiple regression model that relates a y-variable to p− 1 predictor variables is written as

yi = β0 + β1xi,1 + β2xi,2 + . . .+ βp−1xi,p−1 + εi. (7.1)

• We assume that the εi have a normal distribution with mean 0 andconstant variance σ2. These are the same assumptions that we used insimple regression with one x-variable.

• The subscript i refers to the ith individual or unit in the population. Inthe notation for the xvariables, the subscript following i simply denoteswhich x-variable it is.

Estimates of the Model Parameters

• The estimates of the β coefficients are the values that minimize thesum of squared errors for the sample. The exact formula for this willbe given in the next chapter when we introduce matrix notation.

• The letter b is used to represent a sample estimate of a β coefficient.Thus b0 is the sample estimate of β0, b1 is the sample estimate of β1,and so on.

• MSE = SSEn−p estimates σ2, the variance of the errors. In the formula,

n = sample size, p = number of β coefficients in the model and SSE =sum of squared errors. Notice that for simple linear regression p = 2.Thus, we get the formula for MSE that we introduced in that contextof one predictor.

• In the case of two predictors, the estimated regression equation yieldsa plane (as opposed to a line in the simple linear regression setting).For more than two predictors, the estimated regression equation yieldsa hyperplane.

STAT 501 D. S. Young

CHAPTER 7. MULTIPLE REGRESSION 89

Predicted Values and Residuals

• A predicted value is calculated as yi = b0 + b1xi,1 + b2xi,2 + . . . +bp−1xi,p−1, where the b values come from statistical software and thex-values are specified by us.

• A residual (error) term is calculated as ei = yi − yi, the differencebetween an actual and a predicted value of y.

• A plot of residuals versus predicted values ideally should resem-ble a horizontal random band. Departures from this form indicatesdifficulties with the model and/or data.

• Other residual analyses can be done exactly as we did in simple re-gression. For instance, we might wish to examine a normal probabilityplot (NPP) of the residuals. Additional plots to consider are plots ofresiduals versus each x-variable separately. This might help us identifysources of curvature or nonconstant variance.

Interaction Terms

• An interaction term is when there is a coupling or combined effect of2 or more independent variables.

• Suppose we have a response variable (Y ) and two predictors (X1 andX2). Then, the regression model with an interaction term is written as

Y = β0 + β1X1 + β2X2 + β3X1 ∗X2 + ε.

Suppose you also have a third predictor (X3). Then, the regressionmodel with all interaction terms is written as

Y = β0 + β1X1 + β2X2 + β3X3 + β4X1 ∗X2 + β5X1 ∗X3

+ β6X2 ∗X3 + β7X1 ∗X2 ∗X3 + ε.

In a model with more predictors, you can imagine how much the modelgrows by adding interactions. Just make sure that you have enoughobservations to cover the degrees of freedom used in estimating thecorresponding regression coefficients!

D. S. Young STAT 501


• For each observation, their value of the interaction is found by multi-plying the recorded values of the predictor variables in the interaction.

• In models with interaction terms, the significance of the interactionterm should always be assessed first before proceeding with significancetesting of the main variables.

• If one of the main variables is removed from the model, then the modelshould not include any interaction terms involving that variable.

7.2 Significance Testing of Each Variable

Within a multiple regression model, we may want to know whether a par-ticular x-variable is making a useful contribution to the model. That is,given the presence of the other x-variables in the model, does a particularx-variable help us predict or explain the y-variable? For instance, supposethat we have three x-variables in the model. The general structure of themodel could be

Y = β0 + β1X1 + β2X2 + β3X3 + ε. (7.2)

As an example, to determine whether variableX1 is a useful predictor variablein this model, we could test

H0 : β1 = 0

HA : β1 6= 0.

If the null hypothesis above were the case, then a change in the value ofX1 would not change Y , so Y and X1 are not related. Also, we would still beleft with variables X2 and X3 being present in the model. When we cannotreject the null hypothesis above, we should say that we do not need variableX1 in the model given that variables X2 and X3 will remain in the model.In general, the interpretation of a slope in multiple regression can be tricky.Correlations among the predictors can change the slope values dramaticallyfrom what they would be in separate simple regressions.

To carry out the test, statistical software will report p-values for all co-efficients in the model. Each p-value will be based on a t-statistic calculatedas

t∗ = (sample coefficient - hypothesized value) / standard error of coefficient.



For our example above, the t-statistic is:

t∗ =b1 − 0

s.e.(b1)=

b1s.e.(b1)

.

Note that the hypothesized value is usually just 0, so this portion of theformula is often omitted.

7.3 Examples

Example 1: Heat Flux Data SetThe data are from n = 29 homes used to test solar thermal energy. Thevariables of interest for our model are y = total heat flux, and x1, x2, andx3, which are the focal points for the east, north, and south directions, re-spectively. There are two other measurements in this data set: anothermeasurement of the focal points and the time of day. We will not utilizethese predictors at this time. Table 7.1 gives the data used for this analysis.

The regression model of interest is

yi = β0 + β1xi,1 + β2xi,2 + β3xi,3 + εi.

Figure 7.1(a) gives a histogram of the residuals. While the shape is not com-pletely bell-shaped, it again is not suggestive of any severe departures fromnormality. Figure 7.1(b) gives a plot of the residuals versus the fitted val-ues. Again, the values appear to be randomly scattered about 0, suggestingconstant variance.

The following provides the t-tests for the individual regression coefficients:

##########

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 389.1659 66.0937 5.888 3.83e-06 ***

east 2.1247 1.2145 1.750 0.0925 .

north -24.1324 1.8685 -12.915 1.46e-12 ***

south 5.3185 0.9629 5.523 9.69e-06 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 8.598 on 25 degrees of freedom



Histogram of Residuals

Residuals

Den

sity

−10 0 10 20

0.00

0.01

0.02

0.03

0.04

0.05

0.06

(a)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

200 220 240 260

−15

−10

−5

05

1015

20

Residuals vs. Fitted Values

Fitted Values

Res

idua

ls

(b)

Figure 7.1: (a) Histogram of the residuals for the heat flux data set. (b) Plotof the residuals.

Multiple R-Squared: 0.8741, Adjusted R-squared: 0.859

F-statistic: 57.87 on 3 and 25 DF, p-value: 2.167e-11

##########

At the α = 0.05 significance level, both north and south appear to be statis-tically significant predictors of heat flux. However, east is not (with a p-valueof 0.0925). While we could claim this is a marginally significant predictor,we will rerun the analysis by dropping the east predictor.

The following provides the t-tests for the individual regression coefficientsfor the newly suggested model:

##########

Coefficients:


(Intercept) 483.6703 39.5671 12.224 2.78e-12 ***

north -24.2150 1.9405 -12.479 1.75e-12 ***

south 4.7963 0.9511 5.043 3.00e-05 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1






##########

The residual plots still appear okay (they are not included here) and weobtain new estimates for our model (in the above). Some things to note fromthis final analysis are:

• The final sample multiple regression equation is

yi = 483.67− 24.22xi,2 + 4.80xi,3.

To use this equation for prediction, we substitute specified values forthe two directions (i.e., north and south).

• We can interpret the “slopes” in the same way that we do for a straight-line model, but we have to add the constraint that values of othervariables remain constant.

– When the south position is held constant, the average flux tem-perature for a home decreases by 24.22 degrees for each 1 unitincrease in the north position.

– When the north position is held constant, the average flux temper-ature for a home increases by 4.80 degrees for each 1 unit increasein the south position.

• The value of R2 = 0.8587 means that the model (the two x-variables)explains 85.87% of the observed variation in a home’s flux temperature.

• The value√

MSE = 8.9321 is the estimated standard deviation of theresiduals. Roughly, it is the average absolute size of a residual.

Example 2: Kola Project Data SetThe Kola Project ran from 1993-1998 and involved extensive geological sur-veys of Finland, Norway, and Russia. The entire published data set consistsof over 600 observations measured on 111 variables. Table 7.2 provides merelya subset of this data for three variables. The data is subsetted on the LITOvariable for counts of “1”. The sample size of this subset is n = 131.

The investigators are interested in modeling the geological compositionvariable Cr INAA as a function of Cr and Co. A scatterplot of this data with



the least squares plane is provided in Figure 7.2. In this 3D plot, observationsabove the plane (i.e., observations with positive residuals) are given by greenpoints and observations below the plane (i.e., observations with negativeresiduals) are given by red points. The output for fitting a multiple linearregression model to this data is below:

Residuals:

Min 1Q Median 3Q Max

-149.95 -34.42 -14.74 11.58 560.38

Coefficients:


(Intercept) 53.3483 11.6908 4.563 1.17e-05 ***

Cr 1.8577 0.2324 7.994 6.66e-13 ***

Co 2.1808 1.7530 1.244 0.216

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Multiple R-squared: 0.544, Adjusted R-squared: 0.5369

F-statistic: 76.36 on 2 and 128 DF, p-value: < 2.2e-16

Note that Co is found to be not statistically significant. However, the scat-terplot in Figure 7.2 clearly shows that the data is skewed to the right foreach of the variables (i.e., the bulk of the data is clustered near the lower-endof values for each variable while there are fewer values as you increase alonga given axis). In fact, a plot of the standardized residuals against the fittedvalues (Figure 7.3) indicates that a transformation is needed.

Since the data appears skewed to the right for each of the variables, alog transformation on Cr INAA, Cr, and Co will be taken. The scatterplotin Figure 7.4 shows the results from this transformations along with thenew least squares plane. Clearly, the transformation has done a better joblinearizing the relationship. The output for fitting a multiple linear regressionmodel to this transformed data is below:

Residuals:


-0.8181 -0.2443 -0.0667 0.1748 1.3401



Figure 7.2: 3D scatterplot of the Kola data set with the least squares plane.

●

●

●

●

●

●

●

●●

●

●

●

●●

●●●●

●

●

●●

●

●

●●

●

●

●●

●●

●

● ●

●

●

●●

●

●

●● ●

●

●

●●●●

●●

●●● ●

●●

●

●●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●●●

●

●●●

●

●●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●● ●●

●

●

●

●

●

●●

● ●

● ●

●●

100 200 300 400 500 600

−2

02

46

Fitted Values

Sta

ndar

dize

d R

esid

uals

Figure 7.3: The standardized residuals versus the fitted values for the rawKola data set.



Figure 7.4: 3D scatterplot of the Kola data set where the logarithm of eachvariable has been taken.

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●●

4.0 4.5 5.0 5.5 6.0

−2

−1

01

23

Fitted Values

Sta

ndar

dize

d R

esid

uals

Figure 7.5: The standardized residuals versus the fitted values for the log-transformed Kola data set.



Coefficients:


(Intercept) 2.65109 0.17630 15.037 < 2e-16 ***

ln_Cr 0.57873 0.08415 6.877 2.42e-10 ***

ln_Co 0.08587 0.09639 0.891 0.375

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1




There is also a noted improvement in the plot of the standardized residualsversus the fitted values (Figure 7.5). Notice that the log transformation ofCo is not statistically significant as it has a high p-value (0.375).

After omitting the log transformation of Co from our analysis, a simplelinear regression model is fit to the data. Figure 7.6 provides a scatterplotof the data and a plot of the standardized residuals against the fitted values.These plots, combined with the following simple linear regression output,indicate a highly statistically significant relationship between the log trans-formation of Cr INAA and the log transformation of Cr.

Residuals:


-0.85999 -0.24113 -0.05484 0.17339 1.38702

Coefficients:


(Intercept) 2.60459 0.16826 15.48 <2e-16 ***

ln_Cr 0.63974 0.04887 13.09 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1






●●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●●

● ●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

2 3 4 5

3.5

4.0

4.5

5.0

5.5

6.0

6.5

ln(Cr)

ln(C

r_IN

AA

)

(a)

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

● ●

● ●

4.0 4.5 5.0 5.5 6.0

−2

−1

01

23

Fitted Values

Sta

ndar

dize

d R

esid

uals

(b)

Figure 7.6: (a) Scatterplot of the Kola data set where the logarithm ofCr INAA has been regressed on the logarithm of Cr. (b) Plot of the stan-dardized residuals for this simple linear regression fit.



i Flux Insolation East South North Time1 271.8 783.35 33.53 40.55 16.66 13.202 264.0 748.45 36.50 36.19 16.46 14.113 238.8 684.45 34.66 37.31 17.66 15.684 230.7 827.80 33.13 32.52 17.50 10.535 251.6 860.45 35.75 33.71 16.40 11.006 257.9 875.15 34.46 34.14 16.28 11.317 263.9 909.45 34.60 34.85 16.06 11.968 266.5 905.55 35.38 35.89 15.93 12.589 229.1 756.00 35.85 33.53 16.60 10.6610 239.3 769.35 35.68 33.79 16.41 10.8511 258.0 793.50 35.35 34.72 16.17 11.4112 257.6 801.65 35.04 35.22 15.92 11.9113 267.3 819.65 34.07 36.50 16.04 12.8514 267.0 808.55 32.20 37.60 16.19 13.5815 259.6 774.95 34.32 37.89 16.62 14.2116 240.4 711.85 31.08 37.71 17.37 15.5617 227.2 694.85 35.73 37.00 18.12 15.8318 196.0 638.10 34.11 36.76 18.53 16.4119 278.7 774.55 34.79 34.62 15.54 13.1020 272.3 757.90 35.77 35.40 15.70 13.6321 267.4 753.35 36.44 35.96 16.45 14.5122 254.5 704.70 37.82 36.26 17.62 15.3823 224.7 666.80 35.07 36.34 18.12 16.1024 181.5 568.55 35.26 35.90 19.05 16.7325 227.5 653.10 35.56 31.84 16.51 10.5826 253.6 704.05 35.73 33.16 16.02 11.2827 263.0 709.60 36.46 33.83 15.89 11.9128 265.8 726.90 36.26 34.89 15.83 12.6529 263.8 697.15 37.20 36.27 16.71 14.06

Table 7.1: The heat flux for homes data.



X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y

40.9 6.2 300 71.4 11.8 200 52.1 7.7 140 21.3 4.2 11060.7 10.5 270 66.3 9 230 18 6.1 75 73.5 8.5 21029.6 10.1 140 99.1 16.1 220 23.7 9.3 54 80.1 18.8 17040.9 8.7 120 18.6 3.6 93 37.7 6.1 110 75.4 16.6 79027.8 5.2 240 30.5 8.6 140 16.1 2.7 68 32.3 8.7 10023.5 3.8 110 28.9 5.2 130 40 5.5 100 19.4 3.9 6216.6 15.2 64 23 5.3 120 38.4 14.4 100 20 5.5 30029.2 5.1 92 44.2 9.1 120 23.7 4.9 90 48.3 8.2 1106.9 2 37 27.7 8 100 16.4 5.4 82 40 6.7 12057.1 7.8 250 18.1 4.7 100 13.4 2.5 100 22.5 2.4 9550 9.7 190 10 3.3 87 24 4.1 93 31.8 14.7 180129 30.7 210 16.3 3.5 83 28.8 15.2 110 17.1 6.2 180106 13.7 220 31.6 8.7 130 18.4 3.8 63 10.9 2 5036.5 7.3 81 19.5 6.8 90 9.4 3.3 47 30.6 7.3 11066.6 10.5 170 25.2 5.5 110 5.9 1.6 86 52.6 11.4 13037.2 9.6 120 13.6 2.9 170 83.8 16.2 160 53.9 11.5 21042 9.2 120 23.2 6.4 88 280 25.2 640 88.6 15.5 320

17.5 3.9 64 41.6 10.1 150 21.9 5 62 25.6 6.8 6967.1 8.6 120 18 4 120 18.5 3.5 92 18.9 3.1 11010.6 2.5 69 37.4 8.7 97 26.7 5.1 170 16.7 4.5 8411.3 2 27 32.3 7.2 97 50.2 7 340 19.6 5.6 8644 14.1 130 217 10 400 30.9 10 120 15.1 4.2 110

34.1 12.1 240 16.7 5 49 25.5 5.5 61 25.1 7 15029.3 5.2 87 29.8 7.8 160 21.4 3.9 140 8.4 2 3449.2 14 180 15.5 2.5 70 32.3 6.3 220 25.4 6.9 140118 18.3 330 14 3 68 31.9 3.7 110 18.7 4.4 7210.8 3.4 78 30 6.9 120 28.7 6.7 120 21.6 7.6 11059.3 9.6 300 21.9 9 390 36.2 5.3 94 24 5.2 11020 3.8 110 33 5.6 71 45.2 3.8 99 19.3 4.3 130

24.4 5.4 120 30.8 6.5 110 16.3 5.4 59 243 24.1 59028.6 6.6 76 55.5 11.5 130 50 7.5 130 9.6 2.4 4737.9 30.3 130 25.9 6.9 96 15.8 4.8 79 36.4 6 11010.2 4.7 46 16.5 4.7 63 19.3 3.5 54

Table 7.2: The subset of the Kola data. Here X1, X2, and Y are the variablesCr, Co, and Cr INAA, respectively.


Chapter 8

Matrix Notation in Regression

There are two main reasons for using matrices in regression. First, the no-tation simplifies the writing of the model. Secondly, and most importantly,matrix formulas provide the means by which statistical software calculatesthe estimated coefficients and their standard errors, as well as the set of pre-dicted values for the observed sample. If necessary, a review of matrices andsome of their basic properties can be found in Appendix B.

8.1 Matrices and Regression

In matrix notation, the theoretical regression model for the population iswritten as

Y = Xβ + ε.

The four different items in the equation are:

1. Y is a n-dimensional column vector that vertically lists the y values:

Y =

Y1

Y2...Yn

.

2. The X matrix is a matrix in which each row gives the x-variable datafor a different observation. The first column equals 1 for all observations(unless doing a regression through the origin), and each column after

101

102 CHAPTER 8. MATRIX NOTATION IN REGRESSION

the first gives the data for a different variable. There is a columnfor each variable, including any added interactions, transformations,indicators, and so on. The abstract formulation is:

X =

1 X1,1 . . . X1,p−1

1 X2,1 . . . X2,p−1...

.... . .

...1 Xn,1 . . . Xn,p−1

.

In the subscripting, the first value is the observation number and thesecond number is the variable number. The first column is always acolumn of 1’s. The X matrix has n rows and p columns.

3. β is a p-dimensional column vector listing the coefficients:

β =

β0

β1...

βp−1

.

Notice the subscript for the numbering of the β’s. As an example,for simple linear regression, β = (β0 β1)

T. The β vector will containsymbols, not numbers, as it gives the population parameters.

4. ε is a n-dimensional column vector listing the errors:

ε =

ε1ε2...εn

.

Again, we will not have numerical values for the ε vector.

As an example, suppose that data for a y-variable and two x-variables isas given in Table 8.1. For the model

yi = β0 + β1xi,1 + β2xi,2 + β3xi,1 ∗ xi,2 + εi,

the matrices Y, X, β, and ε are as follows:


CHAPTER 8. MATRIX NOTATION IN REGRESSION 103

yi 6 5 10 12 14 18xi,1 1 1 3 5 3 5xi,2 1 2 1 1 2 2

Table 8.1: A sample data set.

Y =

6510121418

, X =

1 1 1 11 1 2 21 3 1 31 5 1 51 3 2 61 5 2 10

, β =

β0

β1

β2

β3

, ε =

ε1ε2ε3ε4ε5ε6

.

1. Notice that the first column of the X matrix equals 1 for all rows(observations), the second column gives the values of xi,1, the thirdcolumn lists the values of xi,2, and the fourth column gives the valuesof the interaction values xi,1 ∗ xi,2.

2. For the theoretical model, we do not know the values of the beta co-efficients or the errors. In those two matrices (column vectors) we canonly list the symbols for these items.

3. There is a slight abuse of notation that occurs here which often happenswhen writing regression models in matrix notation. I stated earlier howcapital letters are reserved for random variables and lower case lettersare reserved for realizations. In this example, capital letters have beenused for the realizations. There should be no misunderstanding asit will usually be clear if we are in the context of discussing randomvariables or their realizations.

Finally, using Calculus rules for matrices, it can be derived that the or-dinary least squares estimates of the β coefficients are calculated using thematrix formula

b = (XTX)−1XTy,



which minimizes the sum of squared errors

||e||2 = eTe

= (Y− Y)T(Y− Y)

= (Y−XbT(Y−Xb),

where b = (b0 b1 . . . bp−1)T. As in the simple linear regression case, these

regression coefficient estimators are unbiased (i.e., E(b) = β). The formulaabove is used by statistical software to calculate values of the sample coeffi-cients.

An important theorem in regression analysis (and Statistics in general)is the Gauss-Markov Theorem, which we alluded to earlier. Since wehave the proper matrix notation in place, we will now formalize this veryimportant result.

Theorem 1 (Gauss-Markov Theorem) Suppose that we have the linear re-gression model

Y = Xβ + ε,

where E(εi|X) = 0 and E(εi|X) = σ2 for all i = 1, . . . , n. Then

β = b = (XTX)−1XTY

is an unbiased estimator of β and has the smallest variance of all otherunbiased estimates of β.

Any estimator which is unbiased and has smaller variance than any otherunbiased estimators is called a best linear unbiased estimator or BLUE.

An important note regarding the matrix expressions introduced above isthat

Y = Xb

= X(XTX)−1XTY

= HY

and

e = Y− Y

= Y−HY

= (In×n −H)Y,



where H = X(XTX)−1XT is the n × n hat matrix. H is important forseveral reasons as it appears often in regression formulas. One importantimplication of H is that it is a projection matrix, meaning that it projectsthe response vector, Y, as a linear combination of the columns of the Xmatrix in order to obtain the vector of fitted values, Y. Also, the diagonalof this matrix contains the hj,j values we introduced earlier in the context ofStudentized residuals, which is important when discussing leverage.

8.2 Variance-Covariance Matrix of b

Two important characteristics of the sample multiple regression coefficientsare their standard errors and their correlations with each other. The variance-covariance matrix of the sample coefficients b is a symmetric p× p squarematrix. Remember that p is the number of beta coefficients in the model(including the intercept).

The rows and the columns of the variance-covariance matrix are in co-efficient order (first row is information about b0, second is about b1, and soon).

• The diagonal values (from top left to bottom right) are the variancesof the sample coefficients (written as Var(bi)). The standard error of acoefficient is the square root of its variance.

• An off-diagonal value is the covariance between two coefficient esti-mates (written as Cov(bi, bj)).

• The correlation between two coefficient estimates can be determinedusing the following relationship: correlation = covariance divided byproduct of standard deviations (written as Corr(bi, bj)).

In regression, the theoretical variance-covariance matrix of the sample coef-ficients is

V(b) = σ2(XTX)−1.

Recall, the MSE estimates σ2, so the estimated variance-covariance ma-trix of the sample beta coefficients is calculated as

V(b) = MSE(XTX)−1.



100× (1− α)% confidence intervals are also readily available for β:

bj ± t∗n−p;1−α/2

√V(b)j,j,

where V(b)j,j is the jth diagonal element of the estimated variance-covariancematrix of the sample beta coefficients (i.e., the (estimated) standard error).Furthermore, the Bonferroni joint 100(1− α)% confidence intervals are:

bj ± t∗n−p;1−α/(2p)

√V(b)j,j,

for j = 0, 1, 2, . . . , (p− 1).

8.3 Statistical Intervals

The statistical intervals for estimating the mean or predicting new observa-tions in the simple linear regression case can easily extend to the multipleregression case. Here, it is only necessary to present the formulas.

First, let use define the vector of given predictors as

Xh =

1

Xh,1

Xh,2...

Xh,p−1

.

We are interested in either intervals for E(Y |X = Xh) or intervals for thevalue of a new response y given that the observation has the particular valueXh. First we define the standard error of the fit at Xh given by:

s.e.(Yh) =

√MSE(XT

h (XTX)−1Xh).

Now, we can give the formulas for the various intervals:

• 100× (1− α)% Confidence Interval:

yh ± t∗n−p;1−α/2s.e.(yh).



• Bonferroni Joint 100× (1− α)% Confidence Intervals:

yhi± t∗n−p;1−α/(2q)s.e.(yhi

),

for i = 1, 2, . . . , q.

• 100× (1− α)% Working-Hotelling Confidence Band:

yh ±√pF ∗

p,n−p;1−αs.e.(yh).

• 100× (1− α)% Prediction Interval:

yh ± t∗n−p;1−α/2√

MSE/m+ [s.e.(yh)]2,

where m = 1 corresponds to a prediction interval for a new observa-tion at a given Xh and m > 1 corresponds to the mean of m newobservations calculated at the same Xh.

• Bonferroni Joint 100× (1− α)% Prediction Intervals:

yhi± t∗n−p;1−α/(2q)

√MSE + [s.e.(yhi

)]2,

for i = 1, 2, . . . , q.

• Scheffe Joint 100× (1− α)% Prediction Intervals:

yhi±

√qF ∗

q,n−p;1−α(MSE + [s.e.(yh)]2),

for i = 1, 2, . . . , q.

• [100× (1− α)%]/[100× P%] Tolerance Intervals:

– One-Sided Intervals:

(−∞, yh +Kα,P

√MSE)

and(yh −Kα,P

√MSE,∞)

are the upper and lower one-sided tolerance intervals, respectively,where Kα,P is found similarly as in the simple linear regressionsetting, but with n∗ = (XT

h (XTX)−1Xh)−1.



– Two-Sided Interval:

yh ±Kα/2,P/2

√MSE,

where Kα/2,P/2 is found similarly as in the simple linear regression set-ting, but with n∗ as given above and f = n−p, where p is the dimensionof Xh.

8.4 Example

Example: Heat Flux Data Set (continued)Refer back to the heat flux data set where only north and south were used aspredictors of insolation. The MSE for this model is equal to 79.7819. How-ever, if we are interested in the full variance-covariance matrix and correlationmatrix, then this must be calculated by hand by finding the (XTX)−1. Then,

V(b) = 79.7819

19.6229 −0.5521 −0.2918−0.5521 0.0472 −0.0066−0.2918 −0.0066 0.0113

=

1565.5532 −44.0479 −23.2797−44.0479 3.7657 −0.5305−23.2797 −0.5305 0.9046

.

Taking the square roots of the diagonal terms of this matrix gives you thevalues of s.e.(b0), s.e.(b1), and s.e.(b2).

We can also calculate the correlation matrix of b (denoted by rb) forthis data set:

rb =

Var(b0)√

Var(b0)Var(b0)

Cov(b0,b1)√Var(b0)Var(b1)



Var(b1)√Var(b1)Var(b1)




Var(b2)√Var(b2)Var(b2)

=

1565.5532√

(1565.5532)(1565.5532)

−44.0479√(1565.5532)(3.7657)

23.2797√(1565.5532)(0.9046)

−44.0479√(3.7657)(1565.5532)

3.7657√(3.7657)(3.7657)

0.5305√(3.7657)(0.9046)

−23.2797√(0.9046)(1565.5532)

−0.5305√(0.9046)(3.7657)

0.9046√(0.9046)(0.9046)

=

1 −0.5737 −0.6186−0.5737 1 −0.2874−0.6186 −0.2874 1

.



rb is an estimate of the population correlation matrix ρb. For example,Corr(b1, b2) = −0.2874, which implies there is a fairly low, negative correla-tion between the average change in flux for each unit increase in the southposition and each unit increase in the north position. Therefore, the presenceof the north position only slightly affects the estimate of the south’s beta co-efficient. The consequence is that it is fairly easy to separate the individualeffects of these two variables. Note that we usually do not care about cor-relations concerning the intercept, b0 since we usually wish to provide aninterpretation concerning the x-variables.

If all x-variables are uncorrelated with each other, then all covariancesbetween pairs of sample coefficients that multiply x-variables will equal 0.This means that the estimate of one beta is not affected by the presence of theother x-variables. Many experiments are designed to achieve this property,but achieving it with real data is often a different story.

The correlation matrix presented above should NOT be confused withthe correlation matrix, r, constructed for each pairwise combination of thevariables Y,X1, X2, . . . , Xp−1; namely:

r =

1 Corr(Y,X1) . . . Corr(Y,Xp−1)

Corr(X1, Y ) 1 . . . Corr(X1, Xp−1)...

.... . .

...Corr(Xp−1, Y ) Corr(Xp−1, X1) . . . 1

.

Note that all of the diagonal entries are 1 because the correlation betweena variable and itself is a perfect (positive) association. This correlation ma-trix is what most statistical software reports and it does not always reportrb. The interpretation of each entry in r is identical to the Pearson correla-tion coefficient interpretation presented earlier. Specifically, it provides thestrength and direction of the association between the variables correspond-ing to the row and column of the respective entry. For this example, thecorrelation matrix is:

r =

1 −0.8488 −0.1121−0.8488 1 0.2874−0.1121 0.2874 1

.

We can also calculate the 95% confidence intervals for the regression co-efficients. First note that t26,0.975 = 2.0555. The 95% confidence interval forβ1 is calculated using −24.2150 ± 2.0555

√3.7657 and for β2 it is calculated



using 4.7963 ± 2.0555√

0.9046. Thus, we are 95% confident that the truepopulation regression coefficients for the north and south focal points arebetween (-28.2039, -20.2262) and (2.8413, 6.7513), respectively.


Chapter 9

Indicator Variables

We next discuss how to include categorical predictor variables in a regressionmodel. A categorical variable is a variable for which the possible out-comes are nameable characteristics, groups or treatments. Some examplesare gender (male or female), highest educational degree attained (secondaryschool, college undergraduate, college graduate), blood pressure medicationused (drug 1, drug 2, drug 3), etc.

We use indicator variables to incorporate a categorical x-variable into aregression model. An indicator variable equals 1 when an observation is ina particular group and equals 0 when an observation is not in that group. Aninteraction between an indicator variable and a quantitative variable existsif the slope between the response and the quantitative variable depends uponthe specific value present for the indicator variable.

9.1 The “Leave One Out” Method

When a categorical predictor variable has k categories, it is possible to definek indicator variables. However, as explained later, we should only use k − 1of them as predictor variables in the regression model.

Let us consider an example where we are analyzing data for a clinical trialdone to compare the effectiveness of three different medications used to treathigh blood pressure. n = 90 participants are randomly divided into threegroups of 30 patients and each group is assigned a different medication. Theresponse variable is the reduction in diastolic blood pressure in a 3 monthperiod. In addition to the treatment variables, two other predictor variables

111

112 CHAPTER 9. INDICATOR VARIABLES

will be X1 =age and X2 =body mass index.We are examining three different treatments so we can define the following

three indicator variables for the treatment:

X3 = 1 if patient used treatment 1, 0 otherwise

X4 = 1 if patient used treatment 2, 0 otherwise

X5 = 1 if patient used treatment 3, 0 otherwise.

On the surface, it seems that our model should be the following “over-parameterized” model, a model that requires us to make a modification inorder to estimate coefficients:

yi = β0 + β1xi,1 + β2xi,2 + β3xi,3 + β4xi,4 + β5xi,5 + εi. (9.1)

The difficulty with this model is that the X matrix has a linear dependency,so we can’t estimate the individual coefficients (technically, this is becausethere will be an infinite number of solutions for the betas). The dependencystems from the fact that Xi,3 + Xi,4 + Xi,5 = 1 for all observations becauseeach patient uses one (and only one) of the treatments. In the X matrix, thelinear dependency is that the sum of the last three columns will equal thefirst column (all 1 ’s). This scenario leads to what is called collinearity andwe investigate this in the next chapter.

One solution (there are others) for avoiding this difficulty is the “leaveone out” method. The “leave one out” method has the general rule thatwhenever a categorical predictor variable has k categories, it is possible todefine k indicator variables, but we should only use k−1 of them to describethe differences among the k categories. For the overall fit of the model, itdoes not matter which set of k − 1 indicators we use. The choice of whichk − 1 indicator variables we use, however, does affect the interpretation ofthe coefficients that multiply the specific indicators in the model.

In our example with three treatments (and three possible indicator vari-ables), we might leave out the third indicator giving us this model:

yi = β0 + β1xi,1 + β2xi,2 + β3xi,3 + β4xi,4 + εi. (9.2)

For the overall fit of the model, it would work equally well to leave outthe first indicator and include the other two or to leave out the second andinclude the first and third.


CHAPTER 9. INDICATOR VARIABLES 113

9.2 Coefficient Interpretations

The interpretation of the coefficients that multiply indicator variables istricky. The interpretation for the individual betas with the “leave one out”method is that a coefficient multiplying an indicator in the model measuresthe difference between the group defined by the indicator in the model andthe group defined by the indicator that was left. Usually, a control or placebogroup is the one that is “left out”.

Let us consider our example again. We are predicting decreases in bloodpressure in response toX1 =age, X2 =body mass, and which of three differenttreatments a person used. The variables X3 and X4 are indicators of thetreatment, as defined above. The model we will examine is

yi = β0 + β1xi,1 + β2xi,2 + β3xi,3 + β4xi,4 + εi.

To see what is going on, look at each treatment separately by substitutingthe appropriately defined values of the two indicators into the equation.

• For treatment 1, by definition X3 = 1 and X4 = 0 leading to

yi = β0 + β1xi,1 + β2xi,2 + β3(1) + β4(0) + εi

= β0 + β1xi,1 + β2xi,2 + β3 + εi.


yi = β0 + β1xi,1 + β2xi,2 + β3(0) + β4(1) + εi

= β0 + β1xi,1 + β2xi,2 + β4 + εi.


yi = β0 + β1xi,1 + β2xi,2 + β3(0) + β4(0) + εi

= β0 + β1xi,1 + β2xi,2 + εi.

Now compare the three equations to each other. The only differencebetween the equations for treatments 1 and 3 is the coefficient β3. The onlydifference between the equations for treatments 2 and 3 is the coefficient β4.This leads to the following meanings for the coefficients:

• β3 = difference in mean response for treatment 1 versus treatment 3,assuming the same age and body mass.



• β4 = difference in mean response for treatment 2 versus treatment 3,assuming the same age and body mass.

Here the coefficients are measuring differences from the third treatment.With the “leave one out” method, a coefficient multiplying an indicator inthe model measures the difference between the group defined by the indicatorin the model and the group defined by the indicator that was left.

IMPORTANT CAUTIONS: Notice that the coefficient that multi-plies an indicator variable in the model does not retain the meaning impliedby the definition of the indicator. It is common for students to wrongly statethat a coefficient measures the difference between that group and the othergroups. That is WRONG! It is also incorrect to say only that a coefficientmultiplying an indicator “measures the effect of being in that group”. Aneffect has to involve a comparison - with the “leave one out” method it is acomparison to the group associated with the indicator left out.

One application where many indicator variables (or binary predictors)are used is in conjoint analysis, which is a marketing tool that attemptsto capture a respondent’s preference given the presence or absence of variousattribute levels. The X matrix is called a “dummy” matrix as it consistsof only 1’s and 0’s. The response is then regressed on the indicators usingordinary least squares and researchers attempt to quantify items like iden-tification of different market segments, predict profitability, or predict theimpact of a new competitor.

One additional note is that, in theory, with a linear dependence there arean infinite number of suitable solutions for the betas (as will be seen withmulticollinearity). With the “leave one out” method, we are picking one witha particular meaning and then the resulting coefficients measure differencesfrom the specified group. A method, often used in courses focused strictlyon ANOVA or Design of Experiments, offers a different meaning for what weestimate. There it will be more common to parameterize in a way so that acoefficient measures how a group differs from an overall average.

9.3 Testing Overall Group Differences

To test the overall significance of a categorical predictor variable, we usea general linear F -test procedure (which is developed in detail later). Weform the reduced model by dropping the indicator variables from the model.



More technically, the null hypothesis is that the coefficients multiplying theindicator all equal 0.

For our example with three treatments of high blood pressure and addi-tional x-variables age and body mass, the details for doing an overall test oftreatment differences are:

• Full model is: yi = β0 + β1xi,1 + β2xi,2 + β3xi,3 + β4xi,4 + εi.

• Null hypothesis is: H0 : β3 = β4 = 0.

• Reduced model is: yi = β0 + β1xi,1 + β2xi,2 + εi.

9.4 Interactions

To examine a possible interaction between a categorical predictor and a quan-titative predictor, include product variables between each indicator and thequantitative variable.

As an example, suppose we thought there could be an interaction be-tween the body mass variable (X2) and the treatment variable. This wouldmean that we thought that treatment differences in blood pressure reductiondepend on the specific value of body mass. The model we would use is:

yi = β0 + β1xi,1 + β2xi,2 + β3xi,3 + β4xi,4 + β5xi,2 ∗ xi,3 + β6xi,2 ∗ xi,4 + εi.

To test whether there is an interaction, the null hypothesis is H0 : β5 =β6 = 0. We would use the general linear F test procedure to carry out thetest. The full model is the interaction model given three lines above. Thereduced model is now:

yi = β0 + β1xi,1 + β2xi,2 + β3xi,3 + β4xi,4 + εi.

A visual way to assess if there is an interaction is by using an interactionplot. An interaction plot is created by plotting the response versus thequantitative predictor and connecting the successive values according to thegrouping of the observations. Recall that an interaction between factorsoccurs when the change in response from lower levels to higher levels of onefactor is not quite the same as going from lower levels to higher levels ofanother factor. Interaction plots allow us to compare the relative strength ofthe effects across factors. What results is one of three possible trends:



• The lines could be (nearly) parallel, which indicates no interaction.This means that the change in the response from lower levels to higherlevels for each factor is roughly the same.

• The lines intersect within the scope of the study, which indicates aninteraction. This means that the change in the response from lowerlevels to higher levels of one factor is noticeably different than thechange in another factor. This type of interaction is called a disordinalinteraction.

• The lines do not intersect within the scope of the study, but the trendsindicate that if we were to extend the levels of our factors, then wemay see an interaction. This type of interaction is called an ordinalinteraction.

Figure 9.1 illustrates each type of interaction plot using a mock data setpertaining to the mean tensile strength measured at three different speedsof 3 different processes. The upper left plot illustrates the case where nointeraction is present because the change in mean tensile strength is similarfor each process as you increase the speed (i.e., the lines are parallel). Theupper right plot illustrates an interaction because as the speeds increase, thechange in mean tensile strength is noticeably different depending on whichprocess is being used (i.e., the lines cross). The bottom right plot illustratesan ordinal interaction where no interaction is present within the scope of therange of speeds studied, but if these trends continued for higher speeds, thenwe may see an interaction (i.e., the lines may cross).

It should also be noted that just because lines cross, it does not necessarilyimply the interaction is statistically significant. Lines which appear nearlyparallel, yet cross at some point, may not yield a statistically significantinteraction term. If two lines cross, the more different the slopes appear andthe more data that is available, then the more likely the interaction term willbe significant.

9.5 Relationship to ANCOVA

When dealing with categorical predictors in regression analysis, we often saythat we are performing a regression with indicator variables or a regression



1.0 1.5 2.0 2.5 3.0 3.5 4.0

2040

6080

100

No Interaction

Predictor

Res

pons

e

●

●

●

●

●

●

●

●

●

●

●

●

Treatment 1Treatment 2Treatment 3

(a)

1.0 1.5 2.0 2.5 3.0 3.5 4.010

2030

4050

60

Disordinal Interaction

Predictor

Res

pons

e●

●

●

●

●

●

● ●

●

● ●

●


(b)

1.0 1.5 2.0 2.5 3.0 3.5 4.0

1020

3040

5060

Ordinal Interaction

Predictor

Res

pons

e

●

●

●

●●

●

●

●

●

●

●

●


(c)

Figure 9.1: (a) A plot of no interactions amongst the groups (notice how thelines are nearly parallel). (b) A plot of a disordinal interaction amongst thegroups (notice how the lines intersect). (c) A plot of an ordinal interactionamongst the groups (notice how the lines don’t intersect, but if we were toextrapolate beyond the predictor limits, then the lines would likely cross).



with interactions (if we are interested in testing for interactions with indi-cator variables and other variables). However, in the design and analysisof experiments literature, this model is also used, but with a slightly dif-ferent motivation. Various experimental layouts using ANOVA tables arecommonly used in the design and analysis of experiments. These ANOVAtables are constructed to compare the means of several levels of one or moretreatments. For example, a one-way ANOVA can be used to compare sixdifferent dosages of blood pressure pills and the mean blood pressure of in-dividuals who are taking one of those six dosages. In this case, there is onefactor with six different levels. Suppose further that there are four differentraces represented in this study. Then a two-way ANOVA can be used sincewe have two factors - the dosage of the pill and the race of the individualtaking the pill. Furthermore, an interaction term can be included if we sus-pect that the dosage a person is taking and the race of the individual havea combined effect on the response. As you can see, you can extend to themore general n-way ANOVA (with or without interactions) for the settingwith n treatments. However, dealing with n > 2 can often lead to difficultyin interpreting the results.

One other important thing to point out with ANOVA models is that,while they use least squares for estimation, they differ from how categoricalvariables are handled in a regression model. In an ANOVA model, there isa parameter estimated for the factor level means and these are used for thelinear model of the ANOVA. This differs slightly from a regression modelwhich estimates a regression coefficient for, say, n − 1 indicator variables(assuming there are n levels of the categorical variable and we are usingthe “leave-one-out” method). Also, ANOVA models utilize ANOVA tables,which are broken down by each factor (i.e., you would look at the sums ofsquares for each factor present). ANOVA tables for regression models simplytest if the regression model has at least one variable which is a significantpredictor of the response. More details on these differences are better left toa course on design of experiments.

When there is also a continuous variable measured with each response,then the n-way ANOVA model needs to reflect the continuous variable. Thismodel is then referred to as an Analysis of Covariance (or ANCOVA)model. The continuous variable in an ANCOVA model is usually called thecovariate or sometimes the concomitant variable. One difference in howan ANCOVA model is approached is that an interaction between the covari-ate and each factor is always tested first. The reason why is because an



ANCOVA is conducted to investigate the overall relationship between theresponse and the covariate while assuming this relationship is true for allgroups (i.e., for all treatment levels). If, however, this relationship does dif-fer across the groups, then the overall regression model is inaccurate. Thisassumption is called the assumption of homogeneity of slopes. This isassessed by testing for parallel slopes, which involves testing the interactionterm between the covariate and each factor in the ANCOVA table. If theinteraction is not statistically significant, then you can claim parallel slopesand proceed to build the ANCOVA model. If the interaction is statisticallysignificant, then the regression model used is not appropriate and an AN-COVA model should not be used.

As an example of how to write ANCOVA models, first consider the one-way ANCOVA setting. Suppose we have i = 1, . . . , I treatments and eachtreatment has j = 1, . . . , Ji pairs of continuous variables measured (i.e.,(xi,1, yi,1), . . . , (xi,Ji

, yi,Ji)). Then the one-way ANCOVA model is written

as

yi,j = αi + βxi,j + εi,j,

where αi is the mean of the ith treatment level, β is the common regressionslope, and the εi,j are iid normal with mean 0 and variance σ2. So notethat the test of parallel slopes concerns testing if β is the same for all slopesversus if it is not the same for all slopes. A high p-value indicates thatwe have parallel slopes (or homogeneity of slopes) and can therefore use anANCOVA model.

9.6 Coded Variables

In the early days when computing power was limited, coding of the variablesaccomplished simplifying the linear algebra and thus allowing least squaressolutions to be solved manually. Many methods exist for coding data, suchas:

• Converting variables to two values (e.g., {-1, 1} or{0, 1}).

• Converting variables to three values (e.g., {-1, 0, 1}).

• Coding continuous variables to reflect only important digits (e.g., ifthe costs of various nuclear programs range from $100,000 to $150,000,



coding can be done by dividing through by $100,000, resulting in therange being from 1 to 1.5).

The purpose of coding is to simplify the calculation of (XTX)−1 in the variousregression equations, which was especially important when this had to bedone by hand. It is important to note that the above methods are just a fewpossibilities and that there are no specific guidelines or rules of thumb forwhen to code data.

Today when (XTX)−1 is calculated with computers, there may be a sig-nificant rounding error in the linear algebra manipulations if the differencein the magnitude of the predictors is large. Good statistical programs assessthe probability of such errors, which would warrant using coded variables.When coding variables, one should be aware of different magnitudes of theparameter estimates compared to those for the original data. The interceptterm can change dramatically, but we are concerned with any drastic changesin the slope estimates. In order to protect against additional errors due tothe varying magnitudes of the regression parameters, you can compare plotsof the actual data and the coded data and see if they appear similar.

9.7 Examples

Example 1: Software Development Data SetSuppose that data from n = 20 institutions is collected on similar softwaredevelopment projects. The data set includes Y = number of man-yearsrequired for each project,X1 = number of application subprograms developedfor the project, and X2 = 1 if an academic institution developed the programor 0 if a private firm developed the program. The data is given in Table 9.1.

Suppose we wish to estimate the number of man-years necessary for de-veloping this type of software for the purpose of contract bidding. We alsosuspect a possible interaction between the number of application subpro-grams developed and the type of institution. Thus, we consider the multipleregression model

yi = β0 + β1xi,1 + β2xi,2 + β3xi,1 ∗ xi,2 + εi.

So first, we fit the above model and assess the significance of the interactionterm.



##########

Coefficients:


(Intercept) -23.8112 20.4315 -1.165 0.261

subprograms 0.8541 0.1066 8.012 5.44e-07 ***

institution 35.3686 26.7086 1.324 0.204

sub.inst -0.2019 0.1556 -1.297 0.213

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1




##########

The above gives the t-tests for these predictors. Notice that only the predictorof application subprograms (i.e., X1) is statistically significant, so we shouldconsider dropping the interaction term for starters.

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

● ●

●

0 100 200 300 400

050

100

150

200

250

300

350

Number of Subprograms

Num

ber

of M

an−

Yea

rs

Academic InstitutionPrivate Firm

Figure 9.2: An interaction plot where the grouping is by institution.

An interaction plot can also be used to justify use of an interaction term.Figure 9.2 provides the interaction plot for this data set. This plot seems to



indicate a possible (disordinal) interaction. A test of this interaction termyields p = 0.213 (see the earlier output). Even though the interaction plotindicates a possible interaction, the actual interaction term is deemed notstatistically significant and thus we can drop it from the model.

i Subprograms Institution Man-Years1 135 0 522 128 1 583 221 0 2074 82 1 955 401 0 3466 360 1 2447 241 0 2158 130 0 1129 252 1 19510 220 0 5411 112 0 4812 29 1 3913 57 0 3114 28 1 5715 41 1 2016 27 1 3317 33 1 1918 7 0 619 17 0 720 94 1 56

Table 9.1: The software development data set.

We next provide the analysis without the interaction term. Though theresults are not shown here, a test of each predictor shows that the subpro-grams predictor is statistically significant (p = 0.000), while the institutionpredictor is not statistically significant (p = 0.612). This then tells us thatthere is no statistically significant difference in man-years for this type of soft-ware development between academic institutions and private firms. However,the number of subprograms is still a statistically significant predictor. So thefinal model should be a simple linear regression model with subprograms asthe predictor and man-years as the response. The final estimated regression



equation isyi = −3.47742 + 0.75088xi,1,

which can be found from the following output:

##########

Coefficients:


(Intercept) -3.47742 13.12068 -0.265 0.794

subprograms 0.75088 0.07591 9.892 1.06e-08 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1




##########

Example 2: Steam Output Data (continued)Consider coding the steam output data by rounding the temperature to thenearest integer value ending in either 0 or 5. For example, a temperature of57.5 degrees would be rounded up to 60 degrees while a temperature of 76.8degrees would be rounded down to 75 degrees. While you would probablynot utilize coding on such an easy data set where magnitude is not an issue,it is utilized here just for illustrative purposes.

Figure 9.3 compares the scatterplots of this data set with the originaltemperature value and the coded temperature value. The plots look compa-rable, suggesting that coding could be used here. Recall that the estimatedregression equation for the original data was yi = 13.6230 − 0.0798xi. Theestimated regression equation for the coded data is yi = 13.7765− 0.0824xi,which is also comparable.



●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

30 40 50 60 70

78

910

1112

Steam Data

Uncoded Temperature (Fahrenheit)

Ste

am U

sage

(M

onth

ly)

(a)

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

30 40 50 60 70

78

910

1112

Steam Data

Coded Temperature (Fahrenheit)

Ste

am U

sage

(M

onth

ly)

(b)

Figure 9.3: Comparing scatterplots of the atmospheric pressure data withthe original temperature (a) and with the temperature coded (b). A line ofbest fit for each is also shown.


Chapter 10

Multicollinearity

Recall that the columns of a matrix are linearly dependent if one columncan be expressed as a linear combination of the other columns. A matrixtheorem is that if there is a linear dependence among the columns of X, then(XTX)−1 does not exist. This means that we can’t determine estimates ofthe beta coefficients since the formula for determining the estimates involves(XTX)−1.

In multiple regression, the term multicollinearity refers to the linearrelationships among the x-variables. Often, the use of this term implies thatthe x-variables are correlated with each other, so when the x-variables are notcorrelated with each other, we might say that there is no multicollinearity.

10.1 Sources and Effects of Multicollinearity

There are various sources for multicollinearity. For example, in the datacollection phase an investigator may have drawn the data from such a narrowsubspace of the independent variables that collinearity appears. Physicalconstraints, such as design limits, may also impact the range of some of theseindependent variables. Model specification (such as defining more variablesthan observations or specifying too many higher-ordered terms/interactions)and outliers can both lead to collinearity.

When there is no multicollinearity among x-variables, the effects of theindividual x-variables can be estimated independently of each other (althoughwe will still want to do a multiple regression). When multicollinearity ispresent, the estimated coefficients are correlated (confounding) with each

125

126 CHAPTER 10. MULTICOLLINEARITY

other. This creates difficulty when we attempt to interpret how individualx-variables affect y.

Along with this correlation, multicollinearity has a multitude of otherramifications on our analysis, including:

• inaccurate regression coefficient estimates,

• inflated standard errors of the regression coefficient estimates,

• deflated t-tests for significance testing of the regression coefficients,

• false nonsignificance determined by the p-values, and

• degradation of model predictability.

In designed experiments with multiple x-variables, researchers usuallychoose the value of the x-variables so that there is no multicollinearity. Inobservational studies (sample surveys), it is nearly always the case that thex-variables will be correlated.

10.2 Detecting and Correcting Multicollinear-

ity

We introduce three primary ways for detecting multicollinearity - two ofwhich are fairly straight-forward to implement, while the third method isactually a variety of measures based on the eigenvalues and eigenvectors ofthe standardized design matrix.

Method 1: Pairwise ScatterplotsFor the first method, we can visually inspect the data by doing pairwisescatterplots of the independent variables. So if you have p − 1 independentvariables, then you should inspect all

(p−12

)pairwise scatterplots. You will

be looking for any plots that seem to indicate a linear relationship betweenpairs of independent variables.

Method 2: VIFSecond, we can use a measure of multicollinearity called the variance infla-tion factor (VIF ). This is defined as

V IFj =1

1−R2j

,


CHAPTER 10. MULTICOLLINEARITY 127

where R2j is the coefficient of determination obtained by regressing Xj on

the remaining independent variables. A common rule of thumb is that ifV IFj = 1, then there is no collinearity, if 1 < V IFj < 5, then there ispossibly some moderate collinearity, and if V IFj ≥ 5, then there is a strongindication of a collinearity problem. Most of the time, we will shoot for valuesas close to 1 as possible and that usually will be sufficient. The bottom lineis that the higher the V IF , the more likely multicollinearity is an issue.

Sometimes, the tolerance is also reported. The tolerance is simply theinverse of the V IF (i.e., Tolj = V IF−1

j ). In this case, the lower the Tol, themore likely multicollinearity is an issue.

If multicollinearity is suspected after doing the above, then a couple ofthings can be done. First, reassess the choice of model and determine ifthere are any unnecessary terms and remove them. You may wish to startby removing the one you most suspect first, because this will then drive downthe V IF s of the remaining variables.

Next, check for outliers and see what effects some of the observations withhigher residuals have on the analysis. Remove some (or all) of the suspectedoutliers and see how that effects the pairwise scatterplots and V IF values.

You can also standardize the variables which involves simply subtractingeach variable by it’s mean and dividing by it’s standard deviation. Thus, thestandardized X matrix is given as:

X∗ = 1√n−1

X1,1−X1

sX1

X1,2−X2

sX2. . . X1,p−1−Xp−1

sXp−1

X2,1−X1

sX1

X2,2−X2

sX2. . . X2,p−1−Xp−1

sXp−1

......

. . ....

Xn,1−X1

sX1

Xn,2−X2

sX2. . . Xn,p−1−Xp−1

sXp−1

,

which is a n× (p− 1) matrix, and the standardized Y vector is given as:

Y∗ = 1√n−1

Y1−YsY

Y2−YsY...

Yn−YsY

,

which is still a n-dimensional vector. Here,

sXj=

√∑ni=1(Xi,j − Xj)2

n− 1



for j = 1, 2, . . . , (p− 1) and

sY =

√∑ni=1(Yi − Y )2

n− 1.

Notice that we have removed the column of 1’s in forming X∗, effectivelyreducing the column dimension of the original X matrix by 1. Because ofthis, we no longer can estimate an intercept term (b0), which may be animportant part of the analysis. Thus, proceed with this method only if youbelieve the intercept term adds little value to explaining the science behindyour regression model!

When using the standardized variables, the regression model of interestbecomes:

Y∗ = X∗β∗ + ε∗,

where β∗ is now a (p− 1)-dimensional vector of standardized regression co-efficients and ε∗ is an n-dimensional vector of errors pertaining to this stan-dardized model. Thus, the ordinary least squares estimates are

b∗ = (X∗TX∗)−1X∗TY∗

= r−1XXrXY ,

where rXX is the (p−1)×(p−1) correlation matrix of the predictors and rXYis the (p−1)-dimensional vector of correlation coefficients between the predic-tors and the response. Because b∗ is a function of correlations, this method iscalled a correlation transformation. Sometimes, it may be enough to justsimply center the variables by their respective means in order to decrease theV IF s. Note the relationship between the quantities introduced above andthe correlation matrix r from earlier:

r =

(1 rT

XY

rXY rXX

).

Method 3: Eigenvalue MethodsFinally, the third method for identifying potential multicollinearity concernsa variety of measures utilizing eigenvalues and eigenvectors. First, note thatthe eigenvalue λj and the corresponding (p − 1)-dimensional orthonormaleigenvectors ξj are solutions to the system of equations:

X∗TX∗ξj = λjξj,



for j = 1, . . . , (p− 1). Since the λj’s are normalized, it follows that

ξTj X∗TX∗ξj = λj.

Therefore, if λj ≈ 0, then X∗ξj ≈ 0; i.e., the columns of X∗ are approx-imately linearly dependent. Thus, since the sum of the eigenvalues mustequal the number of predictors (i.e., (p− 1)), then very small λj’s (say, near0.05) are indicative of collinearity. Another criterion commonly used is todeclare multicollinearity is present when

∑p−1j=1 λ

−1j > 5(p−1). Moreover, the

entries of the corresponding ξj’s indicate the nature of the linear dependen-cies; i.e., large elements of the eigenvectors identify the predictor variablesthat comprise the collinearity.

A measure of the overall multicollinearity of the variables can be obtainedby computing what is called the condition number of the correlation ma-trix (i.e., r) which is defined as

√λ(p−1)/λ(1), such that λ(1) and λ(p−1) are

the minimum and maximum eigenvalues, respectively. Obviously this quan-tity is always greater than 1, so a large number is indicative of collinearity.Empirical evidence suggests that a value less than 15 typically means weakcollinearity, values between 15 and 30 is evidence of moderate collinearity,while anything over 30 is evidence of strong collinearity.

Condition numbers for the individual predictors can also be calculated.This is accomplished by taking cj =

√λ(p−1)/λj for each j = 1, . . . , (p− 1).

When data is centered and scaled, then cj ≤ 100 indicates no collinearity,100 < cj < 1000 indicates moderate collinearity, while cj ≥ 1000 indicatesstrong collinearity for predictor Xj. When the data is only scaled (i.e., forregression through the origin models), then collinearity will always be worse.Thus, more relaxed limits are usually used. For example, a common ruleof thumb is to use 5 times the limits mentioned above; namely, cj ≤ 500indicates no collinearity, 500 < cj < 5000 indicates moderate collinearity,while cj ≥ 5000 indicates strong collinearity for predictor Xj.

It should be noted that there are many heuristic ways other than those de-scribed above to assess multicollinearity with eigenvalues and eigenvectors.1

Moreover, it should be noted that some observations can have an undue influ-ence on these various measures of collinearity. These observations are calledcollinearity-influential observations and care should be taken with how

1For example, one such technique involves taking the square eigenvector relative to thesquare eigenvalue and then seeing what percentage each quantity in this (p−1)-dimensionalvector explains of the total variation for the corresponding regression coefficient.



these observations are handled. You can typically use some of the residualdiagnostic measures (e.g., DFITs, Cook’s Di, DFBETAS, etc.) for identify-ing potential collinearity-influential observations since there is no establishedor agreed-upon method for classifying such observations.

Finally, there are also some more advanced regression procedures thatcan be performed in the presence of multicollinearity. Such methods includeprincipal components regression and ridge regression. These methods arediscussed later.

10.3 Examples

Example 1: Muscle Mass Data SetSuppose that data from n = 6 individuals is collected on their muscle mass.The data set includes Y = muscle mass, X1 = age, and two possible (yetredundant) indicator variables for gender:

1. X2 = 1 if the person is male (M), and 0 if female (F).

2. X3 = 1 if the person is female (F), and 0 if male (M).

The data are given in Table 10.1.

mass 60 50 70 42 50 45age 40 45 43 60 60 65sex M F M F M F

Table 10.1: The muscle mass data set.

Suppose that we (mistakenly) attempt to use the model

yi = β0 + β1xi,1 + β2xi,2 + β3xi,3 + εi.

Notice that we used both indicator variables in the model. For these dataand this model,

X =

1 40 1 01 45 0 11 43 1 01 60 0 11 60 1 01 65 0 1

.



The sum of the last two columns equals the first column for every row in theX matrix. This is a linear dependence, so parameter estimates cannot becalculated because (XTX)−1 does not exist. In practice, the usual solutionis to drop one of the indicator variables from the model. Another solution isto drop the intercept (thus dropping the first column of X above), but thatis not usually done.

For this example, we can’t proceed with a multiple regression analysis be-cause there is perfect collinearity with X2 and X3. Sometimes, a generalizedinverse can be used (which requires more of a discussion beyond the scopeof this course) or if you attempt to do an analysis on such a data set, thesoftware you are using may zero out one of the variables that is contributingto the collinearity and then proceed to do an analysis. However, this canlead to errors in the final analysis.

Example 2: Heat Flux Data Set (continued)Let us return to the heat flux data set. Let our model include the east,south, and north focal points, but also incorporate time and insolation aspredictors. First, let us run a multiple regression analysis which includesthese predictors.

##########

Coefficients:


(Intercept) 325.43612 96.12721 3.385 0.00255 **

east 2.55198 1.24824 2.044 0.05252 .

north -22.94947 2.70360 -8.488 1.53e-08 ***

south 3.80019 1.46114 2.601 0.01598 *

time 2.41748 1.80829 1.337 0.19433

insolation 0.06753 0.02899 2.329 0.02900 *

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1




##########

We see that time is not a statistically significant predictor of heat flux and,in fact, east has become marginally significant.



However, let us now look at the V IF values:

##########

east north south time insolation

1.355448 2.612066 3.175970 5.370059 2.319035

##########

Notice that the V IF for time is fairly high (about 5.37). This is somewhata high value and should be investigated further. So next, the pairwise scat-terplots are given in Figure 10.1 (we will only look at the plots involvingthe time variable since that is the variable we are investigating). Notice howthere appears to be a noticeable linear trend between time and the southfocal point. There also appears to be some sort of curvilinear trend betweentime and the north focal point as well as between time and insolation. Theseplots, combined with the V IF for time, suggests to look at a model withoutthe time variable.

After removing the time variable, we obtain the new V IF values:

##########

east north south insolation

1.277792 1.942421 1.206057 1.925791

##########

Notice how removal of the time variable has sharply decreased the V IFvalues for the other variables. The regression coefficient estimates are:

##########

Coefficients:


(Intercept) 270.21013 88.21060 3.063 0.00534 **

east 2.95141 1.23167 2.396 0.02471 *

north -21.11940 2.36936 -8.914 4.42e-09 ***

south 5.33861 0.91506 5.834 5.13e-06 ***

insolation 0.05156 0.02685 1.920 0.06676 .

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1




##########



●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

11 12 13 14 15 16

3132

3334

3536

3738

Time

Eas

t

(a)

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

11 12 13 14 15 16

15.5

16.0

16.5

17.0

17.5

18.0

18.5

19.0

Time

Nor

th

(b)

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

● ●

●

●

●

●

●

●

11 12 13 14 15 16

3234

3638

40

Time

Sou

th

(c)

●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

11 12 13 14 15 16

600

650

700

750

800

850

900

Time

Inso

latio

n

(d)

Figure 10.1: Pairwise scatterplots of the variable time versus (a) east, (b)north, (c) south, and (d) insolation. Notice how there appears to be some sortof relationship between time and the predictors north, south, and insolation.



So notice how now east is statistically significant while insolation is mar-ginally significant. If we proceeded to drop insolation from the model, thenwe would be back to the analysis we did earlier in the chapter. This illustrateshow dropping or adding a predictor to a model can change the significance ofother predictors. We will return to this when we discuss stepwise regression.


Chapter 11

ANOVA II

As in simple regression, the ANOVA table for a multiple regression model dis-plays quantities that measure how much of the variability in the y-variable isexplained and how much is not explained by the x-variables. The calculationof the quantities involved is nearly identical to what we did in simple regres-sion. The main difference has to do with the degrees of freedom quantities.The basic structure is given in Table 11.1 and the explanations follow.

Source df SS MS FRegression p− 1

∑ni=1(yi − y)2 MSR MSR/MSE

Error n− p∑n

i=1(yi − yi)2 MSE

Total n− 1∑n

i=1(yi − y)2

Table 11.1: ANOVA table for any regression.

• The sum of squares for total is SSTO =∑n

i=1(yi− y)2, which is thesum of squared deviations from the overall mean of y and dfT = n− 1.SSTO is a measure of the overall variation in the y-values. In matrixnotation, SSTO = ||Y− Y 1||2.

• The sum of squared errors is SSE =∑n

i=1(yi− yi)2, which is the sumof squared observed errors (residuals) for the observed data. SSE is ameasure of the variation in y that is not explained by the regression.For multiple linear regression, dfE = n− p, where p = number of betacoefficients in the model (including the intercept β0). As an example,

135

136 CHAPTER 11. ANOVA II

if there are four x-variables then p = 5 (the betas multiplying thex-variables plus the intercept). In matrix notation, SSE = ||Y− Y||2.

• The mean squared error is MSE = SSEdfE

= SSEn−p , which estimates σ2,

the variance of the errors.

• The sum of squares due to the regression is SSR = SSTO− SSE,and it is a measure of the total variation in y that can be explained bythe regression with the x-variable. Also, dfR = dfT − dfE. For multipleregression, dfR = (n − 1) − (n − p) = p − 1. In matrix notation,SSR = ||Y− Y 1||2.

• The mean square for the regression is MSR = SSRdfR

= SSRp−1

.

11.1 Uses of the ANOVA Table

1. The F -statistic in the ANOVA given in Table 11.1 can be used to testwhether the y-variable is related to one or more of the x-variables inthe model. Specifically, F ∗ = MSR/MSE is a test statistic for

H0 : β1 = β2 = . . . = βp−1 = 0

HA : at least one βi 6= 0 for i = 1, . . . , p− 1.

The null hypothesis means that the y-variable is not related to any ofthe x-variables in the model. The alternative hypothesis means thatthe y-variable is related to one or more of the x-variables in the model.Statistical software will report a p-value for this test statistic. The p-value is calculated as the probability to the right of the calculated valueof F ∗ in an F distribution with p−1 and n−p degrees of freedom (oftenwritten as Fp−1,n−p;1−α). The usual decision rule also applies here inthat if p < 0.05, reject the null hypothesis. If that is our decision,we conclude that y is related to at least one of the x-variables in themodel.

2. MSE is the estimate of the error variance σ2. Thus s =√

MSE esti-mates the standard deviation of the errors.

3. As in simple regression, R2 = SSTO-SSESSTO

, but here it is called the co-efficient of multiple determination. R2 is interpreted as the pro-


CHAPTER 11. ANOVA II 137

portion of variation in the observed y values that is explained by themodel (i.e., by the entire set of x-variables in the model).

11.2 The General Linear F -Test

The general linear F -test procedure is used to test any null hypothesisthat, if true, still leaves us with a linear model (linear in the β’s). The mostcommon application is to test whether a particular set of β coefficients areall equal 0. As an example, suppose we have a response variable (Y ) and 5predictor variables (X1, X2, . . . , X5). Then, we might wish to test

H0 : β1 = β3 = β4 = 0

HA : at least one of {β1, β3, β4} 6= 0.

The purpose for testing a hypothesis like this is to determine if we couldeliminate variables X1, X3, and X4 from a multiple regression model (anaction implied by the statistical truth of the null hypothesis).

The full model is the multiple regression model that includes all variablesunder consideration. The reduced model is the regression model that wouldresult if the null hypothesis is true. The general linear F -statistic is

F ∗ =

SSE(reduced)−SSE(full)dfE(reduced)−dfE(full)

MSE(full).

Here, this F -statistic has degrees of freedom df1 = dfE(reduced) − dfE(full)and df2 = dfE(full). With the rejection region approach and a 0.05 signifi-cance level, we reject H0 if the calculated F ∗ is greater than the tabled valueFdf1,df2;1−α, which is the 95th percentile of the appropriate Fdf1,df2-distribution.A p-value is found as the probability that the F -statistic would be as largeor larger than the calculated F ∗.

To summarize, the general linear F -test is used in settings where thereare many predictors and it is desirable to see if only one or a few of thepredictors can adequately perform the task of estimating the mean responseand prediction of new observations. Sometimes the full and reduced sums ofsquares (that we introduced above) are referred to as extra sum of squares.



11.3 Extra Sums of Squares

The extra sums of squares measure the marginal reduction in the SSEwhen one or more predictor variables are added to the regression model giventhat other predictors are already in the model. In probability theory, we writeA|B which means that event A happens GIVEN that event B happens (thevertical bar means “given”). We also utilize this notation when writing extrasums of squares. For example, suppose we are considering two predictors,X1 and X2. The SSE when both variables are in the model is smaller thanwhen only one of the predictors is in the model. This is because when bothvariables are in the model, they both explain additional variability in Ywhich drives down the SSE compared to when only one of the variables isin the model. This difference is what we call the extra sums of squares. Forexample,

SSR(X1|X2) = SSE(X2)− SSE(X1, X2),

which measures the marginal effect of adding X1 to the model, given thatX2 is already in the model. An equivalent expression is to write

SSR(X1|X2) = SSR(X1, X2)− SSR(X2),

which can be viewed as the marginal increase in the regression sum of squares.Notice (for the second formulation) that the corresponding degrees of freedomis (3-1)-(2-1)=1 (because the df for SSR(X1, X2) is (3-1)=2 and the df forSSR(X2) is (2-1)=1). Thus,

MSR(X1|X2) =SSR(X1|X2)

1.

When more predictors are available, then there are a vast array of pos-sible decompositions of the SSR into extra sums of squares. One genericformulation is if you have p predictors, then

SSR(X1, . . . , Xj, . . . , Xp) = SSR(Xj) + SSR(X1|Xj) + . . .

+ SSR(Xj−1|X1, . . . , Xj−2, Xj)

+ SSR(Xj+1|X1, . . . , Xj) + . . .

+ SSR(Xp|X1, . . . , Xp−1).

In the above, j is just being used to indicate any one of the p predictors.You can also calculate the marginal increase in the regression sum of squares



when adding more than one predictor. One generic formulation is

SSR(X1, . . . , Xj|Xj+1, . . . , Xp) = SSR(X1|Xj+1, . . . , Xp)

+ SSR(X2|X1, Xj+1, . . . , Xp) + . . .

+ SSR(Xj−1|X1, . . . , Xj−2, Xj+1, . . . , Xp)

+ SSR(Xj|X1, . . . , Xj−1, Xj+1, . . . , Xp).

Again, you could imagine many such possibilities.The primary use of extra sums of squares is in the testing of whether or

not certain predictors can be dropped from the model (i.e., the general linearF -test). Furthermore, they can also be used to calculate a version of R2 forsuch models, called the partial R2.

11.4 Lack of Fit Testing in the Multiple Re-

gression Setting

Formal lack of fit testing can also be performed in the multiple regressionsetting; however, the ability to achieve replicates can be more difficult as morepredictors are added to the model. Note that the corresponding ANOVAtable (Table 11.2) is similar to that introduced for the simple linear regressionsetting. However, now we have the notion of p regression parameters and thenumber of replicates (m) refers to the number of unique X vectors. In otherwords, each predictor must have the same value for two observations for itto be considered a replicate. For example, suppose we have 3 predictors forour model. The observations (40, 10, 12) and (40, 10, 7) are unique levelsfor our X vectors, whereas the observations (10, 5, 13) and (10, 5, 13) wouldconstitute a replicate.

Formal lack of fit testing in multiple regression can be difficult due tosparse data, unless the experiment was designed properly to achieve repli-cates. However, other methods can be employed for lack of fit testing whenyou do not have replicates. Such methods involve data subsetting. The ba-sic approach is to establish criteria by introducing indicator variables, whichin turn creates coded variables (as discussed earlier). By coding the variables,you can artificially create replicates and then you can proceed with lack of fittesting. Another approach with data subsetting is to look at central regionsof the data (i.e., observations where the leverage is less than (1.1) ∗ p/n) andtreat this as a reduced data set. Then compare this reduced fit to the full fit



Source df SS MS FRegression p− 1 SSR MSR MSR/MSEError n− p SSE MSELack of Fit m− p SSLOF MSLOF MSLOF/MSPEPure Error n−m SSPE MSPE

Total n− 1 SSTO

Table 11.2: ANOVA table for multiple linear regression which includes a lackof fit test.

(i.e., the fit with all of the data), for which the formulas for a lack of fit testcan be employed. Be forewarned that these methods should only be used asexploratory methods and they are heavily dependent on what sort of datasubsetting method is used.

11.5 Partial R2

Suppose we have set up a general linear F -test. Then, we may be interestedin seeing what percent of the variation in the response cannot be explainedby the predictors in the reduced model (i.e., the model specified by H0), butcan be explained by the rest of the predictors in the full model. If we obtaina large percentage, then it is likely we would want to specify some or all ofthe remaining predictors to be in the final model since they explain so muchvariation.

The way we formally define this percentage is by what is called the par-tial R2 (or it is also called the coefficient of partial determination).Specifically, suppose we have three predictors: X1, X2, and X3. For the cor-responding multiple regression model (with response Y ), we wish to knowwhat percent of the variation is explained by X2 and X3 which is not ex-plained by X1. In other words, given X1, what additional percent of thevariation can be explained by X2 and X3? Note that here the full model willinclude all three predictors, while the reduced model will only include X1.

After obtaining the relevant ANOVA tables for these two models, the



partial R2 is as follows:

R2Y,2,3|1 =

SSR(X2, X3|X1)

SSE(X1)

=SSE(X1)− SSE(X1, X2, X3)

SSE(X1)

=SSE(reduced)− SSE(full)

SSE(reduced).

Then, this gives us the proportion of variation explained by X2 and X3 thatcannot be explained by X1. Note that the last line of the above equation isjust demonstrating that the partial R2 has a similar form to the R2 that wecalculated in the simple linear regression case.

More generally, consider partitioning the predictors X1, X2, . . . , Xk intotwo groups, A and B, containing u and k − u predictors, respectively. Theproportion of variation explained by the predictors in group B that cannotbe explained by the predictors in group A is given by

R2Y,B|A =

SSR(B|A)

SSE(A)

=SSE(A)− SSE(A,B)

SSE(A).

These partial R2 values can also be used to calculate the power for thecorresponding general linear F -test. The power of this test is calculated byfirst finding the tabled 100× (1−α)th percentile of the Fu,n−k−1-distribution.Next we calculate Fu,n−k−1;1−α(δ), which is the 100 × (1 − α)th percentileof a non-central Fu,n−k−1-distribution with non-centrality parameter δ. Thenon-centrality parameter is calculated as:

δ = n×(R2Y,A,B −R2

Y,B

1−R2Y,A,B

).

Finally, the power is simply the probability that the calculated Fu,n−k−1;1−α(δ)value is greater than the calculated Fu,n−k−1;1−α value under the Fu,n−k−1(δ)-distribution.



11.6 Partial Leverage and Partial Residual

Plots

Next we establish a way to visually assess the relationship of a given predictorto the response when accounting for all of the other predictors in the multiplelinear regression model. Suppose we have p− 1 predictors X1, . . . , Xp−1 andthat we are trying to assess each predictor’s relationship with a responseY given the other predictors are already in the model. Let rY[j]

denote theresiduals that result from regressing Y on all of the predictors except Xj.Moreover, let rX[j]

denote the residuals that result from regressing Xj on allof the remaining p − 2 predictors. A partial leverage regression plot(also referred to as an added variable plot, adjusted variable plot, orindividual coefficients plot) is constructed by plotting rY[j]

on the y-axis

and rX[j]on the x-axis.1 Then, the relationship between these two sets of

residuals is examined to provide insight intoXj’s contribution to the responsegiven the other p−2 predictors are in the model. This is a helpful exploratorymeasure if you are not quite sure about what type of relationship (e.g., linear,quadratic, logarithmic, etc.) that the response may have with a particularpredictor.

More formally, partial leverage is used to measure the contribution of theindividual predictor variables to the leverage of each observation. That is,if hi,i is the ith diagonal entry of the hat matrix, the partial leverage is ameasure of how hi,i changes as a variable is added to the regression model.The partial leverage is computed as:

(h∗j)i =r2x[j],i∑n

k=1 r2x[j],k

.

If you do a simple linear regression of rY[j]on rX[j]

, you will see that theOLS line goes through the origin. This is because the OLS line goes throughthe point (rX[j]

, rY[j]) and each of these means is 0. Thus, a regression through

the origin can be fit to this data. Moreover, the slope from this regressionthrough the origin fit is equal to the slope for Xj if it were included in thefull model where Y is regressed on all p− 1 predictors.

We have given a great deal of attention to various residual measuresas well as plots to assess the adequacy of our fitted model. We can also

1Note that you can produce p − 1 partial leverage regression plots (i.e., one for eachpredictor).



use another type of residual to check the assumption of linearity for eachpredictor. Partial residuals are residuals that have not been adjusted for aparticular predictor variable (say, Xj∗). Suppose, we partition the X matrixsuch that X = (X−j∗ , Xj∗). For this formulation, X−j∗ is the same as theX matrix, but with the vector of observations for the predictor Xj∗ omitted(i.e., this vector of values is Xj∗). Similarly, let us partition the vector of

estimated regression coefficients as b = (bT−j, bj∗)

T. Then, the set of partialresiduals for the predictor Xj∗ would be

e∗j = Y−Xj∗b−j∗

= Y− Y + Y−Xj∗b−j∗

= e− (Xj∗b−j∗ −Xb)

= e + bj∗Xj∗ .

Note in the above that bj∗ is just a univariate quantity, so bj∗Xj∗ is still ann-dimensional vector. Finally, a plot of e∗j versus X∗

j has slope bj∗ . The morethe data deviates from a straight-line fit for this plot (which is sometimescalled a component plus residual plot), the greater the evidence that ahigher-ordered term or transformation on this predictor variable is necessary.Note also that the vector e would provide the residuals if a straight-line fitwere made to these data.

11.7 Examples

Example 1: Heat Flux Data Set (continued)Refer back to the raw data given in Table 7.1. We will again look at thefull model which includes all of the predictor variables (even though we haveshown which predictors should likely be removed from the model). Theresponse is still y = total heat flux, while the predictors are x1 = insolationrecording, x2 = east focal point, x3 = south focal point, x4 = north focalpoint, and x5 = time of the recordings. So the full model will be

Y = Xβ + ε,

where β = (β0 β1 β2 β3 β4 β5)T, Y is a 29-dimensional response vector, X is

a 29× 5-dimensional design matrix, and ε is a 29-dimensional error vector.First, here is the ANOVA for the model with all of the predictors:



##########

Analysis of Variance Table

Response: flux

Df Sum Sq Mean Sq F value Pr(>F)

Regression 5 13195.5 2639.1 40.837 1.077e-10 ***

Residuals 23 1486.4 64.6

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

##########

Now, using the results from earlier (which seem to indicate a model includingonly the north and south focal points as predictors) let us test the followinghypothesis:

H0 : β1 = β2 = β5 = 0

HA : at least one of {β1, β2, β5} 6= 0.

In other words, we only want our model to include the south (x3) and north(x4) focal points. We see that MSE(full) = 64.63, SSE(full) = 1486.40, anddfE(full) = 23.

Next we calculate the ANOVA table for the above null hypothesis:

##########

Analysis of Variance Table

Response: flux

Df Sum Sq Mean Sq F value Pr(>F)

Regression 2 12607.6 6303.8 79.013 8.938e-12 ***

Residuals 26 2074.3 79.8

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

##########

The ANOVA for this analysis shows that SSE(reduced) = 2074.33 anddfE(reduced) = 26. Thus, the F -statistic is:

F ∗ =2074.33−1486.40

26−23

64.63= 3.027998,



which follows a F3,23 distribution. The p-value (i.e., the probability of gettingan F -statistic as extreme or more extreme than 3.03 under a F3,23 distrib-ution) is 0.0499. Thus we just barely claim statistical significance and con-clude that at least one of the other predictors (insolation, east focal point,and time) is a statistically significant predictor of heat flux.

We can also calculate the power of this F -test by using the partial R2

values. Specifically,

R2Y,1,2,3,4,5 = 0.8987602

R2Y,3,4 = 0.8587154

δ = (29)0.8987602− 0.8587154

1− 0.8987602

= 11.47078.

Therefore, P (F ∗ > 3.027998) under an F3,23(11.47078)-distribution gives thepower, which is as follows:

##########

Power

F-Test 0.7455648

##########

Notice that this is not very powerful (i.e., we usually hope to attain a powerof at least 0.80). Thus the probability of committing a Type II error issomewhat high.

We can also calculate the partial R2 for this testing situation:

R2Y,1,2,5|3,4 =

SSE(X3, X4)− SSE(X1, X2, X3, X4, X5)

SSE(X3, X4)

=2074.33− 1486.4

2074.33= 0.2834.

This means that insolation, the east focal point, and time explain about28.34% of the variation in heat flux that could not be explained by the northand south focal points.

Example 2: Simulated Data for Partial Leverage PlotsSuppose we have a response variable, Y , and two predictors, X1 and X2. Letus consider three settings:



1. Y is a function of X1;

2. Y is a function of X1 and X2; and

3. Y is a function of X1, X2, and X22 .

Setting (3) is a quadratic regression model which falls under the polynomialregression framework, which we discuss in greater detail later. Figure 11.1shows the partial leverage regression plots for X1 and X2 for each of thesethree settings. In Figure 11.1(a), we see how the plot indicates that thereis a strong linear relationship between Y and X1 when X2 is in the model,but this is not the case between Y and X2 when X1 is in the model (Figure11.1(b)). Figures 11.1(c) and 11.1(d) shows that there is a strong linearrelationship between Y and X1 when X2 is in the model as well as betweenY and X2 when X1 is in the model. Finally, Figure 11.1(e) shows that thereis a linear relationship between Y and X1 when X2 is in the model, but thereis an indication of a quadratic (i.e., curvilinear) relationship between Y andX2 when X1 is in the model.

For setting (2), the fitted model when Y is regressed on X1 and X2 isY = 9.92 + 6.95X1 − 4.20X2 as shows in the following output:

Coefficients:


(Intercept) 9.9174 2.4202 4.098 0.000164 ***

X1 6.9483 0.3427 20.274 < 2e-16 ***

X2 -4.2044 0.3265 -12.879 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1



F-statistic: 253 on 2 and 47 DF, p-value: < 2.2e-16

The slope from the regression through the origin fit of rY[1]on rX[1]

is 6.95while the slope from the regression through the origin fit of rY[2]

on rX[2]is

-4.20. The output from both of these fits is given below:

Coefficients:


r_X.1 6.9483 0.3357 20.7 <2e-16 ***



● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●

●

−4 −2 0 2 4

−40

−20

020

40

rX[1]

r Y[1

]

(a)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

−4 −2 0 2 4

−10

−5

05

10

rX[2]

r Y[2

]

(b)

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●

●

−4 −2 0 2 4

−40

−20

020

40

rX[1]

r Y[1

]

(c)

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−4 −2 0 2 4

−20

−10

010

20

rX[2]

r Y[2

]

(d)

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−4 −2 0 2 4

−50

050

100

150

rX[1]

r Y[1

]

(e)

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

−4 −2 0 2 4

−20

0−

100

010

020

030

040

0

rX[2]

r Y[2

]

(f)

Figure 11.1: Scatterplots of (a) rY[1]versus rX[1]

and (b) rY[2]versus rX[2]

forsetting (1), (c) rY[1]

versus rX[1]and (d) rY[2]

versus rX[2]for setting (2), and

(e) rY[1]versus rX[1]

and (f) rY[2]versus rX[2]

for setting (3).



---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1




--------------------------------------------------------------

Coefficients:


r_X.2 -4.2044 0.3197 -13.15 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1




Note that while the slopes from these partial leverage regression routinesare the same as their respective slopes in the full multiple linear regressionroutine, the other statistics regarding hypothesis testing are not the samesince, fundamentally, different assumptions are made for the respective tests.


part ii multiple linear regression - statistics · pdf filepart ii multiple linear regression...

Documents