chapter 13: constructing a multiple regression model...chapter 13: constructing a multiple...

1Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 13Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 13:Constructing a Multiple

Regression Model

Hildebrand, Ott and GrayBasic Statistical Ideas for Managers

Second Edition

Hildebrand, Ott & Gray, Basic Statistical Ideas for Managers, 2nd edition, Chapter 13Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Learning Objectives for Ch. 13

• This chapter presents a four-step process for building a multiple linear regression model:

• STEP ONE:• Initial Selection of Possible Predictor Variables• Incorporating Qualitative Independent Variables by

Using Dummy or Indicator Variables• Incorporating Lagged Predictor Variables when there is

Time-Series Data

2


Learning Objectives for Ch. 13

• STEP TWO:• Addressing Nonlinearity and Interaction Among the

Variables• STEP THREE:

• Choosing Predictors Using Stepwise and Other Methods

• STEP FOUR:• Checking the Assumptions of Linearity,

Heteroscedasticity, Normality and Independence by Doing a Residual Analysis

• Validating the Model

3


Section 13.1Selecting Possible

Independent Variables (Step 1)


13.1 Selecting Possible Independent Variables (Step 1)

• The basic purpose of a multiple regression model is:• to predict a variable, Y, known as the

response variable, using …• two or more predictor or independent

variables, xj , j = 1,2,…,k.• The objective is to produce a reliable and

accurate estimate or prediction of Y.



• This will be affected by which independent variables are chosen.

• The overarching principle is parsimony: build the simplest model possible consistent with producing a good estimate of Y.• This will reduce complexity and simplify

interpretation.• It will also save data collection costs.



• There is no substitute for a thorough understanding of the field in selecting good independent variables.

• In particular, any underlying theory could be very useful in identifying potential independent variables.

• A challenge, however, will be collinearity, as some predictors could be clearly linked with others.



• Collinearity exists when the independent variables are correlated among themselves.

• This makes impossible the interpretation of a partial slope (“the change in Y per unit change in x, holding all other x’s constant”).

• If several x’s are correlated, then we cannot “hold some constant” while we increase the other.



• Suppose candidates for the predictor variables have been determined.

• An ad-hoc assessment of their suitability as predictors and the extent of any collinearity is found by:• A correlation matrix of all the variables • A Matrix plot for every pair of variables

• These concepts will be explained in the context of Example 13.14.



Example 13.14:Data are collected for 20 independent pharmacies in an attempt to predict prescription volume (sales/month).

• The independent variables are: • total floor space (FLOOR_SP); • percentage of floor space allocated to prescription

department (PRESC_RX); • number of available parking spaces (PARKING); • whether or not the pharmacy is located in a shopping

center (SHOPCNTR); and, • per-capita income of the surrounding community

(INCOME).



• The correlation matrix for Example 13.14 follows:Correlations: VOLUME, FLOOR_SP, PRESC_RX, PARKING, SHOPCNTR, INCOME

VOLUME FLOOR_SP PRESC_RX PARKING SHOPCNTRFLOOR_SP 0.183

0.440

PRESC_RX -0.663 -0.7510.001 0.000

PARKING -0.069 0.504 -0.3280.772 0.023 0.158

SHOPCNTR -0.203 0.710 -0.341 0.4820.392 0.000 0.141 0.031

INCOME 0.385 0.863 -0.845 0.393 0.6450.094 0.000 0.000 0.087 0.002

Cell Contents: Pearson correlation P-Value



• A correlation matrix lists the correlation of every possible pair of variables.

• Because the matrix is symmetric, only the lower left part of the matrix is displayed.

• The upper right part of the matrix is a mirror image of the lower left.

• Because a variable is perfectly correlated with itself, these values are not shown.



• The values in the matrix are the correlations of:• the variable indicated in the row, and,• the variable indicated in the column.

• Example 13.14: The correlation between the predictors “Parking” and “Presc_Rx” is -0.328. This indicates that as the percentage of floor space allocated to the prescription department increases, the number of available parking spaces decreases.



• Low correlation (close to 0) between a pair of x’s indicates little to no collinearity.

• If two or more predictors are highly correlated, then we might consider using just one of them.

• Example 13.14:Two concerns here are the high correlations between:• “Income” and “Floor Space” [.863], and,• “Income” and “Presc_Rx” [-.845].

• The predictor that has the highest correlation with Y (“Volume”) is “Presc_Rx.”



• These findings tell us that two pairs of predictor variables are highly correlated

a potential collinearity problem.

• We may want to reconsider this set of predictors.



• The Matrix plot is the visual equivalent of the correlation matrix.

• For each pair of variables, a scatterplot is generated.

• The plots are scanned visually:• A distinct linear pattern indicates a

correlated pair of variables.• A scattering without any obvious pattern

indicates a pair of variables with little or no correlation.



• The Matrix Plot for Example 13.14 follows:

VOLUME

600040002000 906030 2416830

20

10

6000

4000

2000

FLOOR_SP

PRESC_RX

50

30

1090

60

30PARKING

SHOP CNTR

1.0

0.5

0.0

302010

24

16

8

503010 1.00.50.0

INCOME

Matrix Plot for Example 13.14



Example 13.14:• That the correlation between “Income” and “Parking” is only

.393 compared to the correlation of .863 between “Income”and “Floor Space” is evident in their scatterplots.

• The scatterplot between “Income” and “Parking” indicates a nonlinear relation between this pair of predictors.

• That the predictor “Presc_Rx” has the highest correlation of any of the predictors with Y is evident in the first row of the Matrix Plot.

• The unique appearance of the scatterplot between “Volume” and “ShopCntr” occurs because a pharmacy is either in a shopping center (1) or it is not (0).



• If the correlation matrix and Matrix plot indicate some of the independent variables are correlated with the others, we need to reconsider these variables.

• One possibility is to combine some of them into a single predictor.

• The correlation matrix and matrix plot may not show the full extent of a collinearity problem because only pairs of predictors are considered. That is why the VIF should be used (Chapter 12).


Section 13.2Using Qualitative Predictors:

Dummy Variables (Step 1)


13.2 Using Qualitative Predictors: Dummy Variables (Step 1)

• Up to now, we have exclusively used quantitative variables in regression.In Example 13.14, total floor space is a quantitative variable.

• Another type of independent variable is a qualitative variable.In Example 13.14, each of the 20 pharmacies either is, or is not, located in a shopping center.

• A dummy or indicator variable is used to model this.



• When there are only two categories, the dummy variable represents the presence or absence of a particular category.

• In Example 13.14, the variable takes on the value 1 if the pharmacy is located in a shopping center, and the value 0 if the pharmacy is not located in a shopping center.

• In Example 13.14, suppose “FLOOR_SP” and “SHOPCNTR” are the only predictors.



The Minitab output follows:

Regression Analysis: VOLUME versus FLOOR_SP, SHOPCNTR

The regression equation isVOLUME = 7.03 + 0.00352 FLOOR_SP - 8.26 SHOPCNTR

Predictor Coef SE Coef T P VIFConstant 7.033 5.350 1.31 0.206FLOOR_SP 0.003517 0.001585 2.22 0.040 2.0SHOPCNTR -8.256 3.656 -2.26 0.037 2.0

S = 5.72922 R-Sq = 25.7% R-Sq(adj) = 16.9%



• For a pharmacy in a shopping center,

VOLUME = 7.03 + 0.00352 FLOOR_SP - 8.26= (7.03 - 8.26) + 0.00352 FLOOR_SP= - 1.23 + 0.00352 FLOOR_SP

• For a pharmacy not in a shopping center,

VOLUME = 7.03 + 0.00352 FLOOR_SP

• -8.26 is the estimated difference in sales volume between a pharmacy located in a shopping center (SHOPCNTR = 1) and one not located in a shopping center (SHOPCNTR = 0) for any specified value of Floor Space.



• When there are only two categories, it would be wrong to use a dummy variable for each category.

• In Example 13.14, it would be wrong to have two dummy variables:Yes = 1, if a pharmacy is in a shopping center;

= 0, otherwise;No = 1, if a pharmacy is not in a shopping center;

= 0, otherwise.



• The worksheet for the first 5 pharmacies follows:

Pharmacy VOLUME FLOOR_SP Yes No1 22 4900 1 02 19 5800 1 03 24 5000 1 04 28 4400 0 15 18 3850 0 1

• The dummy variables “Yes” and “No” add to one for each pharmacy.

Since these two variables have a correlation of -1, there is severe multicollinearity.



The Minitab output follows:

Regression Analysis: VOLUME versus FLOOR_SP, Yes, No

* No is highly correlated with other X variables* No has been removed from the equation.

The regression equation is

VOLUME = 7.03 + 0.00352 FLOOR_SP - 8.26 Yes

• Minitab recognizes this multicollinearity and removes the “No” variable.



• Although a qualitative variable can be multi-level, interpretation can be problematic.

• Suppose we want to predict the GPA of an undergraduate student where the qualitative variable is class (freshman, sophomore, junior, senior).

• One possibility is to use an independent variable coded as:

1 = freshman 3 = junior2 = sophomore 4 = senior

• The codes used were arbitrary.



• If Class is the only predictor of GPA, then the population model is:

E(GPA) = β0 + β1x

• For freshmen, E(GPA) = β0 + β1

For sophomores, E(GPA) = β0 + 2β1

For juniors, E(GPA) = β0 + 3β1

For seniors, E(GPA) = β0 + 4β1



• This implies that the change in E(Y) is the same when going from freshmen to sophomores as when going from sophomores to juniors.

• It is better to create three dummy variables for three of the classes, e.g.:

x1 = 1, if freshman; = 0, if notx2 = 1, if sophomore; = 0, if notx3 = 1, if junior; = 0, if not



• Then each variable is either 1 or 0 depending on whether the student is or is not in that class.

• For example, a freshman would have x1 = 1 and x2 = x3 = 0.

• Any student with a zero for all three variables would be in the fourth group.

• Specifically, a student with x1 = x2 = x3 = 0 would be a senior.



• With three dummy variables, the population model is:E(GPA) = β0 + β1x1 + β2x2 + β3x3

• The model for each class is:Freshmen E(GPA) = β0 + β1Sophomores E(GPA) = β0 + β2Juniors E(GPA) = β0 + β3Seniors E(GPA) = β0

• β2–β1 is the differential effect between Sophomores and Freshmen.

• β1 is the differential effect between Freshmen and Seniors.



• Testing whether or not two population means are equal (Section 9.1) can also be done using regression analysis with a dummy variable.

Exercise 9.5: Company officials are concerned about the length of time a particular drug retains its potency. A random sample (sample 1) of 10 bottles of the product is drawn from current production and analyzed for potency. A second sample (sample 2) is obtained, stored for one year, and then analyzed. The readings obtained are:

• Management is interested in determining if mean potency has decreased after one year.

9.99.89.69.59.710.110.210.19.69.8Sample 210.610.010.210.710.69.810.810.310.510.2Sample 1



• Assumption: Independent, random samples from two normal populations with parameters (µ1 and σ1) and (µ2 and σ2), respectively.

• Furthermore, assume and unknown, but equal

• For Exercise 9.5, H0: µ1 = µ2 or µ1 - µ2 = 0Ha: µ2 < µ1 or µ2 - µ1 < 0

21σ 2

2σ



• Population Regression Model for Exercise 9.5:

E(Y) = β0 + β1x

where x = 1, if Sample 1 = 0, if Sample 2

• The null and research hypotheses become:

H0: β1 = 0 vs. Ha: β1 > 0

• The Minitab output follows:



Regression Analysis: POTENCY versus x1

The regression equation is: POTENCY = 9.83 + 0.540 x1

Predictor Coef SE Coef T PConstant 9.83000 0.09012 109.07 0.000x1 0.5400 0.1275 4.24 0.000

S = 0.284995 R-Sq = 49.9% R-Sq(adj) = 47.1%

Unusual ObservationsObs x1 POTENCY Fit SE Fit Residual St Resid5 1.00 9.8000 10.3700 0.0901 -0.5700 -2.11R

R denotes an observation with a large standardized residual.

• Conclusion: Since p-value = .000/2 < .05, reject H0: β1 = 0



• The value of the T-statistic (4.24) is the same as in Section 9.1 • The NPP shows the requirement that the standardized residuals be

normally distributed has been met.

SRES1

Perc

ent

3210-1-2-3

99

95

90

80

70

60504030

20

10

5

1

Mean

0.572

-1.54321E-14StDev 1.026N 20AD 0.291P-Value

NPP for SRES from Reression Approach to Problem 9.5Normal


Section 13.3Lagged Predictor Variables (Step 1)


13.3 Lagged Predictor Variables (Step 1)

• For time-series data, a regression model is frequently used to make forecasts.

• For example, a regression model to forecast monthly paint sales of a home-supply store chain for a region could be:

where income denotes Median Household Income of the region.

• There are two difficulties with this model.

,)Income(ˆ)AdvExp(ˆˆeslSa 210 ttt βββ ++=



• First, future estimates of Sales depend on future estimates of Adv Exp and Income for that time period.

• Secondly, it is likely that income for earlier months, rather than current income, will have more of an effect on current sales.



• A major question is the number of lags to use.• Fight the temptation to use many lags.

• The lagged variables are likely to be severely correlated.• Using multiple t-tests increases the overall probability of

Type I error.• Each lag results in the loss of one observation.

• “Knowledge of the basic process involved is almost always useful in choosing lags.” (Hildebrand, Ott and Gray)

• To illustrate these concepts, consider Exercise 13.42.



Exercise 13.42:An auto-supply store had 60 months of data on

variables that were thought to be relevant to sales. The data include monthly sales in thousands of dollars (SALES), average daily low temperature in degrees Fahrenheit (LOWTEMP), advertising expenditure for the month in thousands of dollars (ADEXP), used-car sales in the previous month (USEDCAR), and month number (MONTH).



• The variable “USED CAR” is in lagged form.

• Two new lagged variables were created:

LAG(USCA) is actually a two-month lagged variable for used-car sales.

LAG(ADEXP) is a one-month lagged variable for advertising expenditures.



• It was decided to use the following predictors:

• LOWTEMP: Drastic weather conditions for the current month immediately impact sales.

• LAG(ADEXP): Most sales at an auto-supply store are need-based rather than impulse-based. Meaningful advertising is retained by a potential customer for a future purchase.

• LAG(USCA): Since a warranty covers the repairs for the first 30 days, the impact of used-car purchasers won’t be felt for 2 months.

• The Minitab output follows:



Regression Analysis: SALES versus LOWTEMP, Lag(ADEXP), Lag(USCA)The regression equation isSALES = 352 - 4.00 LOWTEMP + 5.06 Lag(ADEXP) + 0.0154 Lag(USCA)

59 cases used, 1 cases contain missing values

Predictor Coef SE Coef T P VIFConstant 352.3 117.0 3.01 0.004LOWTEMP -4.0023 0.6858 -5.84 0.000 2.8Lag(ADEXP) 5.0649 0.6640 7.63 0.000 1.8Lag(USCA) 0.015412 0.007440 2.07 0.043 2.1

S = 54.9720 R-Sq = 83.2% R-Sq(adj) = 82.3%Durbin-Watson statistic = 0.410629



• The good news:• Each predictor adds statistically detectable

predictive value, given the others.• There is no multicollinearity problem.

• The bad news:• There is a serious autocorrelation problem

since Durbin-Watson statistic = 0.411 (Section 13.7).


Section 13.4Nonlinear Regression Models (Step 2)


13.4 Nonlinear Regression Models

• For 2 predictors, the general form for the first-order fitted model is:

• The residual, , removes the linear effect of x1 and x2.

• If the fitted model should have included an term, a plot of the SRi vs. x1 will show curvature.

• An example illustrating this is in Section 13.6.

• Instead of including an term, another approach is to transform one or more of the existing variables.

22110ˆˆˆˆ xxY βββ ++=

YY ˆ−21x

21x



• There is a difference between constant percentage growth and constant additive growth.

• Suppose initial sales of a company are $100 million. The difference between a constant percentage growth of 8% per year versus a constant additive growth of $8 million per year isshown in the table below.

Year

132.0124.0116.0108.0100.0$8M growth

136.0126.0116.6108.0100.08% growth

43210



• In time series data, the response variable Y frequently changes at an increasing rate.

• For example, if a present amount (P) is invested at a nominal annual interest rate (r) for t years, then the future amount after t years (Ft) is:

Ft = Pert,

under continuous compounding.



• This nonlinear relation becomes linear after a logarithmic transformation:

ln(Ft) = ln(P) + rt

or

Y = β0 + β1t



• The general form of the original model is:

Y = β0eβ1x

• A logarithmic transformation yields

ln(Y) = ln(β0) + β1xor

Y* = β0* + β1x• For this model, β1 is the percent change in Y when x changes

by 1 unit.



• “A logarithmic transformation is only one possibility. It is, however, a particularly useful one, because logarithms convert a multiplicative relation to an additive one. A natural logarithm (base e = 2.7182818), often denoted ln(y), is especially useful, because the results are interpretable as percentage changes.

• For example, if a prediction of high school teachers’ salaries yields predicted ln(salary) = constant + .042 (years’experience) + other terms, then an additional year’s experience, other terms held constant, predicts a 4.2 percentincrease in salary. This guideline isn’t perfect, but very close for values less than 0.2 or so.” (Hildebrand, Ott and Gray)



• Another example of a nonlinear model is the Cobb-Douglas production function:

Y = cIαkβ,

where Y is production, I is labor input, k is capital input,and α and β are unknown parameters.

• After a logarithmic transformation, the nonlinear relation becomes linear:

ln(Y) = ln(c) + α ln(I) + β ln(k),

or

Y = β0 + β1 x1 + β2 x2,



• The general form of the original model is:

Y = β0(x1β1)(x2

β2)(x3β3)…(xk

βk).

• A logarithmic transformation yields:

ln(Y) = ln(β0) + β1 ln(x1) + … + βk ln(xk), or

Y* = β0* + β1 x1* + βk xk *.



• When there is only one predictor, β1 is the percentage change in Y per percentage change in x.

• When x = price and Y = demand, β1 is the price elasticity of demand.



• In both models, the error term (ε) was omitted.

• For the first transformed model to have an additive error term, the original model needs to be:

Y = β0eβ1xeε.

• For the second transformed model to have an additive error term, the original model needs to be:

Y = β0x1β1eε .



• In converting a nonlinear model to a linear one, one seldom thinks about the format of ε in the original model.

• It is important to note that the error term in the transformed model must satisfy all of the usual conditions.

• If the original model was Y = β0x1β1eε, the transformed

model is:ln(Y) = ln(β0) + β1 ln(x1) + ε.

• In this case, ε must satisfy all of the usual conditions.



• Another type of nonlinearity is when the model includes an interaction term:

-- the product of 2 or more predictors.• For a first-order model with k = 2, the general

form of the fitted model is:

The change in per unit change in x1 is:

, when x2 is held constant.

.ˆˆˆˆ22110 xxY βββ ++=

Y

1β



• For a model with an interaction term:

The change in per unit change in x1 is:

, which depends on the level of x2.

• An interaction term, with a dummy variable as one of the predictors, adds considerable flexibility in model building.

.ˆˆˆˆˆ211222110 xxxxY ββββ +++=

Y

2121ˆˆ xββ +



Example 13.14:Data are collected for 20 independent pharmacies in an attempt to predict prescription volume (sales/month).

• The independent variables are: • total floor space (FLOOR_SP); and, • whether or not the pharmacy is located in a shopping

center (SHOPCNTR).

• From the Minitab output that follows, we see each of the predictors is statistically significant.



Regression Analysis: VOLUME versus PRESC_RX, SHOPCNTR

The regression equation isVOLUME = 30.9 - 0.400 PRESC_RX - 5.97 SHOPCNTR

Predictor Coef SE Coef T P VIFConstant 30.869 2.618 11.79 0.000PRESC_RX -0.40046 0.07412 -5.40 0.000 1.1SHOPCNTR -5.970 1.887 -3.16 0.006 1.1

S = 3.94742 R-Sq = 64.7% R-Sq(adj) = 60.6%

Unusual ObservationsObs PRESC_RX VOLUME Fit SE Fit Residual St Resid5 13.0 18.000 25.663 1.813 -7.663 -2.19R18 42.0 6.000 14.050 1.424 -8.050 -2.19RR denotes an observation with a large standardized residual.



• A question of interest is the possible need for an interaction term.

• For a model with an interaction term:

where Y = VOLUME, x1 = PRESC_RX, and, x2 = SHOPCNTR.

,ˆˆˆˆˆ211222110 xxxxY ββββ +++=



• For a pharmacy in a shopping center:

• For a pharmacy not in a shopping center:

• The interaction term affects the intercept and the slope.

• A visual assessment of the need for interaction is gleaned from a scatterplot with with a regression for each group, which follows.

.)ˆˆ()ˆˆ(ˆ112120 xY ββββ +++=

.ˆˆˆ110 xY ββ +=



PRESC_RX

VO

LUM

E

5040302010

30

25

20

15

10

5

SHOPCNTR01

Scatterplot of VOLUME vs PRESC_RX



• Because the separate lines are nearly parallel, there is no evidence of interaction.

• This is confirmed by running a regression model with an interaction term and examining its p-value.

For such a model, the p-value = 0.678 (shown below).



Regression Analysis: VOLUME versus PRESC_RX, SHOPCNTR, (PRE)(SHP)

The regression equation isVOLUME = 29.8 - 0.368 PRESC_RX - 4.22 SHOPCNTR - 0.064 (PRE)(SHP)

Predictor Coef SE Coef T P VIFConstant 29.846 3.613 8.26 0.000PRESC_RX -0.3679 0.1081 -3.40 0.004 2.3SHOPCNTR -4.223 4.560 -0.93 0.368 6.3(PRE)(SHP) -0.0643 0.1520 -0.42 0.678 5.6

S = 4.04635 R-Sq = 65.1% R-Sq(adj) = 58.6%


Section 13.5Choosing Among Regression Models


13.5 Choosing Among Regression Models

• Concept: Use some objective criterion to select the independent or predictor variables to be in the model.

• Criteria to be considered:• Stepwise Regression

• Forward and Backward• Forward Selection• Backward

• Best Subsets



• Example 13.14:• Assume that data are collected for 20 independent

pharmacies in an attempt to predict prescription volume (sales/month).

• The independent variables are: • total floor space (FLOOR_SP); • percentage of floor space allocated to prescription

department (PRESC_RX); • number of available parking spaces (PARKING), • whether or not the pharmacy is located in a shopping

center (SHOPCNTR); and, • per-capita income of the surrounding community

(INCOME).



• Example 13.14 (continued):• Portions of the data are shown below.

• For 5 predictors, there are 32 models to consider.

• Consider automatic screening procedures.

VOLUME FLOOR_SP PRESC_RX PARKING SHOPCNTR INCOME22 4900 9 40 1 1819 5800 10 50 1 2024 5000 11 55 1 1728 4400 12 30 0 1918 3850 13 42 0 10… … … … … …7 2900 45 30 1 9

17 2400 46 16 0 3



• Stepwise (forward and backward) Regression• Let (k) denote the number of predictors under

consideration, including interaction and quadratic terms

• Select an “alpha to enter” and an “alpha to remove”• “Alpha to enter” – the probability of a type 1 error for

entering a new predictor into a regression model • “Alpha to remove” –the probability of a type 1 error

for retaining a predictor that was previously entered into the regression model

• In Minitab, the default value for both alphas is 0.15.



• Stepwise Details • For all of the (k) possible simple regressions,

select that predictor with the largest | t-test |.• If the p-value for that predictor is greater than

“alpha to enter,” stop and choose a new set of predictors.

• If the p-value for the largest | t-test | is less than “alpha to enter,” choose that predictor as the first to enter the model.• Let x[1] denote the first predictor to enter.



• Stepwise Details (continued)

• Now consider all possible (k – 1) two predictor multiple regressions with x[1] as one of the twopredictors.

• The second predictor to enter, denoted by x[2], is that predictor with the largest |t-test| provided its p-value is less than “alpha to enter.” If not, stop with the simple regression obtained previously using x[1] .



• Stepwise Details (continued)• If x[2] enters, consider the two predictor model with x[1]

and x[2] .• Look at the t-test for x[1]. If its p-value is less than “alpha

to remove,” retain x[1] .• If not, remove x[1]. Use the simple regression with x[2] as

the predictor for the next step.• The procedure continues until no new variables can be

entered.

• Consider Example 13.14. The Minitab output follows.



• Stepwise Regression: VOLUME versus FLOOR_SP, PRESC_RX, ...Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15Response is VOLUME on 5 predictors, with N = 20Step 1 2Constant 25.98 48.29

PRESC_RX -0.321 -0.582T-Value -3.76 -5.67P-Value 0.001 0.000

FLOOR_SP -0.0038T-Value -3.39P-Value 0.003

S 4.84 3.84R-Sq 43.93 66.57R-Sq(adj) 40.82 62.63Mallows C-p 10.2 1.6



• The first variable to enter is percentage of floor space allocated to prescription departmentAt the end of the first step, the fitted model is:VOLUME = 25.98 – 0.321 PRESC_RX

• The second variable to enter is total floor space.The previously entered variable, PRESC_RX, remains.At the end of the second step, the fitted model is:VOLUME = 48.29 – 0.582 PRESC_RX – 0.0038 FLOOR_SP

• The stepwise procedure stops after two steps.

• The same results are obtained if “alpha to enter” and “alpha to remove” are both set at .10 or .05.



• Forward Selection:• At each step, enter that predictor with the largest

|t-test| provided its p-value is less than “alpha to enter.”• If not, stop with the previously obtained regression

model.• In Minitab, the default value for “alpha to enter” is 0.25.• Consider Example 13.14 with “alpha to enter” = 0.10.

The Minitab output follows.• For this example, the final results are the same for

both procedures.



• Stepwise Regression: VOLUME versus FLOOR_SP, PRESC_RX, ... Forward selection. Alpha-to-Enter: 0.1Response is VOLUME on 5 predictors, with N = 20Step 1 2Constant 25.98 48.29

PRESC_RX -0.321 -0.582T-Value -3.76 -5.67P-Value 0.001 0.000

FLOOR_SP -0.0038T-Value -3.39P-Value 0.003

S 4.84 3.84R-Sq 43.93 66.57R-Sq(adj) 40.82 62.63Mallows C-p 10.2 1.6



• Backward Elimination:• Begins with a model that contains all (k) predictors.• Removes them one at a time without re-entering any.• Ends when the |t-test| for each of the remaining

predictors has a p-value which is less than “alpha to remove.”

• In Minitab, the default value for “alpha to remove” is 0.1.

• Consider Example 13.14. The Minitab output follows.• For this example, all three procedures give the same

model.



• Stepwise Regression: VOLUME versus FLOOR_SP, PRESC_RX, ... Backward elimination. Alpha-to-Remove: 0.1 Response is VOLUME on 5 predictors, with N = 20

Step 1 2 3 4Constant 42.09 43.47 42.83 48.29

FLOOR_SP -0.0024 -0.0023 -0.0025 -0.0038T-Value -1.32 -1.34 -1.50 -3.39P-Value 0.210 0.200 0.152 0.003

PRESC_RX -0.50 -0.53 -0.53 -0.58T-Value -3.05 -4.65 -4.74 -5.67P-Value 0.009 0.000 0.000 0.000

PARKING -0.037 -0.040T-Value -0.56 -0.63P-Value 0.582 0.537

SHOPCNTR -3.1 -2.7 -3.0T-Value -0.95 -0.98 -1.14P-Value 0.356 0.342 0.272

INCOME 0.11T-Value 0.25P-Value 0.807

S 4.01 3.88 3.81 3.84R-Sq 70.01 69.87 69.07 66.57R-Sq(adj) 59.30 61.84 63.27 62.63Mallows C-p 6.0 4.1 2.4 1.6



• Best Subsets or all possible regressions is an alternative approach to model selection.

• For k possible predictors, there are k subsets of models.

• There is the subset of models each with one predictor, the subset with two predictors, …, and finally the subset with all k predictors.

• Consider Example 13.14. The Minitab output follows.



• Best Subsets Regression: VOLUME vs. FLOOR_SP, PRESC_RX, ...Response is VOLUME

F P SL R P HO E A O IO S R P NR C K C C_ _ I N O

Mallows S R N T MVars R-Sq R-Sq(adj) C-p S P X G R E

1 43.9 40.8 10.2 4.8351 X1 14.8 10.1 23.8 5.9604 X2 66.6 62.6 1.6 3.8420 X X2 64.7 60.6 2.5 3.9474 X X3 69.1 63.3 2.4 3.8089 X X X3 67.9 61.9 3.0 3.8778 X X X4 69.9 61.8 4.1 3.8825 X X X X4 69.3 61.1 4.3 3.9176 X X X X5 70.0 59.3 6.0 4.0099 X X X X X



• Minitab displays only the two best models for each subset, rather than all possible models.

• For each subset, three statistics of model performance are given:

• and were discussed in Chapter 12.

pa CMallowsandRR ', 22

2aR2R



• The Cp criterion

• For a model with p-coefficients [the intercept and (p-1) partial slopes] corresponding to (p-1) predictors,

• If the p-coefficient model has all of the useful predictors, then MS(Residual, p-coefficients)

≈ MS(Residual, all coefficients).

• This implies Cp ≈ p for an appropriate model.

( ) ( )( ) ( )pn

MSMSpn

p 2tscoefficien all Residual,

tscoefficien p Residual,C −−−

=



• Consider Example 13.14.

• Using the Cp criterion, which set of predictors should be selected?

• By scanning down the column labeled “Mallows C-p”, we see that:

C-p = 10.2, when p = 2;C-p = 1.6, when p = 3;C-p = 2.4, when p = 4; C-p = 4.1, when p = 5; andC-p = 6.0, when p = 6.



• The subset with p = 3 (or k = 2) has a value of Cp actually less than p.

• The same could be said for the subsets with p = 4 and p = 5.

• The model with p = 3 is chosen because it has the smallest Cp and is more parsimonious.

• For the subset with all k predictors, Cp = p. • This does not imply that you should use the model with k

predictors.



• Using the criterion, which set of predictors should be selected?

• For the subset with one predictor, that predictor with the largest is PRESC_RX.

• Minitab also gives the one predictor model with the next largest . That predictor is INCOME.

• For the subset with two predictors, the two predictors with the largest are: FLOOR_SP and PRESC_RX.

• The percentage increase in going from one to two predictors is: (62.6 – 40.8)/40.8 = 53.4%, a substantial increase.

2aR

2aR

2aR

2aR



• For the subset with three predictors, the three predictors with the largest are: FLOOR_SP, PRESC_RX and SHOPCNTR.

• The percentage increase in in going from two to three predictors is: (63.3 – 62.6)/62.6 = 1.1%.

• Is this negligible increase worth the loss in parsimony?

• The subset with four predictors is not considered since decreases.

• Using the criterion, the subset with two predictors appears to be the most reasonable, and the predictors are FLOOR_SP and PRESC_RX.

2aR

2aR

2aR

2aR



• Hopefully, the procedures will have some predictors in common that can be used as a starting point.

• “In selecting a regression model, a manager should use experience and judgment as well as statistical results. If one model involves reasonable relations and variables, yet does somewhat less well than another, less plausible model on a purely statistical basis, a manager might well choose the first model anyway.” (Hildebrand, Ott and Gray)


Section 13.6Residual Analysis (Step 4)


13.6 Residual Analysis (Step 4)

• In fitting a regression model, certain assumptions are made regarding the errors (ε) of the true or population model.

• Since the errors are unobservable, we use the residuals (preferably the standardized residuals, SRi).

• If the fitted model does not accurately capture the nature of the data, the SRi’s will exhibit a pattern.

• If the fitted model does accurately capture the nature of the data, a plot of the SRi will be uniformly spread out.



• Standardized residuals (SRi) were defined in Chapter 11:

• Recall that a SRi is considered large if | SRi | > 2.

i

iiSR

f_residualeviation_oStandard_dResidual

=



• One type of unexplained structure is nonlinearityin some of the predictors (xj).

• To detect this, plot SRi versus each xj.• If there is nonlinearity in an xj, the plot will show

curvature. • One remedy is to transform either the dependent

or independent variables. Possible transformations:

ln’s on Y and some or all of the xj, square roots on some or all of the xj, inverses on Y and some or all of the xj.



Exercise 11.33: A government agency responsible for awarding contracts for much of its research work is under careful scrutiny by a number of private companies. One company examines the relationship between the amount of the contract (x $10,000) and the length of time between the submission of the contract proposal and the contract approval:

Length (in months) Y: 3 4 6 8 11 14 20 Size (x $10,000) x: 1 5 10 50 100 500 1000

The scatterplot of Y vs. x follows.

It is obvious from the scatterplot that the relationship is nonlinear.



In the lower left portion of the scatterplot, the values are clustered together. Both SIZE and LENGTH are more spread out as one moves from left to right.

SIZE

LEN

GTH

10008006004002000

20

15

10

5

Scatterplot of Length vs. Size



• The use of a ln transformation on both SIZE and LENGTH shrinks this increasing spread.

• The scatterplot of ln(LENGTH) vs. ln(SIZE) follows:

ln(SIZE)

ln(L

ENGT

H)

76543210

3.0

2.5

2.0

1.5

1.0

Scatterplot of ln(LENGTH ) vs. ln(SIZE)



• In Exercise 11.33, the use of a ln transformation on both variables was determined by inspecting the scatterplot.

• This is not possible in multiple regression.• To illustrate the usefulness of a plot of SRi vs. each xj,

consider a simulated data set.

Example: Consider the Sales vs. Advertising Expenditures and Income example introduced in Chapter 12. However, now data was simulated from a model that also had the square of Advertising Expenditures as an independent variable. In fitting a model, only the linear effects of Advertising Expenditures and Income were used.



The regression output from Minitab follows:Regression Analysis: Sales versus Adv Exp, Income The regression equation isSales = - 6.37 + 8.03 Adv Exp + 0.944 Income

Predictor Coef SE Coef T P VIFConstant -6.366 8.578 -0.74 0.482Adv Exp 8.0292 0.6817 11.78 0.000 1.8Income 0.9439 0.2370 3.98 0.005 1.8S = 2.69837 R-Sq = 98.2% R-Sq(adj) = 97.7%

Unusual ObservationsObs Adv Exp Sales Fit SE Fit Residual St Resid10 6.00 88.625 84.283 1.720 4.342 2.09RR denotes an observation with a large standardized residual

The plot of SRi vs. Advertising Expenditures follows.



The curved relationship between SRi and Advertising Expenditures is obvious.

However, including a quadratic term in the model invites multicollinearity.

Adv Exp

Stan

dard

ized

Res

idua

l

654321

2.5

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

Residuals Versus Adv Exp(response is Sales)



Example: Consider the simulated data set for Sales vs. Advertising Expenditures and Income introduced above. The Minitab output follows for when the square of Advertising Expenditures is included as a predictor.

Regression Analysis: Sales versus Adv Exp, (AdvExp)^2, IncomeThe regression equation isSales = 0.00 + 1.55 Adv Exp + 0.948 (AdvExp)^2 + 0.993 Income

Predictor Coef SE Coef T P VIFConstant 0.001 2.042 0.00 1.000Adv Exp 1.5467 0.5946 2.60 0.041 25.6(AdvExp)^2 0.94800 0.08390 11.30 0.000 24.1Income 0.99345 0.05440 18.26 0.000 1.8S = 0.617500 R-Sq = 99.9% R-Sq(adj) = 99.9%

Although all three predictors are statistically significant, the VIF exceeds 10 for two of the predictors.

• When including a quadratic term, it is recommended that be used rather than x2.

)( 2xx −



The Minitab output follows for when (AdvExp-3.2)2 is used instead of (AdvExp)2

Regression Analysis: Sales versus Adv Exp, (AdvExp-3.2)^2, IncomeThe regression equation isSales = - 9.71 + 7.61 Adv Exp + 0.948(Adv Exp - 3.2)^2 + 0.993 Income

Predictor Coef SE Coef T P VIFConstant -9.706 1.985 -4.89 0.003Adv Exp 7.6139 0.1603 47.51 0.000 1.9X^2 0.94800 0.08390 11.30 0.000 1.1Income 0.99345 0.05440 18.26 0.000 1.8S = 0.617500 R-Sq = 99.9% R-Sq(adj) = 99.9%

Multicollinearity is no longer a problem.



• Another requirement is that the error terms be normally distributed.

• Standardized Residuals can be viewed as values of a standard normal random variable.

• Thus, a NPP of the SRi’s should be linear and the p-value of a normality test should exceed 0.05.

• Example: Consider the simulated data set for Sales vs. Advertising Expenditures and Income considered above when (AdvExp-3.2)2 is also used as a predictor. The NPP follows:



The linearity of the NPP and the p-value = 0.682 of the Anderson Darling normality test indicate that the SRi’s are normally distributed.

SRES

Perc

ent

3210-1-2-3

99

95

90

80

70

60504030

20

10

5

1

Mean

0.682

0.06917StDev 1.036N 10AD 0.245P-Value

Probability Plot of SRESNormal



• In Chapter 11, two types of outliers were introduced.

• In simple regression, a high-leverage point is one for which the x-value is, in some sense, far away from most of the x-values.

• A high leverage point is not necessarily bad.

• A high leverage point has the potential to alter the fitted line.



• The concept of a high leverage point in multiple regression was considered in Chapter 12.

• In multiple regression, one must consider not only the range of values of each predictor but the region of values for of all the predictors taken together.

• As was demonstrated in Chapter 12, Minitab will indicate a high leverage point by the “X” symbol.



• The other type of outlier is a Y-outlier.• A Y-outlier is one where |SRi| > 2.

• One problem with outliers is that they can distort the regression equation.

• Another problem is they can influence the plot of SRi vs. and the NPP of the SRi ’s.

• In the NPP, if there are a few residuals where the |SRi| > 2, then the plot could look nonlinear and have a p-value > 0.05 solely because of these large SRi ’s.

iY



• The “jackknife” method is another approach for detecting outliers.

• In the jackknife method, a set of n regression models is obtained, each time excluding one of the n observations.

• The coefficients of each model are compared to each other. If there is an outlier, the coefficients of that model should change substantially.



• To illustrate this concept, consider the sales and advertising expenditures example presented in Chapter 11, where the observation for Region G was changed from (3,4) to (10,4).

• That (10,4) is a high influence point is evident from the fitted line plot which follows.



• Applying the jackknife procedure, the following results are obtained.

Adv Exp_1

Sale

s_1

1086420

6

5

4

3

2

1

S 1.08509R-Sq 52.9%R-Sq(adj) 47.0%

Fitted Regression Line for Sales vs. Adv Exp (Revised Data)Sales_1 = 1.472 + 0.3919 Adv Exp_1



• Notice that omitting the point (10,4) resulted in large changes in and .

Data Point Excluded Slope Intercept(1,1) 0.345 1.76(2,1) 0.351 1.78(1,2) 0.399 1.43(3,2) 0.382 1.58(2,3) 0.416 1.29(4,3) 0.392 1.48

(10,4) 0.734 0.524(5,4) 0.382 1.45(5,5) 0.363 1.40(6,5) 0.349 1.50

0β 1β



• For multiple regression, the problem becomes more complex as there are k partial slopes that need to be compared.

• The jackknife method is also prohibitive for large n.

• For these situations, Cook’s (1977) statistic or others are recommended.



• Regarding the true model, another assumption is constant variance. This means that the error variance ( ) is constant.

• To investigate this, examine plots of the Standardized Residuals (SRi) vs. the fitted Y’s ( ) and each of the predictors.

• If any of these plots is fan (or funnel) shaped, this indicates that the variance of the error term is increasing (or decreasing) heteroscedasticity.

Example: To demonstrate heteroscedasticity, consider a simulated data set of 30 observations from the true model: E(Y) = 1.0 + 0.80 x, where the standard deviation of the error for the first 10 observation is 1.0, 2.0 for the second set of ten observations and 3.0 for the last set of 10 observations.

iY

2εσ



The Minitab output follows.

Regression Analysis: Y versus XThe regression equation is Y = 0.71 + 0.913 X

Predictor Coef SE Coef T PConstant 0.713 1.240 0.57 0.570X 0.9132 0.2870 3.18 0.004S = 2.567 R-Sq = 26.6% R-Sq(adj) = 23.9%

Unusual ObservationsObs X Y Fit SE Fit Residual St Resid26 6.00 14.644 6.193 0.741 8.451 3.44R

R denotes an observation with a large standardized residual

To determine whether or not heteroscedasticity is a problem, look at the plot ofSRs vs. .Y



Residuals vs Fits ( )

The fan or funnel-shaped pattern indicates heteroscedasticity.

6543

4

3

2

1

0

-1

-2

Fitted Value

Sta

ndar

dize

d R

esid

ual

Residuals Versus the Fitted Values(response is Y)

iY



• When there is heteroscedasticity, there are two possible cures.

• One cure is to use weighted least squares.

• The other cure is to re-express the dependent variable.



Examples 13.19 and 13.20:A very crude model for predicting the price of common

stocks might use price per share (Y) as a linear function of previous year’s earnings per share (x1), change in earnings per share (x2), and asset value per share (x3). A plot of standardized residuals versus values for a regression study of 26 stocks shows evidence of heteroscedasticity, since there is a general tendency for the magnitude of the SR’s to increase as increases.

Refer to the plot of the SR’s vs. that follows.

Y

iY

iY



Fitted Value

Stan

dard

ized

Res

idua

l

9080706050403020100

3

2

1

0

-1

-2

Residuals Versus the Fitted Values(response is PRICE)



• However, when the dependent variable is defined to be price per share (Y) divided by earnings per share (x1), or the P/E ratio, heteroscedasticity is not apparent.

Refer to the plot of the SR’s vs. that follows.iY



The dilemma now is that neither x2 nor x3 are statistically significant.

Fitted Value

Stan

dard

ized

Res

idua

l

7.87.77.67.57.4

2

1

0

-1

-2

Residuals Versus the Fitted Values(response is P/E)


Section 13.7Autocorrelation (Step 4)


13.7 Autocorrelation (Step 4)

• A requirement in regression is that the error terms be uncorrelated.

• For time-series data, the error terms (ε ) at different points of time could be correlated with each other.

• This is called autocorrelation.

• First-order or serial autocorrelation occurs if ε1 is correlated with ε2 , ε2 with ε3 , ε3 with ε4, etc.



• One way of stating that successive error terms are autocorrelated is as follows:

where ρ is the correlation between successive error terms and the are independent normal random variables.

• This is called the first-order autoregressive error model.

• For many business and economic data sets, ρ is positive.

1t t tuε ρε −= +

tu



• When the error terms are positively autocorrelated, consequences of using the least-squares line are: • The residual standard deviation (sε) underestimates the

standard deviation of the error term (σε).• F and t tests will appear more significant than they

really are.• Or, p-values are smaller than they should be.

• R2 is larger than it should be.

• Positive autocorrelation leads to delusions of predictability.“We think we can predict more accurately than we actually can.” (Hildebrand, Ott and Gray)



• How to detect autocorrelation in the unobservableerror terms?

Use the residuals which estimate the error terms!Determine if the residuals are autocorrelated!

• To illustrate residual autocorrelation, consider Exercise 13.42.



Exercise 13.42:An auto-supply store had 60 months of data on variables that

were thought to be relevant to sales. The data include monthly sales in thousands of dollars (SALES), average daily low temperature in degrees Fahrenheit (LOWTEMP), advertising expenditure for the month in thousands of dollars (ADEXP), used-car sales in the previous month (USEDCAR), and month number (MONTH).

For a preliminary analysis, use Simple Linear Regression with LOWTEMP as the predictor. Selected output is shown below. Is there autocorrelation in the residuals?



• A plot of the residuals vs. the order in which they occur follows.

• There is a tendency for a positive residual to be followed by a positive residual and a negative residual to be followed by a negative residual.

Residuals are positively autocorrelated.

Observation Order

Res

idua

l

605550454035302520151051

200

100

0

-100

-200

Residuals Versus the Order of the Data(response is SALES)



• The scatterplot for pairs of successive residuals also shows positive first-order autocorrelation.

RES(t-1)

RES

(t)

2001000-100-200

200

100

0

-100

-200

Scatterplot of RES(t) vs. RES(t-1) with Fitted Line



• The Durbin-Watson statistic is a formal test for first-order autocorrelation.

• Test statistic:

[ ]

∑

∑

=

−

=+ −

= n

tt

n

ttt

d

1

2

1

1

21

Residual

sidualResidualReWatson -Durbin



• Properties of d

• 0 ≤ d ≤ 4

• If errors are uncorrelated, d ≈ 2

• If errors are positively correlated, d is close to 0

• If errors are negatively correlated, d is close to 4

• A description of the formal hypothesis testing procedure follows.



• H0: No first-order autocorrelation between consecutive error terms

Ha: Positive first-order autocorrelation between consecutive error terms

• Negative may be used in place of positive in the Ha.

• Ha could be two-sided: positive or negative autocorrelation.

• For most time series in business, use Ha with positive autocorrelation.



• The formal procedure for testing H0 using the d-statistic is as follows:

H0: No autocorrelation

Ha: Positive autocorrelation

Rejection region: d < dL,α

Nonrejection region: d > dU,α

Inconclusive (“possibly significant”) region: dL,α ≤ d ≤ dU,α



• Note: dL,α and dU,α are the lower and upper tabulated values, respectively, corresponding to kindependent variables and n observations.

• Tables of the Durbin-Watson test bounds are available (Johnston, 1977)

• In practice, we hope not to reject H0: ρ = 0• Rule of thumb: “any value of d less than 1.5 or 1.6

leads us to suspect autocorrelation.” (H,O & G)• To illustrate using the Durbin-Watson test,

consider Exercise 13.42.



Exercise 13.42:An auto-supply store had 60 months of data on variables that were thought to be relevant to sales. The data include monthly sales in thousands of dollars (SALES) and average daily low temperature in degrees Fahrenheit (LOWTEMP). For a preliminary analysis, use Simple Linear Regression with LOWTEMP as the predictor. Obtain the Durbin-Watson statistic. Does this statistic indicate that there is a serious autocorrelation problem?

Selected output is shown below.



Regression Analysis: SALES versus LOWTEMP

The regression equation isSALES = 1026 - 5.68 LOWTEMP

Predictor Coef SE Coef T PConstant 1026.09 29.38 34.93 0.000LOWTEMP -5.6836 0.5854 -9.71 0.000

S = 80.5599 R-Sq = 61.9% R-Sq(adj) = 61.2%

Durbin-Watson statistic = 1.16730



• The rule of thumb says a value less of d < 1.5 indicates positive autocorrelation.

• Since d = 1.17, reject H0 of no positive autocorrelation.

• The table value at the 5% level of significance is dL,.05= 1.55. Since d = 1.17 < 1.55, reject H0 of no positive autocorrelation.



• The appropriate remedial measure for autocorrelation depends on the cause.

• If the autocorrelation is due to an omitted predictor, include another predictor variable whose cycles match the cycles of the residuals.

• For example, in regressing monthly sales of a restaurant chain on its monthly advertising expenditures, the addition of a competitor’s monthly advertising expenditures as a predictor could minimize an autocorrelation problem.



• If the model is correctly specified and autocorrelation occurs because the error terms really are autocorrelated, use transformed variables.

• The transformed variables are denoted as and , where:

, and denotes the estimate of ρ.• A quick approach is to let = 1. • Thus, the transformed or differenced variables are:

and • The Minitab procedure for finding the differenced variables is:

Stat > Time Series > Differences

11 ˆ,ˆ −− −=′−=′ tttttt xxxYYY ρρtY ′ tx′

ρρ

1−−=′ ttt YYY 1−−=′ ttt xxx



• Exercise 13.42 (cont’d.): First differences were obtained of the SALES and LOWTEMP variables and denoted as DIFSALES and DIFTEMP, respectively. Note that although there were 60 observations initially, there are 59 first differences. Is there evidence of autocorrelation problem?

Selected Minitab output follows. At the 5% level of significance, dL,.05 ≈ 1.55 and dU,.05 ≈ 1.62. Since d = 3.03 > 1.62, do not reject HO of no positive autocorrelation.



Regression Analysis: DIFSALES versus DIFTEMP

The regression equation isDIFSALES = 4.4 - 6.15 DIFTEMP

59 cases used, 1 cases contain missing values

Predictor Coef SE Coef T PConstant 4.40 11.40 0.39 0.701DIFTEMP -6.146 1.130 -5.44 0.000

S = 87.5610 R-Sq = 34.2% R-Sq(adj) = 33.0%

Durbin-Watson statistic = 3.02730



• If anything, differencing has resulted in the conversion of positive autocorrelation to apparent negative autocorrelation. Refer to the plot of the residuals vs. their order.

Observation Order

Res

idua

l

605550454035302520151051

200

100

0

-100

-200

Residuals Versus the Order of the Data(response is DIFSALES)


Section 13.8Model Validation


13.8 Model Validation

• The appropriateness of a fitted regression model is very similar to tuning an automobile.

• When an automobile is tuned, it is tuned for the conditions under which it will typically operate, e.g., sea-level altitude. However, the automobile may not perform properly if operated at high altitudes.

• Similarly, the fitted model was built and tuned using the data in hand.

• However, will the fitted model work just as well for data not used to build the model?



• The fitted model needs to be validated:How well does the fitted model work with different

data?

• The different data may be:• New data: data collected after the original data; or,• A holdout sample from the original data:

• For cross-sectional data, the holdout sample could be a randomly selected subset of the original data;

• For time-series data, the holdout sample is usually the most recent data.



• There are several methods for checking validity.

• H,O&G propose the following method:• Obtain the residuals using the new data;

• Check to see if the average error in the validation sample is near 0 and if the standard deviation of the validation errors is reasonably close to the residual standard deviation of the model.



• The mean square forecast error, a.k.a., the mean square prediction error is:

where n2 is the size of the new sample.

• To illustrate this validation method, consider Exercise 13.55.

21

2

)ˆ( nYYn

iii∑

=

−



Exercise 13.55:

A bank that offers charge cards to customers studies the yearly purchase amount on the card as related to age, income, and years of education of the cardholder, and whether the cardholder owns or rents a home. The variable “owner”equals 1 if the cardholder owns a home and 0 if the cardholder rents a home.



• During the model building phase, if one were to use the first 120 observations, this would exclude homeowners.

• Moral: Randomly select the observations to use in building the model.



• Using 120 randomly selected observations, eliminating multicollinearity by defining the response variable to be “yearly purchase amount / income” (PURCH_1/INC_1), and removing the insignificant predictor, the model is:

(.000) (.034)

The p-values are shown below the beta estimates.

).OWNER(00095.)AGE(000470.00703.ˆ −+=Y



• This model was used during the validation phase to obtain residuals for the other 40 observations.

• Since there is no systematic error in the residuals, the model has been validated.

• Note: As with an automobile, the model must undergo scheduled maintenance.


Keywords: Chapter 13

• Collinearity• Correlation matrix• Matrix plot• Qualitative predictors• Dummy variables• Indicator variables• Lagged variables• Nonlinear models• Logarithmic transformation• Residual plots• Stepwise regression• Forward selection

• Backward elimination• Mallows’ Cp• Outliers• Jackknife method• Homoscedasticity• Heteroscedasticity• Weighted least squares• Autocorrelation• First-order autoregressive

model• Durbin-Watson statistic• First differences• Model validation

151


Summary of Chapter 13

• This chapter presented a four-step process for building a multiple linear regression model:

• STEP ONE:• Initial Selection of Possible Predictor Variables• Incorporating Qualitative Independent Variables by

Using Dummy or Indicator Variables• Incorporating Lagged Predictor Variables when there is

Time-Series Data

152


Summary of Chapter 13

• STEP TWO:• Addressing Nonlinearity and Interaction Among the

Variables• STEP THREE:

• Choosing Predictors Using Stepwise and Other Methods

• STEP FOUR:• Checking the Assumptions of Linearity,

Heteroscedasticity, Normality and Independence by Doing a Residual Analysis

• Validating the Model

153

chapter 13: constructing a multiple regression model...chapter 13: constructing a multiple...

Documents