multiple regression mann 2007 ch14

17
614 14.1 Multiple Regression Analysis 14.2 Assumptions of the Multiple Regression Model 14.3 Standard Deviation of Errors 14.4 Coefficient of Multiple Determination 14.5 Computer Solution of Multiple Regression Multiple Regression In Chapter 13, we discussed simple linear regression and linear correlation. A simple regression model includes one independent and one dependent variable, and it presents a very simplified scenario of real-world situations. In the real world, a dependent variable is usually influenced by a number of independent variables. For example, the sales of a company’s product may be deter- mined by the price of that product, the quality of the product, and advertising expenditure incurred by the company to promote that product. Therefore, it makes more sense to use a regression model that includes more than one independent variable. Such a model is called a multiple regression model. In this chapter we will discuss multiple regression models. Chapter 14 14.1 Multiple Regression Analysis The simple linear regression model discussed in Chapter 13 was written as This model includes one independent variable, which is denoted by x, and one dependent vari- able, which is denoted by y. As we know from Chapter 13, the term represented by in the above model is called the random error. Usually a dependent variable is affected by more than one independent variable. When we include two or more independent variables in a regression model, it is called a multiple regression model. Remember, whether it is a simple or a multiple regression model, it always includes one and only one dependent variable. y A Bx 1519T_c14 03/27/2006 07:28 AM Page 614

Upload: hazilah-mohd-amin

Post on 12-Apr-2015

84 views

Category:

Documents


22 download

TRANSCRIPT

Page 1: Multiple Regression Mann 2007 Ch14

614

14.1 Multiple RegressionAnalysis

14.2 Assumptions of theMultiple RegressionModel

14.3 Standard Deviation ofErrors

14.4 Coefficient of MultipleDetermination

14.5 Computer Solution ofMultiple Regression

Multiple Regression

In Chapter 13, we discussed simple linear regression and linear correlation. A simple regressionmodel includes one independent and one dependent variable, and it presents a very simplifiedscenario of real-world situations. In the real world, a dependent variable is usually influenced bya number of independent variables. For example, the sales of a company’s product may be deter-mined by the price of that product, the quality of the product, and advertising expenditure incurredby the company to promote that product. Therefore, it makes more sense to use a regression modelthat includes more than one independent variable. Such a model is called a multiple regressionmodel. In this chapter we will discuss multiple regression models.

Chapter

14

14.1 Multiple Regression AnalysisThe simple linear regression model discussed in Chapter 13 was written as

This model includes one independent variable, which is denoted by x, and one dependent vari-

able, which is denoted by y. As we know from Chapter 13, the term represented by in the

above model is called the random error.

Usually a dependent variable is affected by more than one independent variable. When we

include two or more independent variables in a regression model, it is called a multipleregression model. Remember, whether it is a simple or a multiple regression model, it always

includes one and only one dependent variable.

y � A � Bx � �

1519T_c14 03/27/2006 07:28 AM Page 614

Page 2: Multiple Regression Mann 2007 Ch14

A multiple regression model with y as a dependent variable and as inde-

pendent variables is written as

(1)

where A represents the constant term, are the regression coefficients of in-

dependent variables respectively, and represents the random error term. This

model contains k independent variables and From model (1), it would seem

that multiple regression models can only be used when the relationship between the depend-

ent variable and each independent variable is linear. Furthermore, it also appears as if there

can be no interaction between two or more of the independent variables. This is far from the

truth. In the real world, a multiple regression model can be much more complex. Discussion

of such models is outside the scope of this book. When each term contains a single inde-

pendent variable raised to the first power as in model (1), we call it a first-order multipleregression model. This is the only type of multiple regression model we will discuss in this

chapter.

In regression model (1), A represents the constant term, which gives the value of y when

all independent variables assume zero values. The coefficients and are called

the partial regression coefficients. For example, is a partial regression coefficient of It

gives the change in y due to a one-unit change in when all other independent variables in-

cluded in the model are held constant. In other words, if we change by one unit but keep

and unchanged, then the resulting change in y is measured by Similarly the

value of gives the change in y due to a one-unit change in when all other independent

variables are held constant. In model (1) above, and are called the trueregression coefficients or population parameters.

A positive value for a particular in model (1) will indicate a positive relationship between

y and the corresponding variable. A negative value for a particular in that model will in-

dicate a negative relationship between y and the corresponding variable.

Remember that in a first-order regression model such as model (1), the relationship between

each and y is a straight-line relationship. In model (1),

is called the deterministic portion and is the stochastic portion of the model.

When we use the t distribution to make inferences about a single parameter of a multiple

regression model, the degrees of freedom are calculated as

where n represents the sample size and k is the number of independent variables in the model.

df � n � k � 1

�A � B1x1 � B2x2 � B3x3 � p � Bkxkxi

xi

Bixi

Bi

BkA, B1, B2, B3, p ,

x2B2

B1.xkx2, x3, p ,

x1

x1

x1.B1

BkB1, B2, B3, p ,

xk.x1, x2, x3, p ,

�x1, x2, x3, p , xk,

B1, B2, B3, p , Bk

y � A � B1x1 � B2x2 � B3x3 � p � Bkxk � �

x1, x2, x3, p , xk

14.1 Multiple Regression Analysis 615

Definition

Multiple Regression Model A regression model that includes two or more independent variables

is called a multiple regression model. It is written as

where y is the dependent variable, are the k independent variables, and is the

random error term. When each of the variables represents a single variable raised to the first

power as in the above model, this model is referred to as a first-order multiple regression model.For such a model with a sample size of n and k independent variables, the degrees of freedom are:

df � n � k � 1

xi

�x1, x2, x3, p , xk

y � A � B1x1 � B2x2 � B3x3 � p � Bkxk � �

When a multiple regression model includes only two independent variables (with

model (1) reduces to

A multiple regression model with three independent variables (with is written as

y � A � B1x1 � B2x2 � B3x3 � �

k � 32

y � A � B1x1 � B2x2 � �

k � 22,

1519T_c14 03/27/2006 07:28 AM Page 615

Page 3: Multiple Regression Mann 2007 Ch14

If model (1) is estimated using sample data, which is usually the case, the estimated re-

gression equation is written as

(2)

In equation (2), and are the sample statistics, which are the point estimators

of the population parameters and respectively.

In model (1), y denotes the actual values of the dependent variable for members of the sam-

ple. In the estimated model (2), denotes the predicted or estimated values of the dependent

variable. The difference between any pair of y and values gives the error of prediction. For a

multiple regression model,

where SSE stands for the error sum of squares.

As in Chapter 13, the estimated regression equation (2) is obtained by minimizing the sum

of squared errors, that is,

Minimize

The estimated equation (2) obtained by minimizing the sum of squared errors is called the leastsquares regression equation.

Usually the calculations in a multiple regression analysis are made by using statistical soft-

ware packages for computers, such as MINITAB, instead of using the formulas manually. Even

for a multiple regression equation with two independent variables, the formulas are complex

and manual calculations are time consuming. In this chapter we will perform the multiple re-

gression analysis using MINITAB. The solutions obtained by using other statistical software

packages such as JMP, SAS, S-Plus, or SPSS can be interpreted the same way. The TI-84 and

Excel do not have built-in procedures for the multiple regression model.

14.2 Assumptions of the MultipleRegression Model

Like a simple linear regression model, a multiple (linear) regression model is based on certain

assumptions. The following are the major assumptions for the multiple regression model (1).

Assumption 1: The mean of the probability distribution of is zero, that is,

If we calculate errors for all measurements for a given set of values of independent variables for

a population data set, the mean of these errors will be zero. In other words, while individual pre-

dictions will have some amount of errors, on average our predictions will be correct. Under this

assumption, the mean value of y is given by the deterministic part of regression model (1). Thus,

where is the expected or mean value of y for the population. This mean value of y is also

denoted by

Assumption 2: The errors associated with different sets of values of independent variables are

independent. Furthermore, these errors are normally distributed and have a constant standard

deviation, which is denoted by

Assumption 3: The independent variables are not linearly related. However, they can have a

nonlinear relationship. When independent variables are highly linearly correlated, it is referred

to as multicollinearity. This assumption is about the nonexistence of the multicollinearity prob-

lem. For example, consider the following multiple regression model:

y � A � B1x1 � B2x2 � B3x3 � �

s�.

my 0x1, x2, p , xk.

E1y2

E1y2 � A � B1x1 � B2x2 � B3x3 � p � Bkxk

E1�2 � 0

a 1y � y22

SSE � a 1y � y22

yy

Bk,A, B1, B2, B3, p ,

bka, b1, b2, b3, p ,

y � a � b1x1 � b2x2 � b3x3 � p � bkxk

616 Chapter 14 Multiple Regression

1519T_c14 03/27/2006 07:28 AM Page 616

Page 4: Multiple Regression Mann 2007 Ch14

All of the following linear relationships (and other such linear relationships) between and

should be invalid for this model.

If any linear relationship exists, we can substitute one variable for another, which will reduce

the number of independent variables to two. However, nonlinear relationships, such as

and between and are permissible.

In practice, multicollinearity is a major issue. Examining the correlation for each pair of

independent variables is a good way to determine if multicollinearity exists.

Assumption 4: There is no linear association between the random error term and each inde-

pendent variable

14.3 Standard Deviation of ErrorsThe standard deviation of errors (also called the standard error of the estimate) for the mul-

tiple regression model (1) is denoted by and it is a measure of variation among errors. How-

ever, when sample data are used to estimate multiple regression model (1), the standard devia-

tion of errors is denoted by The formula to calculate is as follows.

where

Note that here SSE is the error sum of squares. We will not use this formula to calculate

manually. Rather we will obtain it from the computer solution. Note that many software pack-

ages label as Root MSE, where MSE stands for mean square error.

14.4 Coefficient of Multiple DeterminationIn Chapter 13, we denoted the coefficient of determination for a simple linear regression model

by and defined it as the proportion of the total sum of squares SST that is explained by the

regression model. The coefficient of determination for the multiple regression model, usually

called the coefficient of multiple determination, is denoted by and is defined as the pro-

portion of the total sum of squares SST that is explained by the multiple regression model. It

tells us how good the multiple regression model is and how well the independent variables in-

cluded in the model explain the dependent variable.

Like the value of the coefficient of multiple determination always lies in the range 0

to 1, that is,

Just as in the case of the simple linear regression model, SST is the total sum of squares, SSRis the regression sum of squares, and SSE is the error sum of squares. SST is always equal to

the sum of SSE and SSR. They are calculated as follows.

SSR is the portion of SST that is explained by the use of the regression model, and SSE is

the portion of SST that is not explained by the use of the regression model. The coefficient of

multiple determination is given by the ratio of SSR and SST as follows.

R2 �SSR

SST

SSR � a 1 y � y22 SST � SSyy � a 1y � y22 SSE � a e2 � a 1y � y22

0 � R2 � 1

R2r2,

R2

r 2

se

se

SSE � a 1y � y 22se � B SSE

n � k � 1

sese.

s�,

xi.

x3x1, x2,x2 � 2x1 � 6x23

x1 � 4x22

x1 � 3.5x2

x2 � 5x1 � 2x3

x1 � x2 � 4x3

x3

x1, x2,

14.4 Coefficient of Multiple Determination 617

1519T_c14 03/27/2006 07:29 AM Page 617

Page 5: Multiple Regression Mann 2007 Ch14

The coefficient of multiple determination has one major shortcoming. The value of gen-

erally increases as we add more and more explanatory variables to the regression model (even if

they do not belong in the model). Just because we can increase the value of does not imply

that the regression equation with a higher value of does a better job of predicting the depend-

ent variable. Such a value of will be misleading, and it will not represent the true explanatory

power of the regression model. To eliminate this shortcoming of it is preferable to use the adj-usted coefficient of multiple determination, which is denoted by Note that is the coeffi-

cient of multiple determination adjusted for degrees of freedom. The value of may increase,

decrease, or stay the same as we add more explanatory variables to our regression model. If a new

variable added to the regression model contributes significantly to explain the variation in y, then

increases; otherwise it decreases. The value of is calculated as follows.

or

Thus, if we know we can find the vlaue of Almost all statistical software packages give

the values of both and for a regression model.

Another property of to remember is that whereas can never be negative, can be

negative.

While a general rule of thumb is that a higher value of implies that a specific set of inde-

pendent variables does a better job of predicting a specific dependent variable, it is important to

recognize that some dependent variables have a great deal more variability than others. Therefore,

could imply that a specific model is not a very strong model, but it could be the best

possible model in a certain scenario. Many good financial models have values of below .50.

14.5 Computer Solution of Multiple RegressionIn this section, we take an example of a multiple regression model, solve it using MINITAB,

interpret the solution, and make inferences about the population parameters of the regression model.

� EXAMPLE 14–1A researcher wanted to find the effect of driving experience and the number of driving viola-

tions on auto insurance premiums. A random sample of 12 drivers insured with the same com-

pany and having similar auto insurance policies was selected from a large city. Table 14.1 lists

R2

R2 � .30

R2

R2R2R2

R2R2

R2.R2,

1 �SSE/ 1n � k � 12

SST/ 1n � 12R2 � 1 � 11 � R22a

n � 1

n � k � 1b

R2R2

R2

R2R2.

R2,

R2

R2

R2

R2R2

618 Chapter 14 Multiple Regression

Table 14.1

Number ofMonthly Premium Driving Experience Driving Violations

(dollars) (years) (past 3 years)

148 5 2

76 14 0

100 6 1

126 10 3

194 4 6

110 8 2

114 11 3

86 16 1

198 3 5

92 9 1

70 19 0

120 13 3

Using MINITAB to find amultiple regression equation.

1519T_c14 03/27/2006 07:29 AM Page 618

Page 6: Multiple Regression Mann 2007 Ch14

the monthly auto insurance premiums (in dollars) paid by these drivers, their driving experi-

ences (in years), and the numbers of driving violations committed by them during the past

three years.

Using MINITAB, find the regression equation of monthly premiums paid by drivers on the

driving experiences and the numbers of driving violations.

Solution Let

We are to estimate the regression model

(3)

The first step is to enter the data of Table 14.1 into MINITAB spreadsheet as shown in Screen

14.1. Here we have entered the given data in columns C1, C2, and C3 and named them Monthly

Premium, Driving Experience and Driving Violations, respectively.

y � A � B1x1 � B2x2 � �

x2 � the number of driving violations committed by a driver during the past three years

x1 � the driving experience 1in years2 of a driver

y � the monthly auto insurance premium 1in dollars2 paid by a driver

To obtain the estimated regression equation, select . In the di-

alog box you obtain, enter Monthly Premium in the Response box, and Driving Experienceand Driving Violations in the Predictors box as shown in Screen 14.2. Note that you can en-

ter the column names C1, C2, and C3 instead of variable names in these boxes. Click OK to

obtain the output, which is shown in Screen 14.3.

From the output given in Screen 14.3, the estimated regression equation is:

y � 110 � 2.75x1 � 16.1x2

Stat�Regression�Regression

14.5 Computer Solution of Multiple Regression 619

Screen 14.1

1519T_c14 03/27/2006 07:29 AM Page 619

Page 7: Multiple Regression Mann 2007 Ch14

620 Chapter 14 Multiple Regression

Screen 14.3

Screen 14.2

1519T_c14 03/27/2006 07:29 AM Page 620

Page 8: Multiple Regression Mann 2007 Ch14

14.5.1 Estimated Multiple Regression ModelExample 14–2 describes, among other things, how the coefficients of the multiple regression

model are interpreted.

� EXAMPLE 14–2Refer to Example 14–1 and the MINITAB solution given in Screen 14.3.

(a) Explain the meaning of the estimated regression coefficients.

(b) What are the values of the standard deviation of errors, the coefficient of multiple

determination, and the adjusted coefficient of multiple determination?

(c) What is the predicted auto insurance premium paid per month by a driver with seven

years of driving experience and three driving violations committed in the past three

years?

(d) What is the point estimate of the expected (or mean) auto insurance premium paid

per month by all drivers with 12 years of driving experience and 4 driving violations

committed in the past three years?

Solution(a) From the portion of the MINITAB solution that is marked I in Screen 14.3, the esti-

mated regression equation is

(4)

From this equation,

and

We can also read the values of these coefficients from the column labeled Coef in the

portion of the output marked II in the MINITAB solution of Screen 14.3. From this

column we obtain

and

Notice that in this column the coefficients of the regression equation appear with more

digits after the decimal point. With these coefficient values, we can write the estimated

regression equation as

(5)

The value of in the estimated regression equation (5) gives the value of

for and Thus, a driver with no driving experience and no driving

violations committed in the past three years is expected to pay an auto insurance

premium of $110.28 per month. Again, this is the technical interpretation of a. In

reality, that may not be true because none of the drivers in our sample has both zero

experience and zero driving violations. As all of us know, some of the highest premi-

ums are paid by teenagers just after obtaining their drivers licenses.

The value of in the estimated regression model gives the change in

for a one-unit change in when is held constant. Thus, we can state that a driver

with one extra year of experience but the same number of driving violations is ex-

pected to pay (or $2.75) less per month for the auto insurance premium. Note

that because is negative, an increase in driving experience decreases the premium

paid. In other words, y and have a negative relationship.

The value of in the estimated regression model gives the change in

for a one-unit change in when is held constant. Thus, a driver with one extra

driving violation during the past three years but with the same years of driving expe-

rience is expected to pay $16.106 (or $16.11) more per month for the auto insurance

premium.

x1x2

yb2 � 16.106

x1

b1

$2.7473

x2x1yb1 � �2.7473

x2 � 0.x1 � 0

ya � 110.28

y � 110.28 � 2.7473x1 � 16.106x2

b2 � 16.106b1 � �2.7473,a � 110.28,

b2 � 16.1b1 � �2.75,a � 110,

y � 110 � 2.75x1 � 16.1x2

14.5 Computer Solution of Multiple Regression 621

Interpreting parts of theMINITAB solution of multiple

regression.

1519T_c14 03/27/2006 07:29 AM Page 621

Page 9: Multiple Regression Mann 2007 Ch14

(b) The values of the standard deviation of errors, the coefficient of multiple determina-

tion, and the adjusted coefficient of multiple determination are given in part III of the

MINITAB solution of Screen 14.3. From this part of the solution,

and

Thus, the standard deviation of errors is 12.1459. The value of tells us

that the two independent variables, years of driving experiences and the numbers of

driving violations, explain 93.1% of the variation in the auto insurance premiums. The

value of is the value of the coefficient of multiple determination adjusted

for degrees of freedom. It states that when adjusted for degrees of freedom, the two

independent variables explain 91.6% of the variation in the dependent variable.

(c) To Find the predicted auto insurance premium paid per month by a driver with seven

years of driving experience and three driving violations during the past three years,

we substitute and in the estimated regression model (5). Thus,

Note that this value of is a point estimate of the predicted value of y, which is de-

noted by The concept of the predicted value of y is the same as that for a simple

linear regression model discussed in Section 13.8.2 of Chapter 13.

(d) To obtain the point estimate of the expected (mean) auto insurance premium paid per

month by all drivers with 12 years of driving experience and four driving violations

during the past three years, we substitute and in the estimated

regression equation (5). Thus,

This value of is a point estimate of the mean value of y, which is denoted by

or The concept of the mean value of y is the same as that for a simple linear

regression model discussed in Section 13.8.1 of Chapter 13. �

14.5.2 Confidence Interval for an Individual CoefficientThe values of and obtained by estimating model (1) using sample data give

the point estimates of and respectively, which are the population parameters.

Using the values of sample statistics and we can make confidence intervals

for the corresponding population parameters and respectively.

Because of the assumption that the errors are normally distributed, the sampling distribution

of each is normal with its mean equal to and standard deviation equal to For example,

the sampling distribution of is normal with its mean equal to and standard deviation equal

to However, usually is not known and, hence, we cannot find Consequently, we use

as an estimator of and use the t distribution to determine a confidence interval for

The formula to obtain a confidence interval for a population parameter is given below.

This is the same formula we used to make a confidence interval for B in Section 13.5.2 of Chap-

ter 13. The only difference is that to make a confidence interval for a particular for a multi-

ple regression model, the degrees of freedom are n � k � 1.

Bi

Bi

Bi.sbisbi

sbi.sesb1

.

B1b1

sbi.Bibi

Bk,A, B1, B2, B3, p ,

bk,a, b1, b2, b3, p ,

Bk,A, B1, B2, B3, p ,

bka, b1, b2, b3, p ,

my 0x1x2.

E1y2y

� 110.28 � 2.7473 1122 � 16.106 142 � $141.74

y � 110.28 � 2.7473x1 � 16.106x2

x2 � 4x1 � 12

yp.

y

� 110.28 � 2.7473 172 � 16.106 132 � $139.37

y � 110.28 � 2.7473x1 � 16.106x2

x2 � 3x1 � 7

R2 � 91.6%

R2 � 93.1%

R2 � 91.6%R2 � 93.1%,se � 12.1459,

622 Chapter 14 Multiple Regression

Confidence Interval for Bi The confidence interval for Bi is given by

The value of t that is used in this formula is obtained from the t distribution table for area

in the right tail of the t distribution curve and degrees of freedom. The values of bi

and are obtained from the computer solution.sbi

1n � k � 12a �2

bi � tsbi

11 � a2 � 100%

1519T_c14 03/27/2006 07:29 AM Page 622

Page 10: Multiple Regression Mann 2007 Ch14

Example 14–3 describes the procedure to make a confidence interval for an individ-

ual regression coefficient

� EXAMPLE 14–3Determine a 95% confidence interval for (the coefficient of experience) for the multiple re-

gression of auto insurance premium on driving experience and the number of driving viola-

tions. Use the MINITAB solution of Screen 14.3.

Solution To make a confidence interval for we use the portion marked II in the MINITAB

solution of Screen 14.3. From that portion of the MINITAB solution,

and

Note that the value of the standard deviation of is given in the column labeled

SE Coef in part II of the MINITAB solution.

The confidence level is 95%. The area in each tail of the t distribution curve is obtained as

follows.

The sample size is 12, which gives Because there are two independent variables,

Therefore,

From the t distribution table (Table V of Appendix C), the value of t for .025 area in the right

tail of the t distribution curve and 9 degrees of freedom is 2.262. Then, the 95% confidence

interval for is

Thus, the 95% confidence interval for is to That is, we can state with 95%

confidence that for one extra year of driving experience, the monthly auto insurance pre-

mium changes by an amount between and Note that since both limits of the

confidence interval have negative signs, we can also state that for each extra year of driv-

ing experience, the monthly auto insurance premium decreases by an amount between $.54

and $4.96. �

By applying the procedure used in Example 14–3, we can make a confidence interval for any

of the coefficients (including the constant term) of a multiple regression model, such as A and

in model (3). For example, the 95% confidence intervals for A and respectively, are

14.5.3 Testing a Hypothesis about an Individual CoefficientWe can perform a test of hypothesis about any of the coefficients of the regression model (1)

using the same procedure that we used to make a test of hypothesis about B for a simple re-

gression model in Section 13.5.3 of Chapter 13. The only difference is that the degrees of free-

dom are equal to for a multiple regression model.

Again, because of the assumption that the errors are normally distributed, the sampling

distribution of each is normal with its mean equal to and standard deviation equal to

However, usually is not known and, hence, we cannot find . Consequently, we use as

an estimator of and use the t distribution to perform the test.sbi,

sbisbi

se

sbi.Bibi

n � k � 1

Bi

b2 � tsb2� 16.106 � 2.26212.6132 � 10.20 to 22.02

a � tsa � 110.28 � 2.262114.622 � 77.21 to 143.35

B2,B2

�$.54.�$4.96

�.54.�4.96b1

� �2.7473 � 2.2100 � �4.9573 to �.5373

b1 � tsb1� �2.7473 � 2.2621.97702

B1

Degrees of freedom � n � k � 1 � 12 � 2 � 1 � 9

k � 2.

n � 12.

Area in each tail of the t distribution � 11 � .952 �2 � .025

sb1� .9770,b1,

sb1� .9770b1 � �2.7473

B1,

B1

Bi.

14.5 Computer Solution of Multiple Regression 623

Making a confidence intervalfor an individual coefficient of

a multiple regression model.

1519T_c14 03/27/2006 07:29 AM Page 623

Page 11: Multiple Regression Mann 2007 Ch14

Example 14–4 illustrates the procedure for testing a hypothesis about a single coefficient.

� EXAMPLE 14–4Using the 2.5% significance level, can you conclude that the coefficient of the number of years

of driving experience in regression model (3) is negative? Use the MINITAB output obtained

in Example 14–1 and shown in Screen 14.3 to perform this test.

Solution From Example 14–1, our multiple regression model (3) is

where y is the monthly auto insurance premium (in dollars) paid by a driver, is the driving

experience (in years), and is the number of driving violations committed during the past

three years. From the MINITAB solution, the estimated regression equation is

To conduct a test of hypothesis about we use the portion marked II in the MINITAB so-

lution given in Screen 14.3. From that portion of the MINITAB solution,

and

Note that the value of the standard deviation of is given in the column labeled

SE Coef in part II of the MINITAB solution.

To make a test of hypothesis about we perform the following five steps.

Step 1. State the null and alternative hypotheses.

We are to test whether or not the coefficient of the number of years of driving experi-

ence in regression model (3) is negative, that is, whether or not is negative. The two

hypotheses are

Note that we can also write the null hypothesis as which states that the coeffi-

cient of the number of years of driving experience in the regression model (3) is either zero

or positive.

Step 2. Select the distribution to use.

The sample size is small and is not known. The sampling distribution of b1

is normal because the errors are assumed to be normally distributed. Hence, we use the tdistribution to make a test of hypothesis about B1.

Step 3. Determine the rejection and nonrejection regions.

The significance level is .025. The sign in the alternative hypothesis indicates that the

test is left-tailed. Therefore, area in the left tail of the t distribution curve is � � .025. The

degrees of freedom are:

From the t distribution table (Table V in Appendix C), the critical value of t for 9 degrees of free-

dom and .025 area in the left tail of the t distribution curve is �2.262, as shown in Figure 14.1.

df � n � k � 1 � 12 � 2 � 1 � 9

se1n 6 302

H0 : B1 0,

H1 : B1 6 0

H0 : B1 � 0

B1

B1,

sb1� .9770,b1,

sb1� .9770b1 � �2.7473

B1,

y � 110.28 � 2.7473x1 � 16.106x2

x2

x1

y � A � B1x1 � B2x2 � �

624 Chapter 14 Multiple Regression

Testing a hypothesis about acoefficient of a multiple

regression model.

Test Statistic for bi The value of the test statistic t for is calculated as

The value of is substituted from the null hypothesis. Usually, but not always, the null hypoth-

esis is The MINITAB solution contains this value of the t statistic.H0: Bi � 0.

Bi

t �bi � Bi

sbi

bi

1519T_c14 03/27/2006 07:29 AM Page 624

Page 12: Multiple Regression Mann 2007 Ch14

Step 4. Calculate the value of the test statistic and p-value.

The value of the test statistic t for can be obtained from the MINITAB solution given in

Screen 14.3. This value is given in the column labeled T and the row named Driving Experi-

ence in the portion marked II in that MINITAB solution. Thus, the observed value of t is

Also, in the same portion of the MINITAB solution, the p-value for this test is given in the

column labeled P and the row named Driving Experience. This p-value is .020. However,

MINITAB always gives the p-value for a two-tailed test. Because our test is one-tailed, the

p-value for our test is

Step 5. Make a decision.

The value of the test statistic, is less than the critical value of and

it falls in the rejection region. Consequently, we reject the null hypothesis and conclude that

the coefficient of in regression model (3) is negative. That is, an increase in the driving ex-

perience decreases the auto insurance premium.

Also the p-value for the test is .010, which is less than the significance level of

Hence, based on this p-value also, we reject the null hypothesis and conclude that is neg-

ative. �

Note that the observed value of t in Step 4 of Example 14–4 is obtained from the MINITAB

solution only if the null hypothesis is However, if the null hypothesis is that is

equal to a number other than zero, then the t value obtained from the MINITAB solution is no

longer valid. For example, suppose the null hypothesis in Example 14–4 is

and the alternative hypothesis is

In this case the observed value of t will be calculated as

To calculate this value of t, the values of and are obtained from the MINITAB solution

of Screen 14.3. The value of is substituted from

EXERCISES

� CONCEPTS AND PROCEDURES14.1 How are the coefficients of independent variables in a multiple regression model interpreted? Explain.

14.2 What are the degrees of freedom for a multiple regression model to make inferences about individual

parameters?

H0.B1

sb1b1

t �b1 � B1

sb1

��2.7473 � 1�22

.9770� �.765

H1 : B1 6 �2

H0 : B1 � �2

B1H0 : B1 � 0.

B1

a � .025.

x1

t � �2.262t � �2.81,

p-value � .020�2 � .010

t �b1 � B1

sb1

� �2.81

b1

14.5 Computer Solution of Multiple Regression 625

t–2.262 0

� = .025

Reject H0 Do not reject H0

Critical value of t

Figure 14.1

1519T_c14 03/27/2006 07:29 AM Page 625

Page 13: Multiple Regression Mann 2007 Ch14

626 Chapter 14 Multiple Regression

14.3 What kinds of relationships among independent variables are permissible and which ones are not

permissible in a linear multiple regression model?

14.4 Explain the meaning of the coefficient of multiple determination and the adjusted coefficient of mul-

tiple determination for a multiple regression model. What is the difference between the two?

14.5 What are the assumptions of a multiple regression model?

14.6 The following table gives data on variables and

y

8 18 38 74

11 26 25 64

19 34 24 47

21 38 44 31

7 13 12 79

23 49 48 35

16 28 38 42

27 59 52 18

9 14 17 71

13 21 39 57

Using MINITAB, estimate the regression model

Using the solution obtained, answer the following questions.

a. Write the estimated regression equation.

b. Explain the meaning of a, and obtained by estimating the given regression model.

c. What are the values of the standard deviation of errors, the coefficient of multiple determination,

and the adjusted coefficient of multiple determination?

d. What is the predicted value of y for and

e. What is the point estimate of the expected (mean) value of y for all elements given that

and

f. Construct a 95% confidence interval for the coefficient of

g. Using the 2.5% significance level, test whether or not the coefficient of is positive.

14.7 The following table gives data on variables y, and

y

24 98 52

14 51 69

18 74 63

31 108 35

10 33 88

29 119 54

26 99 51

33 141 31

13 47 67

27 103 41

26 111 46

Using MINITAB, find the regression of y on and Using the solution obtained, answer the follow-

ing questions.

a. Write the estimated regression equation.

b. Explain the meaning of the estimated regression coefficients of the independent variables.

x2.x1

x2x1

x2.x1,

x1

x3.x3 � 55?x2 � 30,

x1 � 40,x3 � 65?x2 � 40,x1 � 35,

b3b2,b1,

y � A � B1x1 � B2x2 � B3x3 � �

x3x2x1

x3.x2,x1,y,

1519T_c14 03/27/2006 07:30 AM Page 626

Page 14: Multiple Regression Mann 2007 Ch14

14.5 Computer Solution of Multiple Regression 627

c. What are the values of the standard deviation of errors, the coefficient of multiple determination,

and the adjusted coefficient of multiple determination?

d. What is the predicted value of y for and

e. What is the point estimate of the expected (mean) value of y for all elements given that

and

f. Construct a 99% confidence interval for the coefficient of

g. Using the 1% significance level, test if the coefficient of in the population regression model is

negative.

� APPLICATIONS14.8 The salaries of workers are expected to be dependent, among other factors, on the number of years

they have spent in school and their work experiences. The following table gives information on the annual

salaries (in thousands of dollars) for 12 persons, the number of years each of them spent in school, and

the total number of years of work experiences.

Salary 52 44 48 77 68 48 59 83 28 61 27 69

Schooling 16 12 13 20 18 16 14 18 12 16 12 16

Experience 6 10 15 8 11 2 12 4 6 9 2 18

Using MINITAB, find the regression of salary on schooling and experience. Using the solution obtained,

answer the following questions.

a. Write the estimated regression equation.

b. Explain the meaning of the estimates of the constant term and the regression coefficients of inde-

pendent variables.

c. What are the values of the standard deviation of errors, the coefficient of multiple determination,

and the adjusted coefficient of multiple determination?

d. How much salary is a person with 18 years of schooling and 7 years of work experience expected

to earn?

e. What is the point estimate of the expected (mean) salary for all people with 16 years of schooling

and 10 years of work experience?

f. Determine a 99% confidence interval for the coefficient of schooling.

g. Using the 1% significance level, test whether or not the coefficient of experience is positive.

14.9 The CTO Corporation has a large number of chain restaurants throughout the United States. The

research department at the company wanted to find if the sales of restaurants depend on the size of the

population within a certain area surrounding the restaurants and the mean income of households in those

areas. The company collected information on these variables for 11 restaurants. The following table gives

information on the weekly sales (in thousands of dollars) of these restaurants, the population (in thou-

sands) within five miles of the restaurants, and the mean annual income (in thousands of dollars) of the

households for those areas.

Sales 19 29 17 21 14 30 33 22 18 27 24

Population 21 15 32 18 47 69 29 43 75 39 53

Income 58 69 49 52 67 76 81 46 39 64 28

Using MINITAB, find the regression of sales on population and income. Using the solution obtained, answer

the following questions.

a. Write the estimated regression equation.

b. Explain the meaning of the estimates of the constant term and the regression coefficients of pop-

ulation and income.

c. What are the values of the standard deviation of errors, the coefficient of multiple determination,

and the adjusted coefficient of multiple determination?

d. What are the predicted sales for a restaurant with 50 thousand people living within a five-mile

area surrounding it and $55 thousand mean annual income of households in that area.

e. What is the point estimate of the expected (mean) sales for all restaurants with 45 thousand people

living within a five-mile area surrounding them and $46 thousand mean annual income of house-

holds living in those areas?

f. Determine a 95% confidence interval for the coefficient of income.g. Using the 1% significance level, test whether or not the coefficient of population is different from

zero.

x2

x1.x2 � 49?

x1 � 95x2 � 54?x1 � 87

1519T_c14 03/27/2006 07:30 AM Page 627

Page 15: Multiple Regression Mann 2007 Ch14

628 Chapter 14 Multiple Regression

Glossary

Adjusted coefficient of multiple determination Denoted by ,

it gives the proportion of SST that is explained by the multiple re-

gression model and is adjusted for the degrees of freedom.

Coefficient of multiple determination Denoted by it gives the

proportion of SST that is explained by the multiple regression model.

First-order multiple regression model When each term in a re-

gression model contains a single independent variable raised to the

first power.

Least squares regression model The estimated regression model

obtained by minimizing the sum of squared errors.

R2,

R2 Multicollinearity When two or more independent variables in a

regression model are highly correlated.

Multiple regression model A regression model that contains two

or more independent variables.

Partial regression coefficients The coefficients of independent

variables in a multiple regression model are called the partial re-

gression coefficients because each of them gives the effect of the

corresponding independent variable on the dependent variable when

all other independent variables are held constant.

USES AND MISUSES... ADDITIVE VERSUS MULTIPLICATIVE EFFECT

A first-order multiple regression model with (quantitative) inde-pendent variables is one of the simpler types of multiple regressionmodels. However, there are many limitations of this model. A majorlimitation is that the independent variables have an additive effecton the dependent variable. What does additive mean here? Supposewe have the following estimated regression equation:

From this estimated regression equation, if x1 increases by 1 unit (withx2 held constant), our predicted value of y increases by 6 units. If x2

increases by 1 unit (with x1 held constant), our predicted value of yincreases by 3 units. But what happens if x1 and x2 both increase by1 unit each? From this equation, our predicted value of y will increaseby units. The total increase in is simply the sum of thetwo increases. This change in does not depend on the values of x1

and x2 prior to the increase. Since the total increase in the dependentvariable is equal to the sum of the increases from the two individualparts (independent variables), we say that the effect is additive.

Now suppose we have the following equation:

The important difference in this case is that the increase in the valueof is no longer constant when x1 and x2 both increase by 1 uniteach. Instead, it depends on the original values of x1 and x2. For ex-ample, consider the values of x1 and x2, and the changes in the valueof shown in the following table.

x1 x2 Change in (versus and )

2 3 853 3 166 812 4 108 233 4 214 129

Unlike the previous example, here the total increase in is notequal to the sum of the increases from the individual parts. In this

y

x2 � 3x1 � 2yy

y

y

y � 4 � 6x1 � 3x2 � 5x21x2

yy6 � 3 � 9

y � 4 � 6x1 � 3x2

case, the effect is said to be multiplicative. It is important to recognizethat the effect is multiplicative when the total increase does not equalthe sum of the increases of the independent variables.

Pharmaceutical companies are always looking for multiplicativeeffects when creating new drugs. In many cases, a combination oftwo drugs might have a multiplicative effect on a certain condition.Simply stated, the two drugs provide greater relief when you takethem together than if you take them separately so that only one drugis in your system at any time. Of course, the companies also have tolook for multiplicative effects when it comes to side effects. Individ-ual drugs may not have major side effects when taken separately, butcould cause greater harm when taken together. One of the most note-worthy examples of this was the drug Fen-Phen, which was a com-bination of two drugs—Fenfluramine and Phentermine. Each of thesetwo drugs had been approved for short-term (individual) control ofobesity. However, the drugs used in combination became popular forlong-term weight loss. Unfortunately, the combination, when associ-ated with longtime use, resulted in severe side effects that weredetailed in the following statement from the Food and Drug Admin-istration in 1997:

Thanks to the reporting of health care professionals, as of August22, FDA has received reports of 82 cases (including Mayo’s 24cases) of cardiac valvular disease in patients—two of whomwere men—on combination fenfluramine and phentermine.These reports have been from 23 different states. Severity ofthe cardiac valvular disease was graded as moderate or severein over three-fourths of the cases, and two of the reports de-scribed deterioration from no detectable heart murmur to needfor a valve replacement within one-and-a-half years. Sixteen ofthese 82 patients required surgery to repair their heart valves.At least one of these patients died following surgery to repairthe valves. (The agency’s findings, as of July 31, are describedin more detail in the current issue of The New England Jour-nal of Medicine, which also carries the Mayo study.) Source:http://www.fda.gov/cder/ news/phen/fenphenupdate.htm

1519T_c14 03/27/2006 07:30 AM Page 628

Page 16: Multiple Regression Mann 2007 Ch14

Self-Review Test 629

Standard deviation of errors Also called the standard deviationof estimate, it is a measure of the variation among errors.

SSE (error sum of squares) The sum of the squared differences

between the actual and predicted values of y. It is the portion of SST

that is not explained by the regression model.

SSR (regression sum of squares) The portion of SST that is

explained by the regression model.

SST (total sum of squares) The sum of squared differences

between actual y values and y.

Self-Review Test

1. When using the t distribution to make inferences about a single parameter, the degrees of freedom

for a multiple regression model with k independent variables and a sample size of n are equal to

a. b. c.2. The value of is always in the range

a. zero to 1 b. to 1 c. to zero

3. The value of is

a. always positive b. always nonnegative c. can be positive, zero, or negative

4. What is the difference between the population multiple regression model and the estimated multiple

regression model?

5. Why are the regression coefficients in a multiple regression model called the partial regression

coefficients?

6. What is the difference between and ? Explain.

7. A real estate expert wanted to find the relationship between the sale price of houses and various

characteristics of the houses. She collected data on four variables, recorded in the table, for 13 houses that

were sold recently. The four variables are

Price Sale price of a house in thousands of dollars

Lot size Size of the lot in acres

Living area Living area in square feet

Age Age of a house in years

Price Lot Size Living Area Age

455 1.4 2500 8

278 .9 2250 12

463 1.8 2900 5

327 .7 1800 9

505 2.6 3200 4

264 1.2 2400 28

445 2.1 2700 9

346 1.1 2050 13

487 2.8 2850 7

289 1.6 2400 16

434 3.2 2600 5

411 1.7 2300 8

223 .5 1700 19

Using MINITAB, find the regression of price on lot size, living area, and age. Using the solution ob-

tained, answer the following questions.

a. Indicate whether you expect a positive or a negative relationship between the dependent variable

and each of the independent variables.

b. Write the estimated regression equation. Are the signs of the coefficients of independent vari-

ables obtained in the solution consistent with your expectations of part a?

c. Explain the meaning of the estimated regression coefficients of all independent variables.

d. What are the values of the standard deviation of errors, the coefficient of multiple determina-

tion, and the adjusted coefficient of multiple determination?

��

��

R2R2

R2

�1�1

R2

n � k � 1n � k � 1n � k � 1

1519T_c14 03/27/2006 07:30 AM Page 629

Page 17: Multiple Regression Mann 2007 Ch14

e. What is the predicted sale price of a house that has a lot size of 2.5 acres, a living area of 3000

square feet, and is 14 years old?

f. What is the point estimate of the mean sale price of all houses that have a lot size of 2.2 acres, a

living area of 2500 square feet, and are 7 years old?

g. Determine a 99% confidence interval for each of the coefficients of the independent variables.

h. Construct a 98% confidence interval for the constant term in the population regression model.

i. Using the 1% significance level, test whether or not the coefficient of lot size is positive.

j. At the 2.5% significance level, test if the coefficient of living area is positive.

k. At the 5% significance level, test if the coefficient of age is negative.

� Mini-Project 14–1Refer to the McDonald’s data set explained in Appendix B and given on the Web site of this text. Use

MINITAB to estimate the following regression model for that data set.

where

(measured in grams)

(measured in grams)

and (measured in grams)

Now research on the Internet or in a book to find the number of calories in one gram of fat, one gram of

carbohydrate, and one gram of protein.

a. Based on the information you obtain, write what the estimated regression equation should be.

b. Are the differences between your expectation in part a and the regression equation that you ob-

tained from MINTAB small or large?

c. Since each gram of fat is worth a specific number of calories, and the same is true for a gram of

carbohydrate, and for a gram of protein, one would expect that the predicted and observed values

of y would be the same for each food item, but that is not the case. The quantities of fat, carbo-

hydrates, and protein are reported in whole numbers. Explain why this causes the differences dis-

cussed in part b.

x3 � protein

x2 � carbohydrate

x1 � fat

y � calories

y � A � B1x1 � B2x2 � B3x3 � �

DECIDE FOR YOURSELF

Dummy VariablesIn Sanford & Son, a very popular TV show of the 1970s, Fred

Sanford would often refer to other people as big dummies. So, if a

statistics professor questions your work and mentions a dummy in

the process, should you be offended? Obviously, context will help

you to answer that question, but if the professor is referring to a

dummy variable, then do not take it personally.

A dummy variable is the name given to a categorical independ-

ent variable used in a multiple regression model. The simplest version

occurs when there are only two categories. In this case, we assign a

value of 0 to one category and 1 to the other category of the variable.

Suppose you have the following first-order regression equation

to predict the amount of tar inhaled (y) by smoking a cigarette based

on the amount of tar in the cigarette and the presence of a filter

Note that here implies that a cigarette does not have a

filter and means that a filter exists.

Answer the following questions.

1. Does the presence of a filter increase or decrease the tar con-

sumption? What part of the regression equation tells you this?

2. On average, what percentage of the tar in a cigarette is consumed

if the cigarette is unfiltered? What if the cigarette is filtered?

3. Draw a graph of the above regression equation. (Hint: The graph

consists of two different regression lines with two variables, not a

plane.)

y � .94x1 � .45x2

x2 � 1x2 � 01x22.

1x12

630 Chapter 14 Multiple Regression

1519T_c14 03/27/2006 07:30 AM Page 630