chapter 12: multiple regression and model building
DESCRIPTION
Chapter 12: Multiple Regression and Model Building. Where We’ve Been. Introduced the straight-line model relating a dependent variable y to an independent variable x Estimated the parameters of the straight-line model using least squares Assesses the model estimates - PowerPoint PPT PresentationTRANSCRIPT
Chapter 12: Multiple Regression and Model Building
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
2
Where We’ve Been
Introduced the straight-line model relating a dependent variable y to an independent variable x
Estimated the parameters of the straight-line model using least squares
Assesses the model estimates Used the model to estimate a value of y
given x
Where We’re Going
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
3
Introduce a multiple-regression model to relate a variable y to two or more x variables
Present multiple regression models with both quantitative and qualitative independent variables
Assess how well the multiple regression model fits the sample data
Show how analyzing the model residuals can help detect problems with the model and the necessary modifications
12.1: Multiple Regression Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
4
0 1 1 2 2
1 2
0 1 1 2 2
The General Multiple Regression Model
where is the dependent variable, , , ... , are the independent variables, ( ) is the
de
k k
k
k k
y x x x
yx x xE y x x x
terministic portion of the model and
determines the contribution of the independent variable , which may be a quantitative variable of order one or higher or a qualitative variable
i
ix
12.1: Multiple Regression Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
5
Analyzing a Multiple-Regression ModelStep 1: Hypothesize the deterministic
portion of the model by choosing the independent variables x1, x2, … , xk.
Step 2: Estimate the unknown parameters 0, 1, 2, … , k .
Step 3: Specify the probability distribution of and estimate the standard deviation of this distribution.
12.1: Multiple Regression Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
6
Analyzing a Multiple-Regression ModelStep 4: Check that the assumptions
about are satisfied; if not make the required modifications to the model.
Step 5: Statistically evaluate the usefulness of the model.
Step 6: If the model is useful, use it for prediction, estimation and other purposes.
12.1: Multiple Regression Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
7
Assumptions about the Random Error 1. The mean is equal to 0.2. The variance is equal to 2.3. The probability distribution is a normal
distribution.4. Random errors are independent of one
another.
12.2: The First-Order Model: Estimating and Making Inferences about the Parameters
A First-Order Model in Five Quantitative Independent Variables
where x1, x2, … , xk are all quantitative variables that are not functions of other independent variables.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
8
0 1 1 2 2 3 3 4 4 5 5( ) E y x x x x x
12.2: The First-Order Model: Estimating and Making Inferences about the Parameters
A First-Order Model in Five Quantitative Independent Variables
The parameters are estimated by finding the values for the ‘s that minimize
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
9
0 1 1 2 2 3 3 4 4 5 5( ) E y x x x x x
2ˆ( ) .SSE y y
12.2: The First-Order Model: Estimating and Making Inferences about the Parameters
A First-Order Model in Five Quantitative Independent Variables
The parameters are estimated by finding the values for the ‘s that minimize
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
10
0 1 1 2 2 3 3 4 4 5 5( ) E y x x x x x
2ˆ( )SSE y y
Only a truly talented mathematician (or geek) would choose to solve the necessary system of simultaneous linear
equations by hand. In practice, computers are left to do the
complicated calculation required by multiple regression models.
A collector of antique clocks hypothesizes that the auction price can be modeled as
12.2: The First-Order Model: Estimating and Making Inferences about the Parameters
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
11
0 1 1 2 2
1
2
whereauction price in dollarsage of clock in yearsnumber of bidders.
y x x
yxx
Based on the data in Table 12.1, the least squares prediction equation, the equation that minimizes SSE, is
12.2: The First-Order Model: Estimating and Making Inferences about the Parameters
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
12
1 2
2
ˆ 1,339 12.74 85.95516,727
516,727 17,8181 29
133.5 (the estimate for )
y x xSSE
SSEsn k
s
Based on the data in Table 12.1, the least squares prediction equation, the equation that minimizes SSE, is
12.2: The First-Order Model: Estimating and Making Inferences about the Parameters
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
13
1 2
2
ˆ 1,339 12.74 85.95516,727
516,727 17,8181 29
133.5 (the estimate for )
y x xSSE
SSEsn k
s
The estimate for 1 is interpreted as the expected change in y given a one-unit change in x1 holding x2 constant
The estimate for 2 is interpreted as the expected change in y given a one-unit change in x2 holding x1 constant
Based on the data in Table 12.1, the least squares prediction equation, the equation that minimizes SSE, is
12.2: The First-Order Model: Estimating and Making Inferences about the Parameters
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
14
1 2
2
ˆ 1,339 12.74 85.95516,727
516,727 17,8181 29
133.5 (the estimate for )
y x xSSE
SSEsn k
s
Since it makes no sense to sell a clock of age 0 at an auction with no bidders, the intercept term has no meaningful interpretation in this example.
12.2: The First-Order Model:Estimating and Making Inferences about the Parameters
One-Tailed Test Two-Tailed Test
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
15
Test of an Individual Parameter Coefficient in the Multiple Regression Model
0 : 0: ( )0
Rejection Region: ( )
i
a i
HH
t t t
0
/2
: 0: 0
Rejection Region:
i
a i
HH
t t
ˆ
/2
ˆTest statistic:
where and are based on ( 1) degrees of freedom and = number of observations
1 = number of parameters in the model
i
its
t t n knk
12.2: The First-Order Model:Estimating and Making Inferences about the Parameters
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
16
Test of the Parameter Coefficient on the Number of Bidders
0 2
2
.05
: 0: 0
Rejection Region: 1.699a
HH
t t t
2
* 2
ˆ
ˆ 85.953Test statistic: 9.858.729
ts
12.2: The First-Order Model:Estimating and Making Inferences about the Parameters
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
17
Test of the Parameter Coefficient on the Number of Bidders
0 2
2
.05
: 0: 0
Rejection Region: 1.699a
HH
t t t
2
* 2
ˆ
ˆ 85.953Test statistic: 9.858.729
ts
Since t* > t, reject the null hypothesis.
12.2: The First-Order Model:Estimating and Making Inferences about the Parameters
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
18
A 100(1-)% Confidence Interval for a Parameter
ˆ/2ˆ ( )
where is based on ( 1) degrees of freedom and = number of observations
1 = number of parameters in the modelValid inferences about also require that the four assumptions abo
ii
i
t s
t n knk
ut are satisfied.
12.2: The First-Order Model:Estimating and Making Inferences about the Parameters
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
19
A 100(1-)% Confidence Interval for 1
1
1
ˆ1 /2
ˆ1 .05
ˆ
ˆ
12.74 1.699(.905)12.74 1.54
t s
t s
12.2: The First-Order Model:Estimating and Making Inferences about the Parameters
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
20
A 100(1-)% Confidence Interval for 1
1
1
ˆ1 /2
ˆ1 .05
ˆ
ˆ
12.74 1.699(.905)12.74 1.54
t s
t s
Holding the number of bidders constant, the result above tells us that we can be 90% sure that the auction price will rise between $11.20 and $14.28 for each 1-year increase in age.
12.3: Evaluating Overall Model Utility
Reject H 0 for i Evidence of a linear
relationship between y and xi
Do Not Reject H 0 for i There may be no
relationship between y and xi
Type II error occurred The relationship between
y and xi is more complex than a straight-line relationship
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
21
12.3: Evaluating Overall Model Utility
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
22
The multiple coefficient of determination, R2, measures how much of the overall variation in y is explained by the least squares prediction equation.
2 Explained variability1Total variability
yy
yy yy
SS SSESSERSS SS
12.3: Evaluating Overall Model Utility
High values of R2 suggest a good model, but the usefulness of R2 falls as the number of observations becomes close to the number of parameters estimated.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
23
12.3: Evaluating Overall Model Utility
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
24
2 2
The Adjusted Multiple Coefficient of Determination
1 11 1 1( 1) ( 1)a
yy
n SSE nR Rn k SS n k
Ra2 adjusts for the number of observations
and the number of parameter estimates. It will always have a value no greater than R2.
0 1 2
2
2
The Analysis-of-Variance -Test: 0: At least one 0
( ) / /Test Statistic: / ( 1) (1 ) / ( 1)
Mean square (Model)Mean square (Error)
where is the sample size an
k
a i
yy
FHH
SS SSE k R kFSSE n k R n k
n
d is the number of terms in the model.Rejection region: , with numerator and ( 1) denominatordegrees of freedom.
kF F k n k
12.3: Evaluating Overall Model Utility
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
25
12.3: Evaluating Overall Model Utility
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
26
0 1 2
2
2
The Analysis-of-Variance -Test: 0: At least one 0
( ) / /Test Statistic: / ( 1) (1 ) / ( 1)
Mean square (Model)Mean square (Error)
where is the sample size an
k
a i
yy
FHH
SS SSE k R kFSSE n k R n k
n
d is the number of terms in the model.Rejection region: , with numerator and ( 1) denominatordegrees of freedom.
kF F k n k
Rejecting the null hypothesis means that something in your model helps explain variations in y, but it may be that another model provides more reliable estimates and predictions.
A collector of antique clocks hypothesizes that the auction price can be modeled as
12.3: Evaluating Overall Model Utility
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
27
0 1 1 2 2
1
2
whereauction price in dollarsage of clock in yearsnumber of bidders
y x x
yxx
0 1 2: 0: At least one of the
two coefficients is nonzeroTest Statistic:
MS(Model) 2,141,531 120.19MSE 17,818
value: less than .00001
a
HH
F
p
A collector of antique clocks hypothesizes that the auction price can be modeled as
12.3: Evaluating Overall Model Utility
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
28
0 1 1 2 2
1
2
whereauction price in dollarsage of clock in yearsnumber of bidders
y x x
yxx
0 1 2: 0: At least one of the
two coefficients is nonzeroTest Statistic:
MS(Model) 2,141,531 120.19MSE 17,818
value: less than .00001
a
HH
F
p
Something in the model is useful, but the F-test can’t tell us which x-variables are individually useful.
12.3: Evaluating Overall Model Utility
Checking the Utility of a Multiple-Regression Model1. Use the F-test to conduct a test of the adequacy
of the overall model.2. Conduct t-tests on the “most important”
parameters. 3. Examine Ra
2 and 2s to evaluate how well the model fits the data.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
29
12.4: Using the Model for Estimation and Prediction
The model of antique clock prices can be used to predict sale prices for clocks of a certain age with a particular number of bidders.
What is the mean sale price for all 150-year-old clocks with 10 bidders?
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
30
12.4: Using the Model for Estimation and Prediction
What is the mean auction sale price for a single 150-year-old clock with 10 bidders?
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
31
The average value of all clocks with these characteristics can be found by using the statistical software to generate a confidence interval. (See Figure 12.7)
In this case, the confidence interval indicates that we can be 95% sure that the average price of a single 150-year-old clock sold at auction with 10 bidders will be between $1,154.10 and $1,709.30.
12.4: Using the Model for Estimation and Prediction
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
32
12.4: Using the Model for Estimation and Prediction
What is the mean sale price for a single 50-year-old clock with 2 bidders?
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
33
12.4: Using the Model for Estimation and Prediction
What is the mean sale price for a single 50-year-old clock with 2 bidders?
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
34
Since 50 years-of-age and 2 bidders are both outside of the range of values in our data set, any prediction using these values would be unreliable.
12.5: Model Building: Interaction Models
In some cases, the impact of an independent variable xi on y will depend on the value of some other independent variable xk.
Interaction models include the cross-products of independent variables as well as the first-order values.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
35
12.5: Model Building: Interaction Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
36
0 1 1 2 2 3 1 2
1 3 2
1 2
2
An Interaction Model Relating ( ) to Two Quantitative Independent Variables
( )where represents the change in ( ) for every one-unit change in holding fixedand
E y
E y x x x xx E y
x x
3 1
2 1
represents the change in ( ) for every one-unit change in holding fixed.
x E yx x
12.5: Model Building: Interaction Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
37
In the antique clock auction example, assume the collector has reason to believe that the impact of age (x1) on price (y) varies with the number of bidders (x2) .
The model is nowy = 0 + 1x1 + 2x2 + 3x1x2 + .
12.5: Model Building: Interaction Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
38
12.5: Model Building: Interaction Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
39
In the antique clock auction example, assume the collector has reason to believe that the impact of age (x1) on price (y) varies with the number of bidders (x2) .
The model is nowy = 0 + 1x1 + 2x2 + 3x1x2 + .
0 1 2 3
The Global -Test:
The test statistic is = 193.04-value = 0
Reject the null hypothesis
FH
Fp
0 3
The -Test on the Interaction Parameter: 0
The test statistic is = 6.11 (two-tailed)-value = 0 (= 0/2 = 0 for a one-tailed test)
Reject the null hypothesis
tH
tp
12.5: Model Building: Interaction Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
40
In the antique clock auction example, assume the collector has reason to believe that the impact of age (x1) on price (y) varies with the number of bidders (x2) .
The model is nowy = 0 + 1x1 + 2x2 + 3x1x2 + .
The MINITAB results are reported in Figure 12.11 in the text.
1 2 1 2
2
2
The Estimated Model is ˆ 320.5 0.878 ( 93.26) 1.2978To estimate the change in the price of 150-year-old clock given a one-unit change in , we must include the interction term.
Estimated s
y x x x x
x
x
2 3 1ˆ ˆlope 93.26 1.30(150) 101.74x
12.5: Model Building: Interaction Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
41
In the antique clock auction example, assume the collector has reason to believe that the impact of age (x1) on price (y) varies with the number of bidders (x2) .
The model is nowy = 0 + 1x1 + 2x2 + 3x1x2 + .
12.5: Model Building: Interaction Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
42
Once the interaction term has passed the t-test, it is unnecessary to test the individual
independent variables.
12.6: Model Building: Quadratic and Other Higher Order Models
A quadratic (second-order) model includes the square of an independent variable:
y = 0 + 1x + 2x2 + .This allows more complex relationships to be modeled.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
43
12.6: Model Building: Quadratic and Other Higher Order Models
A quadratic (second-order) model includes the square of an independent variable:
y = 0 + 1x + 2x2 + .
1 is the shift parameter and
2 is the rate of curvature.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
44
12.6: Model Building: Quadratic and Other Higher Order Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
45
Example 12.7 considers whether home size (x) impacts electrical usage (y) in a positive but decreasing way.
The MINITAB results are shown in Figure 12.13.
12.6: Model Building: Quadratic and Other Higher Order Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
46
12.6: Model Building: Quadratic and Other Higher Order Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
47
According to the results, the equation that minimizes SSE for the 10 observations is
2
2
ˆ 1, 216.14 2.3989 .00045
.9767a
y x x
R
12.6: Model Building: Quadratic and Other Higher Order Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
48
12.6: Model Building: Quadratic and Other Higher Order Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
49
Since 0 is not in the range of the independent variable (a house of 0 ft2?), the estimated intercept is not meaningful.
The positive estimate on 1 indicates a positive relationship, although the slope is not constant (we’ve estimated a curve, not a straight line).
The negative value on 2 indicates the rate of increase in power usage declines for larger homes.
12.6: Model Building: Quadratic and Other Higher Order Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
50
The Global F-Test H0: 1= 2= 0 Ha: At least one of the coefficients ≠ 0
The test statistic is F = 189.71, p-value near 0. Reject H0.
12.6: Model Building: Quadratic and Other Higher Order Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
51
t-Test of 2 H0: 2= 0
Ha: 2< 0 The test statistic is t = -7.62, p-value = .0001
(two-tailed). The one-tailed test statistic is .0001/2 = .00005 Reject the null hypothesis.
12.6: Model Building: Quadratic and Other Higher Order Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building 52
Complete Second-Order Model with Two Quantitative Independent Variables
E(y) = 0 + 1x1 + 2x2 + 3x1x2 + 4x12 + 5x2
2
y-intercept
Changing 1 and 2 causes the surface to shift along the x1 and x2 axes
Controls the rotation of the surface
Signs and values of these parameters control the type of surface and the rates of curvature
12.6: Model Building: Quadratic and Other Higher Order Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building 53
12.7: Model Building: Qualitative (Dummy) Variable Models
Qualitative variables can be included in regression models through the use of dummy variables.
Assign a value of 0 (the base level) to one category and 1, 2, 3 … to the other categories.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
54
12.7: Model Building: Qualitative (Dummy) Variable Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
55
A Qualitative Independent Variable with k Levels
where xi is the dummy variable for level i + 1 and0 1 1 2 2 1 1 k ky x x x
0 1
0 1 2
0 2 3
0 1
1 if is observed at level 10 otherwise
i
A B A
B C A
C D A
j j j j A
y ix
12.7: Model Building: Qualitative (Dummy) Variable Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
56
For the golf ball example from Chapter 10, there were four levels (the brands).Testing differences in brands can be done with the model
0 1 1 2 2 3 3
1 2 3
( )where
1 if Brand B 1 if Brand C 1 if Brand D, and 0 otherwise 0 otherwise 0 otherwise
E y x x x
x x x
12.7: Model Building: Qualitative (Dummy) Variable Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
57
Brand A is the base level, so 0 represents the mean distance (A) for Brand A, and
1 = B - A2 = C - A3 = D - A
12.7: Model Building: Qualitative (Dummy) Variable Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
58
Testing that the four means are equal is equivalent to testing the significance of the s:
H0: 1 = 2 = 3 = 0
Ha: At least of one the s ≠ 0
12.7: Model Building: Qualitative (Dummy) Variable Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
59
Testing that the four means are equal is equivalent to testing the significance of the s:H0: 1 = 2 = 3 = 0
Ha: At least of one the s ≠ 0
The test statistic is the F-statistic.Here F = 43.99, p-value .000. Hence we reject the null hypothesis that the golf balls all have the same mean driving distance.
12.7: Model Building: Qualitative (Dummy) Variable Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
60
Testing that the four means are equal is equivalent to testing the significance of the s:H0: 1 = 2 = 3 = 0
Ha: At least of one the s ≠ 0
The test statistic if the F-statistic.Here F = 43.99, p-value .000. Hence we reject the null hypothesis that the golf balls all have the same mean driving distance.
Remember that the maximum number of dummy variables is one less than the number of levels for the qualitative variable.
12.8: Model Building: Models with Both Quantitative and Qualitative Variables
Suppose a first-order model is used to evaluate the impact on mean monthly sales of expenditures in three advertising media: television, radio and newspaper. Expenditure, x1, is a quantitative variable Types of media, x2 and x3, are qualitative
variables (limited to k levels -1)McClave: Statistics, 11th ed. Chapter 12: Multiple
Regression and Model Building61
12.8: Model Building: Models with Both Quantitative and Qualitative Variables
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
62
0 1 1 2 2 3 3 4 1 2 4 1 3
1
2
3
( )where x advertising expenditure
1 if radio 0 otherwise1 if television 0 otherwise
Newspaper is the base level.
E y x x x x x x x
x
x
12.8: Model Building: Models with Both Quantitative and Qualitative Variables
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
63
0 1 1 2 2 3 3 4 1 2 4 1 3
Main effects, Main effects, Interactionadvertising type of mediumexpenditure
0 1 1
0 2
Intercept
( )
Newspaper medium line: ( )
Radio medium line: ( ) ( )
E y x x x x x x x
E y x
E y
1 4 1
Slope
0 3 1 5 1
Intercept Slope
( )
Television medium line: ( ) ( ) ( )
x
E y x
12.8: Model Building: Models with Both Quantitative and Qualitative Variables
Suppose now a second-order model is used to evaluate the impact of expenditures in the three advertising media on sales.
The relationship between expenditures, x1, and sales, y, is assumed to be curvilinear.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
64
12.8: Model Building: Models with Both Quantitative and Qualitative Variables
In this model, each medium is assumed to have the save impact on sales.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
65
20 1 1 2 1
1
( )where advertising expenditureE y x x
x
12.8: Model Building: Models with Both Quantitative and Qualitative Variables
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
66
20 1 1 2 1 3 2 4 3
1
2
3
( )where x advertising expenditure
1 if radio 0 otherwise1 if television 0 otherwise
Newspaper is the base level.
E y x x x x
x
x
In this model, theintercepts differ but the shapes of the curves are the same.
12.8: Model Building: Models with Both Quantitative and Qualitative Variables
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
67
20 1 1 2 1 3 2 4 3
2 25 1 2 6 1 3 7 1 2 8 1 3
( )E y x x x x
x x x x x x x x
In this model, the response curve for each media type is different – that is, advertising expenditure and media type interact, at varying rates.
12.9: Model Building: Comparing Nested Models
Two models are nested if one model contains all the terms of the second model and at least one additional term. The more complex of the two models is called the complete model and the simpler of the two is called the reduced model.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
68
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
69
Recall the interaction model relating the auction price (y) of antique clocks to age (x1) and bidders (x2) :
0 1 1 2 2 3 1 2( ) .E y x x x x
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
70
If the relationship is not constant, a second-order model should be considered:
2 20 1 1 2 2 3 1 2 4 1 5 2
Reduced model
Complete model
( ) .E y x x x x x x
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
71
If the complete model produces a better fit, then the s on the quadratic terms should be significant.
H0: 4 = 5 = 0
Ha: At least one of 4 and 5 is non-zero
2 20 1 1 2 2 3 1 2 4 1 5 2
Reduced model
Complete model
( ) .E y x x x x x x
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
72
F-Test for Comparing Nested Models
0 1 1
0 1 1 1 1
0 1 2
0
Reduced model: ( )
Complete model: ( )
: 0
: At least one of the parameters in is nonzeroTest Statistic:
g g
g g g g k k
g g k
a
E y x x
E y x x x x
H
H H
F
0( ) / ( ) ( ) / # s in / [ ( 1)]
R C R C
C C
SSE SSE k g SSE SSE HSSE n k MSE
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
73
F-Test for Comparing Nested ModelswhereSSER = sum of squared errors for the reduced model
SSEC = sum of squared errors for the complete model
MSEC = mean square error (s2) for the complete model
k – g = number of parameters specified in H0
k + 1 = number of parameters in the complete model n = sample sizeRejection region: F > F, with k – g numerator and n – (k + 1)
denominator degrees of freedom.
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
74
The growth of carnations (y) is assumed to be a function of the temperature (x1) and the amount of fertilizer (x2).
The data are shown in Table 12.6 in the text.
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
75
The growth of carnations (y) is assumed to be a function of the temperature (x1) and the amount of fertilizer (x2).
The complete second order model is
The least squares prediction equation from Table 12.6 isrounded to
2 20 1 1 2 2 3 1 2 4 1 5 2
2 21 2 1 2 1 2
( )
ˆ 5,127.90 31.10 139.75 .146 .133 1.14
E y x x x x x x
y x x x x x x
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
76
The growth of carnations (y) is assumed to be a function of the temperature (x1) and the amount of fertilizer (x2).
To test the significance of the contribution of the interaction and second-order terms, useH0: 3 = 4 = 5 = 0
Ha: At least one of 3, 4 or 5 ≠ 0
This requires estimating the complete model in reduced form, dropping the parameters in the null hypothesis.Results are given in Figure 12.31.
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
77
0 3 4 5
0
0
: 0: At least one of the parameters in is nonzero
Test Statistic: ( ) / ( ) ( ) / # s in
/ [ ( 1)](6,671.50852 59.17832) / 3 782.15
2.81802Rejection region:
a
R C R C
C C
HH H
SSE SSE k g SSE SSE HFSSE n k MSE
F
F
.05 3.07
12.9: Model Building: Comparing Nested Models
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
78
0 3 4 5
0
0
: 0: At least one of the parameters in is nonzero
Test Statistic: ( ) / ( ) ( ) / # s in
/ [ ( 1)](6,671.50852 59.17832) / 3 782.15
2.81802Rejection region:
a
R C R C
C C
HH H
SSE SSE k g SSE SSE HFSSE n k MSE
F
F
.05 3.07 Reject the null hypothesis: the complete model seems to provide better predictions than the reduced model.
12.9: Model Building: Comparing Nested Models
A parsimonious model is a general linear model with a small number of parameters. In situations where two competing models have essentially the same predictive power (as determined by an F-test), choose the more parsimonious of the two.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
79
12.9: Model Building: Comparing Nested Models
A parsimonious model is a general linear model with a small number of parameters. In situations where two competing models have essentially the same predictive power (as determined by an F-test), choose the more parsimonious of the two.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
80
If the models are not nested, the choice is more subjective, based on Ra
2, s, and an understanding of the theory behind the model.
12.10: Model Building: Stepwise Regression
It is often unclear which independent variables have a significant impact on y.
Screening variables in an attempt to identify the most important ones is known as stepwise regression.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
81
12.10: Model Building: Stepwise Regression
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
82
12.10: Model Building: Stepwise Regression
Stepwise regression must be used with caution Many t-tests are conducted, leading to
high probabilities of Type I or Type II errors.
Usually, no interaction or higher-order terms are considered – and reality may not be that simple.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
83
12.11: Residual Analysis: Checking the Regression Assumptions
Regression analysis is based on the four assumptions about the random error considered earlier.
1. The mean is equal to 0.2. The variance is equal to 2.3. The probability distribution is a normal
distribution.4. Random errors are independent of one another.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
84
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
85
If these assumptions are not valid, the results of the regression estimation are called into question.
Checking the validity of the assumptions involves analyzing the residuals of the regression.
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
86
A regression residual is defined as the difference between an observed y-value and its corresponding predicted value:
0 1 1 2 2ˆ ˆ ˆ ˆˆ ˆ( ) ( )k ky y y x x x
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
87
Properties of the Regression Residuals1. The mean of the residuals is equal to 0.
2. The standard deviation of the residuals is equal to the standard deviations of the fitted regression model.
ˆ(Residuals) ( ) 0y y
2 2
2
ˆ(Residuals) ( ) 0
(Residuals)( 1) ( 1)
y y
SSEs MSEn k n k
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
88
If the model is misspecified, the mean of will not equal 0. Residual analysis may reveal this
problem. The home-size electricity usage example
illustrates this.
12.11: Residual Analysis: Checking the Regression Assumptions
The plot of the first-order model shows a curvilinear residual pattern …
while the quadratic model shows a more random pattern.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
89
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
90
A pattern in the residual plot may indicate a problem with the model.
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
91
A residual larger than 3s (in absolute value) is considered an outlier. Outliers will have an undue influence on
the estimates.1. Mistakenly recorded data2. An observation that is for some reason truly
different from the others3. Random chance
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
92
A residual larger than 3s (in absolute value) is considered an outlier. Leaving an outlier that should be
removed in the data set will produce misleading estimates and predictions (#1 & #2 above).
So will removing an outlier that actually belongs in the data set (#3 above).
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
93
Residual plots should be centered on 0 and within ±3s of 0.
Residual histograms should be relatively bell-shaped.
Residual normal probability plots should display straight lines.
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
94
Slight departures from normality will not seriously harm the validity of the estimates, but as the departure from normality grows, the validity falls.
REGRESSION ANALYSISREGRESSION ANALYSIS IS ROBUST WITH IS ROBUST WITH
RESPECT TO (SMALL) RESPECT TO (SMALL) NONNORMAL ERRORS.NONNORMAL ERRORS.
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
95
If the variance of changes as y changes, the constant variance assumption is violated.
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
96
A first-order model is used to relate the salaries (y) of social workers to years of experience (x).
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
97
1
0 1
2
( )ˆ 11,368.72 2141.38
.78713.31; value 0
E y xy x
Rt p
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
98
The model seems to provide good predictions, but the residual plot reveals a non-random pattern:
The residual increases as the estimated mean salary increases, violating the constant variance assumption
Transforming the dependent variable often stabilizes the residual Possible transformations of y Natural logarithm Square root sin-1y1/2
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
99
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
100
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
101
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
102
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
103
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
104
12.11: Residual Analysis: Checking the Regression Assumptions
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
105
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
106
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
107
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
108
Problem 1: Parameter Estimability
If x does not take on a sufficient number of different values, no single unique line can be estimated.
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
109
Problem 2: Multicollinearity
Multicollinearity exists when two or more of the independent variables in a regression are correlated.
If xi and xj move together in some way, finding the impact on y of a one-unit change in either of them holding the other constant will be difficult or impossible.
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
110
Problem 2: Multicollinearity
Multicollinearity can be detected in various ways.A simple check is to calculate the correlation coefficients (rij)for each pair of independent variables in the model. Any significant rij may indicate a multicollinearity problem.
If severe multicollinearity exists, the result may be1.Significant F-values but insignificant t-values 2.Signs on s opposite to those expected3.Errors in estimates, standard errors, etc.
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
The Federal Trade Commission (FTC) ranks cigarettes according to their tar (x1), nicotine (x2), weight in grams (x3) and carbon monoxide (y) content .
25 data points (see Table 12.11) are used to estimate the model
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
111
0 1 1 2 2 3 3( ) .E y x x x
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
112
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
F = 78.98, p-value < .0001 t1= 3.97, p-value = .0007
t2= -0.67, p-value = .5072
t3= -0.3, p-value = .9735
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
113
0 1 1 2 2 3 3
1 2 3
( )( ) 3.202 .963 ( 2.63) ( 0.13)
(See Figure 12.49)
E y x x xE y x x x
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
F = 78.98, p-value < .0001 t1= 3.97, p-value = .0007
t2= -0.67, p-value = .5072
t3= -0.3, p-value = .9735
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
114
0 1 1 2 2 3 3
1 2 3
( )( ) 3.202 .963 ( 2.63) ( 0.13)
(See Figure 12.49)
E y x x xE y x x x
The negative signs on two variables and the insignificant t-values are suggestive of multicollinearity .
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
The coefficients of correlation, rij, provide further evidence: rtar, nicotine = .9766 rtar, weight = .4908 rweight, nicotine = .5002
Each rij is significantly different from 0 at the = .05 level.
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
115
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
116
Possible Responses to Problems Created by Multicollinearity in Regression Drop one or more correlated independent
variables from the model. If all the xs are retained,
Avoid making inferences about the individual parameters from the t-tests.
Restrict inferences about E(y) and future y values to values of the xs that fall within the range of the sample data.
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
117
Problem 3: Extrapolation
The data used to estimate the model provide information only on the range of values in the data set. There is no reason to assume that the dependent variable’s response will be the same over a different range of values.
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
118
Problem 3: Extrapolation
12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation
McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building
119
Problem 4: Correlated Errors
If the error terms are not independent (a frequent problem in time series), the model tests and prediction intervals are invalid. Special techniques are used to deal with time series models.