simple linear regression analysis

Bowerman−O’Connell: Business Statistics in Practice, Third Edition

11. Simple Linear Regression Analysis

Text © The McGraw−Hill Companies, 2003

Chapter Outline

11.1 The Simple Linear Regression Model

11.2 The Least Squares Estimates, and PointEstimation and Prediction

11.3 Model Assumptions and the Standard Error

11.4 Testing the Significance of the Slope andy Intercept

11.5 Confidence and Prediction Intervals

11.6 Simple Coefficients of Determination andCorrelation

11.7 An F Test for the Model

*11.8 Residual Analysis

*11.9 Some Shortcut Formulas

Chapter 11

Simple LinearRegressionAnalysis

*Optional section




anagers often make decisions by studying therelationships between variables, and processimprovements can often be made byunderstanding how changes in one or more

variables affect the process output. Regression analysisis a statistical technique in which we use observed datato relate a variable of interest, which is called thedependent (or response) variable, to one or moreindependent (or predictor) variables. The objective is tobuild a regression model, or prediction equation, thatcan be used to describe, predict, and control thedependent variable on the basis of the independentvariables. For example, a company might wish toimprove its marketing process. After collecting dataconcerning the demand for a product, the product’sprice, and the advertising expenditures made topromote the product, the company might useregression analysis to develop an equation to predictdemand on the basis of price and advertising

expenditure. Predictions of demand for variousprice–advertising expenditure combinations can thenbe used to evaluate potential changes in the company’smarketing strategies. As another example, amanufacturer might use regression analysis to describethe relationship between several input variables andan important output variable. Understanding therelationships between these variables would allow themanufacturer to identify control variables that can beused to improve the process performance.

In the next two chapters we give a thoroughpresentation of regression analysis. We begin in thischapter by presenting simple linear regression analysis.Using this technique is appropriate when we arerelating a dependent variable to a single independentvariable and when a straight-line model describes therelationship between these two variables. We explainmany of the methods of this chapter in the context oftwo new cases:

11.1 ■ The Simple Linear Regression ModelThe simple linear regression model assumes that the relationship between the dependentvariable, which is denoted y, and the independent variable, denoted x, can be approximatedby a straight line. We can tentatively decide whether there is an approximate straight-line rela-tionship between y and x by making a scatter diagram, or scatter plot, of y versus x. First,data concerning the two variables are observed in pairs. To construct the scatter plot, each valueof y is plotted against its corresponding value of x. If the y values tend to increase or decreasein a straight-line fashion as the x values increase, and if there is a scattering of the (x, y) pointsaround the straight line, then it is reasonable to describe the relationship between y and x byusing the simple linear regression model. We illustrate this in the following case study, whichshows how regression analysis can help a natural gas company improve its gas orderingprocess.

When the natural gas industry was deregulated in 1993, natural gas companies became responsi-ble for acquiring the natural gas needed to heat the homes and businesses in the cities they serve.To do this, natural gas companies purchase natural gas from marketers (usually through long-term contracts) and periodically (daily, weekly, monthly, or the like) place orders for natural gasto be transmitted by pipeline transmission systems to their cities. There are hundreds of pipelinetransmission systems in the United States, and many of these systems supply a large number of

M

CThe Fuel Consumption Case: A managementconsulting firm uses simple linear regressionanalysis to predict the weekly amount of fuel (inmillions of cubic feet of natural gas) that will berequired to heat the homes and businesses in asmall city on the basis of the week’s averagehourly temperature. A natural gas company uses these predictions to improve its gas ordering process. One of the gas company’s objectives is to reduce the fines imposed by itspipeline transmission system when the

company places inaccurate natural gas orders.

The QHIC Case: The marketing department atQuality Home Improvement Center (QHIC) usessimple linear regression analysis to predict homeupkeep expenditure on the basis of home value.Predictions of home upkeep expenditures are usedto help determine which homes should be sentadvertising brochures promoting QHIC’s productsand services.

EXAMPLE 11.1 The Fuel Consumption Case: Reducing Natural GasTransmission Fines

C

C H A P T E R 1 4




446 Chapter 11 Simple Linear Regression Analysis

cities. For instance, the map on pages 448 and 449 illustrates the pipelines of and the cities servedby the Columbia Gas System.

To place an order (called a nomination) for an amount of natural gas to be transmitted to itscity over a period of time (day, week, month), a natural gas company makes its best prediction ofthe city’s natural gas needs for that period. The natural gas company then instructs its marketer(s)to deliver this amount of gas to its pipeline transmission system. If most of the natural gas com-panies being supplied by the transmission system can predict their cities’ natural gas needs withreasonable accuracy, then the overnominations of some companies will tend to cancel the under-nominations of other companies. As a result, the transmission system will probably have enoughnatural gas to efficiently meet the needs of the cities it supplies.

In order to encourage natural gas companies to make accurate transmission nominations andto help control costs, pipeline transmission systems charge, in addition to their usual fees, trans-mission fines. A natural gas company is charged a transmission fine if it substantially undernom-inates natural gas, which can lead to an excessive number of unplanned transmissions, or if itsubstantially overnominates natural gas, which can lead to excessive storage of unused gas. Typ-ically, pipeline transmission systems allow a certain percentage nomination error before theyimpose a fine. For example, some systems do not impose a fine unless the actual amount of nat-ural gas used by a city differs from the nomination by more than 10 percent. Beyond the allowedpercentage nomination error, fines are charged on a sliding scale—the larger the nominationerror, the larger the transmission fine. Furthermore, some transmission systems evaluate nomina-tion errors and assess fines more often than others. For instance, some transmission systems dothis as frequently as daily, while others do this weekly or monthly (this frequency depends on thenumber of storage fields to which the transmission system has access, the system’s accountingpractices, and other factors). In any case, each natural gas company needs a way to accuratelypredict its city’s natural gas needs so it can make accurate transmission nominations.

Suppose we are analysts in a management consulting firm. The natural gas company serving asmall city has hired the consulting firm to develop an accurate way to predict the amount of fuel(in millions of cubic feet—MMcf—of natural gas) that will be required to heat the city. Becausethe pipeline transmission system supplying the city evaluates nomination errors and assesses finesweekly, the natural gas company wants predictions of future weekly fuel consumptions.1 More-over, since the pipeline transmission system allows a 10 percent nomination error before assess-ing a fine, the natural gas company would like the actual and predicted weekly fuel consumptionsto differ by no more than 10 percent. Our experience suggests that weekly fuel consumptionsubstantially depends on the average hourly temperature (in degrees Fahrenheit) measured in thecity during the week. Therefore, we will try to predict the dependent (response) variable weeklyfuel consumption (y) on the basis of the independent (predictor) variable average hourly tem-perature (x) during the week. To this end, we observe values of y and x for eight weeks. The dataare given in Table 11.1. In Figure 11.1 we give an Excel output of a scatter plot of y versus x. Thisplot shows

1 A tendency for the fuel consumption to decrease in a straight-line fashion as the tempera-tures increase.

2 A scattering of points around the straight line.

A regression model describing the relationship between y and x must represent these two char-acteristics. We now develop such a model.2

We begin by considering a specific average hourly temperature x. For example, consider theaverage hourly temperature 28°F, which was observed in week 1, or consider the average hourlytemperature 45.9°F, which was observed in week 5 (there is nothing special about these twoaverage hourly temperatures, but we will use them throughout this example to help explain theidea of a regression model). For the specific average hourly temperature x that we consider, thereare, in theory, many weeks that could have this temperature. However, although these weeks

1For whatever period of time a transmission system evaluates nomination errors and charges fines, a natural gas company is freeto actually make nominations more frequently. Sometimes this is a good strategy, but we will not further discuss it.

2Generally, the larger the sample size is—that is, the more combinations of values of y and x that we have observed—the moreaccurately we can describe the relationship between y and x. Therefore, as the natural gas company observes values of y andx in future weeks, the new data should be added to the data in Table 11.1.




11.1 The Simple Linear Regression Model 447

each have the same average hourly temperature, other factors that affect fuel consumption couldvary from week to week. For example, these weeks might have different average hourly windvelocities, different thermostat settings, and so forth. Therefore, the weeks could have differentfuel consumptions. It follows that there is a population of weekly fuel consumptions that couldbe observed when the average hourly temperature is x. Furthermore, this population has a mean,which we denote as �y | x (pronounced mu of y given x).

We can represent the straight-line tendency we observe in Figure 11.1 by assuming that my�x isrelated to x by the equation

my �x � b0 � b1x

This equation is the equation of a straight line with y-intercept B0 (pronounced beta zero) andslope B1 (pronounced beta one). To better understand the straight line and the meanings of b0

and b1, we must first realize that the values of b0 and b1 determine the precise value of the meanweekly fuel consumption my �x that corresponds to a given value of the average hourly tempera-ture x. We cannot know the true values of b0 and b1, and in the next section we learn how toestimate these values. However, for illustrative purposes, let us suppose that the true value of b0

is 15.77 and the true value of b1 is �.1281. It would then follow, for example, that the mean ofthe population of all weekly fuel consumptions that could be observed when the average hourlytemperature is 28°F is

my �28 � b0 � b1(28)

� 15.77 � .1281(28)

� 12.18 MMcf of natural gas

As another example, it would also follow that the mean of the population of all weekly fuel con-sumptions that could be observed when the average hourly temperature is 45.9°F is

my �45.9 � b0 � b1(45.9)

� 15.77 � .1281(45.9)


Note that, as the average hourly temperature increases from 28°F to 45.9°F, mean weekly fuelconsumption decreases from 12.18 MMcf to 9.89 MMcf of natural gas. This makes sensebecause we would expect to use less fuel if the average hourly temperature increases. Of course,because we do not know the true values of b0 and b1, we cannot actually calculate these meanweekly fuel consumptions. However, when we learn in the next section how to estimate b0 andb1, we will then be able to estimate the mean weekly fuel consumptions. For now, when we saythat the equation my �x � b0 � b1x is the equation of a straight line, we mean that the differentmean weekly fuel consumptions that correspond to different average hourly temperatures lieexactly on a straight line. For example, consider the eight mean weekly fuel consumptions thatcorrespond to the eight average hourly temperatures in Table 11.1. In Figure 11.2(a) we depictthese mean weekly fuel consumptions as triangles that lie exactly on the straight line defined by

AverageHourly Weekly FuelTemperature, Consumption,

Week x (°F) y (MMcf)1 28.0 12.42 28.0 11.73 32.5 12.44 39.0 10.85 45.9 9.46 57.8 9.57 58.1 8.08 62.5 7.5

T A B L E 1 1 . 1 The Fuel ConsumptionData FuelCon1

F I G U R E 1 1 . 1 Excel Output of a Scatter Plot of y versus x

TEMP FUELCONS1

141312111098765432 28

62.558.157.845.9

3932.5

2812.4

7.58

9.59.4

10.812.411.7

A B C D E F G H

20 30 40 50 60 70

151311975

TEMP

FUEL





Michigan

Indiana

Ohio

Kentucky

Tennessee

Middleburg Heights

SanduskyLorain

Parma

Toledo

Mansfield

Marion

SpringfieldColumbus

Athens

Ashland

Dayton

Cincinnati

HuntingtonLexington

Frankfort

Elyria

Gulf ofMexico

Columbia Gas Transmission

Columbia Gulf Transmission

Cove Point LNG

Corporate Headquarters

Cove Point Terminal

Storage Fields

Distribution Service Territory

Independent Power Projects

Communities Served by CompaniesSupplied by Columbia

Communities Served by Columbia Companies

Columbia Gas System

Source: Columbia Gas System 1995 Annual Report.





New York

New Jersey

Delaware

Virginia

North Carolina

Pennsylvania

WestVirginia

Maryland

Binghamton

Wilkes-Barre

Elizabeth

BethlehemAllentownHarrisburg

York Wilmington

AtlanticCity

New Brighton

Pittsburgh

Uniontown

Wheeling Cumberland Hagerstown

Baltimore

Washington, D.C.Manassas

Cove Point Terminal

RichmondFredericksburgStaunton

Charleston

Roanoke

Lynchburg

Petersburg

Williamsburg

Portsmouth ChesapeakeNorfolk

Newport News

© Reprinted courtesy of Columbia Gas System.





the equation my �x � b0 � b1x. Furthermore, in this figure we draw arrows pointing to the trian-gles that represent the previously discussed means my �28 and my �45.9. Sometimes we refer to thestraight line defined by the equation my �x � b0 � b1x as the line of means.

In order to interpret the slope b1 of the line of means, consider two different weeks. Supposethat for the first week the average hourly temperature is c. The mean weekly fuel consumption forall such weeks is

b0 � b1(c)

For the second week, suppose that the average hourly temperature is (c � 1). The mean weeklyfuel consumption for all such weeks is

b0 � b1(c � 1)

It is easy to see that the difference between these mean weekly fuel consumptions is b1. Thus, asillustrated in Figure 11.2(b), the slope b1 is the change in mean weekly fuel consumption that isassociated with a one-degree increase in average hourly temperature. To interpret the meaning of

F I G U R E 11.2 The Simple Linear Regression Model Relating Weekly Fuel Consumption (y) to AverageHourly Temperature (x)

x

xc c � 1

y

y

7

8

9

10

11

12

13

28.0 45.9 62.5

x

y

789

1011

1314

12

15

280 62.5

(a) The line of means and the error terms

(b) The slope of the line of means

(c) The y-intercept of the line of means

�y �28 � Mean weekly fuel consumption when x � 28The error term for the first week (a positive error term)

12.4 � The observed fuel consumption for the first week

�y �45.9 � Mean weekly fuel consumption when x � 45.9The error term for the fifth week (a negative error term)

9.4 � The observed fuel consumption for the fifth week

The straight linedefined by the equation�y �x � �0 � �1x

�0 � �1c

�0 � �1(c � 1)�1

�1 � The change in mean weekly fuel consumptionthat is associated with a one-degree increasein average hourly temperature

�0 � Mean weekly fuel consumptionwhen the average hourlytemperature is 0°F





the y-intercept b0, consider a week having an average hourly temperature of 0°F. The meanweekly fuel consumption for all such weeks is

b0 � b1(0) � b0

Therefore, as illustrated in Figure 11.2(c), the y-intercept b0 is the mean weekly fuel consump-tion when the average hourly temperature is 0°F. However, because we have not observed anyweeks with temperatures near 0, we have no data to tell us what the relationship between meanweekly fuel consumption and average hourly temperature looks like for temperatures near 0.Therefore, the interpretation of b0 is of dubious practical value. More will be said about this later.

Now recall that the observed weekly fuel consumptions are not exactly on a straight line.Rather, they are scattered around a straight line. To represent this phenomenon, we use the simplelinear regression model

y � my �x � e

� b0 � b1x � e

This model says that the weekly fuel consumption y observed when the average hourly tem-perature is x differs from the mean weekly fuel consumption my �x by an amount equal to e(pronounced epsilon). Here � is called an error term. The error term describes the effect on y ofall factors other than the average hourly temperature. Such factors would include the averagehourly wind velocity and the average hourly thermostat setting in the city. For example, Fig-ure 11.2(a) shows that the error term for the first week is positive. Therefore, the observed fuelconsumption y � 12.4 in the first week was above the corresponding mean weekly fuel con-sumption for all weeks when x � 28. As another example, Figure 11.2(a) also shows that theerror term for the fifth week was negative. Therefore, the observed fuel consumption y � 9.4 inthe fifth week was below the corresponding mean weekly fuel consumption for all weeks whenx � 45.9. More generally, Figure 11.2(a) illustrates that the simple linear regression model saysthat the eight observed fuel consumptions (the dots in the figure) deviate from the eight mean fuelconsumptions (the triangles in the figure) by amounts equal to the error terms (the line segmentsin the figure). Of course, since we do not know the true values of b0 and b1, the relative positionsof the quantities pictured in the figure are only hypothetical.

With the fuel consumption example as background, we are ready to define the simple linearregression model relating the dependent variable y to the independent variable x. We sup-pose that we have gathered n observations—each observation consists of an observed value of xand its corresponding value of y. Then:

3As implied by the discussion of Example 11.1, if we have not observed any values of x near 0, this interpretation is of dubiouspractical value.

This model is illustrated in Figure 11.3 (note that x0 in this figure denotes a specific value of theindependent variable x). The y-intercept b0 and the slope b1 are called regression parameters.Because we do not know the true values of these parameters, we must use the sample data to

The simple linear (or straight line) regression model is: y � my �x � e� b0 � b1x � e

Here

The Simple Linear Regression Model

1 my �x � b0 � b1x is the mean value of the depen-dent variable y when the value of the indepen-dent variable is x.

2 b0 is the y-intercept. b0 is the mean value of ywhen x equals 0.3

3 b1 is the slope. b1 is the change (amount ofincrease or decrease) in the mean value of y

associated with a one-unit increase in x. If b1 ispositive, the mean value of y increases as xincreases. If b1 is negative, the mean value of ydecreases as x increases.

4 e is an error term that describes the effects on yof all factors other than the value of the inde-pendent variable x.





estimate these values. We see how this is done in the next section. In later sections we show howto use these estimates to predict y.

The fuel consumption data in Table 11.1 were observed sequentially over time (in eightconsecutive weeks). When data are observed in time sequence, the data are called time seriesdata. Many applications of regression utilize such data. Another frequently used type of datais called cross-sectional data. This kind of data is observed at a single point in time.

Quality Home Improvement Center (QHIC) operates five stores in a large metropolitan area. Themarketing department at QHIC wishes to study the relationship between x, home value (in thou-sands of dollars), and y, yearly expenditure on home upkeep (in dollars). A random sample of40 homeowners is taken and asked to estimate their expenditures during the previous year on thetypes of home upkeep products and services offered by QHIC. Public records of the countyauditor are used to obtain the previous year’s assessed values of the homeowner’s homes. Theresulting x and y values are given in Table 11.2. Because the 40 observations are for the sameyear (for different homes), these data are cross-sectional.

The MINITAB output of a scatter plot of y versus x is given in Figure 11.4. We see that the ob-served values of y tend to increase in a straight-line (or slightly curved) fashion as x increases.Assuming that my �x and x have a straight-line relationship, it is reasonable to relate y to x by usingthe simple linear regression model having a positive slope (b1 � 0)

y � b0 � b1x � e

The slope b1 is the change (increase) in mean dollar yearly upkeep expenditure that is as-sociated with each $1,000 increase in home value. In later examples the marketing depart-ment at QHIC will use predictions given by this simple linear regression model to helpdetermine which homes should be sent advertising brochures promoting QHIC’s productsand services.

We have interpreted the slope b1 of the simple linear regression model to be the change in themean value of y associated with a one-unit increase in x. We sometimes refer to this change as theeffect of the independent variable x on the dependent variable y. However, we cannot prove that

F I G U R E 11.3 The Simple Linear Regression Model (Here B1 � 0)

y

x

y-intercept

One-unit changein x

Slope � �1

Errorterm

An observedvalue of ywhen x equals x0

Mean value of ywhen x equals x0

Straight line definedby the equation�y � x � �0 � �1x

x0 � A specific value ofthe independentvariable x

0

�0

EXAMPLE 11.2 The QHIC Case C





a change in an independent variable causes a change in the dependent variable. Rather, regres-sion can be used only to establish that the two variables move together and that the independentvariable contributes information for predicting the dependent variable. For instance, regressionanalysis might be used to establish that as liquor sales have increased over the years, college pro-fessors’ salaries have also increased. However, this does not prove that increases in liquor salescause increases in college professors’ salaries. Rather, both variables are influenced by a thirdvariable—long-run growth in the national economy.

CONCEPTS

11.1 When does the scatter plot of the values of a dependent variable y versus the values of an indepen-dent variable x suggest that the simple linear regression model

y � my �x � e

� b0 � b1x � e

might appropriately relate y to x?

T A B L E 11.2 The QHIC Upkeep Expenditure Data QHIC

Value of Home, x Upkeep Expenditure, Value of Home, x Upkeep Expenditure,Home (Thousands of Dollars) y (Dollars) Home (Thousands of Dollars) y (Dollars)1 237.00 1,412.08 21 153.04 849.142 153.08 797.20 22 232.18 1,313.843 184.86 872.48 23 125.44 602.064 222.06 1,003.42 24 169.82 642.145 160.68 852.90 25 177.28 1,038.806 99.68 288.48 26 162.82 697.007 229.04 1,288.46 27 120.44 324.348 101.78 423.08 28 191.10 965.109 257.86 1,351.74 29 158.78 920.14

10 96.28 378.04 30 178.50 950.9011 171.00 918.08 31 272.20 1,670.3212 231.02 1,627.24 32 48.90 125.4013 228.32 1,204.78 33 104.56 479.7814 205.90 857.04 34 286.18 2,010.6415 185.72 775.00 35 83.72 368.3616 168.78 869.26 36 86.20 425.6017 247.06 1,396.00 37 133.58 626.9018 155.54 711.50 38 212.86 1,316.9419 224.20 1,475.18 39 122.02 390.1620 202.04 1,413.32 40 198.02 1,090.84

F I G U R E 11.4 MINITAB Plot of Upkeep Expenditure versus Value of Home for the QHIC Data

0

2000

1000

100 300200

VALUE

UPKEEP

Exercises for Section 11.1

11.5, 11.6





11.2 In the simple linear regression model, what are y, my �x, and e?

11.3 In the simple linear regression model, define the meanings of the slope b1 and the y-intercept b0.

11.4 What is the difference between time series data and cross-sectional data?

METHODS AND APPLICATIONS

11.5 THE STARTING SALARY CASE StartSal

The chairman of the marketing department at a large state university undertakes a study to relatestarting salary (y) after graduation for marketing majors to grade point average (GPA) in majorcourses. To do this, records of seven recent marketing graduates are randomly selected.

Using the scatter plot (from MINITAB) of y versus x, explain why the simple linear regressionmodel

y � my �x � e

� b0 � b1x � e

might appropriately relate y to x.


Consider the simple linear regression model describing the starting salary data of Exercise 11.5.a Explain the meaning of my �x�4.00 � b0 � b1(4.00).b Explain the meaning of my �x�2.50 � b0 � b1(2.50).c Interpret the meaning of the slope parameter b1.d Interpret the meaning of the y-intercept b0. Why does this interpretation fail to make practical

sense?e The error term e describes the effects of many factors on starting salary y. What are these

factors? Give two specific examples.

11.7 THE SERVICE TIME CASE SrvcTime

Accu-Copiers, Inc., sells and services the Accu-500 copying machine. As part of its standardservice contract, the company agrees to perform routine service on this copier. To obtain information about the time it takes to perform routine service, Accu-Copiers has collected data for11 service calls. The data are as follows:

2728293031323334353637

2 3 4

GPA

StartSal

Starting Salary, Marketing y (ThousandsGraduate GPA, x of Dollars)1 3.26 33.82 2.60 29.83 3.35 33.54 2.86 30.45 3.82 36.46 2.21 27.67 3.47 35.3

00

50

100

150

200

2 4 6 8Copiers

Min

ute

s

Service Number of Copiers Number of MinutesCall Serviced, x Required, y1 4 1092 2 583 5 1384 7 1895 1 376 3 827 4 1038 5 1349 2 68

10 4 11211 6 154





Using the scatter plot (from Excel) of y versus x, discuss why the simple linear regression modelmight appropriately relate y to x.


Consider the simple linear regression model describing the service time data in Exercise 11.7.a Explain the meaning of my �x�4 � b0 � b1(4).b Explain the meaning of my �x�6 � b0 � b1(6).c Interpret the meaning of the slope parameter b1.d Interpret the meaning of the y-intercept b0. Does this interpretation make practical

sense?e The error term e describes the effects of many factors on service time. What are these factors?

Give two specific examples.

11.9 THE FRESH DETERGENT CASE Fresh

Enterprise Industries produces Fresh, a brand of liquid laundry detergent. In order to study therelationship between price and demand for the large bottle of Fresh, the company has gathered dataconcerning demand for Fresh over the last 30 sales periods (each sales period is four weeks). Here,for each sales period,

y � demand for the large bottle of Fresh (in hundreds of thousands of bottles) in the salesperiod

x1 � the price (in dollars) of Fresh as offered by Enterprise Industries in the sales period

x2 � the average industry price (in dollars) of competitors’ similar detergents in the salesperiod

x4 � x2 � x1 � the “price difference” in the sales period

Note: We denote the “price difference” as x4 (rather than, for example, x3) to be consistent withother notation to be introduced in the Fresh detergent case in Chapter 12.

Fresh Detergent Demand DataSales SalesPeriod x1 x2 x4 � x2 � x1 y Period x1 x2 x4 � x2 � x1 y1 3.85 3.80 �.05 7.38 16 3.80 4.10 .30 8.872 3.75 4.00 .25 8.51 17 3.70 4.20 .50 9.263 3.70 4.30 .60 9.52 18 3.80 4.30 .50 9.004 3.70 3.70 0 7.50 19 3.70 4.10 .40 8.755 3.60 3.85 .25 9.33 20 3.80 3.75 �.05 7.956 3.60 3.80 .20 8.28 21 3.80 3.75 �.05 7.657 3.60 3.75 .15 8.75 22 3.75 3.65 �.10 7.278 3.80 3.85 .05 7.87 23 3.70 3.90 .20 8.009 3.80 3.65 �.15 7.10 24 3.55 3.65 .10 8.50

10 3.85 4.00 .15 8.00 25 3.60 4.10 .50 8.7511 3.90 4.10 .20 7.89 26 3.65 4.25 .60 9.2112 3.90 4.00 .10 8.15 27 3.70 3.65 �.05 8.2713 3.70 4.10 .40 9.10 28 3.75 3.75 0 7.6714 3.75 4.20 .45 8.86 29 3.80 3.85 .05 7.9315 3.75 4.10 .35 8.90 30 3.70 4.25 .55 9.26

Using the scatter plot (from MINITAB) of y versus x4 shown below, discuss why the simple linearregression model might appropriately relate y to x4.

7.0

7.5

8.0

8.5

9.0

9.5

-0.2-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

Demand, y

PriceDif, x4






Consider the simple linear regression model relating demand, y, to the price difference, x4, andthe Fresh demand data of Exercise 11.9.a Explain the meaning of � b0 � b1(.10).b Explain the meaning of � b0 � b1(�.05).c Explain the meaning of the slope parameter b1.d Explain the meaning of the intercept b0. Does this explanation make practical sense?e What factors are represented by the error term in this model? Give two specific examples.

11.11 THE DIRECT LABOR COST CASE DirLab

An accountant wishes to predict direct labor cost (y) on the basis of the batch size (x) of a productproduced in a job shop. Data for 12 production runs are given in the table in the margin.a Construct a scatter plot of y versus x.b Discuss whether the scatter plot suggests that a simple linear regression model might

appropriately relate y to x.


Consider the simple linear regression model describing the direct labor cost data of Exercise 11.11.a Explain the meaning of my �x�60 � b0 � b1(60).b Explain the meaning of my �x�30 � b0 � b1(30).c Explain the meaning of the slope parameter b1.d Explain the meaning of the intercept b0. Does this explanation make practical sense?e What factors are represented by the error term in this model? Give two specific examples of

these factors.

11.13 THE REAL ESTATE SALES PRICE CASE RealEst

A real estate agency collects data concerning y � the sales price of a house (in thousands ofdollars), and x � the home size (in hundreds of square feet). The data are given in the margin.a Construct a scatter plot of y versus x.b Discuss whether the scatter plot suggests that a simple linear regression model might

appropriately relate y to x.


Consider the simple linear regression model describing the sales price data of Exercise 11.13.a Explain the meaning of my �x�20 � b0 � b1(20).b Explain the meaning of my �x�18 � b0 � b1(18).c Explain the meaning of the slope parameter b1.d Explain the meaning of the intercept b0. Does this explanation make practical sense?e What factors are represented by the error term in this model? Give two specific examples.

11.2 ■ The Least Squares Estimates, and Point Estimationand Prediction

The true values of the y-intercept (b0) and slope (b1) in the simple linear regression model areunknown. Therefore, it is necessary to use observed data to compute estimates of these regres-sion parameters. To see how this is done, we begin with a simple example.

Consider the fuel consumption problem of Example 11.1. The scatter plot of y (fuel consumption)versus x (average hourly temperature) in Figure 11.1 suggests that the simple linear regressionmodel appropriately relates y to x. We now wish to use the data in Table 11.1 to estimate the inter-cept b0 and the slope b1 of the line of means. To do this, it might be reasonable to estimate the lineof means by “fitting” the “best” straight line to the plotted data in Figure 11.1. But how do we fit thebest straight line? One approach would be to simply “eyeball” a line through the points. Then wecould read the y-intercept and slope off the visually fitted line and use these values as the estimatesof b0 and b1. For example, Figure 11.5 shows a line that has been visually fitted to the plot of the

my�x4��.05

my�x4� .10

Real Estate SalesPrice Data

RealEst

Sales HomePrice (y) Size (x)180 2398.1 11

173.1 20136.5 17141 15165.9 21193.5 24127.8 13163.5 19172.5 25

Source: Reprinted withpermission from The RealEstate Appraiser andAnalyst Spring 1986 issue.Copyright 1986 by theAppraisal Institute,Chicago, Illinois.

Direct Labor Cost Data DirLab

DirectLabor Cost, Batchy ($100s) Size, x

71 5663 62381 35138 12861 83145 14493 46548 52251 23

1024 100435 41772 75

EXAMPLE 11.3 The Fuel Consumption Case C

C H A P T E R 1 5




11.2 The Least Squares Estimates, and Point Estimation and Prediction 457

fuel consumption data. We see that this line intersects the y axis at y � 15. Therefore, they-intercept of the line is 15. In addition, the figure shows that the slope of the line is

Therefore, based on the visually fitted line, we estimate that b0 is 15 and that b1 is �.1.In order to evaluate how “good” our point estimates of b0 and b1 are, consider using the visu-

ally fitted line to predict weekly fuel consumption. Denoting such a prediction as (pronouncedy hat), a prediction of weekly fuel consumption when average hourly temperature is x is

� 15 � .1x

For instance, when temperature is 28°F, predicted fuel consumption is

� 15 � .1(28) � 15 � 2.8 � 12.2

Here is simply the point on the visually fitted line corresponding to x � 28 (see Figure 11.6).We can evaluate how well the visually determined line fits the points on the scatter plot by

y

y

y

y

change in y

change in x�

12.8 � 13.8

20 � 10�

�1

10� �.1

F I G U R E 11.5 Visually Fitting a Line to the Fuel Consumption Data

x

y

7

8

9

10

11

13

14

12

13.8

12.8

16

15

0 10 20 30 40 50 60 70

Changein x

Changein y

y-intercept

12.4 � 12.2 � .2

Slope � � � �.1change in ychange in x

12.8 � 13.820 � 10

F I G U R E 11.6 Using the Visually Fitted Line to Predict When x � 28

x

y

7

8

9

10

11

13

14

12

x � 28

16

15

0 10 20 30 40 50 60 70

y � 12.2





comparing each observed value of y with the corresponding predicted value of y given by the fit-ted line. We do this by computing the deviation For instance, looking at the first obser-vation in Table 11.1 (page 447), we observed y � 12.4 and x � 28.0. Since the predicted fuelconsumption when x equals 28 is .2, the deviation equals 12.4 � 12.2 � .2. Thisdeviation is illustrated in Figure 11.5. Table 11.3 gives the values of y, x, , and for eachobservation in Table 11.1. The deviations (or prediction errors) are the vertical distances be-tween the observed y values and the predictions obtained using the fitted line—that is, they arethe line segments depicted in Figure 11.5.

If the visually determined line fits the data well, the deviations (errors) will be small. To obtainan overall measure of the quality of the fit, we compute the sum of squared deviations or sumof squared errors, denoted SSE. Table 11.3 also gives the squared deviations and the SSE for ourvisually fitted line. We find that SSE � 4.8796.

Clearly, the line shown in Figure 11.5 is not the only line that could be fitted to the observedfuel consumption data. Different people would obtain somewhat different visually fitted lines.However, it can be shown that there is exactly one line that gives a value of SSE that is smallerthan the value of SSE that would be given by any other line that could be fitted to the data. Thisline is called the least squares regression line or the least squares prediction equation. Toshow how to find the least squares line, we first write the general form of a straight-line predic-tion equation as

� b0 � b1x

Here b0 (pronounced b zero) is the y-intercept and b1 (pronounced b one) is the slope of the line.In addition, denotes the predicted value of the dependent variable when the value of the inde-pendent variable is x. Now suppose we have collected n observations (x1, y1), (x2, y2), . . . ,(xn, yn). If we consider a particular observation (xi, yi), the predicted value of yi is

� b0 � b1xi

Furthermore, the prediction error (also called the residual) for this observation is

ei � yi � � yi � (b0 � b1xi)

Then the least squares line is the line that minimizes the sum of the squared prediction errors(that is, the sum of squared residuals):

To find this line, we find the values of the y-intercept b0 and the slope b1 that minimize SSE. Thesevalues of b0 and b1 are called the least squares point estimates of b0 and b1. Using calculus, it

SSE � an

i�1(yi � (b0 � b1xi))

2

yi

yi

y

y

y � yyy � yy � 12

y � y.

y x � 15 � .1x y � (y � )2

12.4 28.0 15 � .1(28.0) � 12.2 12.4 � 12.2 � .2 (.2)2 � .0411.7 28.0 15 � .1(28.0) � 12.2 11.7 � 12.2 � �.5 (�.5)2 � .2512.4 32.5 15 � .1(32.5) � 11.75 12.4 � 11.75 � .65 (.65)2 � .422510.8 39.0 15 � .1(39.0) � 11.1 10.8 � 11.1 � �.3 (�.3)2 � .09

9.4 45.9 15 � .1(45.9) � 10.41 9.4 � 10.41 � �1.01 (�1.01)2 � 1.02019.5 57.8 15 � .1(57.8) � 9.22 9.5 � 9.22 � .28 (.28)2 � .07848.0 58.1 15 � .1(58.1) � 9.19 8.0 � 9.19 � �1.19 (�1.19)2 � 1.41617.5 62.5 15 � .1(62.5) � 8.75 7.5 � 8.75 � �1.25 (�1.25)2 � 1.5625

SSE � a (y � y)2 � .04 � .25 � � � � � 1.5625 � 4.8796

y ˆy ˆy ˆ

T A B L E 11.3 Calculation of SSE for a Line Visually Fitted to the Fuel Consumption Data




The following example illustrates how to calculate these point estimates and how to use thesepoint estimates to estimate mean values and predict individual values of the dependent variable.Note that the quantities SSxy and SSxx used to calculate the least squares point estimates are alsoused throughout this chapter to perform other important calculations.

Part 1: Calculating the least squares point estimates Again consider the fuel con-sumption problem. To compute the least squares point estimates of the regression parameters b0

and b1 we first calculate the following preliminary summations:

yi xi xiyi

12.4 28.0 (28.0)2 � 784 (28.0)(12.4) � 347.211.7 28.0 (28.0)2 � 784 (28.0)(11.7) � 327.612.4 32.5 (32.5)2 � 1,056.25 (32.5)(12.4) � 40310.8 39.0 (39.0)2 � 1,521 (39.0)(10.8) � 421.29.4 45.9 (45.9)2 � 2,106.81 (45.9)(9.4) � 431.469.5 57.8 (57.8)2 � 3,340.84 (57.8)(9.5) � 549.18.0 58.1 (58.1)2 � 3,375.61 (58.1)(8.0) � 464.87.5 62.5 (62.5)2 � 3,906.25 (62.5)(7.5) � 468.75

Using these summations, we calculate SSxy and SSxx as follows.

� 16,874.76 �(351.8)2

8� 1,404.355

SSxx � a x2i �

�a xi�2

n

� 3,413.11 �(351.8)(81.7)

8� �179.6475

SSxy � a xiyi ��a xi��a yi�

n

a xiyi � 3,413.11a x2i � 16,874.76a xi � 351.8a yi � 81.7

x2i


can be shown that these estimates are calculated as follows:4

4In order to simplify notation, we will often drop the limits on summations in this and subsequent chapters. That is, instead of

using the summation we will simply writea .an

i�1

For the simple linear regression model:

1 The least squares point estimate of the slope B1 is where

and

2 The least squares point estimate of the y-intercept B0 is where

Here n is the number of observations (an observation is an observed value of x and its correspondingvalue of y).

y �a yi

n and x �

a xi

n

b0 � y � b1x

SSxx � a (xi � x )2 �a x2i �

�a xi�2

nSSxy � a (xi � x )(yi � y ) �a xiyi �

�a xi��a yi�n

b1 �SSxy

SSxx

The Least Squares Point Estimates






It follows that the least squares point estimate of the slope b1 is

Furthermore, because

the least squares point estimate of the y-intercept b0 is

Since b1 � �.1279, we estimate that mean weekly fuel consumption decreases (since b1 isnegative) by .1279 MMcf of natural gas when average hourly temperature increases by 1 degree.Since b0 � 15.84, we estimate that mean weekly fuel consumption is 15.84 MMcf of natural gaswhen average hourly temperature is 0°F. However, we have not observed any weeks with tem-peratures near 0, so making this interpretation of b0 might be dangerous. We discuss this pointmore fully after this example.

Table 11.4 gives predictions of fuel consumption for each observed week obtained by usingthe least squares line (or prediction equation)

The table also gives each of the residuals and squared residuals and the sum of squared residuals(SSE � 2.5680112) obtained by using this prediction equation. Notice that the SSE here, whichwas obtained using the least squares point estimates, is smaller than the SSE of Table 11.3, whichwas obtained using the visually fitted line . In general, it can be shown that the SSEobtained by using the least squares point estimates is smaller than the value of SSE that would beobtained by using any other estimates of b0 and b1. Figure 11.7(a) illustrates the eight observedfuel consumptions (the dots in the figure) and the eight predicted fuel consumptions (the squaresin the figure) given by the least squares line. The distances between the observed and predictedfuel consumptions are the residuals. Therefore, when we say that the least squares point estimatesminimize SSE, we are saying that these estimates position the least squares line so as to minimizethe sum of the squared distances between the observed and predicted fuel consumptions. In thissense, the least squares line is the best straight line that can be fitted to the eight observed fuelconsumptions. Figure 11.7(b) gives the MINITAB output of this best fit line. Note that this out-put gives the least squares estimates b0 � 15.8379 and b1 � �.127922. In general, we will relyon MINITAB, Excel, and MegaStat to compute the least squares estimates (and to perform manyother regression calculations).

Part 2: Estimating a mean fuel consumption and predicting an individual fuelconsumption We define the experimental region to be the range of the previously observedvalues of the average hourly temperature x. Because we have observed average hourly tempera-tures between 28°F and 62.5°F (see Table 11.4), the experimental region consists of the range ofaverage hourly temperatures from 28°F to 62.5°F. The simple linear regression model relates

y � 15 � .1x

y � b0 � b1x � 15.84 � .1279x

b0 � y � b1x � 10.2125 � (�.1279)(43.98) � 15.84

y �a yi

8�

81.7

8� 10.2125 and x �

a xi

8�

351.8

8� 43.98

b1 �SSxy

SSxx

��179.6475

1,404.355� �.1279

T A B L E 11.4 Calculation of SSE Obtained by Using the Least Squares Point Estimates

yi xi � 15.84 � .1279xi yi � � residual (yi � )2

12.4 28.0 15.84 � .1279(28.0) � 12.2588 12.4 � 12.2588 � .1412 (.1412)2 � .019937411.7 28.0 15.84 � .1279(28.0) � 12.2588 11.7 � 12.2588 � �.5588 (�.5588)2 � .312257412.4 32.5 15.84 � .1279(32.5) � 11.68325 12.4 � 11.68325 � .71675 (.71675)2 � .513730610.8 39.0 15.84 � .1279(39.0) � 10.8519 10.8 � 10.8519 � �.0519 (�.0519)2 � .00269369.4 45.9 15.84 � .1279(45.9) � 9.96939 9.4 � 9.96939 � �.56939 (�.56939)2 � .3242059.5 57.8 15.84 � .1279(57.8) � 8.44738 9.5 � 8.44738 � 1.05262 (1.05262)2 � 1.10800898.0 58.1 15.84 � .1279(58.1) � 8.40901 8.0 � 8.40901 � �.40901 (�.40901)2 � .16728927.5 62.5 15.84 � .1279(62.5) � 7.84625 7.5 � 7.84625 � �.34625 (�.34625)2 � .1198891

SSE � a (yi � yi)2 � .0199374 � .3122574 � � � � � .1198891 � 2.5680112

y ˆ iy ˆ iy ˆ i





weekly fuel consumption y to average hourly temperature x for values of x that are in the exper-imental region. For such values of x, the least squares line is the estimate of the line of means.This implies that the point on the least squares line that corresponds to the average hourly tem-perature x

is the point estimate of the mean of all the weekly fuel consumptions that could be observed whenthe average hourly temperature is x:

Note that is an intuitively logical point estimate of my �x. This is because the expression b0 � b1xused to calculate y has been obtained from the expression b0 � b1x for my �x by replacing theunknown values of b0 and b1 by their least squares point estimates b0 and b1.

The quantity y is also the point prediction of the individual value

y � b0 � b1x � e

which is the amount of fuel consumed in a single week when average hourly temperature equalsx. To understand why y is the point prediction of y, note that y is the sum of the mean b0 � b1xand the error term e. We have already seen that y � b0 � b1x is the point estimate of b0 � b1x.We will now reason that we should predict the error term E to be 0, which implies that y is alsothe point prediction of y. To see why we should predict the error term to be 0, note that in the nextsection we discuss several assumptions concerning the simple linear regression model. One im-plication of these assumptions is that the error term has a 50 percent chance of being positive anda 50 percent chance of being negative. Therefore, it is reasonable to predict the error term to be0 and to use y as the point prediction of a single value of y when the average hourly temperatureequals x.

Now suppose a weather forecasting service predicts that the average hourly temperature in thenext week will be 40°F. Because 40°F is in the experimental region

y � 15.84 � .1279(40)


is (1) the point estimate of the mean weekly fuel consumption when the average hourly temper-ature is 40°F and (2) the point prediction of an individual weekly fuel consumption when the

y

my�x � b0 � b1x

� 15.84 � .1279x

y � b0 � b1x

F I G U R E 11.7 The Least Squares Line for the Fuel Consumption Data

x

y

789

1011

1314

12

1615

0 10 20 30 40 50 60 70

The least squares liney � 15.84 � .1279x

Residual

Predicted fuel consumptionwhen x � 45.9

Observed fuel consumptionwhen x � 45.9

BEST FIT LINE FOR FUEL CONSUMPTION DATA

30 40 50 60FUELCONS

14

13

12

11

10

9

8

7

6

TEMP

Y = 15.8379 � 0.127922X

R-Squared = 0.899

(a) The observed and predicted fuel consumptions (b) The MINITAB output of the least squares line





average hourly temperature is 40°F. This says that (1) we estimate that the average of all possi-ble weekly fuel consumptions that could potentially be observed when the average hourlytemperature is 40°F equals 10.72 MMcf of natural gas, and (2) we predict that the fuel con-sumption in a single week when the average hourly temperature is 40°F will be 10.72 MMcf ofnatural gas.

Figure 11.8 illustrates (1) the point estimate of mean fuel consumption when x is 40°F(the square on the least squares line), (2) the true mean fuel consumption when x is 40°F (the

F I G U R E 11.8 Point Estimation and Point Prediction in the Fuel Consumption Problem

x

y

7

8

9

10

11

13

14

12

16

15

10 20 30 40 50 60 70

y � 10.72 The point estimate ofmean fuel consumptionwhen x � 40


An individual value of fuelconsumption when x � 40

The true mean fuel consumptionwhen x � 40

The true line of means�y �x � �0 � �1x

F I G U R E 11.9 The Danger of Extrapolation Outside the Experimental Region

x

y

7

8

9

10

11

13

14

12

16

17

18

19

20

21

22

15

10 2028 62.5

30 40 50 60 700�10

Experimental region


The relationship between mean fuel consumptionand x might become curved at low temperatures

True mean fuelconsumption whenx � �10

Estimated mean fuelconsumption whenx � �10 obtained byextrapolating theleast squares line





triangle on the true line of means), and (3) an individual value of fuel consumption when x is40°F (the dot in the figure). Of course this figure is only hypothetical. However, it illustrates thatthe point estimate of the mean value of y (which is also the point prediction of the individual valueof y) will (unless we are extremely fortunate) differ from both the true mean value of y and theindividual value of y. Therefore, it is very likely that the point prediction � 10.72, which is thenatural gas company’s transmission nomination for next week, will differ from next week’s ac-tual fuel consumption, y. It follows that we might wish to predict the largest and smallest that ymight reasonably be. We will see how to do this in Section 11.5.

To conclude this example, note that Figure 11.9 illustrates the potential danger of using theleast squares line to predict outside the experimental region. In the figure, we extrapolate the leastsquares line far beyond the experimental region to obtain a prediction for a temperature of�10°F. As shown in Figure 11.1, for values of x in the experimental region the observed valuesof y tend to decrease in a straight-line fashion as the values of x increase. However, for tempera-tures lower than 28°F the relationship between y and x might become curved. If it does, extrapo-lating the straight-line prediction equation to obtain a prediction for x � �10 might badlyunderestimate mean weekly fuel consumption (see Figure 11.9).

The previous example illustrates that when we are using a least squares regression line, weshould not estimate a mean value or predict an individual value unless the corresponding valueof x is in the experimental region—the range of the previously observed values of x. Often thevalue x � 0 is not in the experimental region. For example, consider the fuel consumption prob-lem. Figure 11.9 illustrates that the average hourly temperature 0°F is not in the experimentalregion. In such a situation, it would not be appropriate to interpret the y-intercept b0 as the esti-mate of the mean value of y when x equals 0. For example, in the fuel consumption problem itwould not be appropriate to use b0 � 15.84 as the point estimate of the mean weekly fuel con-sumption when average hourly temperature is 0. Therefore, because it is not meaningful tointerpret the y-intercept in many regression situations, we often omit such interpretations.

We now present a general procedure for estimating a mean value and predicting an individual value:

y

Consider the simple linear regression model relating yearly home upkeep expenditure, y, to homevalue, x. Using the data in Table 11.2 (page 453), we can calculate the least squares pointestimates of the y-intercept b0 and the slope b1 to be b0 � �348.3921 and b1 � 7.2583. Sinceb1 � 7.2583, we estimate that mean yearly upkeep expenditure increases by $7.26 for eachadditional $1,000 increase in home value. Consider a home worth $220,000, and note that x0 �220 is in the range of previously observed values of x: 48.9 to 286.18 (see Table 11.2). It followsthat

y � b0 � b1x0

� �348.3921 � 7.2583(220)

� 1,248.43 (or $1,248.43)

is the point estimate of the mean yearly upkeep expenditure for all homes worth $220,000 and isthe point prediction of a yearly upkeep expenditure for an individual home worth $220,000.

Let b0 and b1 be the least squares point estimatesof the y-intercept b0 and the slope b1 in the simple

linear regression model, and suppose that x0, a spec-ified value of the independent variable x, is insidethe experimental region. Then

y � b0 � b1x0

is the point estimate of the mean value of the de-pendent variable when the value of the indepen-dent variable is x0. In addition, y is the point predic-tion of an individual value of the dependentvariable when the value of the independent variableis x0. Here we predict the error term to be 0.

Point Estimation and Point Prediction in Simple Linear Regression






The marketing department at QHIC wishes to determine which homes should be sent adver-tising brochures promoting QHIC’s products and services. The prediction equation � b0 � b1ximplies that the home value x corresponding to a predicted upkeep expenditure of y is

Therefore, for example, if QHIC wishes to send an advertising brochure to any home that has apredicted upkeep expenditure of at least $500, then QHIC should send this brochure to any homethat has a value of at least

CONCEPTS

11.15 What does SSE measure?

11.16 What is the least squares regression line, and what are the least squares point estimates?

11.17 How do we obtain a point estimate of the mean value of the dependent variable and a pointprediction of an individual value of the dependent variable?

11.18 Why is it dangerous to extrapolate outside the experimental region?


Exercises 11.19, 11.20, and 11.21 are based on the following MINITAB and Excel output. At the left isthe output obtained when MINITAB is used to fit a least squares line to the starting salary data given inExercise 11.5 (page 454). In the middle is the output obtained when Excel is used to fit a least squares lineto the service time data given in Exercise 11.7 (page 454). The rightmost output is obtained whenMINITAB is used to fit a least squares line to the Fresh detergent demand data given in Exercise 11.9(page 455).

x �y � 348.3921

7.2583�

500 � 348.3921

7.2583� 116.886 ($116,886)

x �y � b0

b1�

y � (�348.3921)

7.2583�

y � 348.3921

7.2583

y


Using the leftmost outputa Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0

make practical sense?b Use the least squares line to obtain a point estimate of the mean starting salary for all market-

ing graduates having a grade point average of 3.25 and a point prediction of the starting salaryfor an individual marketing graduate having a grade point average of 3.25.


2728293031323334353637

2 3 4

Star

tSal

GPA

Regression Plot

Y � 14.8156 � 5.70657XR-Sq � 0.977

Regression Plot

Y � 7.81409 � 2.66522XR-Sq � 0.792

7.0

7.5

8.0

8.5

9.0

9.5

�0.2 0.0 0.1 0.2 0.3 0.4 0.5 0.6PriceDif

�0.1

Dem

and

Copiers Line Fit Plot

Y � 11.4641 � 24.6022X

020406080

100120140160180200

0 2 4 6 8

Min

ute

s

Copiers

11.19, 11.23




11.3 Model Assumptions and the Standard Error 465


Using the middle outputa Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0

make practical sense?b Use the least squares line to obtain a point estimate of the mean time to service four copiers

and a point prediction of the time to service four copiers on a single call.


Using the rightmost outputa Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0

make practical sense?b Use the least squares line to obtain a point estimate of the mean demand in all sales periods

when the price difference is .10 and a point prediction of the actual demand in an individualsales period when the price difference is .10.

c If Enterprise Industries wishes to maintain a price difference that corresponds to apredicted demand of 850,000 bottles (that is, � 8.5), what should this pricedifference be?


Consider the direct labor cost data given in Exercise 11.11 (page 456), and suppose that a simplelinear regression model is appropriate.a Verify that b0 � 18.4880 and b1 � 10.1463 by using the formulas illustrated in Example 11.4

(pages 459–460).b Interpret the meanings of b0 and b1. Does the interpretation of b0 make practical sense?c Write the least squares prediction equation.d Use the least squares line to obtain a point estimate of the mean direct labor cost for all

batches of size 60 and a point prediction of the direct labor cost for an individual batch ofsize 60.


Consider the sales price data given in Exercise 11.13 (page 456), and suppose that a simple linearregression model is appropriate.a Verify that b0 � 48.02 and b1 � 5.7003 by using the formulas illustrated in Example 11.4

(pages 459–460).b Interpret the meanings of b0 and b1. Does the interpretation of b0 make practical sense?c Write the least squares prediction equation.d Use the least squares line to obtain a point estimate of the mean sales price of all houses

having 2,000 square feet and a point prediction of the sales price of an individual house hav-ing 2,000 square feet.

11.3 ■ Model Assumptions and the Standard Error

Model assumptions In order to perform hypothesis tests and set up various types of inter-vals when using the simple linear regression model

y � my �x � e

� b0 � b1x � e

we need to make certain assumptions about the error term e. At any given value of x, there is apopulation of error term values that could potentially occur. These error term values describe thedifferent potential effects on y of all factors other than the value of x. Therefore, these error termvalues explain the variation in the y values that could be observed when the independent variableis x. Our statement of the simple linear regression model assumes that my �x, the mean of the pop-ulation of all y values that could be observed when the independent variable is x, is b0 � b1x.This model also implies that e� y � (b0 � b1x), so this is equivalent to assuming that the meanof the corresponding population of potential error term values is 0. In total, we make fourassumptions—called the regression assumptions—about the simple linear regression model.

y





Taken together, the first three assumptions say that, at any given value of x, the population ofpotential error term values is normally distributed with mean zero and a varianceS2 that doesnot depend on the value of x. Because the potential error term values cause the variation in thepotential y values, these assumptions imply that the population of all y values that could beobserved when the independent variable is x is normally distributed with mean B0 � B1x and avariance S2 that does not depend on x. These three assumptions are illustrated in Figure 11.10in the context of the fuel consumption problem. Specifically, this figure depicts the populations ofweekly fuel consumptions corresponding to two values of average hourly temperature—32.5 and45.9. Note that these populations are shown to be normally distributed with different means (eachof which is on the line of means) and with the same variance (or spread).

The independence assumption is most likely to be violated when time series data are being uti-lized in a regression study. Intuitively, this assumption says that there is no pattern of positive errorterms being followed (in time) by other positive error terms, and there is no pattern of positiveerror terms being followed by negative error terms. That is, there is no pattern of higher-than-average y values being followed by other higher-than-average y values, and there is no pattern ofhigher-than-average y values being followed by lower-than-average y values.

It is important to point out that the regression assumptions very seldom, if ever, hold exactlyin any practical regression problem. However, it has been found that regression results are notextremely sensitive to mild departures from these assumptions. In practice, only pronounced

F I G U R E 11.10 An Illustration of the Model Assumptions

12.4 � Observed value of y when x � 32.5

The mean fuel consumption when x � 32.5

Population ofy values when

x � 45.9

Population ofy values whenx � 32.5

y

x32.5 45.9

The mean fuel consumptionwhen x � 45.9

9.4 � Observed value of y when x � 45.9The straight line definedby the equation �y |x � �0 � �1x(the line of means)

These assumptions can be stated in terms of potential y values or, equivalently, in terms ofpotential error term values. Following tradition, we begin by stating these assumptions in termsof potential error term values:

1 At any given value of x, the population of poten-tial error term values has a mean equal to 0.

2 Constant Variance AssumptionAt any given value of x, the population ofpotential error term values has a variance thatdoes not depend on the value of x. That is, thedifferent populations of potential error termvalues corresponding to different values of xhave equal variances. We denote the constantvariance as �2.

3 Normality AssumptionAt any given value of x, the population of poten-tial error term values has a normal distribution.

4 Independence AssumptionAny one value of the error term E is statisticallyindependent of any other value of E. That is, thevalue of the error term E corresponding to anobserved value of y is statistically independentof the value of the error term corresponding toany other observed value of y.

The Regression Assumptions




11.3 Model Assumptions and the Standard Error 467

departures from these assumptions require attention. In optional Section 11.8 we show how tocheck the regression assumptions. Prior to doing this, we will suppose that the assumptions arevalid in our examples.

In Section 11.2 we stated that, when we predict an individual value of the dependent variable,we predict the error term to be 0. To see why we do this, note that the regression assumptionsstate that, at any given value of the independent variable, the population of all error term valuesthat can potentially occur is normally distributed with a mean equal to 0. Since we also assumethat successive error terms (observed over time) are statistically independent, each error term hasa 50 percent chance of being positive and a 50 percent chance of being negative. Therefore, it isreasonable to predict any particular error term value to be 0.

The mean square error and the standard error To present statistical inference formulasin later sections, we need to be able to compute point estimates of s2 and s, the constant varianceand standard deviation of the error term populations. The point estimate of s2 is called the meansquare error and the point estimate of s is called the standard error. In the following box, weshow how to compute these estimates:

In order to understand these point estimates, recall that s2 is the variance of the population ofy values (for a given value of x) around the mean value my �x. Because is the point estimate ofthis mean, it seems natural to use

to help construct a point estimate of s2. We divide SSE by n � 2 because it can be proven thatdoing so makes the resulting s2 an unbiased point estimate of s2. Here we call n � 2 the numberof degrees of freedom associated with SSE.

Consider the fuel consumption situation, and recall that in Table 11.4 (page 460) we have calcu-lated the sum of squared residuals to be SSE � 2.568. It follows, because we have observed n � 8 fuel consumptions, that the point estimate of s2 is the mean square error

This implies that the point estimate of s is the standard error

As another example, it can be verified that the standard error for the simple linear regressionmodel describing the QHIC data is s � 146.8970.

To conclude this section, note that in optional Section 11.9 we present a shortcut formula forcalculating SSE. The reader may study Section 11.9 now or at any later point.

s � 2s2 � 2.428 � .6542

s2 �SSE

n � 2�

2.568

8 � 2� .428

SSE � a (yi � yi)2

y

1 The point estimate of s2 is the mean square error 2 The point estimate of s is the standard error

s �B

SSEn � 2

s2 �SSE

n � 2

The Mean Square Error and the Standard Error


If the regression assumptions are satisfied and SSE is the sum of squared residuals:





CONCEPTS

11.24 What four assumptions do we make about the simple linear regression model?

11.25 What is estimated by the mean square error, and what is estimated by the standard error?



Refer to the starting salary data of Exercise 11.5 (page 454). Given that SSE � 1.438, calculate s2

and s.


Refer to the service time data in Exercise 11.7 (page 454). Given that SSE � 191.70166,calculate s2 and s.


Refer to the Fresh detergent data of Exercise 11.9 (page 455). Given that SSE � 2.8059, calculates2 and s.


Refer to the direct labor cost data of Exercise 11.11 (page 456). Given that SSE � 747, calculates2 and s.


Refer to the sales price data of Exercise 11.13 (page 456). Given that SSE � 896.8, calculate s2

and s.

11.31 Ten sales regions of equal sales potential for a company were randomly selected. The advertis-ing expenditures (in units of $10,000) in these 10 sales regions were purposely set duringJuly of last year at, respectively, 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14. The sales volumes (inunits of $10,000) were then recorded for the 10 sales regions and found to be, respectively,89, 87, 98, 110, 103, 114, 116, 110, 126, and 130. Assuming that the simple linearregression model is appropriate, it can be shown that b0 � 66.2121, b1 � 4.4303, and SSE � 222.8242. SalesVolCalculate s2 and s.

11.4 ■ Testing the Significance of the Slope and y Intercept

Testing the significance of the slope A simple linear regression model is not likely to beuseful unless there is a significant relationship between y and x. In order to judge the signifi-cance of the relationship between y and x, we test the null hypothesis

H0: b1 � 0

which says that there is no change in the mean value of y associated with an increase in x, versusthe alternative hypothesis

Ha: b1 � 0

which says that there is a (positive or negative) change in the mean value of y associated with anincrease in x. It would be reasonable to conclude that x is significantly related to y if we can bequite certain that we should reject H0 in favor of Ha.

In order to test these hypotheses, recall that we compute the least squares point estimate b1 ofthe true slope b1 by using a sample of n observed values of the dependent variable y. A differentsample of n observed y values would yield a different least squares point estimate b1. For exam-ple, consider the fuel consumption problem, and recall that we have observed eight averagehourly temperatures. Corresponding to each temperature there is a (theoretically) infinite popu-lation of fuel consumptions that could potentially be observed at that temperature [seeTable 11.5(a)]. Sample 1 in Table 11.5(b) is the sample of eight fuel consumptions that we haveactually observed from these populations (these are the same fuel consumptions originally given


11.26, 11.31




11.4 Testing the Significance of the Slope and y Intercept 469

T A B L E 11.5 Three Samples in the Fuel Consumption Case

(a) The Eight Populations of Fuel Consumptions

Week Average Hourly Temperature x Population of Potential Weekly Fuel Consumptions1 28.0 Population of fuel consumptions when x � 28.02 28.0 Population of fuel consumptions when x � 28.03 32.5 Population of fuel consumptions when x � 32.54 39.0 Population of fuel consumptions when x � 39.05 45.9 Population of fuel consumptions when x � 45.96 57.8 Population of fuel consumptions when x � 57.87 58.1 Population of fuel consumptions when x � 58.18 62.5 Population of fuel consumptions when x � 62.5

(b) Three Possible Samples

Sample 1 Sample 2 Sample 3y1 � 12.4 y1 � 12.0 y1 � 10.7y2 � 11.7 y2 � 11.8 y2 � 10.2y3 � 12.4 y3 � 12.3 y3 � 10.5y4 � 10.8 y4 � 11.5 y4 � 9.8y5 � 9.4 y5 � 9.1 y5 � 9.5y6 � 9.5 y6 � 9.2 y6 � 8.9y7 � 8.0 y7 � 8.5 y7 � 8.5y8 � 7.5 y8 � 7.2 y8 � 8.0

(c) Three Possible Values of b1

b1 � �.1279 b1 � �.1285 b1 � �.0666

(d) Three Possible Values of b0

b0 � 15.84 b0 � 15.85 b0 � 12.44

(e) Three Possible Values of s2

s2 � .428 s2 � .47 s2 � .0667

(f) Three Possible Values of s

s � .6542 s � .686 s � .2582

in Table 11.1). Samples 2 and 3 in Table 11.5(b) are two other samples that we could haveobserved. In general, an infinite number of such samples could be observed. Because each sam-ple yields its own unique values of b1, b0, s2, and s [see Table 11.5(c)–(f)], there is an infinitepopulation of potential values of each of these estimates.

If the regression assumptions hold, then the population of all possible values of b1 is normallydistributed with a mean of b1 and with a standard deviation of

The standard error s is the point estimate of s, so it follows that a point estimate of is

which is called the standard error of the estimate b1. Furthermore, if the regression assump-tions hold, then the population of all values of

has a t distribution with n � 2 degrees of freedom. It follows that, if the null hypothesisH0: b1 � 0 is true, then the population of all possible values of the test statistic

t �b1

sb1

b1 � b1

sb1

sb1�

s

1SSxx

sb1

sb1�

s

1SSxx





We usually use the two-sided alternative Ha: b1 � 0 for this test of significance. However,sometimes a one-sided alternative is appropriate. For example, in the fuel consumption problemwe can say that if the slope b1 is not 0, then it must be negative. A negative b1 would say thatmean fuel consumption decreases as temperature x increases. Because of this, it would be appro-priate to decide that x is significantly related to y if we can reject H0: b1 � 0 in favor of the one-sided alternative Ha: b1 � 0. Although this test would be slightly more effective than the usualtwo-sided test, there is little practical difference between using the one-sided or two-sided alter-native. Furthermore, computer packages (such as MINITAB and Excel) present results for test-ing a two-sided alternative hypothesis. For these reasons we will emphasize the two-sided test.

It should also be noted that

1 If we can decide that the slope is significant at the .05 significance level, then we haveconcluded that x is significantly related to y by using a test that allows only a .05 probabil-ity of concluding that x is significantly related to y when it is not. This is usually regardedas strong evidence that the regression relationship is significant.

2 If we can decide that the slope is significant at the .01 significance level, this is usuallyregarded as very strong evidence that the regression relationship is significant.

3 The smaller the significance level a at which H0 can be rejected, the stronger is the evidencethat the regression relationship is significant.

Again consider the fuel consumption model

y � b0 � b1x � e

For this model SSxx � 1,404.355, b1 � �.1279, and s � .6542 [see Examples 11.4 (pages 459–460)and 11.6 (page 467)]. Therefore

and

t �b1

sb1

��.1279

.01746� �7.33

sb1�

s

1SSxx

�.6542

11,404.355� .01746

has a t distribution with n � 2 degrees of freedom. Therefore, we can test the significance of theregression relationship as follows:

Define the test statistic

where

and suppose that the regression assumptions hold. Then we can reject H0: b1 � 0 in favor of a particularalternative hypothesis at significance level a (that is, by setting the probability of a Type I error equal to a)if and only if the appropriate rejection point condition holds, or, equivalently, the correspondingp-value is less than a.

sb1�

s1SSxx

t �b1

sb1

Testing the Significance of the Regression Relationship: Testing the Significance of the Slope

Alternative Rejection Point Condition:Hypothesis Reject H0 if p-ValueHa: b1 � 0 Twice the area under the t curve

to the right of Ha: b1 � 0 t � ta The area under the t curve to

the right of tHa: b1 � 0 t � �ta The area under the t curve to

the left of t

� t �� t � � ta�2


Here tA/2, tA, and all p-values are based on n � 2 degrees of freedom.




F I G U R E 11.11 MINITAB and Excel Output of a Simple Linear Regression Analysis of the Fuel Consumption Data

(a) The MINITAB outputRegression Analysis

The regression equation is

FUELCONS = 15.8 - 0.128 TEMP

Predictor Coef Stdev T p

Constant 15.8379a 0.8018c 19.75e 0.000g

TEMP -0.12792b 0.01746d -7.33f 0.000

S = 0.6542h R-Sq = 89.9%i R-Sq(adj) = 88.3%

Analysis of Variance

Source DF SS MS F p

Regression 1 22.981j 22.981 53.69m 0.000n

Error 6 2.568k 0.428

Total 7 25.549l

Fito Stdev Fitp 95.0% CIq 95.0% PIr

10.721 0.241 (10.130, 11.312) (9.014, 12.428)

ab0bb1

et for testing H0: b0 � 0 ft for testing H0: b1 � 0 gp-values for t statistics hs � standard error ir2

jExplained variation kSSE � Unexplained variation lTotal variation mF(model) statistic np-value for F(model) oyq95% confidence interval when x � 40 r95% prediction interval when x � 40

psy

dsb1

csb0


To test the significance of the slope we compare with based on n � 2 � 8 � 2 � 6degrees of freedom. Because

we can reject H0: b1 � 0 in favor of Ha: b1 � 0 at level of significance .05.The p-value for testing H0 versus Ha is twice the area to the right of under the

curve of the t distribution having n � 2 � 6 degrees of freedom. Since this p-value can be shownto be .00033, we can reject H0 in favor of Ha at level of significance .05, .01, or .001. We there-fore have extremely strong evidence that x is significantly related to y and that the regressionrelationship is significant.

Figure 11.11 presents the MINITAB and Excel outputs of a simple linear regression analysisof the fuel consumption data. Note that b0 � 15.84, b1 � �.1279, s � .6542, andt � �7.33 (each of which has been previously calculated) are given on these outputs. Also notethat Excel gives the p-value of .00033, and MINITAB has rounded this p-value to .000 (whichmeans less than .001). Other quantities on the MINITAB and Excel outputs will be discussed later.

sb1� .01746,

� t � � 7.33

� t � � 7.33 � t.025 � 2.447

ta�2� t �

(b) The Excel output

Regression StatisticsMultiple R 0.948413871R Square 0.899488871i

Adjusted R Square 0.882737016Standard Error 0.654208646h

Observations 8

ANOVAdf SS MS F Significance F

Regression 1 22.98081629j 22.98082 53.69488m 0.000330052n

Residual 6 2.567933713k 0.427989Total 7 25.54875l

Coefficients Standard Error t Stat P-Valueg Lower 95% Upper 95%Intercept 15.83785741a 0.801773385c 19.75353e 1.09E-06 13.87598718 17.79972765Temp �0.127921715b 0.01745733d �7.32768f 0.00033 �0.170638294o �0.08520514o

ab0bb1


jExplained variation kSSE � Unexplained variation lTotal variation mF(model) statistic np-value for F(model) o95% confidence interval for b1

dsb1

csb0





In addition to testing the significance of the slope, it is often useful to calculate a confidenceinterval for b1. We show how this is done in the following box:

The MINITAB and Excel outputs in Figure 11.11 tell us that b1 � �.1279 and Thus, for instance, because t.025 based on n � 2 � 8 � 2 � 6 degrees of freedom equals 2.447, a95 percent confidence interval for b1 is

This interval says we are 95 percent confident that, if average hourly temperature increases byone degree, then mean weekly fuel consumption will decrease (because both the lower bound andthe upper bound of the interval are negative) by at least .0852 MMcf of natural gas and by at most.1706 MMcf of natural gas. Also, because the 95 percent confidence interval for b1 does not con-tain 0, we can reject H0: b1 � 0 in favor of Ha: b1 � 0 at level of significance .05. Note that the95 percent confidence interval for b1 is given on the Excel output but not on the MINITABoutput.

Figure 11.12 presents the MegaStat output of a simple linear regression analysis of the QHICdata. Below we summarize some important quantities from the output (we discuss the otherquantities later):

b0 � �348.3931 b1 � 7.2583 s � 146.897

p-value for t � .001

Since the p-value for testing the significance of the slope is less than .001, we can reject H0: b1 � 0in favor of Ha: b1 � 0 at the .001 level of significance. It follows that we have extremely strongevidence that the regression relationship is significant. The MegaStat output also tells us thata 95 percent confidence interval for the true slope b1 is [6.4170, 8.0995]. This interval says we are95 percent confident that mean yearly upkeep expenditure increases by between $6.42 and $8.10for each additional $1,000 increase in home value.

Testing the significance of the y-intercept We can also test the significance of the y-in-tercept b0. We do this by testing the null hypothesis H0: b0 � 0 versus the alternative hypothesisHa: b0 � 0. To carry out this test we use the test statistic

Here the rejection point and p-value conditions for rejecting H0 are the same as those given previ-ously for testing the significance of the slope, except that t is calculated as For example, ifwe consider the fuel consumption problem and the MINITAB output in Figure 11.11, we seethat b0 � 15.8379, � .8018, t � 19.75, and p-value � .000. Because t � 19.75 � t.025 � 2.447and p-value � .05, we can reject H0: b0 � 0 in favor of Ha: b0 � 0 at the .05 level of significance.

sb0

b0�sb0.

t �b0

sb0

where sb0� sB

1

n�

x 2

SSxx

sb1� .4156 t �

b1

sb1

� 17.466

� [�.1706, �.0852]

[b1 t.025sb1] � [�.1279 2.447(.01746)]

sb1� .01746.

If the regression assumptions hold, a 100(1 � A) percent confidence interval for the true slope B1 is. Here is based on n � 2 degrees of freedom.ta�2[b1 ta�2sb1

]

A Confidence Interval for the Slope







In fact, since the p-value � .001, we can also reject H0 at the .001 level of significance. Thisprovides extremely strong evidence that the y-intercept b0 does not equal 0 and that we shouldinclude b0 in the fuel consumption model.

In general, if we fail to conclude that the intercept is significant at a level of significance of.05, it might be reasonable to drop the y-intercept from the model. However, remember that b0

equals the mean value of y when x equals 0. If, logically speaking, the mean value of y wouldnot equal 0 when x equals 0 (for example, in the fuel consumption problem, mean fuel con-sumption would not equal 0 when the average hourly temperature is 0), it is common practiceto include the y-intercept whether or not H0: b0 � 0 is rejected. In fact, experience suggeststhat it is definitely safest, when in doubt, to include the intercept b0.

CONCEPTS

11.32 What do we conclude if we can reject H0: b1 � 0 in favor of Ha: b1 � 0 by settinga a equal to .05? b a equal to .01?

11.33 Give an example of a practical application of the confidence interval for b1.


In Exercises 11.34 through 11.38, we refer to MINITAB, MegaStat, and Excel output of simple linearregression analyses of the data sets related to the five case studies introduced in the exercises for Section 11.1. Using the appropriate output for each case study,a Identify the least squares point estimates b0 and b1 of b0 and b1.b Identify SSE, s2, and s.c Identify and the t statistic for testing the significance of the slope. Show how t has been calculated

by using b1 and sb1.

d Using the t statistic and appropriate rejection points, test H0: b1 � 0 versus Ha: b1 � 0 by setting aequal to .05. What do you conclude about the relationship between y and x?

e Using the t statistic and appropriate rejection points, test H0: b1 � 0 versus Ha: b1 � 0 by setting aequal to .01. What do you conclude about the relationship between y and x?

f Identify the p-value for testing H0: b1 � 0 versus Ha: b1 � 0. Using the p-value, determine whether wecan reject H0 by setting a equal to .10, .05, .01, and .001. What do you conclude about the relationshipbetween y and x?

g Calculate the 95 percent confidence interval for b1. Discuss one practical application of this interval.

sb1


F I G U R E 11.12 MegaStat Output of a Simple Linear Regression Analysis of the QHIC Data

Regression Analysisr2 0.889i n 40r 0.943 k 1

Std. Error 146.897h Dep. Var. UpkeepANOVA table

Source SS df MS F p-valueRegression 6,582,759.6972j 1 6,582,759.6972 305.06m 9.49E-20n

Residual 819,995.5427k 38 21,578.8301Total 7,402,755.2399l 39

Regression output confidence intervalvariables coefficients std. error t (df � 38) p-valueg 95% lower 95% upperIntercept �348.3921a 76.1410c �4.576e 4.95E-05 �502.5315 �194.2527

Value 7.2583b 0.4156d 17.466f 9.49E-20 6.4170o 8.0995o

Predicted values for: Upkeep95% Confidence Intervalq 95% Prediction Intervalr

Value PredictedP lower upper lower upper Leverages

220.00 1,248.42597 1,187.78943 1,309.06251 944.92878 1,551.92317 0.042

ab0bb1


jExplained variation kSSE � Unexplained variation lTotal variation mF(model) statistic np-value for F(model) o95% confidence interval for b1py q95% confidence interval when x � 220 r95% prediction interval when x � 220 sdistance value

dsb1

csb0

11.34, 11.38, 11.40





h Calculate the 99 percent confidence interval for b1.i Identify and the t statistic for testing the significance of the y intercept. Show how t has been calcu-

lated by using b0 and .j Identify the p-value for testing H0: b0 � 0 versus Ha: b0 � 0. Using the p-value, determine whether we

can reject H0 by setting a equal to .10, .05, .01, and .001. What do you conclude?k Using the appropriate data set, show how and have been calculated. Hint: Calculate SSxx.


The MINITAB output of a simple linear regression analysis of the data set for this case (seeExercise 11.5 on page 454) is given in Figure 11.13. Recall that a labeled MINITAB regressionoutput is on page 471.


The MegaStat output of a simple linear regression analysis of the data set for this case(see Exercise 11.7 on page 454) is given in Figure 11.14. Recall that a labeled MegaStatregression output is on page 473.

sb1sb0

sb0

sb0

F I G U R E 11.13 MINITAB Output of a Simple Linear Regression Analysis of the Starting Salary Data

F I G U R E 11.14 MegaStat Output of a Simple Linear Regression Analysis of the Service Time Data


SALARY = 14.8 + 5.71 GPA

Predictor Coef StDev T P

Constant 14.816 1.235 12.00 0.000

GPA 5.7066 0.3953 14.44 0.000

S = 0.5363 R-Sq = 97.7% R-Sq(adj) = 97.2%


Source DF SS MS F P

Regression 1 59.942 59.942 208.39 0.000

Error 5 1.438 0.288

Total 6 61.380

Fit StDev Fit 95.0% CI 95.0% PI

33.362 0.213 ( 32.813, 33.911) (31.878, 34.846)

Regression Analysisr2 0.990 n 11r 0.995 k 1

Std. Error 4.615 Dep. Var. MinutesANOVA table

Source SS df MS F p-valueRegression 19,918.8438 1 19,918.8438 935.15 2.09E-10

Residual 191.7017 9 21.3002Total 20,110.5455 10

Regression output confidence intervalvariables Coefficients std. error t (df � 9) p-value 95% lower 95% upper

Intercept 11.4641 3.4390 3.334 .0087 3.6845 19.2437Copiers 24.6022 0.8045 30.580 2.09E-10 22.7823 26.4221

Predicted values for: Minutes95% Confidence Intervals 95% Prediction Intervals

Copiers Predicted lower upper lower upper Leverage1 36.066 29.907 42.226 23.944 48.188 0.3482 60.669 55.980 65.357 49.224 72.113 0.2023 85.271 81.715 88.827 74.241 96.300 0.1164 109.873 106.721 113.025 98.967 120.779 0.0915 134.475 130.753 138.197 123.391 145.559 0.1276 159.077 154.139 164.016 147.528 170.627 0.2247 183.680 177.233 190.126 171.410 195.950 0.381






The MINITAB output of a simple linear regression analysis of the data set for this case (seeExercise 11.9 on page 455) is given in Figure 11.15. Recall that a labeled MINITAB regressionoutput is on page 471.


The Excel and MegaStat output of a simple linear regression analysis of the data set for thiscase (see Exercise 11.11 on page 456) is given in Figure 11.16. Recall that labeled Excel andMegaStat regression outputs are on pages 471 and 473.


The MINITAB output of a simple linear regression analysis of the data set for this case (seeExercise 11.13 on page 456) is given in Figure 11.17 on page 476. Recall that a labeledMINITAB regression output is on page 471.

11.39 Find and interpret a 95 percent confidence interval for the slope b1 of the simple linear regressionmodel describing the sales volume data in Exercise 11.31 (page 468). SalesVol

F I G U R E 11.15 MINITAB Output of a Simple Linear Regression Analysis of the Fresh Detergent Demand Data

F I G U R E 11.16 Excel and MegaStat Output of a Simple Linear Regression Analysis of the Direct Labor Cost Data


Demand = 7.81 + 2.67 PriceDif


Constant 7.81409 0.07988 97.82 0.000

PriceDif 2.6652 0.2585 10.31 0.000

S = 0.3166 R-Sq = 79.2% R-Sq(adj) = 78.4%


Source DF SS MS F P

Regression 1 10.653 10.653 106.30 0.000

Error 28 2.806 0.100

Total 29 13.459


8.0806 0.0648 ( 7.9478, 8.2134) ( 7.4186, 8.7427)

8.4804 0.0586 ( 8.3604, 8.6004) ( 7.8208, 9.1400)

(a) The Excel output

Regression StatisticsMultiple R 0.99963578R Square 0.999271693Adjusted R Square 0.999198862Standard Error 0.851387097Observations 12


Regression 1 9945.418067 9945.418067 13720.46769 5.04436E-17Residual 10 7.248599895 0.724859989Total 11 9952.666667

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept �1.787514479 0.473848084 �3.772336617 0.003647528 �2.843313988 �0.73171497LaborCost 0.098486713 0.000840801 117.1344001 5.04436E-17 0.096613291 0.100360134

(b) Prediction using MegaStat

Predicted values for: LaborCost95% Confidence Interval 95% Prediction Interval

BatchSize Predicted lower upper lower upper Leverage60 627.263 621.054 633.472 607.032 647.494 0.104





11.40 THE FAST-FOOD RESTAURANT RATING CASE FastFood

In the early 1990s researchers at The Ohio State University studied consumer ratings of six fast-food restaurants: Borden Burger, Hardee’s, Burger King, McDonald’s, Wendy’s, and WhiteCastle. Each of 406 randomly selected individuals gave each restaurant a rating of 1, 2, 3, 4, 5, or6 on the basis of taste, and then ranked the restaurants from 1 through 6 on the basis of overallpreference. In each case, 1 is the best rating and 6 the worst. The mean ratings given by the 406individuals are given in the following table:

Mean MeanRestaurant Taste PreferenceBorden Burger 3.5659 4.2552Hardee’s 3.329 4.0911Burger King 2.4231 3.0052McDonald’s 2.0895 2.2429Wendy’s 1.9661 2.5351White Castle 3.8061 4.7812

Figure 11.18 gives the Excel output of a simple linear regression analysis of this data. Here, meanpreference is the dependent variable and mean taste is the independent variable. Recall that alabeled Excel regression output is given on page 471.

Regression StatisticsMultiple R 0.98728324R Square 0.974728196Adjusted R Square 0.968410245Standard Error 0.183265974Observations 6


Regression 1 5.181684399 5.181684399 154.2791645 0.000241546Residual 4 0.134345669 0.033586417Total 5 5.316030068

StandardCoefficients Error t Stat P-Value Lower 95% Upper 95%

Intercept �0.160197061 0.302868523 �0.528932686 0.624842601 �1.00109663 0.680702508MEANTASTE 1.27312365 0.102498367 12.42091641 0.000241546 0.988541971 1.557705329

F I G U R E 11.18 Excel Output of a Simple Linear Regression Analysis of the Fast-Food Restaurant Rating Data

F I G U R E 11.17 MINITAB Output of a Simple Linear Regression Analysis of the Sales Price Data


SPrice = 48.0 + 5.70 HomeSize


Constant 48.02 14.41 3.33 0.010

HomeSize 5.7003 0.7457 7.64 0.000

S = 10.59 R-Sq = 88.0% R-Sq(adj) = 86.5%


Source DF SS MS F P

Regression 1 6550.7 6550.7 58.43 0.000

Error 8 896.8 112.1

Total 9 7447.5


162.03 3.47 ( 154.04, 170.02) ( 136.33, 187.73)




11.5 Confidence and Prediction Intervals 477

a Identify the least squares point estimates b0 and b1, of b0 and b1.b Identify SSE, s2, and s.c Identify sb1

.d Identify the t statistic for testing H0: b1 � 0 versus Ha: b1 � 0.e Identify the p-value for testing H0: b1 � 0 versus Ha: b1 � 0. Would we reject H0 at a � .05?

At a � .01? At a � .001?f Identify and interpret a 95 percent confidence interval for b1.

11.5 ■ Confidence and Prediction IntervalsThe point on the least squares line corresponding to a particular value x0 of the independent vari-able x is

Unless we are very lucky, will not exactly equal either the mean value of y when x equals x0 ora particular individual value of y when x equals x0. Therefore, we need to place bounds on howfar might be from these values. We can do this by calculating a confidence interval for themean value of y and a prediction interval for an individual value of y.

Both of these intervals employ a quantity called the distance value. For simple linear regres-sion this quantity is calculated as follows:

y

y

y � b0 � b1x0

C H A P T E R 1 5

In simple linear regression the distance value for aparticular value x0 of x is

The Distance Value for Simple Linear Regression

If the regression assumptions hold, a 100(1 � A) percent confidence interval for the mean value of y whenthe value of the independent variable is x0 is

Here is based on n � 2 degrees of freedom.ta�2

[y ˆ tA�2s2Distance value]

A Confidence Interval for a Mean Value of y

This quantity is given its name because it is a measure of the distance between the value x0 of xand the average of the previously observed values of x. Notice from the above formula that thefarther x0 is from which can be regarded as the center of the experimental region, the larger isthe distance value. The significance of this fact will become apparent shortly.

We now consider establishing a confidence interval for the mean value of y when x equals aparticular value x0 (for later reference, we call this mean value ). Because each possible sam-ple of n values of the dependent variable gives values of b0 and b1 that differ from the valuesgiven by other samples, different samples give different values of the point estimate

It can be shown that, if the regression assumptions hold, then the population of all possible val-ues of is normally distributed with mean and standard deviation

The point estimate of is

which is called the standard error of the estimate Using this standard error, we form a con-fidence interval as follows:

y.

sy � s2Distance value

sy


my�x0y

y � b0 � b1x0

my�x0

x,x,

Distance value �1n

�(x0 � x )2

SSxx





In the fuel consumption problem, suppose we wish to compute a 95 percent confidence intervalfor the mean value of weekly fuel consumption when the average hourly temperature is x0 �40°F. From Example 11.4 (pages 459 and 460), the point estimate of this mean is

� 15.84 � .1279(40)


Furthermore, using the information in Example 11.4, we compute

Since s � .6542 (see Example 11.6 on page 467) and since ta�2 � t.025 based on n � 2 � 8 � 2 � 6degrees of freedom equals 2.447, it follows that the desired 95 percent confidence interval is

� [10.72 2.447(.6542) ]

� [10.72 .59]

� [10.13, 11.31]

This interval says we are 95 percent confident that the mean (or average) of all the weekly fuelconsumptions that would be observed in all weeks having an average hourly temperature of 40°Fis between 10.13 MMcf of natural gas and 11.31 MMcf of natural gas.

We develop an interval for an individual value of y when x equals a particular value x0 by con-sidering the prediction error . After observing each possible sample and calculating thepoint prediction based on that sample, we could observe any one of an infinite number of differ-ent individual values of y (because of different possible error terms). Therefore, there are an infi-nite number of different prediction errors that could be observed. If the regression assumptionshold, it can be shown that the population of all possible prediction errors is normally distributedwith mean 0 and standard deviation

The point estimate of is

which is called the standard error of the prediction error. Using this quantity we obtain aprediction interval as follows:

s(y� y) � s21 � Distance value

s(y� y)

s(y� y) � s21 � Distance value

y � y

2.1362

[ y ta�2s2Distance value]

� .1362

�1

8�

(40 � 43.98)2

1,404.355

Distance value �1

n�

(x0 � x)2

SSxx

y � b0 � b1x0


If the regression assumptions hold, a 100(1 � A) percent prediction interval for an individual value of ywhen the value of the independent variable is x0 is

Here is based on n � 2 degrees of freedom.ta�2

[y ˆ tA�2s21 � Distance value]

A Prediction Interval for an Individual Value of y





In the fuel consumption problem, suppose we wish to compute a 95 percent prediction intervalfor an individual weekly fuel consumption when average hourly temperature equals 40°F.Recalling that when x0 � 40, the desired interval is

� [10.72 2.447(.6542) ]

� [10.72 1.71]

� [9.01, 12.43]

Here ta�2 � t.025 is again based on n � 2 � 6 degrees of freedom. This interval says we are95 percent confident that the individual fuel consumption in a future single week having an av-erage hourly temperature of 40°F will be between 9.01 MMcf of natural gas and 12.43 MMcf ofnatural gas. Because the weather forecasting service has predicted that the average hourlytemperature in the next week will be 40°F, we can use the prediction interval to evaluate how wellour regression model is likely to predict next week’s fuel consumption and to evaluate whetherthe natural gas company will be assessed a transmission fine. First, recall that the point predic-tion given by our model is the natural gas company’s transmission nomination fornext week. Also, note that the half-length of the 95 percent prediction interval given by our modelis 1.71, which is (1.71�10.72)100% � 15.91% of the transmission nomination. It follows that weare 95 percent confident that the actual amount of natural gas that will be used by the city nextweek will differ from the natural gas company’s transmission nomination by no more than 15.91 percent. That is, we are 95 percent confident that the natural gas company’s percentagenomination error will be less than or equal to 15.91 percent. Although this does not imply that thenatural gas company is likely to make a terribly inaccurate nomination, we are not confident thatthe company’s percentage nomination error will be within the 10 percent allowance granted bythe pipeline transmission system. Therefore, the natural gas company may be assessed a trans-mission fine. In Chapter 12 we use a multiple regression model to substantially reduce the naturalgas company’s percentage nomination errors.

Below we repeat the bottom of the MINITAB output in Figure 11.11(a). This output gives thepoint estimate and prediction the 95 percent confidence interval for the mean valueof y when x equals 40, and the 95 percent prediction interval for an individual value of y when xequals 40.

Fit Stdev Fit 95.0% CI 95.0% PI

10.721 0.241 (10.130, 11.312) (9.014, 12.428)

Although the MINITAB output does not directly give the distance value, it does giveunder the heading “Stdev Fit”. A little algebra shows that this implies

that the distance value equals ( �s)2. Specifically, because � .241 and s � .6542 [see theMINITAB output in Figure 11.11(a) on page 471], it follows that the distance value equals(.241/.6542)2 � .1357. This distance value is (within rounding) equal to the distance value thatwe hand calculated in Example 11.10.

Figure 11.19 illustrates and compares the 95 percent confidence interval for the mean value ofy when x equals 40 and the 95 percent prediction interval for an individual value of y whenx equals 40. We see that both intervals are centered at However, the prediction inter-val is longer than the confidence interval. This is because the formula for the prediction intervalhas an “extra 1 under the radical,” which accounts for the added uncertainty introduced by ournot knowing the value of the error term (which we nevertheless predict to be 0, while it probablywill not equal 0). Figure 11.19 hypothetically supposes that the true values of b0 and b1 are 15.77and �.1281, and also supposes that in a future week when x equals 40 the error term will equal1.05—no human being would actually know these facts. Assuming these values, the mean valueof y when x equals 40 would be

b0 � b1(40) � 15.77 � .1281(40) � 10.65

y � 10.72.

sysy


y � 10.72,

y � 10.72

21.1362

[ y ta�2s21 � Distance value]

y � 10.72






and in the future week the individual value of y when x equals 40 would be

b0 � b1(40) � e� 10.65 � 1.05 � 11.7

As illustrated in the figure, the 95 percent confidence interval contains the mean 10.65. However,this interval is not long enough to contain the individual value 11.7; remember, it is not meant tocontain this individual value of y. In contrast, the 95 percent prediction interval is long enoughand does contain the individual value 11.7. Of course, the relative positions of the quantitiesshown in Figure 11.19 will vary in different situations. However, this figure emphasizes that wemust be careful to include the extra 1 under the radical when computing a prediction interval foran individual value of y.

F I G U R E 11.20 MINITAB Output of 95% Confidence and Prediction Intervals for the Fuel Consumption Case

95% confidenceinterval

95% prediction interval

9.01

10.13 11.31

1.71

12.4310.72� y

10.72� y

.59

Scale of values of fuel consumption

A hypothetical true meanvalue of y when x equals 40(� 10.65)

A hypothetical individual value of ywhen x equals 40 (� 11.7)

F I G U R E 11.19 A Comparison of a Confidence Interval for the Mean Value of y When x Equals 40 and aPrediction Interval for an Individual Value of y When x Equals 40





To conclude this example, note that Figure 11.20 illustrates the MINITAB output of the 95 per-cent confidence and prediction intervals corresponding to all values of x in the experimental region.Here can be regarded as the center of the experimental region. Notice that the farther xis from the larger is the distance value and, therefore, the longer are the 95 percentconfidence and prediction intervals. These longer intervals are undesirable because they give usless information about mean and individual values of y.

In general, the prediction interval is useful if, as in the fuel consumption problem, it is impor-tant to predict an individual value of the dependent variable. A confidence interval is useful if itis important to estimate the mean value. Although it is not important to estimate a mean value inthe fuel consumption problem, it is important to estimate a mean value in other situations. To un-derstand this, recall that the mean value is the average of all the values of the dependent variablethat could potentially be observed when the independent variable equals a particular value.Therefore, it might be important to estimate the mean value if we will observe and are affectedby a very large number of values of the dependent variable when the independent variable equalsa particular value. We illustrate this in the following example.

Consider a home worth $220,000. We have seen in Example 11.5 that the predicted yearly up-keep expenditure for such a home is

This predicted value is given at the bottom of the MegaStat output in Figure 11.12, which werepeat here:

Predicted values for: Upkeep

95% Confidence Interval 95% Prediction IntervalValue Predicted lower upper lower upper Leverage220.00 1,248.42597 1,187.78943 1,309.06251 944.92878 1,551.92317 0.042

In addition to giving the MegaStat output also tells us that the distance value,which is given under the heading “Leverage” on the output, equals .042. Therefore, since s equals146.897 (see Figure 11.12 on page 473), it follows that a 95 percent prediction interval for theyearly upkeep expenditure of an individual home worth $220,000 is calculated as follows:

Here t.025 is based on n � 2 � 40 � 2 � 38 degrees of freedom. Although the t table in Table A.4does not give t.025 for 38 degrees of freedom, MegaStat uses the appropriate t point and finds thatthe 95 percent prediction interval is [944.93, 1,551.92]. This interval is given on the MegaStatoutput.

There are many homes worth roughly $220,000 in the metropolitan area; QHIC is more inter-ested in the mean upkeep expenditure for all such homes than in the individual upkeep expendi-ture for one such home. The MegaStat output tells us that a 95 percent confidence interval for thismean upkeep expenditure is [1187.79, 1309.06]. This interval says that QHIC is 95 percentconfident that the mean upkeep expenditure for all homes worth $220,000 is at least $1,187.79and is no more than $1,309.06.

� [1,248.43 t.025(146.897)11.042]

[y t.025s11 � Distance value]

y � 1,248.43,

� 1,248.43 (that is, $1,248.43)

� �348.3921 � 7.2583(220)

y � b0 � b1x0

x � 43.98,x � 43.98






CONCEPTS

11.41 What does the distance value measure?

11.42 What is the difference between a confidence interval and a prediction interval?

11.43 Discuss how the distance value affects the length of a confidence interval and a predictioninterval.



The information at the bottom of the MINITAB output in Figure 11.13 (page 474) relates tomarketing graduates having a grade point average of 3.25.a Report a point estimate of and a 95 percent confidence interval for the mean starting salary of

all marketing graduates having a grade point average of 3.25.b Report a point prediction of and a 95 percent prediction interval for the starting salary of an

individual marketing graduate having a grade point average of 3.25.c Verify the results in a and b (within rounding) by hand calculation.


Below we repeat the bottom portion of the MegaStat output in Figure 11.14 on page 474.


Predicted values for: Minutes

95% Confidence Intervals 95% Prediction IntervalsCopiers Predicted lower upper lower upper Leverage

1 36.066 29.907 42.226 23.944 48.188 0.3482 60.669 55.980 65.357 49.224 72.113 0.2023 85.271 81.715 88.827 74.241 96.300 0.1164 109.873 106.721 113.025 98.967 120.779 0.0915 134.475 130.753 138.197 123.391 145.559 0.1276 159.077 154.139 164.016 147.528 170.627 0.2247 183.680 177.233 190.126 171.410 195.950 0.381

a Report a point estimate of and a 95 percent confidence interval for the mean time to servicefour copiers.

b Report a point prediction of and a 95 percent prediction interval for the time to service fourcopiers on a single call.

c If we examine the service time data, we see that there was at least one call on which Accu-Copiers serviced each of 1, 2, 3, 4, 5, 6, and 7 copiers. The 95 percent confidence intervalsfor the mean service times on these calls might be used to schedule future service calls. Tounderstand this, note that a person making service calls will (in, say, a year or more) makea very large number of service calls. Some of the person’s individual service times will bebelow, and some will be above, the corresponding mean service times. However, since thevery large number of individual service times will average out to the mean service times, itseems fair to both the efficiency of the company and to the person making service calls toschedule service calls by using estimates of the mean service times. Therefore, suppose wewish to schedule a call to service five copiers. Examining the MegaStat output, we see thata 95 percent confidence interval for the mean time to service five copiers is [130.8, 138.2].Since the mean time might be 138.2 minutes, it would seem fair to allow 138 minutes to makethe service call. Now suppose we wish to schedule a call to service four copiers. Determinehow many minutes to allow for the service call.


The information at the bottom of the MINITAB output in Figure 11.15 (page 475) corresponds tofuture sales periods in which the price difference will be .10 and .25, respectively.a Report a point estimate of and a 95 percent confidence interval for the mean demand for Fresh

in all sales periods when the price difference is .10.b Report a point prediction of and a 95 percent prediction interval for the actual demand for Fresh

in an individual sales period when the price difference is .10.

11.44, 11.48, 11.49





c Locate the confidence interval and prediction interval you found in parts a and b on the follow-ing MINITAB output:

d Find 99 percent confidence and prediction intervals for the mean and actual demands referredto in parts a and b. Hint: Solve for the distance value—see Example 11.11.

e Repeat parts a, b, c, and d for sales periods in which the price difference is .25.


The information on the MegaStat output in Figure 11.16(b) (page 475) relates to a batch ofsize 60.a Report a point estimate of and a 95 percent confidence interval for the mean direct labor cost

of all batches of size 60.b Report a point estimate of and a 95 percent prediction interval for the actual direct labor cost

of an individual batch of size 60.c Find 99 percent confidence and prediction intervals for the mean and actual direct labor costs

referred to in parts a and b. Hint: Find the distance value on the MegaStat output.


The information at the bottom of the MINITAB output in Figure 11.17 (page 476) relates to ahouse that will go on the market in the future and have 2,000 square feet.a Report a point estimate of and a 95 percent confidence interval for the mean sales price of all

houses having 2,000 square feet.b Report a point estimate of and a 95 percent prediction interval for the sales price of an

individual house having 2,000 square feet.


Figure 11.21 gives the MINITAB output of a simple linear regression analysis of the fast-foodrestaurant rating data in Exercise 11.40 (page 476). The information at the bottom of the outputrelates to a fast-food restaurant that has a mean taste rating of 1.9661. Find a point prediction ofand a 95 percent prediction interval for the mean preference ranking (given by 406 randomlyselected individuals) of the restaurant.

11.50 Ott (1987) presents a study of the amount of heat loss for a certain brand of thermal panewindow. Three different windows were randomly assigned to each of three different outdoortemperatures. For each trial the indoor window temperature was controlled at 68F and 50 percent relative humidity. The heat losses at the outdoor temperature 20F were 86, 80, and77. The heat losses at the outdoor temperature 40F were 78, 84, and 75. The heat losses at theoutdoor temperature 60F were 33, 38, and 43. Use the simple linear regression model to find apoint prediction of and a 95 percent prediction interval for the heat loss of an individual windowwhen the outdoor temperature is HeatLossa 20°F b 30°F c 40°F d 50°F e 60°F

11.51 In an article in the Journal of Accounting Research, Benzion Barlev and Haim Levy considerrelating accounting rates on stocks and market returns. Fifty-four companies were selected. Foreach company the authors recorded values of x, the mean yearly accounting rate for the period1959 to 1974, and y, the mean yearly market return rate for the period 1959 to 1974. The data in

Regression

95% CI

95% PI

10

9

8

7

Demand

-0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

PriceDif

Y = 7.81409 + 2.66522X

R-Sq = 0.792





T A B L E 11.6 Accounting Rates on Stocks and Market Returns for 54 Companies AcctRet

Source: Reprinted by permission from Benzion Barlev and Haim Levy, “On the Variability of Accounting Income Numbers,” Journal of Accounting Research(Autumn 1979), pp. 305–315.

Market Accounting Market AccountingCompany Rate Rate Company Rate RateMcDonnell Douglas 17.73 17.96 FMC 5.71 13.30NCR 4.54 8.11 Caterpillar Tractor 13.38 17.66Honeywell 3.96 12.46 Georgia Pacific 13.43 14.59TRW 8.12 14.70 Minnesota Mining & Manufacturing 10.00 20.94Raytheon 6.78 11.90 Standard Oil (Ohio) 16.66 9.62W. R. Grace 9.69 9.67 American Brands 9.40 16.32Ford Motors 12.37 13.35 Aluminum Company of America .24 8.19Textron 15.88 16.11 General Electric 4.37 15.74Lockheed Aircraft �1.34 6.78 General Tire 3.11 12.02Getty Oil 18.09 9.41 Borden 6.63 11.44Atlantic Richfield 17.17 8.96 American Home Products 14.73 32.58Radio Corporationof America 6.78 14.17 Standard Oil (California) 6.15 11.89Westinghouse Electric 4.74 9.12 International Paper 5.96 10.06Johnson & Johnson 23.02 14.23 National Steel 6.30 9.60Champion International 7.68 10.43 Republic Steel .68 7.41R. J. Reynolds 14.32 19.74 Warner Lambert 12.22 19.88General Dynamics �1.63 6.42 U.S. Steel .90 6.97Colgate-Palmolive 16.51 12.16 Bethlehem Steel 2.35 7.90Coca-Cola 17.53 23.19 Armco Steel 5.03 9.34InternationalBusiness Machines 12.69 19.20 Texaco 6.13 15.40Allied Chemical 4.66 10.76 Shell Oil 6.58 11.95Uniroyal 3.67 8.49 Standard Oil (Indiana) 14.26 9.56Greyhound 10.49 17.70 Owens Illinois 2.60 10.05Cities Service 10.00 9.10 Gulf Oil 4.97 12.11Philip Morris 21.90 17.47 Tenneco 6.65 11.53General Motors 5.86 18.45 Inland Steel 4.25 9.92Philips Petroleum 10.81 10.06 Kraft 7.30 12.27

Table 11.6 were obtained. Here the accounting rate can be interpreted to represent input intoinvestment and therefore is a logical predictor of market return. Use the simple linear regressionmodel and a computer to AcctReta Find a point estimate of and a 95 percent confidence interval for the mean market return rate of

all stocks having an accounting rate of 15.00.b Find a point prediction of and a 95 percent prediction interval for the market return rate of an

individual stock having an accounting rate of 15.00.

F I G U R E 11.21 MINITAB Output of a Simple Linear Regression Analysis of the Fast-Food Restaurant Rating Data


MEANPREF = -0.160 + 1.27 MEANTASTE


Constant �0.1602 0.3029 �0.53 0.625

MEANTAST 1.2731 0.1025 12.42 0.000

S = 0.1833 R-Sq = 97.5% R-Sq(adj) = 96.8%


Source DF SS MS F P

Regression 1 5.1817 5.1817 154.28 0.000

Error 4 0.1343 0.0336

Total 5 5.3160


2.3429 0.1186 ( 2.0136, 2.6721) ( 1.7367, 2.9491)




11.6 Simple Coefficients of Determination and Correlation 485

11.6 ■ Simple Coefficients of Determinationand Correlation

The simple coefficient of determination The simple coefficient of determination is ameasure of the usefulness of a simple linear regression model. To introduce this quantity, whichis denoted r2 (pronounced r squared), suppose we have observed n values of the dependent vari-able y. However, we choose to predict y without using a predictor (independent) variable x. Insuch a case the only reasonable prediction of a specific value of y, say yi, would be which issimply the average of the n observed values y1, y2, . . . , yn. Here the error of prediction in pre-dicting yi would be yi � . For example, Figure 11.22(a) illustrates the prediction errors obtainedfor the fuel consumption data when we do not use the information provided by the independentvariable x, average hourly temperature.

Next, suppose we decide to employ the predictor variable x and observe the values x1, x2, . . . , xn

corresponding to the observed values of y. In this case the prediction of yi is

and the error of prediction is yi � yi. For example, Figure 11.22(b) illustrates the prediction errorsobtained in the fuel consumption problem when we use the predictor variable x. Together,

y ˆi � b0 � b1xi

y

y, C H A P T E R 1 5

F I G U R E 11.22 The Reduction in the Prediction Errors Accomplished by Employing thePredictor Variable x

10 20 30 40 50 60

y

y � 10.21

x

7

70

89

10111213141516

10 20 30 40 50 60

y The least squares liney � 15.84 � .1279x

x

7

70

89

10111213141516

(a) Prediction errors for the fuel consumption problem when we do not use the informationcontributed by x

(b) Prediction errors for the fuel consumption problem when we use the information contributed by x by using the least squares line





Figures 11.22(a) and (b) show the reduction in the prediction errors accomplished by employingthe predictor variable x (and the least squares line).

Using the predictor variable x decreases the prediction error in predicting yi from (yi – ) to (yi – ), or by an amount equal to

It can be shown that in general

The sum of squared prediction errors obtained when we do not employ the predictor variable x,is called the total variation. Intuitively, this quantity measures the total amount of

variation exhibited by the observed values of y. The sum of squared prediction errors obtainedwhen we use the predictor variable is called the unexplained variation (this isanother name for SSE). Intuitively, this quantity measures the amount of variation in the valuesof y that is not explained by the predictor variable. The quantity is called the ex-plained variation. Using these definitions and the above equation involving these summations,we see that

Total variation � Unexplained variation � Explained variation

It follows that the explained variation is the reduction in the sum of squared prediction errors thathas been accomplished by using the predictor variable x to predict y. It also follows that

Total variation � Explained variation � Unexplained variation

Intuitively, this equation implies that the explained variation represents the amount of the totalvariation in the observed values of y that is explained by the predictor variable x (and the simplelinear regression model).

We now define the simple coefficient of determination to be

That is, r2 is the proportion of the total variation in the n observed values of y that is explained bythe simple linear regression model. Neither the explained variation nor the total variation can benegative (both quantities are sums of squares). Therefore, r2 is greater than or equal to 0. Becausethe explained variation must be less than or equal to the total variation, r2 cannot be greater than1. The nearer r2 is to 1, the larger is the proportion of the total variation that is explained by themodel, and the greater is the utility of the model in predicting y. If the value of r2 is not reason-ably close to 1, the independent variable in the model does not provide accurate predictions of y.In such a case, a different predictor variable must be found in order to accurately predict y. It isalso possible that no regression model employing a single predictor variable will accurately pre-dict y. In this case the model must be improved by including more than one independent variable.We see how to do this in Chapter 12.

In the following box we summarize the results of this section:

r2 �Explained variation

Total variation

� (y î � y)2

x, � (yi � y î)2,

� (yi � y)2,

a (yi � y)2 � a (yi � y î)2 � a (y î � y)2

(yi � y) � (yi � y î) � (y î � y)

y î

y

For the simple linear regression model

1 Total variation �

2 Explained variation �

3 Unexplained variation �

4 Total variation � Explained variation� Unexplained variation

5 The simple coefficient of determination is

6 r2 is the proportion of the total variation in the n observed values of the dependent variable thatis explained by the simple linear regressionmodel.


Total variation

a ( yi � yi

)2a ( yiˆ � y )2

a (yi � y )2

The Simple Coefficient of Determination, r2





Below we present the middle portion of the MINITAB output in Figure 11.11(a):

S = 0.6542 R-Sq = 89.9% R-Sq(adj)= 88.3%


Source DF SS MS F P

Regression 1 22.981 22.981 53.69 0.000

Error 6 2.568 0.428

Total 7 25.549

This output tells us that the explained variation is 22.981 (see “SS Regression”), the unexplainedvariation is 2.568 (see “SS Error”), and the total variation is 25.549 (see “SS Total”). It follows that

This value of r2 says that the regression model explains 89.9 percent of the total variation in theeight observed fuel consumptions. Note that r2 is given on the MINITAB output (see “R-Sq”) andis expressed as a percentage.5 Also note that the quantities discussed here are given in the Exceloutput in Figure 11.11(b) on page 471.

Below we present the top portion of the MegaStat output in Figure 11.12:


Total variation�

22.981

25.549� .899

The simple correlation coefficient between y and x, denoted by r, is

and

where b1 is the slope of the least squares line relating y to x. This correlation coefficient measures thestrength of the linear relationship between y and x.

r � �2r2 if b1 is negativer � �2r2 if b1 is positive

The Simple Correlation Coefficient



This output gives the explained variation, the unexplained variation, and the total variation underthe respective headings “SS Regression,” “SS Residual,” and “SS Total.” The output also tells usthat r2 equals .889. Therefore, the simple linear regression model that employs home value as apredictor variable explains 88.9 percent of the total variation in the 40 observed home upkeepexpenditures.

Before continuing, note that in optional Section 11.9 we present some shortcut formulas forcalculating the total, explained, and unexplained variations.

The simple correlation coefficient, r People often claim that two variables are correlated.For example, a college admissions officer might feel that the academic performance of collegestudents (measured by grade point average) is correlated with the students’ scores on a standard-ized college entrance examination. This means that college students’ grade point averages arerelated to their college entrance exam scores. One measure of the relationship between two vari-ables y and x is the simple correlation coefficient. We define this quantity as follows:

5We explain the meaning of “R-Sq (adj)” in Chapter 12.

r2 0.889 n 40r 0.943 k 1

Std. Error 146.897 Dep. Var. Upkeep

ANOVA tableSource SS df MS F p-value

Regression 6,582,759.6972 1 6,582,759.6972 305.06 9.49E-20Residual 819,995.5427 38 21,578.8301

Total 7,402,755.2399 39





Because r2 is always between 0 and 1, the correlation coefficient r is between �1 and 1. Avalueof r near 0 implies little linear relationship between y and x. A value of r close to 1 says that y and xhave a strong tendency to move together in a straight-line fashion with a positive slope and, there-fore, that y and x are highly related and positively correlated. A value of r close to �1 says that yand x have a strong tendency to move together in a straight-line fashion with a negative slope and,therefore, that y and x are highly related and negatively correlated. Figure 11.23 illustrates theserelationships. Notice that when r � 1, y and x have a perfect linear relationship with a positiveslope, whereas when r � �1, y and x have a perfect linear relationship with a negative slope.

In the fuel consumption problem we have previously found that b1 � �.1279 and r2 � .899. Itfollows that the simple correlation coefficient between y (weekly fuel consumption) and x (aver-age hourly temperature) is

This simple correlation coefficient says that x and y have a strong tendency to move together in alinear fashion with a negative slope. We have seen this tendency in Figure 11.1 (page 447), whichindicates that y and x are negatively correlated.

If we have computed the least squares slope b1 and r2, the method given in the previous boxprovides the easiest way to calculate r. The simple correlation coefficient can also be calculatedusing the formula

Here SSxy and SSxx have been defined in Section 11.2 on page 459, and SSyy denotes the total vari-ation, which has been defined in this section. Furthermore, this formula for r automatically givesr the correct (� or �) sign. For instance, in the fuel consumption problem, SSxy � �179.6475,

r �SSxy

2SSxx SSyy

r � �2r2 � �2.899 � �.948

F I G U R E 11.23 An Illustration of Different Values of the Simple Correlation Coefficient

y

x

(a) r � 1: perfect positivecorrelation

y

x

(b) Positive correlation (positive r): y increases as x increases in

a straight-line fashion

y

x

(c) Little correlation (r near 0):little linear relationship

between y and x

y

x

(e) r � �1: perfect negativecorrelation

y

x

(d) Negative correlation (negative r):y decreases as x increases in

a straight-line fashion






SSxx � 1,404.355, and SSyy � 25.549 (see Examples 11.4 on page 459 and 11.13 on page 487).Therefore

It is important to make a couple of points. First, the value of the simple correlation coefficientis not the slope of the least squares line. If we wish to find this slope, we should use the previouslygiven formula for b1.

6 Second, high correlation does not imply that a cause-and-effect rela-tionship exists. When r indicates that y and x are highly correlated, this says that y and x have astrong tendency to move together in a straight-line fashion. The correlation does not mean thatchanges in x cause changes in y. Instead, some other variable (or variables) could be causing theapparent relationship between y and x. For example, suppose that college students’grade point av-erages and college entrance exam scores are highly positively correlated. This does not mean thatearning a high score on a college entrance exam causes students to receive a high grade point aver-age. Rather, other factors such as intellectual ability, study habits, and attitude probably determineboth a student’s score on a college entrance exam and a student’s college grade point average. Ingeneral, while the simple correlation coefficient can show that variables tend to move together in astraight-line fashion, scientific theory must be used to establish cause-and-effect relationships.

Testing the significance of the population correlation coefficient Thus far we haveseen that the simple correlation coefficient measures the linear relationship between the observedvalues of x and the observed values of y that make up the sample. A similar coefficient of linearcorrelation can be defined for the population of all possible combinations of observed values of xand y. We call this coefficient the population correlation coefficient and denote it by the sym-bol R (pronounced rho). We use r as the point estimate of r. In addition, we can also carry out ahypothesis test. Here we test the null hypothesis H0: � � 0, which says there is no linear rela-tionship between x and y. We test H0 against the alternative Ha: � � 0, which says there is apositive or negative linear relationship between x and y. This test can be done by using r tocompute the test statistic

The test is based on the assumption that the population of all possible observed combinations ofvalues of x and y has a bivariate normal probability distribution. See Wonnacott andWonnacott (1981) for a discussion of this distribution. It can be shown that the preceding test sta-tistic t and the p-value used to test H0: r � 0 versus Ha: r � 0 are equal to, respectively, the teststatistic and the p-value used to test H0: b1 � 0 versus Ha: b1 � 0, where b1 is theslope in the simple linear regression model. Keep in mind, however, that although the mechanicsinvolved in these hypothesis tests are the same, these tests are based on different assumptions (re-member that the test for significance of the slope is based on the regression assumptions). If thebivariate normal distribution assumption for the test concerning r is badly violated, we can use anonparametric approach to correlation. One such approach is Spearman’s rank correlation co-efficient. This approach is discussed in optional Section 15.5.

Again consider testing the significance of the slope in the fuel consumption problem. Recall thatin Example 11.7 we found that t � �7.33 and that the p-value related to this t statistic is less than.001. We therefore (if the regression assumptions hold) can reject H0: b1 � 0 at level of signifi-cance .05, .01, or .001, and we have extremely strong evidence that x is significantly related to y.This also implies (if the population of all possible observed combinations of x and y has abivariate normal probability distribution) that we can reject H0: r� 0 in favor of Ha: r� 0 atlevel of significance .05, .01, or .001. It follows that we have extremely strong evidence of alinear relationship, or correlation, between x and y. Furthermore, because we have previouslycalculated r to be �.9482, we estimate that x and y are negatively correlated.

t � b1�sb1

t �r2n � 2

21 � r 2

r �SSxy

2SSxxSSyy

��179.6475

2(1,404.355)(25.549)� �.948


6Essentially, the difference between r and b1 is a change of scale. It can be shown that b1 and r are related by the equationb1 � (SSyy�SSxx)

1�2 r.





Exercises for Section 11.6CONCEPTS

11.52 Discuss the meanings of the total variation, the unexplained variation, and the explained variation.

11.53 What does the simple coefficient of determination measure?


In Exercises 11.54 through 11.59, we refer to MINITAB, MegaStat, and Excel output of simple linearregression analyses of the data sets related to six previously discussed case studies. In Exercises 11.60through 11.62 we refer to data sets contained in other exercises of this chapter. Using the appropriate outputor data set,

a Find the total variation, the unexplained variation, the explained variation, the simple coefficient ofdetermination (r2), and the simple correlation coefficient (r). Interpret r2.

b Using a t statistic involving r and appropriate rejection points, test H0: r� 0 versus Ha: r� 0 bysetting a equal to .05 and by setting a equal to .01. What do you conclude?


Use the MINITAB output in Figure 11.13 (page 474).


Use the MegaStat output in Figure 11.14 (page 474).




Use the Excel output in Figure 11.16 (page 475).




Use the Excel output in Figure 11.18 (page 476) or the MINITAB output in Figure 11.21 (page 484).

11.60 Use the sales volume data in Exercise 11.31 (page 468) and a computer. SalesVol

11.61 Use the accounting rate data in Exercise 11.51 (page 483) and a computer. AccRet

11.62 Use the heat loss data in Exercise 11.50 (page 483) and a computer. HeatLoss

11.63 In the Quality Management Journal (Fall 1994), B. F. Yavas and T. M. Burrows present thefollowing table, which gives the percentage of 100 U.S. managers and the percentage of 96 Asianmanagers who agree with each of 13 randomly selected statements concerning quality. (Note: Allmanagers are in the electronics manufacturing industry.) MgmtOp

If we calculate the simple correlation coefficient, r, for these data, we find that r � .570. Whatdoes this indicate about the similarity of U.S. and Asians managers’ views about quality?

Percentage U.S. Percentage AsianStatement Managers Agreeing Managers Agreeing

1 36 382 31 423 28 434 27 485 78 586 74 497 43 468 50 569 31 65

10 66 5811 18 2112 61 6913 53 45

Source: B. F. Yavas and T. M. Burrows, “A Comparative Study of Attitudes ofU.S. and Asian Managers toward Product Quality,” Quality Management Journal,Fall 1994, p. 49 (Table 5). © 1994 American Society for Quality. Reprinted bypermission.

11.54, 11.59




11.7 An F Test for the Model 491

11.7 ■ An F Test for the ModelIn this section we discuss an F test that can be used to test the significance of the regression rela-tionship between x and y. Sometimes people refer to this as testing the significance of the simplelinear regression model. For simple linear regression, this test is another way to test the nullhypothesis H0: b1 � 0 (the relationship between x and y is not significant) versus Ha: b1 � 0 (therelationship between x and y is significant). If we can reject H0 at level of significance a, we oftensay that the simple linear regression model is significant at level of significance A.

Suppose that the regression assumptions hold, anddefine the overall F statistic to be

Also define the p-value related to F(model) to bethe area under the curve of the F distribution (having1 numerator and n � 2 denominator degrees of free-dom) to the right of F (model)—see Figure 11.24(b).

We can reject in favor of at level of significance a if either of the followingequivalent conditions hold:

1 F(model) � Fa2 p-value � a

Here the point Fa is based on 1 numerator and n � 2denominator degrees of freedom.

Ha: b1 Z 0H0: b1 � 0

F (model) �Explained variation

(Unexplained variation)�(n � 2)

An F Test for the Simple Linear Regression Model

The first condition in the box says we should reject H0: b1 � 0 (and conclude that the rela-tionship between x and y is significant) when F(model) is large. This is intuitive because a largeoverall F statistic would be obtained when the explained variation is large compared to the un-explained variation. This would occur if x is significantly related to y, which would imply that theslope b1 is not equal to 0. Figure 11.24(a) illustrates that we reject H0 when F(model) is greater

F I G U R E 11.24 An F Test for the Simple Linear Regression Model

The curve of the F distribution having1 and n � 2 degrees of freedom

The curve of the F distribution having1 and n � 2 degrees of freedom

� � The probabilityof a Type I error1 � �

F�

F(model)

If F(model) � F�,do not reject H0 in favor of Ha

If F(model) � F�,reject H0 in favor of Ha

(a) The rejection point F� based on setting the probability of a Type I error equal to �

(b) If the p-value is smaller than �, thenF (model) � F� and we reject H0.

p-value





than Fa. As can be seen in Figure 11.24(b), when F(model) is large, the related p-value is small.When the p-value is small enough [resulting from an F(model) statistic that is large enough], wereject H0. Figure 11.24(b) illustrates that the second condition in the box ( p-value � a) is anequivalent way to carry out this test.

Consider the fuel consumption problem and the MINITAB output in Example 11.13 (page 487)of the simple linear regression model relating weekly fuel consumption y to average hourly tem-perature x. Looking at this output, we see that the explained variation is 22.981 and the unex-plained variation is 2.568. It follows that

� 53.69

Note that this overall F statistic is given on the MINITAB output (it is labeled as “F”). The p-valuerelated to F(model) is the area to the right of 53.69 under the curve of the F distribution having1 numerator and 6 denominator degrees of freedom. This p-value is also given on the MINITABoutput (labeled “p”). Here MINITAB tells us that the p-value is .000 (which means less than .001).If we wish to test the significance of the regression relationship with level of significance a � .05,we use the rejection point F.05 based on 1 numerator and 6 denominator degrees of freedom. UsingTable A.6 (page 818), we find that F.05 � 5.99. Since F(model) � 53.69 � F.05 � 5.99, we canreject H0: b1 � 0 in favor of Ha: b1 � 0 at level of significance .05. Alternatively, sincep-value � .000 is smaller than .05, .01, and .001, we can reject H0 at level of significance .05, .01,or .001. Therefore, we have extremely strong evidence that H0: b1 � 0 should be rejected and thatthe regression relationship between x and y is significant. That is, we might say that we have ex-tremely strong evidence that the simple linear model relating y to x is significant.

As another example, the MegaStat output in Example 11.14 (page 487) tells us that for theQHIC simple linear regression model F(model) is 305.06 and the related p-value is less than.001. Here F(model) is labeled as “F.” Because the p-value is less than .001, we have extremelystrong evidence that the regression relationship is significant.

Testing the significance of the regression relationship between y and x by using the overall Fstatistic and its related p-value is equivalent to doing this test by using the t statistic and its re-lated p-value. Specifically, it can be shown that (t)2 � F(model) and that (ta�2)

2 based on n � 2degrees of freedom equals Fa based on 1 numerator and n � 2 denominator degrees of freedom.It follows that the rejection point conditions

and F(model) � Fa

are equivalent. Furthermore, the p-values related to t and F(model) can be shown to be equal. Be-cause these tests are equivalent, it would be logical to ask why we have presented the F test.There are two reasons. First, most standard regression computer packages include the resultsof the F test as a part of the regression output. Second, the F test has a useful generalization inmultiple regression analysis (where we employ more than one predictor variable). The F testin multiple regression is not equivalent to a t test. This is further explained in Chapter 12.

CONCEPTS

11.64 What are the null and alternative hypotheses for the F test in simple linear regression?

11.65 The F test in simple linear regression is equivalent to what other test?

� t � � ta�2

�22.981

2.568�(8 � 2)�

22.981

.428

F(model) �Explained variation

(Unexplained variation)�(n � 2)



11.66, 11.70




11.8 Residual Analysis 493

C H A P T E R 1 6


In Exercises 11.66 through 11.71, we refer to MINITAB, MegaStat, and Excel output of simple linear re-gression analyses of the data sets related to the six previously discussed case studies. Using the appropriatecomputer output,

a Calculate the F(model) statistic by using the explained variation, the unexplained variation, and otherrelevant quantities.

b Utilize the F(model) statistic and the appropriate rejection point to test H0: b1 � 0 versus Ha: b1 � 0by setting a equal to .05. What do you conclude about the relationship between y and x?

c Utilize the F(model) statistic and the appropriate rejection point to test H0: b1 � 0 versus Ha: b1 � 0by setting a equal to .01. What do you conclude about the relationship between y and x?

d Find the p-value related to F(model). Using the p-value, determine whether we can reject H0: b1 � 0in favor of Ha: b1 � 0 by setting a equal to .10, .05, .01, and .001. What do you conclude?

e Show that the F(model) statistic is the square of the t statistic for testing H0: b1 � 0 versus Ha: b1 � 0.Also, show that the F.05 rejection point is the square of the t.025 rejection point.




Use the MegaStat output in Figure 11.14 (page 474).




Use the Excel output in Figure 11.16 (page 475).




Use the Excel output in Figure 11.18 (page 476) or the MINITAB output in Figure 11.21 (page 484).

*11.8 ■ Residual AnalysisIn this section we explain how to check the validity of the regression assumptions. The requiredchecks are carried out by analyzing the regression residuals. The residuals are defined as follows:

For any particular observed value of y, the corresponding residual is

e � y � � (observed value of y � predicted value of y)

where the predicted value of y is calculated using the least squares prediction equation

� b0 � b1x

The linear regression model y � b0 � b1x � e implies that the error term e is given by theequation e� y � (b0 � b1x). Since in the previous box is clearly the point estimate of b0 � b1x, we see that the residual e � y � is the point estimate of the error term e. If the re-gression assumptions are valid, then, for any given value of the independent variable, the popu-lation of potential error term values will be normally distributed with mean 0 and variance s2

(see the regression assumptions in Section 11.3 on page 466). Furthermore, the different errorterms will be statistically independent. Because the residuals provide point estimates of the errorterms, it follows that

If the regression assumptions hold, the residuals should look like they have been randomly andindependently selected from normally distributed populations having mean 0 and variance s2.

In any real regression problem, the regression assumptions will not hold exactly. In fact, it isimportant to point out that mild departures from the regression assumptions do not seriously

yy

y

y





hinder our ability to use a regression model to make statistical inferences. Therefore, we are look-ing for pronounced, rather than subtle, departures from the regression assumptions. Because ofthis, we will require that the residuals only approximately fit the description just given.

Residual plots One useful way to analyze residuals is to plot them versus various criteria. Theresulting plots are called residual plots. To construct a residual plot, we compute the residual foreach observed y value. The calculated residuals are then plotted versus some criterion. To vali-date the regression assumptions, we make residual plots against (1) values of the independentvariable x; (2) values of , the predicted value of the dependent variable; and (3) the time orderin which the data have been observed (if the regression data are time series data).

We next look at an example of constructing residual plots. Then we explain how to use theseplots to check the regression assumptions.

The MegaStat output in Figure 11.25(a) presents the predicted home upkeep expenditures andresiduals that are given by the simple linear regression model describing the QHIC data. Hereeach residual is computed as

e � y � � y � (b0 � b1x) � y � (�348.3921 � 7.2583x)

For instance, for the first observation (home) when y � 1,412.08 and x � 237.00 (see Table 11.2on page 453), the residual is

e � 1,412.08 � (�348.3921 � 7.2583(237))

� 1,412.08 � 1,371.816 � 40.264

The MINITAB output in Figure 11.25(b) and (c) gives plots of the residuals for the QHIC simplelinear regression model against values of x and . To understand how these plots are constructed,recall that for the first observation (home) y � 1,412.08, x � 237.00, � 1,371.816, and theresidual is 40.264. It follows that the point plotted in Figure 11.25(b) corresponding to the firstobservation has a horizontal axis coordinate of the x value 237.00 and a vertical axis coordinateof the residual 40.264. It also follows that the point plotted in Figure 11.25(c) corresponding tothe first observation has a horizontal axis coordinate of the value 1,371.816, and a vertical axiscoordinate of the residual 40.264. Finally, note that the QHIC data are cross-sectional data, nottime series data. Therefore, we cannot make a residual plot versus time.

The constant variance assumption To check the validity of the constant variance as-sumption, we examine plots of the residuals against values of x, , and time (if the regression dataare time series data). When we look at these plots, the pattern of the residuals’ fluctuation around0 tells us about the validity of the constant variance assumption. A residual plot that “fans out”[as in Figure 11.26(a)] suggests that the error terms are becoming more spread out as the hori-zontal plot value increases and that the constant variance assumption is violated. Here we wouldsay that an increasing error variance exists. A residual plot that “funnels in” [as in Fig-ure 11.26(b)] suggests that the spread of the error terms is decreasing as the horizontal plot valueincreases and that again the constant variance assumption is violated. In this case we would saythat a decreasing error variance exists. A residual plot with a “horizontal band appearance” [asin Figure 11.26(c)] suggests that the spread of the error terms around 0 is not changing much asthe horizontal plot value increases. Such a plot tells us that the constant variance assumption (approximately) holds.

As an example, consider the QHIC case and the residual plot in Figure 11.25(b). This plotappears to fan out as x increases, indicating that the spread of the error terms is increasing as xincreases. That is, an increasing error variance exists. This is equivalent to saying that the vari-ance of the population of potential yearly upkeep expenditures for houses worth x (thousanddollars) appears to increase as x increases. The reason is that the model y � b0 � b1x � e saysthat the variation of y is the same as the variation of e. For example, the variance of the popula-tion of potential yearly upkeep expenditures for houses worth $200,000 would be larger than thevariance of the population of potential yearly upkeep expenditures for houses worth $100,000.

y

y

yy

y

y






-300

-200

-100

0

100

200

300

100 200 300

VALUE

Residual

0

100

200

300

Fitted Value

Residual

0 200 400 600 800 1000 1200 1400 1600 1800

-300

-200

-100

F I G U R E 11.25 MegaStat and MINITAB Output of the Residuals and Residual Plots for the QHIC Simple Linear Regression Model

(a) MegaStat output of the residuals

Residual

Residual

Residual

(c) Constant error variance

(a) Increasing error variance (b) Decreasing error variance

Residuals fan out Residuals funnel in

Residuals form a horizontal band

F I G U R E 11.26 Residual Plots and the Constant Variance Assumption

Observation Upkeep Predicted Residual1 1,412.080 1,371.816 40.2642 797.200 762.703 34.4973 872.480 993.371 �120.8914 1,003.420 1,263.378 �259.9585 852.900 817.866 35.0346 288.480 375.112 �86.6327 1,288.460 1,314.041 �25.5818 423.080 390.354 32.7269 1,351.740 1,523.224 �171.484

10 378.040 350.434 27.60611 918.080 892.771 25.30912 1,627.240 1,328.412 298.82813 1,204.760 1,308.815 �104.05514 857.040 1,146.084 �289.04415 775.000 999.613 �224.61316 869.260 876.658 �7.39817 1,396.000 1,444.835 �48.83518 711.500 780.558 �69.05819 1,475.180 1,278.911 196.26920 1,413.320 1,118.068 295.252

Observation Upkeep Predicted Residual21 849.140 762.413 86.72722 1,313.840 1,336.832 �22.99223 602.060 562.085 39.97524 642.140 884.206 �242.06625 1,038.800 938.353 100.44726 697.000 833.398 �136.39827 324.340 525.793 �201.45328 965.100 1,038.662 �73.56229 920.140 804.075 116.06530 950.900 947.208 3.69231 1,670.320 1,627.307 43.01332 125.400 6.537 118.86333 479.780 410.532 69.24834 2,010.640 1,728.778 281.86235 368.360 259.270 109.09036 425.600 277.270 148.33037 626.900 621.167 5.73338 1,316.940 1,196.602 120.33839 390.160 537.261 �147.10140 1,090.840 1,088.889 1.951

(b) MINITAB output of residual plot versus x (c) MINITAB output of residual plot versus y





Increasing variance makes some intuitive sense because people with more expensive homes gen-erally have more discretionary income. These people can choose to spend either a substantialamount or a much smaller amount on home upkeep, thus causing a relatively large variation inupkeep expenditures.

Another residual plot showing the increasing error variance in the QHIC case is Fig-ure 11.25(c). This plot tells us that the residuals appear to fan out as (predicted y) increases,which is logical because is an increasing function of x. Also, note that the original scatter plot ofy versus x in Figure 11.4 (page 453) shows the increasing error variance—the y values appear tofan out as x increases. In fact, one might ask why we need to consider residual plots when we cansimply look at scatter plots. One answer is that, in general, because of possible differences in scal-ing between residual plots and scatter plots, one of these types of plots might be more informativein a particular situation. Therefore, we should always consider both types of plots.

When the constant variance assumption is violated, we cannot use the formulas of this chap-ter to make statistical inferences. Later in this section we discuss how we can make statistical in-ferences when a nonconstant error variance exists.

The assumption of correct functional form If the functional form of a regression modelis incorrect, the residual plots constructed by using the model often display a pattern suggestingthe form of a more appropriate model. For instance, if we use a simple linear regression modelwhen the true relationship between y and x is curved, the residual plot will have a curvedappearance. For example, the scatter plot of upkeep expenditure, y, versus home value, x, inFigure 11.4 (page 453) has either a straight-line or slightly curved appearance. We used a simplelinear regression model to describe the relationship between y and x, but note that there is a“dip,” or slightly curved appearance, in the upper left portion of each residual plot in Fig-ure 11.25. Therefore, both the scatter plot and residual plots indicate that there might be aslightly curved relationship between y and x. Later in this section we discuss one way to modelcurved relationships.

The normality assumption If the normality assumption holds, a histogram and/or stem-and-leaf display of the residuals should look reasonably bell-shaped and reasonably symmetricabout 0. Figure 11.27(a) gives the MINITAB output of a stem-and-leaf display of the residualsfrom the simple linear regression model describing the QHIC data. The stem-and-leaf displaylooks fairly bell-shaped and symmetric about 0. However, the tails of the display look somewhatlong and “heavy” or “thick,” indicating a possible violation of the normality assumption.

Another way to check the normality assumption is to construct a normal plot of the residuals.To make a normal plot, we first arrange the residuals in order from smallest to largest. Letting theordered residuals be denoted as e(1), e(2), . . . , e(n) we denote the ith residual in the ordered listingas e(i). We plot e(i) on the vertical axis against a point called z(i) on the horizontal axis. Here z(i) isdefined to be the point on the horizontal axis under the standard normal curve so that the areaunder this curve to the left of z(i) is (3i � 1)�(3n � 1). For example, recall in the QHIC case thatthere are n � 40 residuals given in Figure 11.25(a). It follows that, when i � 1, then

Therefore, z(1) is the normal point having an area of .0165 under the standard normal curve to itsleft. This implies that the area under the standard normal curve between z(1) and 0 is .5 � .0165 �.4835. Thus, as illustrated in Figure 11.27(b), z(1) equals �2.13. Because the smallest residual inFigure 11.25(a) is �289.044, the first point plotted is e(1) � �289.044 on the vertical scale ver-sus z(1) � �2.13 on the horizontal scale. When i � 2, it can be verified that (3i � 1)�(3n � 1)equals .0413 and thus that z(2) � �1.74. Therefore, because the second-smallest residual in Fig-ure 11.25(a) is �259.958, the second point plotted is e(2) � �259.958 on the vertical scale ver-sus z(2) � �1.74 on the horizontal scale. This process is continued until the entire normal plot isconstructed. The MINITAB output of this plot is given in Figure 11.27(c).

It can be proven that, if the normality assumption holds, then the expected value of the ithordered residual e(i) is proportional to z(i). Therefore, a plot of the e(i) values on the vertical scaleversus the z(i) values on the horizontal scale should have a straight-line appearance. That is, if thenormality assumption holds, then the normal plot should have a straight-line appearance. A

3i � 1

3n � 1�

3(1) � 1

3(40) � 1�

2

121� .0165

yy





normal plot that does not look like a straight line (admittedly, a subjective decision) indicates thatthe normality assumption is violated. Since the normal plot in Figure 11.27(c) has some curvature(particularly in the upper right portion), there is a possible violation of the normality assumption.

It is important to realize that violations of the constant variance and correct functional formassumptions can often cause a histogram and/or stem-and-leaf display of the residuals to looknonnormal and can cause the normal plot to have a curved appearance. Because of this, it is usu-ally a good idea to use residual plots to check for nonconstant variance and incorrect functionalform before making any final conclusions about the normality assumption. Later in this sectionwe discuss a procedure that sometimes remedies simultaneous violations of the constant vari-ance, correct functional form, and normality assumptions.

The independence assumption The independence assumption is most likely to be violatedwhen the regression data are time series data—that is, data that have been collected in a time se-quence. For such data the time-ordered error terms can be autocorrelated. Intuitively, we saythat error terms occurring over time have positive autocorrelation if a positive error term in timeperiod i tends to produce, or be followed by, another positive error term in time period i � k(some later time period) and if a negative error term in time period i tends to produce, or befollowed by, another negative error term in time period i � k. In other words, positive autocorre-lation exists when positive error terms tend to be followed over time by positive error terms andwhen negative error terms tend to be followed over time by negative error terms. Positive auto-correlation in the error terms is depicted in Figure 11.28, which illustrates that positiveautocorrelation can produce a cyclical error term pattern over time. The simple linearregression model implies that a positive error term produces a greater-than-average value of yand a negative error term produces a smaller-than-average value of y. It follows that positive

(11)

Stem-and-leaf of RESI N = 40Leaf Unit = 10

256

101317

1210433

-2-2-1-1-0-0001122

8542074320876422000022333344680011249

899

(a) MINITAB output of the stem-and-leaf display (b) Calculating z(1) for a normal plot

.5 � .0165� .4835

Standard normal curve

3(1) � 13(40) � 1

2121

� � .0165

z(1) � �2.13 0

Normal Probability Plot of the Residuals

-300

-200

-100

-2 -1

0

100

200

300

0 1 2

Normal Score

Residual

(c) MINITAB output of the normal plot

F I G U R E 11.27 Stem-and-Leaf Display and a Normal Plot of the Residuals from the Simple Linear Regression Model Describing the QHIC Data





autocorrelation in the error terms means that greater-than-average values of y tend to be followedby greater-than-average values of y, and smaller-than-average values of y tend to be followed bysmaller-than-average values of y. An example of positive autocorrelation could hypothetically beprovided by a simple linear regression model relating demand for a product to advertisingexpenditure. Here we assume that the data are time series data observed over a number of con-secutive sales periods. One of the factors included in the error term of the simple linear regres-sion model is competitors’ advertising expenditure for their similar products. If, for the moment,we assume that competitors’ advertising expenditure significantly affects the demand for theproduct, then a higher-than-average competitors’ advertising expenditure probably causesdemand for the product to be lower than average and hence probably causes a negative errorterm. On the other hand, a lower-than-average competitors’ advertising expenditure probablycauses the demand for the product to be higher than average and hence probably causes a posi-tive error term. If, then, competitors tend to spend money on advertising in a cyclical fashion—spending large amounts for several consecutive sales periods (during an advertising campaign)and then spending lesser amounts for several consecutive sales periods—a negative error term inone sales period will tend to be followed by a negative error term in the next sales period, and apositive error term in one sales period will tend to be followed by a positive error term in the nextsales period. In this case the error terms would display positive autocorrelation, and thus theseerror terms would not be statistically independent.

Intuitively, error terms occurring over time have negative autocorrelation if a positive errorterm in time period i tends to produce, or be followed by, a negative error term in time periodi � k and if a negative error term in time period i tends to produce, or be followed by, a positiveerror term in time period i � k. In other words, negative autocorrelation exists when posi-tive error terms tend to be followed over time by negative error terms and negative error termstend to be followed over time by positive error terms. An example of negative autocorrelation inthe error terms is depicted in Figure 11.29, which illustrates that negative autocorrelation in theerror terms can produce an alternating pattern over time. It follows that negative autocorre-lation in the error terms means that greater-than-average values of y tend to be followed bysmaller-than-average values of y and smaller-than-average values of y tend to be followedby greater-than-average values of y. An example of negative autocorrelation might be providedby a retailer’s weekly stock orders. Here a larger-than-average stock order one week might resultin an oversupply and hence a smaller-than-average order the next week.

The independence assumption basically says that the time-ordered error terms display nopositive or negative autocorrelation. This says that the error terms occur in a random patternover time. Such a random pattern would imply that the error terms (and their corresponding yvalues) are statistically independent.

Because the residuals are point estimates of the error terms, a residual plot versus time isused to check the independence assumption. If a residual plot versus the data’s time sequencehas a cyclical appearance, the error terms are positively autocorrelated, and the independenceassumption is violated. If a plot of the time-ordered residuals has an alternating pattern, theerror terms are negatively autocorrelated, and again the independence assumption is violated.

5 9

6 7 84321Time

Error term

F I G U R E 11.28 Positive Autocorrelation in the ErrorTerms: Cyclical Pattern

F I G U R E 11.29 Negative Autocorrelation in theError Terms: Alternating Pattern

5

9

6 7 84321Time

Error term





EXAMPLE 11.19

However, if a plot of the time-ordered residuals displays a random pattern, the error terms havelittle or no autocorrelation. In such a case, it is reasonable to conclude that the independenceassumption holds.

Figure 11.30(a) presents data concerning weekly sales at Pages’ Bookstore (Sales), Pages’weekly advertising expenditure (Adver), and the weekly advertising expenditure of Pages’ maincompetitor (Compadv). Here the sales values are expressed in thousands of dollars, and the ad-vertising expenditure values are expressed in hundreds of dollars. Figure 11.30(a) also gives theresiduals that are obtained when MegaStat is used to perform a simple linear regression analysisrelating Pages’ sales to Pages’ advertising expenditure. These residuals are plotted versus time inFigure 11.30(b). We see that the residual plot has a cyclical pattern. This tells us that the errorterms for the model are positively autocorrelated and the independence assumption is violated.Furthermore, there tend to be positive residuals when the competitor’s advertising expenditure islower (in weeks 1 through 8 and weeks 14, 15, and 16) and negative residuals when the com-petitor’s advertising expenditure is higher (in weeks 9 through 13). Therefore, the competitor’sadvertising expenditure seems to be causing the positive autocorrelation.

F I G U R E 11.30 Pages’ Bookstore Sales and Advertising Data, and Residual Analysis

(a) The data and the MegaStat output of the residuals from a simple linear regressionrelating Pages’ sales to Pages’ advertising expenditure BookSales

Observation Adver Compadv Sales Predicted Residual1 18 10 22 18.7 3.32 20 10 27 23.0 4.03 20 15 23 23.0 �0.04 25 15 31 33.9 �2.95 28 15 45 40.4 4.66 29 20 47 42.6 4.47 29 20 45 42.6 2.48 28 25 42 40.4 1.69 30 35 37 44.7 �7.710 31 35 39 46.9 �7.911 34 35 45 53.4 �8.412 35 30 52 55.6 �3.613 36 30 57 57.8 �0.814 38 25 62 62.1 �0.115 41 20 73 68.6 4.416 45 20 84 77.3 6.7

Durbin-Watson � 0.65

0 5 10 2015�10.1

0.0

5.0

10.1

�5.0

Observation

Res

idu

al(g

rid

lines

� s

td. e

rro

r)

(b) MegaStat output of a plot of the residuals inFigure 11.30(a) versus time





To conclude this example, note that the simple linear regression model relating Pages’ sales toPages’ advertising expenditure has a standard error, s, of 5.038. The MegaStat residual plot inFigure 11.30(b) includes grid lines that are placed one and two standard errors above and belowthe residual mean of 0. All MegaStat residual plots use such grid lines to help better diagnosepotential violations of the regression assumptions.

When the independence assumption is violated, various remedies can be employed. One ap-proach is to identify which independent variable left in the error term (for example, competitors’advertising expenditure) is causing the error terms to be autocorrelated. We can then remove thisindependent variable from the error term and insert it directly into the regression model, forminga multiple regression model. (Multiple regression models are discussed in Chapter 12.)

The Durbin–Watson test One type of positive or negative autocorrelation is called first-order autocorrelation. It says that et, the error term in time period t, is related to et�1, the errorterm in time period t � 1. To check for first-order autocorrelation, we can use the Durbin–Watson statistic

where e1, e2, . . . , en are the time-ordered residuals.Intuitively, small values of d lead us to conclude that there is positive autocorrelation. This is

because, if d is small, the differences (et � et�1) are small. This indicates that the adjacent resid-uals et and et�1 are of the same magnitude, which in turn says that the adjacent error terms et andet�1 are positively correlated. Consider testing the null hypothesis H0 that the error terms arenot autocorrelated versus the alternative hypothesis Ha that the error terms are positivelyautocorrelated. Durbin and Watson have shown that there are points (denoted dL,a and dU,a) suchthat, if a is the probability of a Type I error, then

1 If d � dL,a, we reject H0.

2 If d � dU,a, we do not reject H0.

3 If dL,a � d � dU,a, the test is inconclusive.

So that the Durbin–Watson test may be easily done, tables containing the points dL,a and dU,a

have been constructed. These tables give the appropriate dL,a and dU,a points for various values ofa; k, the number of independent variables used by the regression model; and n, the number of ob-servations. Tables A.10, A.11, and A.12 (pages 827–829) give these points for a � .05,a� .025, and a � .01. A portion of Table A.10 is given in Table 11.7. Note that when we are con-sidering a simple linear regression model, which uses one independent variable, we look up thepoints dL,a and dU,a under the heading “k � 1.” Other values of k are used when we study multi-ple regression models in Chapter 12. Using the residuals in Figure 11.30(a), the Durbin–Watsonstatistic for the simple linear regression model relating Pages’ sales to Pages’ advertising expen-diture is calculated to be

� .65

A MegaStat output of the Durbin–Watson statistic is given at the bottom of Figure 11.30(a). Totest for positive autocorrelation, we note that there are n � 16 observations and the regression

�(4.0 � 3.3)2 � (0.0 � 4.0)2 � � � � � (6.7 � 4.4)2

(3.3)2 � (4.0)2 � � � � � (6.7)2

d �a16

t�2(et � et�1)

2

a16

t�1

e2t

d �a

n

t�2(et � et�1)

2

a

n

t�1e2

t

C H A P T E R 1 6





model uses k � 1 independent variable. Therefore, if we set a � .05, Table 11.7 tells us that dL,.05 � 1.10 and dU,.05 � 1.37. Since d � .65 is less than dL,.05 � 1.10, we reject the null hypoth-esis of no autocorrelation. That is, we conclude (at an a of .05) that there is positive (first-order)autocorrelation.

It can be shown that the Durbin–Watson statistic d is always between 0 and 4. Large values ofd (and hence small values of 4 � d) lead us to conclude that there is negative autocorrelationbecause if d is large, this indicates that the differences (et � et�1) are large. This says that theadjacent error terms et and et�1 are negatively autocorrelated. Consider testing the null hypothe-sis H0 that the error terms are not autocorrelated versus the alternative hypothesis Ha that theerror terms are negatively autocorrelated. Durbin and Watson have shown that based on set-ting the probability of a Type I error equal to a, the points dL,a and dU,a are such that

1 If (4 � d ) � dL,a, we reject H0.

2 If (4 � d ) � dU,a, we do not reject H0.

3 If dL,a � (4 � d ) � dU,a, the test is inconclusive.

As an example, for the Pages’ sales simple linear regression model, we see that

(4 � d ) � (4 � .65) � 3.35 � dU, .05 � 1.37

Therefore, on the basis of setting a equal to .05, we do not reject the null hypothesis of no auto-correlation. That is, there is no evidence of negative (first-order) autocorrelation.

We can also use the Durbin–Watson statistic to test for positive or negative autocorrelation.Specifically, consider testing the null hypothesis H0 that the error terms are not autocorre-lated versus the alternative hypothesis Ha that the error terms are positively or negativelyautocorrelated. Durbin and Watson have shown that, based on setting the probability of a Type Ierror equal to a,

1 If d � dL,a�2 or if (4 � d ) � dL,a�2, we reject H0.

2 If d � dU,a�2 and if (4 � d ) � dU,a�2, we do not reject H0.

3 If dL,a�2 � d � dU,a�2 and dL,a�2 � (4 � d) � dU,a�2, the test is inconclusive.

For example, consider testing for positive or negative autocorrelation in the Pages’ sales model.If we set a equal to .05, then a�2 � .025, and we need to find the points dL,.025 and dU,.025 whenn � 16 and k � 1. Looking up these points in Table A.11 (page 828), we find that dL, .025 � .98and dU, .025 � 1.24. Since d � .65 is less than dL,.025 � .98, we reject the null hypothesis of noautocorrelation. That is, we conclude (at an a of .05) that there is first-order autocorrelation.

Although we have used the Pages’ sales model in these examples to demonstrate the Durbin–Watson tests for (1) positive autocorrelation, (2) negative autocorrelation, and (3) positive or neg-ative autocorrelation, we must in practice choose one of these Durbin–Watson tests in a particularsituation. Since positive autocorrelation is more common in real time series data than negativeautocorrelation, the Durbin–Watson test for positive autocorrelation is used more often than theother two tests.Also, note that each Durbin–Watson test assumes that the population of all possibleresiduals at any time t has a normal distribution.

Transforming the dependent variable: A possible remedy for violations of theconstant variance, correct functional form, and normality assumptions In general,if a data or residual plot indicates that the error variance of a regression model increases as an

T A B L E 11.7 Critical Values for the Durbin–Watson d Statistic (A � .05)

k � 1 k � 2 k � 3 k � 4n dL,.05 dU,.05 dL,.05 dU,.05 dL,.05 dU,.05 dL,.05 dU,.05

15 1.08 1.36 0.95 1.54 0.82 1.75 0.69 1.9716 1.10 1.37 0.98 1.54 0.86 1.73 0.74 1.9317 1.13 1.38 1.02 1.54 0.90 1.71 0.78 1.9018 1.16 1.39 1.05 1.53 0.93 1.69 0.82 1.8719 1.18 1.40 1.08 1.53 0.97 1.68 0.86 1.8520 1.20 1.41 1.10 1.54 1.00 1.68 0.90 1.83





independent variable or the predicted value of the dependent variable increases, then we cansometimes remedy the situation by transforming the dependent variable. One transformation thatworks well is to take each y value to a fractional power. As an example, we might use a transfor-mation in which we take the square root (or one-half power) of each y value. Letting y* denotethe value obtained when the transformation is applied to y, we would write the square roottransformation as

Another commonly used transformation is the quartic root transformation. Here we take eachy value to the one-fourth power. That is,

If we consider a transformation that takes each y value to a fractional power (such as .5, .25, orthe like), as the power approaches 0, the transformed value y* approaches the natural logarithmof y (commonly written lny). In fact, we sometimes use the logarithmic transformation

which takes the natural logarithm of each y value. In general, when we take a fractional power(including the natural logarithm) of the dependent variable, the transformation not only tends toequalize the error variance but also tends to “straighten out” certain types of nonlinear data plots.Specifically, if a data plot indicates that the dependent variable is increasing at an increasing rate (as in Figure 11.4 on page 453), then a fractional power transformation tends to straighten out thedata plot. A fractional power transformation can also help to remedy a violation of the normalityassumption. Because we cannot know which fractional power to use before we actually take thetransformation, we recommend taking all of the square root, quartic root, and natural logarithmtransformations and seeing which one best equalizes the error variance and (possibly) straightensout a nonlinear data plot.

Consider the QHIC upkeep expenditures. In Figures 11.31, 11.32, and 11.33 we show the plotsthat result when we take the square root, quartic root, and natural logarithmic transformations ofthe upkeep expenditures and plot the transformed values versus the home values. The square roottransformation seems to best equalize the error variance and straighten out the curved data plotin Figure 11.4. Note that the natural logarithm transformation seems to “overtransform” thedata—the error variance tends to decrease as the home value increases and the data plot seems to

y* � lny

y* � y.25

y* � 1y � y .5

45

40

35

30

25

20

15

10

SRUPKEEP

100 200 300

VALUE

QRUPKEEP

100 200 300

VALUE

3

4

5

6

7

F I G U R E 11.31 MINITAB Plot of the Square Roots ofthe Upkeep Expenditures versus theHome Values

F I G U R E 11.32 MINITAB Plot of the Quartic Roots ofthe Upkeep Expenditures versus theHome Values






“bend down.” The plot of the quartic roots indicates that the quartic root transformation alsoseems to overtransform the data (but not by as much as the logarithmic transformation). In gen-eral, as the fractional power gets smaller, the transformation gets stronger. Different fractionalpowers are best in different situations.

Since the plot in Figure 11.31 of the square roots of the upkeep expenditures versus the homevalues has a straight-line appearance, we consider the model

The MINITAB output of a regression analysis using this transformed model is given in Figure 11.34, and the MINITAB output of an analysis of the model’s residuals is given in Fig-ure 11.35. Note that the residual plot versus x for the transformed model in Figure 11.35(a) has ahorizontal band appearance. It can also be verified that the transformed model’s residual plotversus y, which we do not give here, has a similar horizontal band appearance. Therefore, we con-clude that the constant variance and the correct functional form assumptions approximately holdfor the transformed model. Next, note that the stem-and-leaf display of the transformed model’s

y* � b0 � b1x � e where y* � y.5

LNUPKEEP

100 200 300

VALUE

5

6

7

F I G U R E 11.33 MINITAB Plot of the Natural Logarithms of the Upkeep Expendituresversus the Home Values

Coef

7.201

0.127047

StDev

1.205

0.006577

T

5.98

19.32

P

0.000

0.000

R-Sq = 90.8% R-Sq(adj) = 90.5%

Regression Analysis


SRUPKEEP = 7.20 + 0.127 VALUE

Predictor

Constant

VALUE

S = 2.325


Source

Regression

Error

Total

DF

1

38

39

SS

2016.8

205.4

2222.2

MS

2016.8

5.4

F

373.17

P

0.000

Fit

35.151

StDev Fit

0.474

95.0% CI

( 34.191, 36.111)

95.0% PI

( 30.347, 39.955)

F I G U R E 11.34 MINITAB Output of a Regression Analysis of the Upkeep ExpenditureData by Using the Model y* � B0 � B1x � E where y* � y.5





residuals in Figure 11.35(b) looks reasonably bell-shaped and symmetric, and note that the nor-mal plot of these residuals in Figure 11.35(c) looks straighter than the normal plot for the un-transformed model (see Figure 11.27 on page 497). Therefore, we also conclude that the normal-ity assumption approximately holds for the transformed model.

Because the regression assumptions approximately hold for the transformed regressionmodel, we can use this model to make statistical inferences. Consider a home worth $220,000.Using the least squares point estimates on the MINITAB output in Figure11.34, it follows that apoint prediction of y* for such a home is

� 35.151

This point prediction is given at the bottom of the MINITAB output, as is the 95 percentprediction interval for y*, which is [30.347, 39.955]. It follows that a point prediction of theupkeep expenditure for a home worth $220,000 is (35.151)2 � $1,235.59 and that a 95 percentprediction interval for this upkeep expenditure is [(30.347)2, (39.995)2] � [$920.94, $1599.60].Suppose that QHIC wishes to send an advertising brochure to any home that has a predictedupkeep expenditure of at least $500. Solving the prediction equation for x, andnoting that a predicted upkeep expenditure of $500 corresponds to a ofit follows that QHIC should send the advertising brochure to any home that has a value of atleast

x �y ˆ* � b0

b1�

22.36068 � 7.201

.127047� 119.3234 (or $119,323)

1500 � 22.36068,y*y* � b0 � b1x

y* � 7.201 � .127047(220)

5

4

3

2

1

0-1-2

-3

-4

-5

100 200 300

VALUE

Residual

(a) Residual plot versus x

(b) Stem-and-leaf display of the residuals

(7)

Stem-and-leaf of RESI N = 40

Leaf Unit = 0.10

2

5

9

13

18

15

8

2

1

-4

-3

-2

-1

-0

0

1

2

3

4

40

741

9982

4421

94442

0246889

2333455

044579

7

7

5

4

3

2

2

1

1

0

0

-1

-1

-2

-2

-3

-4

-5

Residual

Normal Score

Normal Probability Plot of the Residuals

(c) Normal plot of the residuals

F I G U R E 11.35 MINITAB Output of Residual Analysis for the Upkeep Expenditure Model y* � B0 � B1x � E where y* � y .5





Recall that because there are many homes of a particular value in the metropolitan area, QHICis interested in estimating the mean upkeep expenditure corresponding to this value. Consider allhomes worth, for example, $220,000. The MINITAB output in Figure 11.34 tells us that a pointestimate of the mean of the square roots of the upkeep expenditures for all such homes is 35.151and that a 95 percent confidence interval for this mean is [34.191, 36.111]. Unfortunately, becauseit can be shown that the mean of the square root is not the square root of the mean, we cannot trans-form the results for the mean of the square roots back into a result for the mean of the original up-keep expenditures. This is a major drawback to transforming the dependent variable and one rea-son why many statisticians avoid transforming the dependent variable unless the regressionassumptions are badly violated. In Chapter 12 we discuss other remedies for violations of the re-gression assumptions that do not have some of the drawbacks of transforming the dependent vari-able. Some of these remedies involve transforming the independent variable—a procedure intro-duced in Exercise 11.85 of this section. Furthermore, if we reconsider the residual analysis of theoriginal, untransformed QHIC model in Figures 11.25 (page 495) and 11.27 (page 497), we mightconclude that the regression assumptions are not badly violated for the untransformed model.Also, note that the point prediction, 95 percent prediction interval, and value of x obtained hereusing the transformed model are not very different from the results obtained in Examples 11.5(page 463) and 11.12 (page 481) using the untransformed model. This implies that it might bereasonable to rely on the results obtained using the untransformed model, or to at least rely on theresults for the mean upkeep expenditures obtained using the untransformed model.

In this section we have concentrated on analyzing the residuals for the QHIC simple linear re-gression model. If we analyze the residuals in Table 11.4 (page 460) for the fuel consumptionsimple linear regression model (recall that the fuel consumption data are time series data), weconclude that the regression assumptions approximately hold for this model.

CONCEPTS

11.72 In a regression analysis, what variables should the residuals be plotted against? What types ofpatterns in residual plots indicate violations of the regression assumptions?

11.73 In regression analysis, how do we check the normality assumption?

11.74 What is one possible remedy for violations of the constant variance, correct function form, andnormality assumptions?


11.75 THE FUEL CONSUMPTION CASE FuelCon1

Recall that Table 11.4 gives the residuals from the simple linear regression model relating weeklyfuel consumption to average hourly temperature. Figure 11.36(a) gives the Excel output of a plotof these residuals versus average hourly temperature. Describe the appearance of this plot. Doesthe plot indicate any violations of the regression assumptions?


Figure 11.36(b) gives the MINITAB output of residual diagnostics that are obtained whenthe simple linear regression model is fit to the Fresh detergent demand data in Exercise 11.9(page 455). Interpret the diagnostics and determine if they indicate any violations of theregression assumptions. Note that the I chart of the residuals is a plot of the residuals versustime, with control limits that are used to show unusually large residuals (such control limits arediscussed in Chapter 14, but we ignore them here).


Recall that Figure 11.14 on page 474 gives the MegaStat output of a simple linear regressionanalysis of the service time data in Exercise 11.7. The MegaStat output of the residuals given bythis model is given in Figure 11.37, and MegaStat output of residual plots versus x and is givenin Figure 11.38(a) and (b). Do the plots indicate any violations of the regression assumptions?


Figure 11.37 gives the MegaStat output of the residuals from the simple linear regression modeldescribing the service time-data in Exercise 11.7.

y ˆ


11.80, 11.81

simple linear regression analysis

Documents