describing relationships: scatterplots and linear...

21
AP Statistics – Ch. 3 Notes Describing Relationships: Scatterplots and Linear Regression In many statistical studies, looking at the distribution of one variable is not enough. The relationships between two or more variables can give a more complete picture. A response variable is the variable of interest in a study. Often, we are interested in predicting its value. An explanatory variable helps explain or influence changes in a response variable. There are some situations where there isn’t a clear distinction between the variables. If we are simply observing both variables, there may or may not be explanatory and response variables. It depends on how you intend to use the data. In other fields of mathematics, response variables are called dependent variables and explanatory variables are called independent variables. In statistics, “dependent” and “independent” have other very specific meanings, so we prefer “explanatory” and “response”. Examples: For each situation, determine which of the two variables is the response variable and which is the explanatory variable. a) Weed growth and amount of rain. Explanatory: Amount of rain Response: Weed growth b) Winning percentage of a basketball team and attendance at basketball games. If you feel that winning percentage may influence attendance, then winning percentage would be the explanatory variable and attendance would be the response variable. If you believe that crowd support can improve a team’s performance, then attendance would be the explanatory variable and winning percentage would be the response variable. c) Amount of daily exercise and resting pulse rate. Explanatory: Amount of daily exercise Response: Resting pulse rate Example: Julie wants to know if she can predict a student’s weight from his or her height. Jim wants to know if there is a relationship between height and weight. In each case, identify the explanatory variable and response variable, if possible. Since Julie is trying to use height to predict weight, she is treating height as an explanatory variable and weight as a response variable. Jim only wants to know if there is a relationship between the two. In his case, there is no clear- cut explanatory or response variable. The best way to display the relationship between two quantitative variables is on a scatterplot. The values of one variable appear on the horizontal axis and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point in the graph. How to Make a Scatterplot: 1. Decide which variable should go on each axis. (The eXplanatory variable goes on the x-axis). 2. Label and scale your axes. 3. Plot individual data values.

Upload: trankhuong

Post on 24-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

AP Statistics – Ch. 3 Notes

Describing Relationships: Scatterplots and Linear Regression

In many statistical studies, looking at the distribution of one variable is not enough. The relationships between two or more variables can give a more complete picture. A response variable is the variable of interest in a study. Often, we are interested in predicting its value. An explanatory variable helps explain or influence changes in a response variable. � There are some situations where there isn’t a clear distinction between the variables. If we are

simply observing both variables, there may or may not be explanatory and response variables. It depends on how you intend to use the data.

� In other fields of mathematics, response variables are called dependent variables and explanatory

variables are called independent variables. In statistics, “dependent” and “independent” have other very specific meanings, so we prefer “explanatory” and “response”.

Examples: For each situation, determine which of the two variables is the response variable and which is the explanatory variable.

a) Weed growth and amount of rain. Explanatory: Amount of rain Response: Weed growth

b) Winning percentage of a basketball team and attendance at basketball games. If you feel that winning percentage may influence attendance, then winning percentage would be the explanatory variable and attendance would be the response variable. If you believe that crowd support can improve a team’s performance, then attendance would be the explanatory variable and winning percentage would be the response variable.

c) Amount of daily exercise and resting pulse rate. Explanatory: Amount of daily exercise Response: Resting pulse rate

Example: Julie wants to know if she can predict a student’s weight from his or her height. Jim wants to know if there is a relationship between height and weight. In each case, identify the explanatory variable and response variable, if possible.

Since Julie is trying to use height to predict weight, she is treating height as an explanatory variable and weight as a response variable. Jim only wants to know if there is a relationship between the two. In his case, there is no clear-cut explanatory or response variable.

The best way to display the relationship between two quantitative variables is on a scatterplot. The values of one variable appear on the horizontal axis and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point in the graph. How to Make a Scatterplot:

1. Decide which variable should go on each axis. (The eXplanatory variable goes on the x-axis). 2. Label and scale your axes. 3. Plot individual data values.

� Pick “nice” values to mark on each axis, making sure you cover the range of each variable. � In a scatterplot, the axes usually do not start at 0, and the axes are often not on the same scale, but

you need to make sure that the scale on each axis is consistent. � Remember to clearly LABEL the variable name and units on each axis!

Example: The table below shows data for 13 students in a statistics class. Each member of the class ran a 40-yard sprint and then did a long jump (with a running start). Make a scatterplot of the relationship between sprint time (in seconds) and long-jump distance (in inches).

How to Examine a Scatterplot As with any graph, look for the overall pattern and for striking departures from that pattern.

• Describe the direction (positive or negative), form (linear or curved, and also mention distinct clusters), and strength (strong, weak, moderate) of the relationship.

• Describe any outliers (individuals that fall outside the overall pattern of the relationship).

Describing Direction:

Two variables have a positive association if above-average values of one variable tend to accompany above-average values of the other, and when below-average values also tend to occur together. Two variables have a negative association when above-average values of one tend to accompany below-average values of the other. Tips: � When discussing the form of a relationship, don’t let one or two unusual values influence your

description. If covering up one value makes the form go from nonlinear to linear, call it a linear association with an outlier.

� There is no specific rule for identifying outliers on a scatterplot. Just look for points that fall outside the overall pattern of the association.

� Always describe the direction, form, strength, and outliers in context! � Association does not imply causation! The fact that there is a relationship between two variables

doesn’t mean that one causes the other.

1 098765

200

1 75

1 50

1 25

1 00

75

50

Sprint Time (sec)

Lon

g-J

um

p D

ista

nce

(in

)

Sprint Times

(seconds) 5.41 5.05 9.49 8.09 7.01 7.17 6.83 6.73 8.01 5.68 5.78 6.31 6.04

Long-Jump

Distance (inches) 171 184 48 151 90 65 94 78 71 130 173 143 141

Example: Describe what the scatterplot in the previous example reveals about the relationship between sprint times and long-jump distances. There is a moderate, negative, slightly curved relationship between sprint time and long-jump distance. In general, individuals with longer sprint times had shorter jump distances than those with faster times. The individual with a sprint time of 8.09 seconds and a jump distance of 151 inches appears to be an outlier. He or she jumped much further than would be expected based on the slow sprint speed. When describing scatterplots, we are often interested in linear relationships because they can be described by simple models. A linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line. Unfortunately, sometimes it’s hard to judge the strength of a linear relationship just from looking at a scatterplot. Luckily, we have a numerical measure to help. The correlation r measures the direction and strength of the linear relationship between two quantitative variables.

• r is always a number between a –1 and 1.

• 0r > for a positive association and 0r < for a negative association.

• Values of r near 0 indicate a very weak linear relationship.

• The strength of a linear relationship increases as r moves away from 0 toward either –1 or 1.

• The extreme values 1r = − and 1r = occur only in the case of a perfect linear relationship, when the points lie exactly along a straight line.

� A value of r close to 1 does not guarantee a linear relationship between two variables. A scatterplot

with a clearly curved form can have a correlation that is near 1 or –1. ALWAYS PLOT THE DATA!

How to Calculate the Correlation r (Don’t worry – in practice, this is almost never done by hand, but it’s good to look at the formula and figure out what it is telling us.) Suppose that we have data on variables x and y for n individuals. The values for the first individual

are 1x and 1,y the values for the second individual are 2x and 2 ,y and so on. The means and standard

deviations of the two variables are x and x

s for the x-values, and y and ys for the y-values. The

correlation r between x and y is

1 1 2 21...

1 1

x yn n

x y x y x y

z zx x y yx x y y x x y yr

n s s s s s s n

− −− − − −= + + + = − −

Notice that the formula involves z-scores for each individual for both variables. These z-scores are multiplied together. The correlation r is an “average” of the products of the standardized scores for all individuals. Examples: Estimate the correlation r for each graph. Then interpret the value of r in context.

a) 0.9.r ≈ There is a strong, positive, linear association between the number of boats registered in Florida and the number of manatees killed. In general, as the number of boats registered increases, so does the number of manatees killed by boats.

b) 0.5.r ≈ There is a moderate, positive, linear association between the number of storms predicted

and the number of storms actually observed. In general, as the number of predicted storms increases, so does the number of actual storms.

c) 0.3.r ≈ There is a weak, positive, linear association between the healing rates of the two front

limbs of the newts in the experiment. In general, as the healing rate for one limb increased, so did the healing rate for the other limb.

353025201 51 0

400

300

200

1 00

0

Weight (lbs)

Co

st (

$)

32.530.027.525.022.520.01 7.51 5.0

400

350

300

250

200

1 50

Weight (lbs)

Co

st (

$)

32.530.027.525.022.520.01 7.51 5.0

400

300

200

1 00

0

Weight (lbs)

Co

st (

$)

d) 0.1.r ≈ − There is a very weak, negative, linear association between last year’s percent return in the stock market and this year’s percent return in the stock market. In general, as last year’s return increases, this year’s return decreases.

Example: The following data gives the weight (in pounds) and cost (in dollars) of a sample of 11 stand mixers (from Consumer Reports, November 2005). a) Look at a scatterplot of the data and describe what you see.

There is a moderately strong, positive, linear association between the weight of stand mixers and their cost. In general, as the weight increases, so does the cost. b) Calculate the correlation.

0.74r =

c) The last mixer in the table is from Walmart. What happens to the correlation when you remove this

point? The correlation drops to 0.56.r = The relationship is weaker than it was before.

d) What happens to the correlation if the Walmart mixer weighs 25 pounds instead of 8 pounds? Add the point (25, 30) and recalculate the correlation.

The correlation drops to 0.34.r = The relationship is even weaker than it was before.

Weight (lbs) 23 28 19 17 25 26 21 32 16 17 8

Cost ($) 180 250 300 150 300 370 400 350 200 150 30

353025201 51 0

500

400

300

200

1 00

Weight (lbs)

Co

st (

$)

e) Suppose that a new titanium mixer was introduced that weighed 8 pounds, but the cost was $500. Remove the point (25, 30), add the point (8,500), and recalculate the correlation.

The correlation actually becomes negative ( 0.08r = − ). There is now a weak, negative, linear association between the weight of stand mixers and their cost.

f) Summarize what you have learned about the effect of a single point on the correlation. A single point can have a huge effect on correlation. Points at the far right and far left of the scatterplot (extreme x values) tend to have the greatest effect. If removing a point has a dramatic effect on the correlation, the point is called an influential point.

Facts about Correlation 1. Correlation makes no distinction between explanatory and response variables. If you swap the

explanatory and response variables, the correlation remains the same. 2. Since it is based on z-scores, r does not change when we change the units of measurement of , ,x y

or both. 3. The correlation r has no unit of measurement. It is just a number. Warnings

� Correlation requires that both variables be quantitative.

� Correlation measures the strength of only the linear relationship between two variables. Correlation does not describe curved relationships between variables, no matter how strong the relationship is! A correlation of 0 doesn’t guarantee that there’s no relationship between two variables, just that there’s no linear relationship.

� A value of r close to 1 does not guarantee a linear relationship between two variables. A scatterplot with a clearly curved form can have a correlation that is near 1 or –1. ALWAYS PLOT THE DATA!

� The correlation is not resistant. r is strongly affected by a few outlying observations.

� Correlation is not a complete summary of two-variable data. You should give the means and

standard deviations of both x and y along with the correlation.

Least-Squares Regression A regression line is a line that describes how a response variable y changes as an explanatory variable

x changes. We often use a regression line to predict the value of y for a given value of .x

Suppose that y is a response variable and x is an explanatory variable. A regression line relating y to

x has an equation of the form ˆ .y a bx= +

• y (“y hat”) is the predicted value of y for a given value of the explanatory variable .x In

statistics, “putting a hat on it” is standard notation to indicate that something has been predicted by a model.

• b is the slope, the amount by which y is predicted to change when x increases by one unit.

• a is the y-intercept, the predicted value of y when 0.x =

� You may have noticed the emphasis on the word “predicted.” The AP people are huge on this. When

interpreting a slope or intercept, always use language like, “The model predicts that for every one mph increase in average speed, the gas mileage will decrease by y miles per gallon.” Also, always

write your equation with a hat over the predicted quantity, like � ( )gas mileage average speed .a b= +

Example: The following data show the number of miles driven and the advertised price for 11 used Honda CR-Vs from 2002-2006 model years. The scatterplot shows a strong negative linear association between number of miles and advertised price. The correlation is 0.874.r = − The line on the plot is the regression line for predicting advertised price based on number of miles. The regression equation is

ˆ 18,773 86.18 ,y x= − or � ( )price 18,773 86.18 thousands of miles .= −

a) Identify the slope and y-intercept of the regression line. Interpret each value in context.

Slope: $86.18 per thousand miles.b = −

The model predicts that for each additional 1000 miles driven, the price of the CR-V will decrease by about $86.18. y-intercept: $18,773.a =

The model predicts that a used CR-V with 0 miles on it will have a price of $18,773. This is does not make much sense. It’s unlikely that anyone would be selling a used car with 0 miles or that the price for a car with 0 miles would be as low as $18,773.

Miles Driven

(thousands) 22 29 35 39 45 49 55 56 69 70 86

Price

(dollars) 17,998 16,450 14,998 13,998 14,599 14,988 13,599 14,599 11,998 14,450 10,998

9080706050403020

18000

17000

16000

15000

14000

13000

12000

11000

Miles Driven (thousands)

Pri

ce

(d

olla

rs)

Scatterplot of Price (dollars) vs Miles Driven (thousands)

b) Predict the price for a used Honda CR-V with 50,000 miles. � ( )price 18,773 86.18 50 $14, 464= − =

c) Would it be appropriate to predict the asking price for a used 2002-2006 Honda CR-V with 250,000

miles? Why or why not? No! This would be extrapolation. We only have data for cars with between 22,000 and 86,000 miles on them. We don’t know if the linear pattern will continue beyond these values. In fact, if we did use the equation to predict a price for a car with 250,000 miles, the prediction would be –$2,772.

Extrapolation is the use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line. Few relationships are linear for all values of the

explanatory variable. Don’t make predictions using values of x that are much larger of much

smaller than those that actually appear in your data! (Notice that the y-intercept is often an extrapolation, so interpreting it doesn’t always make sense.) Choosing a Regression Line How do we know which line is the best? A good regression line makes the vertical distances of the

points from the line as small as possible. A residual is the difference between an observed value of the response variable and the value predicted by the regression line.

residual observed predicted

ˆ

y y

y y

= −

= −

Example: Find the residuals for the two Honda CR-Vs that were driven 39,000 miles and 70,000 miles. What do the signs on the residuals indicate? 39,000 miles: 70,000 miles: � ( )price 18,773 86.18 39 $15, 412= − = � ( )price 18,773 86.18 70 $12,740= − =

�residual price price

$13,998 $15, 412 $1, 414

= −

= − = −

�residual price price

$14, 450 $12,740 $1,710

= −

= − =

The CR-V with 39,000 miles sold for $1,414 less than the model predicted, and the CR-V with 70,000 sold for $1,710 more than the model predicted. The least-squares regression line of y on x is the line that makes the sum of the squared residuals as

small as possible.

Equation of the Least-Squares Regression Line The least-squares regression line is the line ˆ ,y a bx= + where:

y

x

sb r

s= and a y bx= −

� The equation for slope tells us that along the regression line, a change of one standard deviation

in x corresponds to a change of r standard deviations in y.

� The equation for the y-intercept makes use of the fact that the least squares line always passes

through the point (((( )))), .x y

Example: The number of miles (in thousands) for the 11 used Hondas have a mean of 50.5 and a standard deviation of 19.3. The advertised prices have a mean of $14,425 and a standard deviation of $1899. The correlation for these variables is 0.874.r = − Find the equation of the least-squares regression line.

50.5 thousand miles 19.3 thousand miles $14, 425 $1,899 0.874x yx s y s r= = = = = −

1899

0.874 86.019.3

b

= − = −

( )( )14,425 86.0 50.5 18,768a = − − =

The equation of the least-squares regression line for predicting price from miles driven is � ( )price 18,768 86.0 thousands of miles .= − The reason it’s slightly different from the equation used in

the previous examples is that we used rounded values of , , , , and x yx y s s r in this example.

Explain what change in price we would expect for each additional 19.3 thousand miles. A change of one standard deviation in x corresponds to a change of r standard deviations in y.

19.3 thousand miles is one .x

s This corresponds to a change of .yr s⋅

( )( )0.874 1899 1659.73yr s⋅ = − = −

For each additional 19.3 thousand miles, we should expect the price to drop by $1,659.73

Residual Plots A residual plot is a scatterplot with the x values along the x axis and the residuals along the y axis.

ALWAYS look at a residual plot to make sure that a linear model is the best fit. Basically, you want it to be the most boring scatterplot you’ve ever seen—no obvious pattern and residuals randomly scattered around zero. Watch out for curved patterns and plots with low scatter in some areas and high scatter in others.

Here are some examples of what you want to see: There is no real pattern, and the residuals appear to be randomly scattered around zero.

Here are some examples of residual plots that signal trouble:

The residual plot to the left has a clearly curved pattern, so a linear model is definitely not appropriate. The residual plot above has increasing scatter as the x-values increase. A linear model is still appropriate because the

residuals still appear to be randomly scattered around zero and there’s no curved pattern, but we should use caution when using the model for larger x-values because the prediction error increases as x increases. A residual plot with a fan-shaped pattern like this will also mean we shouldn’t use certain methods and formulas we will learn about later in the year.

Example: The residual plot for the used Honda example is shown below. What does this plot indicate about the appropriateness of using a linear model?

The residual plot shows no obvious pattern. The residuals appear to be randomly scattered around 0. This indicates that a linear model is appropriate for this data.

9080706050403020

2000

1500

1000

500

0

-500

-1000

-1500

Miles Driven (thousands)

Re

sid

ua

l

Residuals Versus Miles Driven (thousands)

(response is P rice (dollars))

Standard Deviation of the Residuals (s)

It’s important when looking at a regression model to be able to measure how far off, on average, the predictions are from the actual values. The approximate size of the “typical” or “average” prediction error is measured by the standard deviation of the residuals (s):

( )22 ˆresidual

2 2

iy ys

n n

−= =

− −

∑ ∑

s is measured in the units of the response variable.

Example: For the used Hondas data, 84,999,851

$972.11 2

s = =−

Interpret this value in context.

The typical difference between the actual and predicted prices of the used Honda CR-Vs is about $972.

Coefficient of Determination ( 2r )

The coefficient of determination is a measure of how well the least-squares line predicts values of the

response variable y. It is abbreviated 2r for the excellent reason that it is the square of the correlation.

2r is the percent of the variation in the y values that can be explained by the linear relationship between x and .y

( )

( )

( )

22

2

2 2

ˆresidual1 1 1

i

i i

y ySSEr

SST y y y y

−= − = − = −

− −

∑ ∑∑ ∑

In this formula, SSE measures the variation of the y values from the regression line (the unexplained variation), and SST measures the total variation in the y values. SST is based on the line .y y=

Example: Calculate 2r for the used Honda example and interpret it in context. 2residual 8,496,888SSE = =∑ and ( )

236,072,692

iSST y y= − =∑ . Verify that the answer is also

the square of the correlation ( 0.874r = − ).

2 8, 496,8881 1 0.764

36,072,692

SSEr

SST= − = − = ( )

20.874 0.764− =

76.4% of the variation in the price of used Honda CR-Vs can be explained by the approximate linear relationship with the number of miles driven.

Unusual Points: Influential Observations and Outliers

An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals, but may not have a very large effect on the regression line. An observation is influential if removing it would dramatically change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential. Beware—influential points drag the line to them, so they often have small residuals! � You are not allowed to just ignore or delete outliers or influential points just because they are

inconvenient! However, you can report the results of the regression analysis with and without them. Reading Computer Output

Here is computer output for the Honda example from a common statistical program. You need to learn to write the regression equation and identify the important information from computer output.

Steps in a Linear Regression Analysis:

1. Summarize the data graphically by constructing a scatterplot. 2. Based on the scatterplot, decide if it looks like there is an approximate linear relationship

between x and .y If so, proceed to the next step.

3. Find the equation of the least-squares regression line. Interpret the values of b and a in context. 4. Construct a residual plot and look for any patterns or unusual features that may indicate that a

line is not the best way to summarize the relationship between x and .y If none are found,

proceed to the next step.

5. Compute the values of s and 2r and interpret them in context. 6. Based on what you learned in the previous two steps, decide whether the least squares regression

line is useful for making predictions. If so, proceed to the last step. 7. Use the least squares regression line to make predictions.

Common Regression Mistakes (plagiarized shamelessly from “Statistics: Learning from Data” by Peck and Olsen and “Stats: Modeling the World” by Bock, Velleman, and De Veaux.)

1. CORRELATION (OR ANY ASSOCIATION) DOES NOT IMPLY CAUSATION!

2. Correlation and regression lines only describe linear relationships. IF THE

RELATIONSHIP ISN’T LINEAR, THE REGRESSION ANALYSIS IS MEANINGLESS

AND MISLEADING! A correlation near 0 does not necessarily imply that there is no relationship between two variables. Always look at a scatterplot to see if there is a strong, but non-linear relationship. Also, a correlation near 1 or –1 is useless if there isn’t a linear relationship. Sometimes it is possible to transform the data to make it linear.

3. The least squares regression line for predicting y from x is not the same line as the line

for predicting x from .y By definition, the least squares line is the line that minimizes the sum

of the squared residuals in the y direction. This is not the same line as the line that minimizes the sum of the squared residuals in the x direction. The consequence of this is that you can’t solve a regression equation for the explanatory variable and then try to predict values of the explanatory variable from values of the response variable.

4. BEWARE OF EXTRAPOLATION!

5. Be careful interpreting the value of the intercept of the regression line. Often, this is an extrapolation, and in many cases, it is nonsense.

6. Be on guard for different groups in your regression. Check for evidence that the data consist of separate subsets. If you find subsets that behave differently, consider fitting a different linear model to each subset.

7. Remember that the least squares regression line may be the “best” line, but that doesn’t

necessarily mean that it will produce good predictions. Make sure to carefully interpret s and 2r before betting your house on your results.

8. Consider both 2r and ,s not just one of the two—they measure different things. Ideally, you

want a small value for s (which indicates that the predictions are close to the actual values) and

a large value for 2r (which indicates that the linear relationship explains a large proportion of the variability in the response variable).

9. The values of the correlation coefficient and the slope and intercept of the least-squares line

are extremely sensitive to influential observations in the data set. You can always get a good fit if you remove enough carefully selected points, but this isn’t honest and it won’t give you a good understanding. Consider comparing two regressions—one with an influential point included, and one without. Also, realize that some variables are just not related in a way that’s simple enough for a linear model to fit well. When that happens, report the failure and stop.

10. Watch out for regressions based on summary data. If the data values are summaries themselves (means or medians), they will be less variable than the data on which they are based,

which makes the relationship look stronger than it actually is. ( r and 2r will be larger for a regression based on means than a regression based on raw data, and s will be smaller.)

Transforming to Achieve Linearity If the scatterplot or residual plot of bivariate data shows a curved relationship, it is sometimes possible to transform the data in some way to straighten a non-linear relationship. Once we’ve straightened the data, we can use linear regression techniques to estimate the values of the response variable. Here are some of the most common transformations:

Square Root: y a b x= +

Reciprocal: 1

y a bx

= +

Logarithmic: ˆ lny a b x= + - or - ˆ logy a b x= +

Exponential: �ln y a bx= + - or - �log y a bx= + (can also be written in the form ˆ a bxy e += )

Power: �ln lny a b x= + - or - �log logy a b x= + (can also be written in the form ˆ by ax= )

Other possibilities: squaring one or both sides, cubing one or both sides, using 2

1 1 or , etc.

xy−

� You need to be able to evaluate the effectiveness of a model (using a residual plot and the values of

2R and s ) and use the regression equation from the transformed data to make predictions.

� For a non-linear model, the coefficient of determination is denoted with a capital 2.R 2R is the proportion of the variation in the original response variable that can be explained by the approximate

quadratic/exponential/logarithmic/power model. This means 2R can be compared directly for different models even if the transformed y variables have different units.

� Be very careful comparing the values of s for different transformed models. Remember that it is the

prediction error for predicting the transformed y variable, not the original y variable. You shouldn’t compare s directly for two models if the transformed y variables are different (e.g. if one model is

predicting ln y and another is predicting .y )

Example: The following data set shows the circumferences (cm) and masses (grams) of several different types of fruits.

Name of Fruit Grapefruit Tomato Cantaloupe Plum Green

Melon

Orange

Melon Nectarine

Circumference (cm) 34.3 15.5 45.7 19.5 51.5 45.1 25.3

Mass (g) 540.3 68.9 1240 111.3 1730 1130 244.4

Name of Fruit Apple Orange Peach Vine

Tomato

Navel

Orange

Pink

Grapefruit

Globe

Grape

Circumference (cm) 24.5 24 22.3 12.4 26.2 32.5 8.5

Mass (g) 232.3 201.3 190.9 33.7 250 510.5 14

The computer output for a linear regression model is shown below. Write the equation of the least-squares line. What is the correlation coefficient? What does this suggest about the linear relationship between fruit circumference and fruit mass? What should we make sure we do before using the regression equation to make predictions? Response Variable: Mass (g)

Term Coef SE Coef T-Value P-Value

Constant -614 114 -5.36 0.000

Circumference (cm) 38.97 3.78 10.32 0.000

S R-sq R-sq(adj)

174.709 89.87% 89.03%

� ( )mass 614 38.97 circumference= − + 0.8987 0.9480.r = =

If we didn’t look at a graph, this value would lead us to believe that there is a strong, positive, linear association between the circumference of fruits and their masses. However, if you look at a scatterplot, you find out that the relationship is NOT LINEAR! Remember, r is meaningless if the relationship isn’t linear. You should never use regression output without looking at a graph. Based on the scatterplot and residual plot below, is it appropriate to use a linear model for these data?

A linear model is not appropriate for these data. The scatterplot and the residual plot both show a clearly curved pattern.

5040302010

2000

1500

1000

500

0

Circumference (cm)

Mass

(g

)

Scatterplot of Mass vs Circumference

5040302010

400

300

200

100

0

-100

-200

Circumference (cm)

Resi

du

al

Residuals Versus Circumference(response is Mass)

Transformation: Mass vs. Circumference

Response Variable: Sqrt Mass

Term Coef SE Coef T-Value P-Value

Constant -5.966 0.761 -7.84 0.000

Circumference (cm) 0.8853 0.0251 35.23 0.000

S R-sq R-sq(adj)

1.16231 99.04% 98.96%

How would you write the equation for the least-squares regression line for the transformed data? �

( )mass 5.966 0.8853 circumference= − +

What mass does the model predict for a fruit with a circumference of 30 cm? �

( )

� 2

mass 5.966 0.8853 30

mass 20.593

mass 20.593 424.1 g

= − +

=

= ≈

Is this an appropriate model to describe the relationship between fruit circumference and mass? Explain. No. The scatterplot and the residual plot still show a clearly curved pattern.

5040302010

40

30

20

10

0

Circumference (cm)

Sq

rt M

ass

Scatterplot of Sqrt Mass vs Circumference

5040302010

2.52.01.51.00.50.0

-0.5-1.0-1.5

Circumference (cm)

Resi

du

al

Residuals Versus Circumference(response is Sqrt Mass)

Transformation: ln Mass vs. Circumference

Response Variable: ln Mass

Term Coef SE Coef T-Value P-Value

Constant 2.596 0.243 10.68 0.000

Circumference (cm) 0.10305 0.00802 12.85 0.000

S R-sq R-sq(adj)

0.371003 93.22% 92.66%

How would you write the equation for the least-squares regression line for the transformed data? � ( )ln Mass 2.596 0.10305 Circumference= +

What mass does the model predict for a fruit with a circumference of 30 cm? � ( )

� 5.6875

ln Mass 2.596 0.10305 30 5.6875

Mass 295.2 ge

= + =

= ≈

Is this an appropriate model to describe the relationship between fruit circumference and mass? Explain. No, the scatterplot and the residual plot still show a clearly curved pattern.

5040302010

8

7

6

5

4

3

2

Circumference (cm)

ln M

ass

Scatterplot of ln Mass vs Circumference

5040302010

0.50

0.25

0.00

-0.25

-0.50

-0.75

-1.00

Circumference

Resi

du

al

Residuals Versus Circumference(response is ln Mass)

Transformation: Mass vs. ln Circumference

Response Variable: Mass (g)

Term Coef SE Coef T-Value P-Value

Constant -2311 536 -4.31 0.001

ln Circumference 865 165 5.24 0.000

S R-sq R-sq(adj)

302.705 69.59% 67.06%

How would you write the equation for the least-squares regression line for the transformed data? � ( )Mass 2311 865 ln Circumference= − +

What mass does the model predict for a fruit with a circumference of 30 cm? � ( )Mass 2311 865 ln 30 631.0 g= − + ≈

Is this an appropriate model to describe the relationship between fruit circumference and mass? Explain. No – it’s terrible! The scatterplot and the residual plot have even more of a curved pattern than those in the other models, and R2 is much lower than in any of the other models, so this model explains much less of the variation in mass than the other models. Also, s is huge. The typical prediction error for predicting mass using this model is 302.705 grams!

4.03.53.02.52.0

2000

1500

1000

500

0

-500

ln Circumference

Mass

(g

)Scatterplot of Mass vs ln Circumference

4.03.53.02.52.0

600

400

200

0

-200

ln Circumference

Resi

du

al

Residuals Versus ln Circumference(response is Mass (g))

Transformation: ln Mass vs. ln Circumference

Response Variable: ln Mass

Term Coef SE Coef T-Value P-Value

Constant -3.183 0.109 -29.31 0.000

ln Circumference 2.6888 0.0334 80.40 0.000

How would you write the equation for the least-squares regression line for the transformed data? � ( )ln Mass 3.183 2.6888 ln Circumference= − +

What mass does the model predict for a fruit with a circumference of 30 cm? � ( ) � 5.96213952ln Mass 3.183 2.6888 ln 30 5.96213952 Mass 388.4 ge= − + = ⇒ = ≈

Is this an appropriate model to describe the relationship between fruit circumference and mass? Explain. Yes! The scatterplot looks approximately linear, and the residual plot does not have a curved pattern. The residuals appear to be randomly scattered around 0. (It’s like magic!)

Interpret the values of 2, , , and a b R s in context. (These sound really awkward – but you need to take

the transformations into account when interpreting!) a: The model predicts that when ln Circumference is 0 ln cm, ln Mass will be –3.183 ln grams. Since we only have data for ln Masses between 2 and 4, this is extrapolation and should not be trusted. b: The model predicts that for each additional ln cm in ln Circumference, ln Mass will increase by 2.6888 ln grams.

2:R 99.81% of the variation in mass can be explained by the power model.

s: The typical difference between the predicted and actual values of ln Mass is about 0.0613516 ln grams. (Some algebra and the rules of logarithms can show that this means that this model results in about a 6.1% prediction error for predicting mass).

4.03.53.02.52.0

8

7

6

5

4

3

2

ln Circumference

ln M

ass

Scatterplot of ln Mass vs ln Circumference

4.03.53.02.52.0

0.10

0.05

0.00

-0.05

-0.10

ln Circumference

Resi

du

al

Residuals Versus ln Circumference(response is ln Mass)

S R-sq R-sq(adj)

0.0613516 99.81% 99.80%

Example: The lengths (inches) and weights (pounds) of 25 crocodiles are shown below. In your calculator, create the following lists: L1: Length L2: Weight L3: Square root of weight L4: Common log of length L5: Common log of weight

Look at the scatterplots for the following:

• Weight vs. Length

• Weight vs. Length

• ( )log Weight vs. Length

• Weight vs. ( )log Length

• ( ) ( )log Weight vs. log Length

Length (in) 94 74 147 58 86 94 63 86 69 72 128 85 82

Weight (lb) 130 51 640 28 80 110 33 90 36 38 366 84 80

Length (in) 86 88 72 74 61 90 89 68 76 114 90 78

Weight (lb) 83 70 61 54 44 106 84 39 42 197 102 57

Describe the scatterplot for weight vs. length in context. There is a strong, positive, curved relationship between crocodile length and weight. In general, as length increases, so does weight. The weight seems to increase more quickly for the longer lengths. Which of the relationships appears the most linear? Check the residual plots to verify the linear relationship. (To make a residual plot, first run the LinReg (a + bx) command on the two lists. This will create a residuals list. Then make a scatterplot with the x variable as the XList and YList: RESID.) All of the residual plots except the one for log (weight) vs. length (the exponential model) show curved patterns. The residual plot for log (weight) vs. length does not have a curved pattern and the residuals appear randomly scattered around zero, so it is the best model. Write the equation for the least-squares line for the transformation that seems best at eliminating the curvatures in the alligator data. � ( )log Weight 0.579929 0.015381 Length= +

Interpret 2, , , and a b R s in context.

0.579929 log pounds.a =

The model predicts that the log of the weight of a crocodile with a length of 0 inches would be 0.579929 log pounds. This is extrapolation and clearly doesn’t make sense because a crocodile would never be 0 inches long.

0.015381 log pounds per inch.b =

The model predicts that for each additional inch in crocodile length, the log of the crocodile’s weight will increase by 0.015381 log pounds.

2 96.01%.R =

96.01% of the variation in crocodile weights is explained by the exponential model.

0.0648396 log pounds.s =

The typical difference between the actual and predicted values of log (weight) is 0.0648396 log pounds. Using your model, predict the weight of an alligator whose length is 140 cm. � ( )

� 2.733269

log Weight 0.579929 0.015381 140 2.733269

Weight 10 541.1 pounds

= + =

= =

The model predicts that an alligator whose length is 140 in would weigh about 541.1 pounds.