scatterplots and correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/math134 4...

25
CHAPTER 4 In this chapter we cover... Explanatory and response variables Displaying relationships: scatterplots Interpreting scatterplots Adding categorical variables to scatterplots Measuring linear association: correlation Facts about correlation Stuart Westmorland/Getty Images Scatterplots and Correlation A medical study finds that short women are more likely to have heart attacks than women of average height, while tall women have the fewest heart attacks. An insurance group reports that heavier cars have fewer deaths per 10,000 vehicles registered than do lighter cars. These and many other statistical studies look at the relationship between two variables. Statistical relationships are overall tendencies, not ironclad rules. They allow individual exceptions. Although smokers on the average die younger than nonsmokers, some people live to 90 while smoking three packs a day. To understand a statistical relationship between two variables, we measure both variables on the same individuals. Often, we must examine other variables as well. To conclude that shorter women have higher risk from heart attacks, for ex- ample, the researchers had to eliminate the effect of other variables such as weight and exercise habits. In this chapter we begin our study of relationships between variables. One of our main themes is that the relationship between two variables can be strongly influenced by other variables that are lurking in the background. Explanatory and response variables We think that car weight helps explain accident deaths and that smoking influ- ences life expectancy. In each of these relationships, the two variables play differ- ent roles: one explains or influences the other. 90

Upload: voquynh

Post on 14-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

CH

AP

TE

R

4In this chapter we cover...

Explanatory and responsevariables

Displaying relationships:scatterplots

Interpreting scatterplots

Adding categoricalvariables to scatterplots

Measuring linearassociation: correlation

Facts about correlation

Stua

rtW

estm

orla

nd/G

etty

Imag

es

Scatterplots andCorrelation

A medical study finds that short women are more likely to have heart attacks thanwomen of average height, while tall women have the fewest heart attacks. Aninsurance group reports that heavier cars have fewer deaths per 10,000 vehiclesregistered than do lighter cars. These and many other statistical studies look at therelationship between two variables. Statistical relationships are overall tendencies,not ironclad rules. They allow individual exceptions. Although smokers on theaverage die younger than nonsmokers, some people live to 90 while smoking threepacks a day.

To understand a statistical relationship between two variables, we measureboth variables on the same individuals. Often, we must examine other variables aswell. To conclude that shorter women have higher risk from heart attacks, for ex-ample, the researchers had to eliminate the effect of other variables such as weightand exercise habits. In this chapter we begin our study of relationships betweenvariables. One of our main themes is that the relationship between two variablescan be strongly influenced by other variables that are lurking in the background.

Explanatory and response variablesWe think that car weight helps explain accident deaths and that smoking influ-ences life expectancy. In each of these relationships, the two variables play differ-ent roles: one explains or influences the other.

90

Page 2: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Explanatory and response variables 91

RESPONSE VARIABLE, EXPLANATORY VARIABLE

A response variable measures an outcome of a study. An explanatoryvariable may explain or influence changes in a response variable.

You will often find explanatory variables called independent variables, and independent variableresponse variables called dependent variables. The idea behind this language is dependent variablethat the response variable depends on the explanatory variable. Because “inde-pendent” and “dependent” have other meanings in statistics that are unrelated tothe explanatory-response distinction, we prefer to avoid those words.

It is easiest to identify explanatory and response variables when we actuallyset values of one variable in order to see how it affects another variable.

E X A M P L E 4 . 1 Beer and blood alcohol

How does drinking beer affect the level of alcohol in our blood? The legal limit for driv-ing in all states is 0.08%. Student volunteers at The Ohio State University drank differ-ent numbers of cans of beer. Thirty minutes later, a police officer measured their bloodalcohol content. Number of beers consumed is the explanatory variable, and percent ofalcohol in the blood is the response variable.

When we don’t set the values of either variable but just observe both variables,there may or may not be explanatory and response variables. Whether there aredepends on how we plan to use the data.

E X A M P L E 4 . 2 College debts

A college student aid officer looks at the findings of the National Student Loan Survey.She notes data on the amount of debt of recent graduates, their current income, andhow stressful they feel about college debt. She isn’t interested in predictions but is simplytrying to understand the situation of recent college graduates. The distinction betweenexplanatory and response variables does not apply.

A sociologist looks at the same data with an eye to using amount of debt and income,along with other variables, to explain the stress caused by college debt. Now amount ofdebt and income are explanatory variables and stress level is the response variable.

In many studies, the goal is to show that changes in one or more explanatoryvariables actually cause changes in a response variable. Other explanatory-responserelationships do not involve direct causation. The SAT scores of high school stu-dents help predict the students’ future college grades, but high SAT scores certainlydon’t cause high college grades.

After you plot your data,think!

The statistician Abraham Wald(1902–1950) worked on warproblems during World War II.Wald invented some statisticalmethods that were military secretsuntil the war ended. Here is one ofhis simpler ideas. Asked where extraarmor should be added to airplanes,Wald studied the location of enemybullet holes in planes returningfrom combat. He plotted thelocations on an outline of the plane.As data accumulated, most of theoutline filled up. Put the armor inthe few spots with no bullet holes,said Wald. That’s where bullets hitthe planes that didn’t make it back.

Most statistical studies examine data on more than one variable. Fortunately,statistical analysis of several-variable data builds on the tools we used to examineindividual variables. The principles that guide our work also remain the same:

• Plot your data. Look for overall patterns and deviations from those patterns.• Based on what your plot shows, choose numerical summaries for some aspects

of the data.

Page 3: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

92 C H A P T E R 4 • Scatterplots and Correlation

A P P L Y Y O U R K N O W L E D G E

4.1 Explanatory and response variables? You have data on a large group of collegestudents. Here are four pairs of variables measured on these students. For eachpair, is it more reasonable to simply explore the relationship between the twovariables or to view one of the variables as an explanatory variable and the otheras a response variable? In the latter case, which is the explanatory variable andwhich is the response variable?

(a) Amount of time spent studying for a statistics exam and grade on the exam.

(b) Weight in kilograms and height in centimeters.

(c) Hours per week of extracurricular activities and grade point average.

(d) Score on the SAT math exam and score on the SAT verbal exam.

4.2 Coral reefs. How sensitive to changes in water temperature are coral reefs? Tofind out, measure the growth of corals in aquariums where the water temperatureis controlled at different levels. Growth is measured by weighing the coral beforeand after the experiment. What are the explanatory and response variables? Arethey categorical or quantitative?

4.3 Beer and blood alcohol. Example 4.1 describes a study in which college studentsdrank different amounts of beer. The response variable was their blood alcoholcontent (BAC). BAC for the same amount of beer might depend on other factsabout the students. Name two other variables that could influence BAC.Stuart Westmorland/Getty Images

Displaying relationships: scatterplotsThe most useful graph for displaying the relationship between two quantitativevariables is a scatterplot.

E X A M P L E 4 . 3 State SAT scores

Some people use average SAT scores to rank state school systems. This is not proper,4STEPSTEP

because state average scores depend on more than just school quality. Following ourfour-step process (page 53), let’s look at one influence on state SAT scores.

STATE: The percent of high school students who take the SAT varies from state tostate. Does this fact help explain differences among the states in average SAT score?

FORMULATE: Examine the relationship between percent taking and state mean score.Choose the explanatory and response variables (if any). Make a scatterplot to display therelationship between the variables. Interpret the plot to understand the relationship.

SOLVE (first steps): We suspect that “percent taking”will help explain “mean score.”So “percent taking”is the explanatory variable and “mean score”is the response variable.We want to see how mean score changes when percent taking changes, so we put percenttaking (the explanatory variable) on the horizontal axis. Figure 4.1 is the scatterplot.Each point represents a single state. In Colorado, for example, 27% took the SAT, andtheir mean SAT score was 1107. Find 27 on the x (horizontal) axis and 1107 on the y(vertical) axis. Colorado appears as the point (27, 1107) above 27 and to the right of1107.

Page 4: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Displaying relationships: scatterplots 93

100

011

00

120

013

00

900

0 20 40 60 80 100

Percent of graduates taking the SAT

Stat

e m

ean

SA

T s

core

In Colorado, 27% tookthe SAT and the meanscore was 1107.

Colorado

F I G U R E 4 . 1 Scatterplot of the mean SAT score in each state against the percent ofthat state’s high school graduates who take the SAT. The dotted lines intersect at thepoint (27, 1107), the data for Colorado.

SCATTERPLOT

A scatterplot shows the relationship between two quantitative variablesmeasured on the same individuals. The values of one variable appear on thehorizontal axis, and the values of the other variable appear on the verticalaxis. Each individual in the data appears as the point in the plot fixed bythe values of both variables for that individual.Always plot the explanatory variable, if there is one, on the horizontal axis(the x axis) of a scatterplot. As a reminder, we usually call the explanatoryvariable x and the response variable y. If there is no explanatory-responsedistinction, either variable can go on the horizontal axis.

William S. Clark; Frank Lane Picture Agency/CORBIS

A P P L Y Y O U R K N O W L E D G E

4.4 Bird colonies. One of nature’s patterns connects the percent of adult birds in acolony that return from the previous year and the number of new adults that jointhe colony. Following are data for 13 colonies of sparrowhawks:1

Page 5: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 19, 2006 9:24

94 C H A P T E R 4 • Scatterplots and Correlation

Percent return 74 66 81 52 73 62 52 45 62 46 60 46 38

New adults 5 6 8 11 12 15 16 17 18 18 19 20 20

Plot the count of new adults (response) against the percent of returning birds(explanatory).

Interpreting scatterplotsTo interpret a scatterplot, apply the strategies of data analysis learned in Chapters1 and 2.

EXAMINING A SCATTERPLOT

In any graph of data, look for the overall pattern and for striking deviationsfrom that pattern.You can describe the overall pattern of a scatterplot by the direction, form,and strength of the relationship.An important kind of deviation is an outlier, an individual value that fallsoutside the overall pattern of the relationship.

E X A M P L E 4 . 4 Understanding state SAT scores

SOLVE (interpret the plot): Figure 4.1 shows a clear direction: the overall pattern4STEPSTEP moves from upper left to lower right. That is, states in which a higher percent of high

school graduates take the SAT tend to have lower mean SAT score. We call this a neg-ative association between the two variables.

The form of the relationship is roughly a straight line with a slight curve to theright as it moves down. What is more, most states fall into two distinct clusters. In theclusterscluster at the right of the plot, 49% or more of high school graduates take the SAT andthe mean scores are low. The states in the cluster at the left have higher SAT scores andno more than 32% of graduates take the test. Only Nevada, where 40% take the SAT,lies between these clusters.

The strength of a relationship in a scatterplot is determined by how closely the pointsfollow a clear form. The overall relationship in Figure 4.1 is moderately strong: stateswith similar percents taking the SAT tend to have roughly similar mean SAT scores.

What explains the clusters? There are two widely used college entrance exams, theSAT and the ACT. Each state favors one or the other. The left cluster in Figure 4.1contains the ACT states, and the SAT states make up the right cluster. In ACT states,most students who take the SAT are applying to a selective college that requires SATscores. This select group of students has a higher mean score than the much larger groupof students who take the SAT in SAT states.

CONCLUDE: Percent taking explains much of the variation among states in averageSAT score. States in which a higher percent of students take the SAT tend to havelower mean scores. SAT states as a group have lower mean SAT scores than ACT states.Average SAT score says almost nothing about quality of education in a state.

Page 6: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 19, 2006 9:24

Interpreting scatterplots 95

POSITIVE ASSOCIATION, NEGATIVE ASSOCIATION

Two variables are positively associated when above-average values of onetend to accompany above-average values of the other, and below-averagevalues also tend to occur together.Two variables are negatively associated when above-average values of onetend to accompany below-average values of the other, and vice versa.

Here is an example of a relationship with a clearer form.

E X A M P L E 4 . 5 Counting carnivores

Ecologists look at data to learn about nature’s patterns. One pattern they have foundrelates the size of a carnivore and how many of those carnivores there are in an area.Measure size by body mass in kilograms. Measure “how many” by counting carnivores per10,000 kilograms of their prey in the area. Table 4.1 gives data for 25 carnivore species.2

To see the pattern, plot carnivore abundance (response) against body mass (ex-planatory). Biologists often find that patterns involving sizes and counts are simplerwhen we plot the logarithms of the data. Figure 4.2 does that—you can see that 1, 10,100, and 1000 are equally spaced on the vertical scale.

This scatterplot shows a negative association. That is, bigger carnivores are lessabundant. The form of the association is linear. That is, the overall pattern follows a linear relationshipstraight line from upper left to lower right. The association is quite strong because thepoints don’t deviate a great deal from the line. It is striking that animals from manydifferent parts of the world should fit so simple a pattern.

T A B L E 4 . 1 Size and abundance of carnivores

Carnivore Body Carnivore Bodyspecies mass (kg) Abundance species mass (kg) Abundance

Least weasel 0.14 1656.49 Eurasian lynx 20.0 0.46Ermine 0.16 406.66 Wild dog 25.0 1.61Small Indian mongoose 0.55 514.84 Dhole 25.0 0.81Pine marten 1.3 31.84 Snow leopard 40.0 1.89Kit fox 2.02 15.96 Wolf 46.0 0.62Channel Island fox 2.16 145.94 Leopard 46.5 6.17Arctic fox 3.19 21.63 Cheetah 50.0 2.29Red fox 4.6 32.21 Puma 51.9 0.94Bobcat 10.0 9.75 Spotted hyena 58.6 0.68Canadian lynx 11.2 4.79 Lion 142.0 3.40European badger 13.0 7.35 Tiger 181.0 0.33Coyote 13.0 11.65 Polar bear 310.0 0.60Ethiopian wolf 14.5 2.70

Page 7: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

96 C H A P T E R 4 • Scatterplots and Correlation

Ab

un

dan

ce p

er 1

0,0

00

kg

of

pre

y10

00

100

101

0.5 1.0

Carnivore body mass (kilograms)5.0 10.0 50.0 100.0

F I G U R E 4 . 2 Scatterplot ofthe abundance of 25 species ofcarnivores against their bodymass. Larger carnivores are lessabundant. (Logarithmic scalesare used for both variables.)

Of course, not all relationships have a simple form and a clear direction thatwe can describe as positive association or negative association. Exercise 4.6 givesan example that does not have a single direction.

A P P L Y Y O U R K N O W L E D G E

4.5 Bird colonies. Describe the form, direction, and strength of the relationshipbetween number of new sparrowhawks in a colony and percent of returningadults, as displayed in your plot from Exercise 4.4.

For short-lived birds, the association between these variables is positive:changes in weather and food supply drive the populations of new and returningbirds up or down together. For long-lived territorial birds, on the other hand, theassociation is negative because returning birds claim their territories in the colonyand don’t leave room for new recruits. Which type of species is the sparrowhawk?

4.6 Does fast driving waste fuel? How does the fuel consumption of a car changeas its speed increases? Here are data for a British Ford Escort. Speed is measured inkilometers per hour, and fuel consumption is measured in liters of gasoline usedper 100 kilometers traveled.3

Speed 10 20 30 40 50 60 70 80Fuel 21.00 13.00 10.00 8.00 7.00 5.90 6.30 6.95

Speed 90 100 110 120 130 140 150Fuel 7.57 8.27 9.03 9.87 10.79 11.77 12.83

Page 8: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Adding categorical variables to scatterplots 97

(a) Make a scatterplot. (Which is the explanatory variable?)

(b) Describe the form of the relationship. It is not linear. Explain why the form ofthe relationship makes sense.

(c) It does not make sense to describe the variables as either positively associatedor negatively associated. Why?

(d) Is the relationship reasonably strong or quite weak? Explain your answer.

Adding categorical variables to scatterplotsThe Census Bureau groups the states into four broad regions, named Midwest,Northeast, South, and West. We might ask about regional patterns in SAT examscores. Figure 4.3 repeats part of Figure 4.1, with an important difference. We haveplotted only the Northeast and Midwest groups of states, using the plot symbol “+”for the northeastern states and the symbol “ �” for the midwestern states.

The regional comparison is striking. The 9 northeastern states are all SATstates—in fact, at least 66% of high school graduates in each of these states takethe SAT. The 12 midwestern states are mostly ACT states. In 10 of these states, thepercent taking the SAT is between 5% and 11%. One midwestern state is clearlyan outlier within the region. Indiana is an SAT state (64% take the SAT) that fallsclose to the northeastern cluster. Ohio, where 28% take the SAT, also lies outsidethe midwestern cluster.

100

011

00

120

013

00

900

0 20 40 60

IN

OH

80 100

Percent of graduates taking SAT

Stat

e m

ean

SA

T s

core

+++ + + +

+++

F I G U R E 4 . 3 Mean SATscore and percent of high schoolgraduates who take the test foronly the northeastern (+) andmidwestern ( �) states.

Page 9: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

98 C H A P T E R 4 • Scatterplots and Correlation

Dividing the states into regions introduces a third variable into the scatter-plot. “Region” is a categorical variable that has four values, although we plotteddata from only two of the four regions. The two regions are identified by the twodifferent plotting symbols.

CATEGORICAL VARIABLES IN SCATTERPLOTS

To add a categorical variable to a scatterplot, use a different plot color orsymbol for each category.

A P P L Y Y O U R K N O W L E D G E

4.7 How fast do icicles grow? Japanese researchers measured the growth of iciclesin a cold chamber under various conditions of temperature, wind, and water flow.4

Table 4.2 contains data produced under two sets of conditions. In both cases,there was no wind and the temperature was set at −11◦C. Water flowed over theicicle at a higher rate (29.6 milligrams per second) in Run 8905 and at a slowerrate (11.9 mg/s) in Run 8903.

(a) Make a scatterplot of the length of the icicle in centimeters versus time inminutes, using separate symbols for the two runs.

(b) What does your plot show about the pattern of growth of icicles? What doesit show about the effect of changing the rate of water flow on icicle growth?

T A B L E 4 . 2 Growth of icicles over time

Run 8903 Run 8905

Time Length Time Length Time Length Time Length(min) (cm) (min) (cm) (min) (cm) (min) (cm)

10 0.6 130 18.1 10 0.3 130 10.420 1.8 140 19.9 20 0.6 140 11.030 2.9 150 21.0 30 1.0 150 11.940 4.0 160 23.4 40 1.3 160 12.750 5.0 170 24.7 50 3.2 170 13.960 6.1 180 27.8 60 4.0 180 14.670 7.9 70 5.3 190 15.880 10.1 80 6.0 200 16.290 10.9 90 6.9 210 17.9

100 12.7 100 7.8 220 18.8110 14.4 110 8.3 230 19.9120 16.6 120 9.6 240 21.1

Page 10: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Measuring linear association: correlation 99

Measuring linear association: correlationA scatterplot displays the direction, form, and strength of the relationship be-tween two quantitative variables. Linear (straight-line) relations are particularlyimportant because a straight line is a simple pattern that is quite common. A lin-ear relation is strong if the points lie close to a straight line, and weak if they arewidely scattered about a line. Our eyes are not good judges of how strong a linearrelationship is. The two scatterplots in Figure 4.4 depict exactly the same data,but the lower plot is drawn smaller in a large field. The lower plot seems to show astronger linear relationship. Our eyes can be fooled by changing the plotting scalesor the amount of space around the cloud of points in a scatterplot.5 We need tofollow our strategy for data analysis by using a numerical measure to supplementthe graph. Correlation is the measure we use.

60 80 100 120 14040

160

60

80

100

120

140

x

y

0 50 100 150 200 2500

250

50

100

150

200

x

y

F I G U R E 4 . 4 Twoscatterplots of the same data.The straight-line pattern in thelower plot appears strongerbecause of the surroundingspace.

Page 11: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

100 C H A P T E R 4 • Scatterplots and Correlation

CORRELATION

The correlation measures the direction and strength of the linearrelationship between two quantitative variables. Correlation is usuallywritten as r.Suppose that we have data on variables x and y for n individuals. Thevalues for the first individual are x1 and y1, the values for the secondindividual are x2 and y2, and so on. The means and standard deviations ofthe two variables are x and sx for the x-values, and y and s y for they-values. The correlation r between x and y is

r = 1n − 1

[(x1 − x

sx

) (y1 − y

sy

)+

(x2 − x

sx

) (y2 − y

sy

)

+ · · · +(

xn − xsx

) (yn − y

sy

)]

or, more compactly,

r = 1n − 1

∑ (xi − x

sx

) (yi − y

sy

)

The formula for the correlation r is a bit complex. It helps us see what corre-lation is, but in practice you should use software or a calculator that finds r fromkeyed-in values of two variables x and y. Exercise 4.8 asks you to calculate a cor-relation step-by-step from the definition to solidify its meaning.

Death from superstition?

Is there a relationship betweensuperstitious beliefs and bad thingshappening? Apparently there is.Chinese and Japanese people thinkthat the number 4 is unluckybecause when pronounced it soundslike the word for “death.”Sociologists looked at 15 years’worth of death certificates forChinese and Japanese Americansand for white Americans. Deathsfrom heart disease were notablyhigher on the fourth day of themonth among Chinese andJapanese but not among whites. Thesociologists think the explanation isincreased stress on “unlucky days.”

The formula for r begins by standardizing the observations. Suppose, for ex-ample, that x is height in centimeters and y is weight in kilograms and that wehave height and weight measurements for n people. Then x and sx are the meanand standard deviation of the n heights, both in centimeters. The value

xi − xsx

is the standardized height of the i th person, familiar from Chapter 3. The stan-dardized height says how many standard deviations above or below the mean aperson’s height lies. Standardized values have no units—in this example, they areno longer measured in centimeters. Standardize the weights also. The correlationr is an average of the products of the standardized height and the standardizedweight for the n people.

A P P L Y Y O U R K N O W L E D G E

4.8 Coffee and deforestation. Coffee is a leading export from several developingcountries. When coffee prices are high, farmers often clear forest to plant morecoffee trees. Here are five years’ data on prices paid to coffee growers in Indonesia

Page 12: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Facts about correlation 101

and the percent of forest area lost in a national park that lies in a coffee-producing region:6

Price (cents per pound) 29 40 54 55 72

Forest lost (percent) 0.49 1.59 1.69 1.82 3.10

Bill Ross/CORBIS

(a) Make a scatterplot. Which is the explanatory variable? What kind of patterndoes your plot show?

(b) Find the correlation r step-by-step. First find the mean and standard deviationof each variable. Then find the five standardized values for each variable. Finally,use the formula for r. Explain how your value for r matches your graph in (a).

(c) Enter these data into your calculator or software and use the correlationfunction to find r. Check that you get the same result as in (b), up to roundofferror.

Facts about correlationThe formula for correlation helps us see that r is positive when there is a positiveassociation between the variables. Height and weight, for example, have a positiveassociation. People who are above average in height tend to also be above averagein weight. Both the standardized height and the standardized weight are positive.People who are below average in height tend to also have below-average weight.Then both standardized height and standardized weight are negative. In both cases,the products in the formula for r are mostly positive and so r is positive. In the sameway, we can see that r is negative when the association between x and y is negative.More detailed study of the formula gives more detailed properties of r. Here is whatyou need to know in order to interpret correlation.

1. Correlation makes no distinction between explanatory and response variables. Itmakes no difference which variable you call x and which you call y incalculating the correlation.

2. Because r uses the standardized values of the observations, r does not changewhen we change the units of measurement of x, y, or both. Measuring height ininches rather than centimeters and weight in pounds rather than kilogramsdoes not change the correlation between height and weight. The correlationr itself has no unit of measurement; it is just a number.

3. Positive r indicates positive association between the variables, and negative rindicates negative association.

4. The correlation r is always a number between −1 and 1. Values of r near 0indicate a very weak linear relationship. The strength of the linearrelationship increases as r moves away from 0 toward either −1 or 1. Valuesof r close to −1 or 1 indicate that the points in a scatterplot lie close to astraight line. The extreme values r = −1 and r = 1 occur only in the case ofa perfect linear relationship, when the points lie exactly along a straight line.

Page 13: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

102 C H A P T E R 4 • Scatterplots and Correlation

E X A M P L E 4 . 6 From scatterplot to correlation

The scatterplots in Figure 4.5 illustrate how values of r closer to 1 or −1 correspond tostronger linear relationships. To make the meaning of r clearer, the standard deviationsof both variables in these plots are equal, and the horizontal and vertical scales are thesame. In general, it is not so easy to guess the value of r from the appearance of a scatter-plot. Remember that changing the plotting scales in a scatterplot may mislead our eyes,but it does not change the correlation.

The real data we have examined also illustrate how correlation measures thestrength and direction of linear relationships. Figure 4.2 shows a strong negative linearrelationship between the logarithms of body mass and abundance for carnivore species.The correlation is r = −0.912. Figure 4.1 shows a weaker but still quite strong negativeassociation between percent of students taking the SAT and the mean SAT score in astate. The correlation is r = −0.876.

Correlation r = 0

Correlation r = 0.5

Correlation r = 0.9

Correlation r = –0.3

Correlation r = –0.7

Correlation r = –0.99

F I G U R E 4 . 5 Howcorrelation measures thestrength of a linear relationship.Patterns closer to a straight linehave correlations closer to 1 or−1.

Describing the relationship between two variables is a more complex task thandescribing the distribution of one variable. Here are some more facts about corre-lation, cautions to keep in mind when you use r.

Page 14: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Facts about correlation 103

1. Correlation requires that both variables be quantitative, so that it makes sense to do

CAUTIONUTION

the arithmetic indicated by the formula for r. We cannot calculate a correlationbetween the incomes of a group of people and what city they live in, becausecity is a categorical variable.

2. Correlation measures the strength of only the linear relationship betweentwo variables. Correlation does not describe curved relationships between

CAUTIONUTIONvariables, no matter how strong they are. Exercise 4.11 illustrates this importantfact.

3. Like the mean and standard deviation, the correlation is not resistant: r is strongly

CAUTIONUTION

affected by a few outlying observations. Use r with caution when outliers appearin the scatterplot. To explore how extreme observations can influence r, usethe Correlation and Regression applet.

APPLETAPPLET

4. Correlation is not a complete summary of two-variable data, even when the

CAUTIONUTION

relationship between the variables is linear. You should give the means andstandard deviations of both x and y along with the correlation.

Because the formula for correlation uses the means and standard deviations,these measures are the proper choice to accompany a correlation. Here is an ex-ample in which understanding requires both means and correlation.

Neal Preston/CORBIS

E X A M P L E 4 . 7 Scoring figure skaters

Until a scandal at the 2002 Olympics brought change, figure skating was scored by judgeson a scale from 0.0 to 6.0. The scores were often controversial. We have the scoresawarded by two judges, Pierre and Elena, to many skaters. How well do they agree? Wecalculate that the correlation between their scores is r = 0.9. But the mean of Pierre’sscores is 0.8 point lower than Elena’s mean.

These facts do not contradict each other. They are simply different kinds of infor-mation. The mean scores show that Pierre awards lower scores than Elena. But becausePierre gives every skater a score about 0.8 point lower than Elena, the correlation re-mains high. Adding the same number to all values of either x or y does not change thecorrelation. If both judges score the same skaters, the competition is scored consistentlybecause Pierre and Elena agree on which performances are better than others. The highr shows their agreement. But if Pierre scores some skaters and Elena others, we must add0.8 points to Pierre’s scores to arrive at a fair comparison.

Of course, even giving means, standard deviations, and the correlation forstate SAT scores and percent taking will not point out the clusters in Figure 4.1.Numerical summaries complement plots of data, but they don’t replace them.

A P P L Y Y O U R K N O W L E D G E

4.9 Changing the units. Coffee is currently priced in dollars. If it were priced ineuros, and the dollar prices in Exercise 4.8 were translated into the equivalentprices in euros, would the correlation between coffee price and percent of forestloss change? Explain your answer.

Page 15: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

104 C H A P T E R 4 • Scatterplots and Correlation

4.10 Changing the correlation.(a) Use your calculator or software to find the correlation between the percent ofreturning birds and the number of new birds from the data in Exercise 4.4.

(b) Make a scatterplot of the data with two new points added. Point A: 10%return, 25 new birds. Point B: 40% return, 5 new birds. Find two new correlations:one for the original data plus Point A, and another for the original data plusPoint B.

(c) In terms of what correlation measures, explain why adding Point A makes thecorrelation stronger (closer to −1) and adding Point B makes the correlationweaker (closer to 0).

4.11 Strong association but no correlation. The gas mileage of an automobile firstincreases and then decreases as the speed increases. Suppose that this relationshipis very regular, as shown by the following data on speed (miles per hour) andmileage (miles per gallon):

Speed 20 30 40 50 60

MPG 24 28 30 28 24

Make a scatterplot of mileage versus speed. Show that the correlation betweenspeed and mileage is r = 0. Explain why the correlation is 0 even though there isa strong relationship between speed and mileage.

C H A P T E R 4 SUMMARYTo study relationships between variables, we must measure the variables on thesame group of individuals.If we think that a variable x may explain or even cause changes in anothervariable y, we call x an explanatory variable and y a response variable.A scatterplot displays the relationship between two quantitative variablesmeasured on the same individuals. Mark values of one variable on the horizontalaxis (x axis) and values of the other variable on the vertical axis (y axis). Ploteach individual’s data as a point on the graph. Always plot the explanatoryvariable, if there is one, on the x axis of a scatterplot.Plot points with different colors or symbols to see the effect of a categoricalvariable in a scatterplot.In examining a scatterplot, look for an overall pattern showing the direction,form, and strength of the relationship, and then for outliers or other deviationsfrom this pattern.Direction: If the relationship has a clear direction, we speak of either positiveassociation (high values of the two variables tend to occur together) or negativeassociation (high values of one variable tend to occur with low values of theother variable).Form: Linear relationships, where the points show a straight-line pattern, are animportant form of relationship between two variables. Curved relationships andclusters are other forms to watch for.

Page 16: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Check Your Skills 105

Strength: The strength of a relationship is determined by how close the points inthe scatterplot lie to a simple form such as a line.The correlation r measures the strength and direction of the linear associationbetween two quantitative variables x and y. Although you can calculate acorrelation for any scatterplot, r measures only straight-line relationships.Correlation indicates the direction of a linear relationship by its sign: r > 0 for apositive association and r < 0 for a negative association. Correlation alwayssatisfies −1 ≤ r ≤ 1 and indicates the strength of a relationship by how close it isto −1 or 1. Perfect correlation, r = ±1, occurs only when the points on ascatterplot lie exactly on a straight line.Correlation ignores the distinction between explanatory and response variables.The value of r is not affected by changes in the unit of measurement of eithervariable. Correlation is not resistant, so outliers can greatly change the value of r.

C H E C K Y O U R S K I L L S

4.12 You have data for many families on the parents’ income and the years ofeducation their eldest child completes. When you make a scatterplot, theexplanatory variable on the x axis

(a) is parents’ income.

(b) is years of education.

(c) can be either income or education.

4.13 You have data for many families on the parents’ income and the years ofeducation their eldest child completes. You expect to see

(a) a positive association.

(b) very little association.

(c) a negative association.

4.14 Figure 4.6 is a scatterplot of reading test scores against IQ test scores for 14fifth-grade children. There is one low outlier in the plot. The IQ and readingscores for this child are

(a) IQ = 10, reading = 124.

(b) IQ = 124, reading = 72.

(c) IQ = 124, reading = 10.

4.15 Removing the outlier in Figure 4.6 would

(a) increase the correlation between IQ and reading score.

(b) decrease the correlation between IQ and reading score.

(c) have little effect on the correlation.

4.16 If we leave out the low outlier, the correlation for the remaining 14 points inFigure 4.6 is closest to

(a) 0.5. (b) −0.5. (c) 0.95.

4.17 What are all the values that a correlation r can possibly take?

(a) r ≥ 0 (b) 0 ≤ r ≤ 1 (c) −1 ≤ r ≤ 1

Page 17: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

106 C H A P T E R 4 • Scatterplots and Correlation

120

110

100

1020

3040

5060

8090

70

90 95 100 105 110 115 120 125 130 135 140 145 150

Child's IQ test score

Ch

ild

's r

ead

ing

test

sco

re

F I G U R E 4 . 6 Scatterplot ofreading test score against IQ testscore for fifth-grade children, forExercises 4.14 to 4.16.

4.18 The points on a scatterplot lie very close to the line whose equation isy = 4 − 3x . The correlation between x and y is close to

(a) −3. (b) −1. (c) 1.

4.19 If women always married men who were 2 years older than themselves, thecorrelation between the ages of husband and wife would be

(a) 1.

(b) 0.5.

(c) Can’t tell without seeing the data.

4.20 For a biology project, you measure the weight in grams and the tail length inmillimeters of a group of mice. The correlation is r = 0.7. If you had measuredtail length in centimeters instead of millimeters, what would be the correlation?(There are 10 millimeters in a centimeter.)

(a) 0.7/10 = 0.07 (b) 0.7 (c) (0.7)(10) = 7

4.21 Because elderly people may have difficulty standing to have their heightsmeasured, a study looked at predicting overall height from height to the knee.Here are data (in centimeters) for five elderly men:

Knee height x 57.7 47.4 43.5 44.8 55.2

Height y 192.1 153.3 146.4 162.7 169.1

Page 18: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Chapter 4 Exercises 107

Use your calculator or software: the correlation between knee height and overallheight is about

(a) r = 0.88. (b) r = 0.09. (c) r = 0.77.

C H A P T E R 4 EXERCISES

4.22 Stocks versus T-bills. What is the relationship between returns from buyingTreasury bills and returns from buying common stocks? To buy a Treasury bill is tomake a short-term loan to the U.S. government. This is much less risky thanbuying stock in a company, so (on the average) the returns on Treasury bills arelower than the return on stocks. Figure 4.7 plots the annual returns on stocks forthe years 1950 to 2003 against the returns on Treasury bills for the same years.

(a) The best year for stocks during this period was 1954. The worst year was1974. About what were the returns on stocks in those two years?

(b) Treasury bills are a measure of the general level of interest rates. The yearsaround 1980 saw very high interest rates. Treasury bill returns peaked in 1981.About what was the percent return that year?

(c) Some people say that high Treasury bill returns tend to go with low returns onstocks. Does such a pattern appear clearly in Figure 4.7? Does the plot have anyclear pattern?

−40

−30

−20

−10

010

2030

4050

60

0 2 4 6 8 10 12 14

Percent return on Treasury bills

Per

cen

t re

turn

on

co

mm

on

sto

cks

F I G U R E 4 . 7 Scatterplot ofyearly return on common stocksagainst return on Treasury bills,for Exercise 4.22.

4.23 Can children estimate their own reading ability? To study this question,investigators asked 60 fifth-grade children to estimate their own reading ability,

Page 19: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

108 C H A P T E R 4 • Scatterplots and Correlation

12

34

5

0 20 40 60 80 100

Child‘s score on a test of reading ability

Ch

ild

‘s s

elf-

esti

mat

e o

f re

adin

g ab

ilit

y

F I G U R E 4 . 8 Scatterplot ofchildren’s estimates of theirreading ability (on a scale of 1 to5) against their score on areading test, for Exercise 4.23.

on a scale from 1 (low) to 5 (high). Figure 4.8 is a scatterplot of the children’sestimates (response) against their scores on a reading test (explanatory).7

(a) What explains the “stair-step” pattern in the plot?

(b) Is there an overall positive association between reading score andself-estimate?

(c) There is one clear outlier. What is this child’s self-estimated reading level?Does this appear to over- or underestimate the level as measured by the test?

4.24 Data on dating. A student wonders if tall women tend to date taller men thando short women. She measures herself, her dormitory roommate, and the womenin the adjoining rooms; then she measures the next man each woman dates. Hereare the data (heights in inches):

Women (x) 66 64 66 65 70 65

Men (y) 72 68 70 68 71 65

(a) Make a scatterplot of these data. Based on the scatterplot, do you expect thecorrelation to be positive or negative? Near ±1 or not?

(b) Find the correlation r between the heights of the men and women. Do thedata show that taller women tend to date taller men?

Page 20: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Chapter 4 Exercises 109

T A B L E 4 . 3 World record times for the 10,000-meter run

Men Women

Record Time Record Time Record Timeyear (seconds) year (seconds) year (seconds)

1912 1880.8 1962 1698.2 1967 2286.41921 1840.2 1963 1695.6 1970 2130.51924 1835.4 1965 1659.3 1975 2100.41924 1823.2 1972 1658.4 1975 2041.41924 1806.2 1973 1650.8 1977 1995.11937 1805.6 1977 1650.5 1979 1972.51938 1802.0 1978 1642.4 1981 1950.81939 1792.6 1984 1633.8 1981 1937.21944 1775.4 1989 1628.2 1982 1895.31949 1768.2 1993 1627.9 1983 1895.01949 1767.2 1993 1618.4 1983 1887.61949 1761.2 1994 1612.2 1984 1873.81950 1742.6 1995 1603.5 1985 1859.41953 1741.6 1996 1598.1 1986 1813.71954 1734.2 1997 1591.3 1993 1771.81956 1722.8 1997 1587.81956 1710.4 1998 1582.71960 1698.8 2004 1580.3

4.25 World record running times. Table 4.3 shows the progress of world recordtimes (in seconds) for the 10,000-meter run for both men and women.

(a) Make a scatterplot of world record time against year, using separate symbolsfor men and women. Describe the pattern for each sex. Then compare theprogress of men and women.

Duomo/CORBIS

(b) Find the correlation between record time and year separately for men and forwomen. What do the correlations say about the patterns?

(c) Women began running this long distance later than men, so we might expecttheir improvement to be more rapid. Moreover, it is often said that men havelittle advantage over women in distance running as opposed to sprints, wheremuscular strength plays a greater role. Do the data appear to support these claims?

4.26 Thinking about correlation. Exercise 4.24 presents data on the heights ofwomen and of the men they date.

(a) How would r change if all the men were 6 inches shorter than the heightsgiven in the table? Does the correlation tell us whether women tend to date mentaller than themselves?

(b) If heights were measured in centimeters rather than inches, how would thecorrelation change? (There are 2.54 centimeters in an inch.)

Page 21: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

110 C H A P T E R 4 • Scatterplots and Correlation

(c) If every woman dated a man exactly 3 inches taller than herself, what wouldbe the correlation between male and female heights?

4.27 Heating a home. The Sanchez household is about to install solar panels toreduce the cost of heating their house. In order to know how much the solarpanels help, they record their consumption of natural gas before the panels areinstalled. Gas consumption is higher in cold weather, so the relationship betweenoutside temperature and gas consumption is important. Here are data for 16consecutive months:8

Month Nov. Dec. Jan. Feb. Mar. Apr. May June

Degree-days per day 24 51 43 33 26 13 4 0Gas used per day 6.3 10.9 8.9 7.5 5.3 4.0 1.7 1.2

Month July Aug. Sept. Oct. Nov. Dec. Jan. Feb.

Degree-days per day 0 1 6 12 30 32 52 30Gas used per day 1.2 1.2 2.1 3.1 6.4 7.2 11.0 6.9

Outside temperature is recorded in degree-days, a common measure of demand forheating. A day’s degree-days are the number of degrees its average temperaturefalls below 65◦F. Gas used is recorded in hundreds of cubic feet. Make a plot anddescribe the pattern. Is correlation a helpful way to describe the pattern? Why orwhy not? Find the correlation if it is helpful.

4.28 How many corn plants are too many? How much corn per acre should afarmer plant to obtain the highest yield? Too few plants will give a low yield. Onthe other hand, if there are too many plants, they will compete with each otherfor moisture and nutrients, and yields will fall. To find the best planting rate, plantat different rates on several plots of ground and measure the harvest. (Be sure totreat all the plots the same except for the planting rate.) Here are data from suchan experiment:9

Plants per acre Yield (bushels per acre)

12,000 150.1 113.0 118.4 142.616,000 166.9 120.7 135.2 149.820,000 165.3 130.1 139.6 149.924,000 134.7 138.4 156.128,000 119.0 150.5

(a) Is yield or planting rate the explanatory variable?

(b) Make a scatterplot of yield and planting rate. Use a scale of yields from 100 to200 bushels per acre so that the pattern will be clear.

(c) Describe the overall pattern of the relationship. Is it linear? Is there a positiveor negative association, or neither? Is correlation r a helpful description of thisrelationship? Find the correlation if it is helpful.

(d) Find the mean yield for each of the five planting rates. Plot each mean yieldagainst its planting rate on your scatterplot and connect these five points withlines. This combination of numerical description and graphing makes the

Page 22: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Chapter 4 Exercises 111

relationship clearer. What planting rate would you recommend to a farmer whoseconditions were similar to those in the experiment?

4.29 Do solar panels reduce gas usage? After the Sanchez household gathered theinformation recorded in Exercise 4.27, they added solar panels to their house.They then measured their natural-gas consumption for 23 more months. Here arethe data:10

Degree-days 19 3 3 0 0 0 8 11 27 46 38 34Gas used 3.2 2.0 1.6 1.0 0.7 0.7 1.6 3.1 5.1 7.7 7.0 6.1

Degree-days 16 9 2 1 0 2 3 18 32 34 40Gas used 3.0 2.1 1.3 1.0 1.0 1.0 1.2 3.4 6.1 6.5 7.5

Add the new data to your scatterplot from Exercise 4.27, using a different color orsymbol. What do the before-and-after data show about the effect of solar panels?

4.30 Hot mutual funds. Fidelity Investments, like other large mutual-fundscompanies, offers many “sector funds” that concentrate their investments innarrow segments of the stock market. These funds often rise or fall by much morethan the market as a whole. We can group them by broader market sector tocompare returns. Here are percent total returns for 23 Fidelity “Select Portfolios”funds for the year 2003, a year in which stocks rose sharply:11

Market sector Fund returns (percent)

Consumer 23.9 14.1 41.8 43.9 31.1Financial services 32.3 36.5 30.6 36.9 27.5Natural resources 22.9 7.6 32.1 28.7 29.5 19.1Technology 26.1 62.7 68.1 71.9 57.0 35.0 59.4

(a) Make a plot of total return against market sector (space the four marketsectors equally on the horizontal axis). Compute the mean return for each sector,add the means to your plot, and connect the means with line segments.

(b) Based on the data, which of these market sectors were the best places toinvest in 2003? Hindsight is wonderful.

(c) Does it make sense to speak of a positive or negative association betweenmarket sector and total return? Why? Is correlation r a helpful description of therelationship? Why?

4.31 Statistics for investing. Investment reports now often include correlations.Following a table of correlations among mutual funds, a report adds: “Two fundscan have perfect correlation, yet different levels of risk. For example, Fund A andFund B may be perfectly correlated, yet Fund A moves 20% whenever Fund Bmoves 10%.” Write a brief explanation, for someone who knows no statistics, ofhow this can happen. Include a sketch to illustrate your explanation.

4.32 Statistics for investing. A mutual-funds company’s newsletter says, “Awell-diversified portfolio includes assets with low correlations.” The newsletterincludes a table of correlations between the returns on various classes ofinvestments. For example, the correlation between municipal bonds and

Page 23: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

112 C H A P T E R 4 • Scatterplots and Correlation

large-cap stocks is 0.50, and the correlation between municipal bonds andsmall-cap stocks is 0.21.

(a) Rachel invests heavily in municipal bonds. She wants to diversify by addingan investment whose returns do not closely follow the returns on her bonds.Should she choose large-cap stocks or small-cap stocks for this purpose? Explainyour answer.

(b) If Rachel wants an investment that tends to increase when the return on herbonds drops, what kind of correlation should she look for?

4.33 The effect of changing units. Changing the units of measurement candramatically alter the appearance of a scatterplot. Return to the data on kneeheight and overall height in Exercise 4.21:

Knee height x 57.7 47.4 43.5 44.8 55.2

Height y 192.1 153.3 146.4 162.7 169.1

Both heights are measured in centimeters. A mad scientist prefers to measureknee height in millimeters and height in meters. The data in these units are:

Knee height x 577 474 435 448 552

Height y 1.921 1.533 1.464 1.627 1.691

(a) Make a plot with x axis extending from 0 to 600 and y axis from 0 to 250.Plot the original data on these axes. Then plot the new data using a differentcolor or symbol. The two plots look very different.

(b) Nonetheless, the correlation is exactly the same for the two sets ofmeasurements. Why do you know that this is true without doing any calculations?Find the two correlations to verify that they are the same.

4.34 Teaching and research. A college newspaper interviews a psychologist aboutstudent ratings of the teaching of faculty members. The psychologist says, “Theevidence indicates that the correlation between the research productivity andteaching rating of faculty members is close to zero.” The paper reports this as“Professor McDaniel said that good researchers tend to be poor teachers, and viceversa.” Explain why the paper’s report is wrong. Write a statement in plainlanguage (don’t use the word “correlation”) to explain the psychologist’s meaning.

4.35 Sloppy writing about correlation. Each of the following statements contains ablunder. Explain in each case what is wrong.

(a) “There is a high correlation between the gender of American workers andtheir income.”

(b) “We found a high correlation (r = 1.09) between students’ ratings of facultyteaching and ratings made by other faculty members.”

(c) “The correlation between planting rate and yield of corn was found to ber = 0.23 bushel.”

4.36 Correlation is not resistant. Go to the Correlation and Regression applet. Click

APPLETAPPLET

on the scatterplot to create a group of 10 points in the lower-left corner of thescatterplot with a strong straight-line pattern (correlation about 0.9).

Page 24: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

Chapter 4 Exercises 113

(a) Add one point at the upper right that is in line with the first 10. How doesthe correlation change?

(b) Drag this last point down until it is opposite the group of 10 points. Howsmall can you make the correlation? Can you make the correlation negative? Yousee that a single outlier can greatly strengthen or weaken a correlation. Alwaysplot your data to check for outlying points.

4.37 Match the correlation. You are going to use the Correlation and Regression

APPLETAPPLET

applet to make scatterplots with 10 points that have correlation close to 0.7. Thelesson is that many patterns can have the same correlation. Always plot your databefore you trust a correlation.

(a) Stop after adding the first two points. What is the value of the correlation?Why does it have this value?

(b) Make a lower-left to upper-right pattern of 10 points with correlation aboutr = 0.7. (You can drag points up or down to adjust r after you have 10 points.)Make a rough sketch of your scatterplot.

(c) Make another scatterplot with 9 points in a vertical stack at the left of theplot. Add one point far to the right and move it until the correlation is close to0.7. Make a rough sketch of your scatterplot.

(d) Make yet another scatterplot with 10 points in a curved pattern that starts atthe lower left, rises to the right, then falls again at the far right. Adjust the pointsup or down until you have a quite smooth curve with correlation close to 0.7.Make a rough sketch of this scatterplot also.

The following exercises ask you to answer questions from data without having the stepsoutlined as part of the exercise. Follow the Formulate, Solve, and Conclude steps ofthe four-step process described on page 53.

Russell Burden/Index Stock Imagery/PictureQuest

4.38 Brighter sunlight? The brightness of sunlight at the earth’s surface changes

4STEPSTEP

over time depending on whether the earth’s atmosphere is more or less clear.Sunlight dimmed between 1960 and 1990. After 1990, air pollution dropped inindustrial countries. Did sunlight brighten? Here are data from Boulder, Colorado,averaging over only clear days each year. (Other locations show similar trends.)The response variable is solar radiation in watts per square meter.12

Year 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002

Sun 243.2 246.0 248.0 250.3 250.9 250.9 250.0 248.9 251.7 251.4 250.9

4.39 Merlins breeding. Often the percent of an animal species in the wild thatsurvive to breed again is lower following a successful breeding season. This is partof nature’s self-regulation to keep population size stable. A study of merlins (smallfalcons) in northern Sweden observed the number of breeding pairs in an isolatedarea and the percent of males (banded for identification) who returned the nextbreeding season. Here are data for nine years:13

Breeding pairs 28 29 29 29 30 32 33 38 38

Percent return 82 83 70 61 69 58 43 50 47

Page 25: Scatterplots and Correlation - virtual.yosemite.cc.ca.usvirtual.yosemite.cc.ca.us/jcurl/Math134 4 s/ch4-090-114.pdf · That’s where bullets hit ... statistical analysis of several-variable

P1: PBU/OVY P2: PBU/OVY QC: PBU/OVY T1: PBU

GTBL011-04 GTBL011-Moore-v15.cls May 16, 2006 17:8

114 C H A P T E R 4 • Scatterplots and Correlation

Do the data support the theory that a smaller percent of birds survive following asuccessful breeding season?

4.40 Does social rejection hurt? We often describe our emotional reaction to socialrejection as “pain.” A clever study asked whether social rejection causes activityin areas of the brain that are known to be activated by physical pain. If it does, wereally do experience social and physical pain in similar ways. Subjects were firstincluded and then deliberately excluded from a social activity while changes inbrain activity were measured. After each activity, the subjects filled outquestionnaires that assessed how excluded they felt. Here are data for 13subjects:14

Social Brain Social BrainSubject distress activity Subject distress activity

1 1.26 −0.055 8 2.18 0.0252 1.85 −0.040 9 2.58 0.0273 1.10 −0.026 10 2.75 0.0334 2.50 −0.017 11 2.75 0.0645 2.17 −0.017 12 3.33 0.0776 2.67 0.017 13 3.65 0.1247 2.01 0.021

The explanatory variable is “social distress” measured by each subject’squestionnaire score after exclusion relative to the score after inclusion. (So valuesgreater than 1 show the degree of distress caused by exclusion.) The responsevariable is change in activity in a region of the brain that is activated by physicalpain. Discuss what the data show.

4.41 Hot mutual funds? The data for 2003 in Exercise 4.30 make sector funds lookattractive. Stocks rose sharply in 2003, after falling sharply in 2002 (and also in2001 and 2000). Let’s look at the percent returns for 2003 and 2002 for thesesame 23 funds.

2002 2003 2002 2003 2002 2003return return return return return return

−17.1 23.9 −0.7 36.9 −37.8 59.4−6.7 14.1 −5.6 27.5 −11.5 22.9

−21.1 41.8 −26.9 26.1 −0.7 7.6−12.8 43.9 −42.0 62.7 64.3 32.1−18.9 31.1 −47.8 68.1 −9.6 28.7−7.7 32.3 −50.5 71.9 −11.7 29.5

−17.2 36.5 −49.5 57.0 −2.3 19.1−11.4 30.6 −23.4 35.0

Do a careful analysis of these data: side-by-side comparison of the distributions ofreturns in 2002 and 2003 and also a description of the relationship between thereturns of the same funds in these two years. What are your most importantfindings? (The outlier is Fidelity Gold Fund.)