chapter 5: regression1 chapter 5 relationships: regression

54
Chapter 5: Regression 1 Chapter 5 Relationships: Regression

Post on 19-Dec-2015

243 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Chapter 5: Regression1 Chapter 5 Relationships: Regression

Chapter 5: Regression 1

Chapter 5

Relationships: Regression

Page 2: Chapter 5: Regression1 Chapter 5 Relationships: Regression

2

Objectives (BPS chapter 5)Regression

Regression lines

The least-squares regression line

Facts about least-squares regression

Residuals

Influential observations

Cautions about correlation and regression

Association does not imply causation

Page 3: Chapter 5: Regression1 Chapter 5 Relationships: Regression

3

Correlation tells us about

strength (scatter) and direction

of the linear relationship

between two quantitative

variables.

We would like to have a numerical description of how the variables vary

together. We would also like to make predictions based on the observed

association between those two variables.

We wish to find the straight line that best fits our data. But which line best describes our data?

Page 4: Chapter 5: Regression1 Chapter 5 Relationships: Regression

4

A regression lineA regression line is a straight line that describes how a response variable

(y) changes as an explanatory variable (x) changes. We often use a

regression line to predict the value of y for a given value of x.

Page 5: Chapter 5: Regression1 Chapter 5 Relationships: Regression

5Chapter 5: Regression 5

Linear Regression We wish to quantify the linear relationship between an

explanatory variable and a response variable.

We can then predict the average response for all subjects with a given value of the explanatory variable.

Regression equation: y = a + bx– x is the value of the explanatory variable– y is the average value of the response variable

– note that a and b are just the y-intercept and slope of a straight line

Page 6: Chapter 5: Regression1 Chapter 5 Relationships: Regression

6Chapter 5: Regression 6

0

20

40

60

80

0 20 40 60Axis Title

Axis Title

Thought Question 1

How would you draw a line through the points? How would you draw a line through the points? How do you determine which line ‘fits best’? How do you determine which line ‘fits best’?

Page 7: Chapter 5: Regression1 Chapter 5 Relationships: Regression

7Chapter 5: Regression 7

Y

Y = mX + b

b = Y-intercept

X

Changein Y

Change in X

m = Slope

Linear Equations

High School TeacherHigh School Teacher

Page 8: Chapter 5: Regression1 Chapter 5 Relationships: Regression

8Chapter 5: Regression 8

The Linear Model Remember from Algebra that a straight

line can be written as: In Statistics we use a slightly different

notation:

We write to emphasize that the points that satisfy this equation are just our predicted values, not the actual data values.

y

y mx b

y

= a + bxy

Page 9: Chapter 5: Regression1 Chapter 5 Relationships: Regression

9Chapter 5: Regression 9

Example: Fat Versus Protein

The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu:

We wish to fit a straight line through the data.

Page 10: Chapter 5: Regression1 Chapter 5 Relationships: Regression

10Chapter 5: Regression 10

Residuals

The model won’t be perfect, regardless of the line we draw.

Some points will be above the line and some will be below.

The estimate made from a model is the predicted value (denoted as ).

y

y

Page 11: Chapter 5: Regression1 Chapter 5 Relationships: Regression

11Chapter 5: Regression 11

Residuals (cont.)

The difference between the observed value and its associated predicted value is called the residual.

To find the residuals, we always subtract the predicted value from the observed one:

ˆresidual observed predicted y y

Page 12: Chapter 5: Regression1 Chapter 5 Relationships: Regression

12Chapter 5: Regression 12

Residuals (cont.)

A negative residual means the predicted value is too big (an overestimate).

A positive residual means the predicted value is too small (an underestimate).

Page 13: Chapter 5: Regression1 Chapter 5 Relationships: Regression

13Chapter 5: Regression 13

“Best Fit” Means Least Squares Some residuals are positive, others are negative,

and, on average, they cancel each other out. So, we can’t assess how well the line fits by

adding up all the residuals. Similar to what we did with the standard

deviation, we square the residuals and add the squares.

The smaller the sum, the better the fit. The line of best fit is the line for which the sum of

the squared residuals is smallest.

Page 14: Chapter 5: Regression1 Chapter 5 Relationships: Regression

14Chapter 5: Regression 14

Least Squares

Used to determine the “best” line

We want the line to be as close as possible to the data points in the vertical (y) direction (since that is what we are trying to predict)

Least Squares: use the line that minimizes the sum of the squares of the vertical distances of the data points from the line

Page 15: Chapter 5: Regression1 Chapter 5 Relationships: Regression

15Chapter 5: Regression 15

The Linear Model (cont.)

We write b and a for the slope and intercept of the line. The b and a are called the coefficients of the linear model.

The coefficient b is the slope, which tells us how rapidly changes with respect to x. The coefficient a is the intercept, which tells where the line hits (intercepts) the y-axis.

y

y

Page 16: Chapter 5: Regression1 Chapter 5 Relationships: Regression

16

First we calculate the slope of the line, b. We already know how to calculate r, sx and sy.

r is the correlation (the slope has the same sign as r)sy is the standard deviation of the response variable ysx is the the standard deviation of the explanatory variable x

Once we know b, the slope, we can calculate a, the y-intercept:

where x and y are the sample means of the x and y variables

How to:

This means that we don’t have to calculate a lot of squared distances to find the least-squares regression line for a data set. We can instead rely on the equation.

Some calculators can calculate r, a and b.

b rsy

sx

Page 17: Chapter 5: Regression1 Chapter 5 Relationships: Regression

17Chapter 5: Regression 17

Example

Fill in the missing information in the table below:

Page 18: Chapter 5: Regression1 Chapter 5 Relationships: Regression

18

Facts about least-squares regression

1. The distinction between explanatory and response variables is

essential in regression.

2. The correlation coefficient (r) and the slope (b) of the least-squares

line have the same sign. The direction of the association determines

the sign of the slope of the regression line.

3. The least-squares regression line always passes through the point

.

4. The correlation r describes the strength of a straight-line relationship.

The square of the correlation, r2, is the fraction of the variation in the

values of y that is explained by the least-squares regression of y on x.

The square of the correlation is called Coefficient of Determination.

,x y

Page 19: Chapter 5: Regression1 Chapter 5 Relationships: Regression

19

The distinction between explanatory and response variables is crucial in

regression. If you exchange y for x in calculating the regression line, you

will get the wrong line.

Regression examines the distance of all points from the line in the y

direction only.

Data from the Hubble

telescope about galaxies

moving away from Earth:

These two lines are the two

regression lines calculated

either correctly (x = distance,

y = velocity, solid line) or

incorrectly (x = velocity, y =

distance, dotted line).

Page 20: Chapter 5: Regression1 Chapter 5 Relationships: Regression

20Chapter 5: Regression 20

Interpretation of the Slope and Intercept

The slope indicates the amount by which

changes when x changes by one unit.

The intercept is the value of when x = 0.

It is not always meaningful.

y

y

y

Page 21: Chapter 5: Regression1 Chapter 5 Relationships: Regression

21Chapter 5: Regression 21

Example

The regression line for the Burger King data is

Interpret the slope and the intercept.

Slope: For every one gram increase in protein, the fat

content increases by 0.97g.

Intercept: A BK meal that has 0g of protein contains

6.8g of fat.

y

Page 22: Chapter 5: Regression1 Chapter 5 Relationships: Regression

22Chapter 5: Regression 22

In predicting a value of y based on some given value of x ...

1. If there is no linear correlation, the best predicted y-value is y.

Predictions

2. If there is a linear correlation, the best predicted y-value is found by substituting the x-value into the regression equation.

Page 23: Chapter 5: Regression1 Chapter 5 Relationships: Regression

23Chapter 5: Regression 23

Example: Fat Versus ProteinThe regression line for the

Burger King data fits the data

well:– The equation is

– The predicted fat content for a BK Broiler chicken sandwich that contains 30g of protein is

6.8 + 0.97(30) = 35.9 grams of fat.

Page 24: Chapter 5: Regression1 Chapter 5 Relationships: Regression

24Chapter 5: Regression 24

Prediction via Regression Line

Hand, et al., A Handbook of Small Data Sets, London: Chapman and Hall

The regression equation is y = 3.6 + 0.97x– y is the average age of all husbands who have wives

of age x

Suppose we know that an individual wife’s age is 30. What would we predict her husband’s age to be?

For all women aged 30, we predict the average husband age to be 32.7 years:

3.6 + (0.97)(30) = 32.7 years

Husband and Wife: Ages

^

Page 25: Chapter 5: Regression1 Chapter 5 Relationships: Regression

25

Extrapolation is the use of a

regression line for predictions outside

the range of x values used to obtain

the line.

This can be misleading, as

seen here.

Caution!Beware of Extrapolation

Page 26: Chapter 5: Regression1 Chapter 5 Relationships: Regression

26Chapter 5: Regression 26

Caution !Beware of Extrapolation

Sarah’s height was plotted against her age

Can you predict her height at age 42 months?

Can you predict her height at age 30 years (360 months)?

80

85

90

95

100

30 35 40 45 50 55 60 65

age (months)

hei

gh

t (c

m)

Page 27: Chapter 5: Regression1 Chapter 5 Relationships: Regression

27Chapter 5: Regression 27

A CautionBeware of Extrapolation

Regression line: = 71.95 + .383 x

height at age 42 months? = 88 cm.

height at age 30 years? = 209.8 cm.– She is predicted to

be 6' 10.5" at age 30.70

90

110

130

150

170

190

210

30 90 150 210 270 330 390

age (months)

hei

gh

t (c

m)

Page 28: Chapter 5: Regression1 Chapter 5 Relationships: Regression

28Chapter 5: Regression 28

Residuals Revisited Residuals help us to see whether the

model makes sense. When a regression model is

appropriate, nothing interesting should be left behind.

After we fit a regression model, we usually plot the residuals in the hope of finding no apparent pattern.

Page 29: Chapter 5: Regression1 Chapter 5 Relationships: Regression

29Chapter 5: Regression 29

Residual Plot Analysis

A residual plot is a scatterplot of the regression residuals against the explanatory variable.

If a residual plot does not reveal any pattern, the regression equation is a good representation of the association between the two variables.

If a residual plot reveals some systematic pattern, the regression equation is not a good representation of the association between the two variables.

Page 30: Chapter 5: Regression1 Chapter 5 Relationships: Regression

30Chapter 5: Regression 30

Residuals Revisited (cont.) The residuals for the BK menu

regression look appropriately boring:

Plot

Page 31: Chapter 5: Regression1 Chapter 5 Relationships: Regression

31

Residuals are randomly scattered—good!

A curved pattern—means the relationship you are looking at is not linear.

A change in variability across plot is a

warning sign. You need to find out what it

is and remember that predictions made in

areas of larger variability will not be as

good.

Page 32: Chapter 5: Regression1 Chapter 5 Relationships: Regression

32Chapter 5: Regression 32

Coefficient of Determination (R2)

Measures usefulness of regression prediction R2 (or r2, the square of the correlation):

measures the percentage of the variation in the values of the response variable (y) that is explained by the regression line r=1: R2=1: regression line explains all (100%) of

the variation in y r=.7: R2=.49: regression line explains almost

half(50%) of the variation in y

Page 33: Chapter 5: Regression1 Chapter 5 Relationships: Regression

33Chapter 5: Regression 33

Along with the slope and intercept for a regression, you should always report R2 so that readers can judge for themselves how successful the regression is at fitting the data.

Statistics is about variation, and R2 measures the success of the regression model in terms of the fraction of the variation of y accounted for by the regression.

R2 (cont)

Page 34: Chapter 5: Regression1 Chapter 5 Relationships: Regression

34

r = −1r2 = 1

Changes in x

explain 100% of

the variations in y.

y can be entirely

predicted for any

given value of x.

r = 0r2 = 0

Changes in x

explain 0% of the

variations in y.

The value(s) y

takes is (are)

entirely

independent of

what value x

takes.

Here the change in x only

explains 76% of the change in

y. The rest of the change in y

(the vertical scatter, shown as

red arrows) must be explained

by something other than x.

r = 0.87r2 = 0.76

Page 35: Chapter 5: Regression1 Chapter 5 Relationships: Regression

35Chapter 5: Regression 35

Caution with regression Since regression and correlation are closely

related, we need to check the same conditions for regression as we did for correlation:

– Quantitative Variables Condition

– Straight Enough Condition

– Outlier Condition

Page 36: Chapter 5: Regression1 Chapter 5 Relationships: Regression

36

Caution with regression Do not use a regression on inappropriate data.

Pattern in the residuals

Presence of large outliers Use residual plots for help.

Clumped data falsely appearing linear

Recognize when the correlation/regression is performed on

averages.

A relationship, however strong, does not imply causation.

Beware of lurking variables.

Avoid extrapolating (predicting outside the observed x data range).

Page 37: Chapter 5: Regression1 Chapter 5 Relationships: Regression

37Chapter 5: Regression 37

1. If there is no linear correlation, don’t use the regression equation to make predictions.

2. When using the regression equation for predictions, stay within the scope of the

available sample data.

3. A regression equation based on old data is not necessarily valid now.

4. Don’t make predictions about a population that is different from the population from which the sample data were drawn.

Guidelines for Using The Regression Equation

Page 38: Chapter 5: Regression1 Chapter 5 Relationships: Regression

38Chapter 5: Regression 38

Vocabulary Marginal Change – refers to the slope; the amount the

response variable changes when the explanatory variable changes by one unit.

Outlier - A point lying far away from the other data points.

Influential Point - An outlier that that has the potential to change the regression line.

- Points that are outliers in either the x or y direction of a scatterplot are often influential for the correlation.

- Points that outliers in the x direction are often influential for the least-squares regression line. Try

Page 39: Chapter 5: Regression1 Chapter 5 Relationships: Regression

39

All data Without child 18 Without child 19

Outlier in y-direction

Influential

Are these

points

influential?

Page 40: Chapter 5: Regression1 Chapter 5 Relationships: Regression

40

Vocabulary: Lurking vs. Confounding

LURKING VARIABLE

A lurking variable is a variable that is not among the explanatory

or response variables in a study and yet may influence the

interpretation of relationships among those variables.

CONFOUNDING

Two variables are confounded when their effects on a response

variable cannot be distinguished from each other. The confounded

variables may be either explanatory variables or lurking variables.

Page 41: Chapter 5: Regression1 Chapter 5 Relationships: Regression

41

Lurking variablesLurking variables can falsely suggest a relationship.

What is the lurking variable in these examples?How could you answer if you didn’t know anything about the topic?

Strong positive association between

the number firefighters at a fire site and

the amount of damage a fire does

Negative association between moderate

amounts of wine drinking and death rates

from heart disease in developed nations

Page 42: Chapter 5: Regression1 Chapter 5 Relationships: Regression

42Chapter 5: Regression 42

Correlation Does Not Imply Causation

Even very strong correlations may not correspond to a real

causal relationship.

Page 43: Chapter 5: Regression1 Chapter 5 Relationships: Regression

43Chapter 5: Regression 43

Evidence of Causation A properly conducted experiment

establishes the connection Other considerations:

– A reasonable explanation for a cause and effect exists

– The connection happens in repeated trials – The connection happens under varying

conditions– Potential confounding factors are ruled out– Alleged cause precedes the effect in time

Page 44: Chapter 5: Regression1 Chapter 5 Relationships: Regression

44Chapter 5: Regression 44

Evidence of Causation An observed relationship can be used

for prediction without worrying about causation as long as the patterns found in past data continue to hold true.

We must make sure that the prediction makes sense.

We must be very careful of extreme extrapolation.

Page 45: Chapter 5: Regression1 Chapter 5 Relationships: Regression

45Chapter 5: Regression 45

Reasons Two Variables May Be Related (Correlated)

Explanatory variable causes change in response variable

Response variable causes change in explanatory variable

Explanatory may have some cause, but is not the sole cause of changes in the response variable

Confounding variables may exist Both variables may result from a common cause

– such as, both variables changing over time The correlation may be merely a coincidence

Page 46: Chapter 5: Regression1 Chapter 5 Relationships: Regression

46Chapter 5: Regression 46

Common Response(both variables change due to

common cause)

Both may result from an unhappy marriage.

Explanatory: Divorce among men Response: Percent abusing alcohol

Page 47: Chapter 5: Regression1 Chapter 5 Relationships: Regression

47Chapter 5: Regression 47

Both Variables are Changing Over Time

Both divorces and suicides have increased dramatically since 1900.

Are divorces causing suicides? Are suicides causing divorces??? The population has increased

dramatically since 1900 (causing both to increase). Better to investigate: Has the rate of divorce

or the rate of suicide changed over time?

Page 48: Chapter 5: Regression1 Chapter 5 Relationships: Regression

48Chapter 5: Regression 48

The Relationship May Be Just a Coincidence

Sometimes we see some strong correlations (or apparent associations) just by chance, even when the variables are not related in the population.

Page 49: Chapter 5: Regression1 Chapter 5 Relationships: Regression

49Chapter 5: Regression 49

A required whooping cough vaccine was blamed for seizures that caused brain damage– led to reduced production of vaccine (due to lawsuits)

Study of 38,000 children found no evidence for the accusations (reported in New York Times)– “people confused association with cause-and-effect”– “virtually every kid received the vaccine…it was

inevitable that, by chance, brain damage caused by other factors would occasionally occur in a recently vaccinated child”

Coincidence (?)Vaccines and Brain Damage

Page 50: Chapter 5: Regression1 Chapter 5 Relationships: Regression

50

Quiz1. The least-squares regression line is

A) the line that best splits the data in half, with half of the points above the line and half below the line.

B) the line that makes the square of the correlation in the data as

large as possible.

C) the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

D) all of the above.

Page 51: Chapter 5: Regression1 Chapter 5 Relationships: Regression

51

Quiz2. The fraction of the variation in the values of a response y

that is explained by the least-squares regression of y on x is

A) the intercept of the least-squares regression line

B) the correlation coefficient.

C) the slope of the least-squares regression line.

D) the square of the correlation coefficient.

Page 52: Chapter 5: Regression1 Chapter 5 Relationships: Regression

52

Quiz3. A researcher obtained the average SAT scores of all

students in each of the 50 states, and the average teacher salaries in each of the 50 states of the US. He found a negative correlation between these variables. The researcher concluded that a lurking variable must be present. By lurking variable he means

A) the true cause of a response. B) the true variable, which is explained by the explanatory variable. C) a variable that is not among the variables studied but which affects the response variable. D) any variable that produces a large residual. .

Page 53: Chapter 5: Regression1 Chapter 5 Relationships: Regression

53

1. Find the least-squares regression line of John’s height on age based on the data.

2. Interpret the slope of the regression line.

3. If John’s height at 54 months has a residual of 1.2 in, what was his actual height?

4. What percentage of the variation in John’s height is explained by the regression model?

Example:

Page 54: Chapter 5: Regression1 Chapter 5 Relationships: Regression

54Chapter 5: Regression 54

Key Concepts Least Squares Regression Equation Interpretation of the slope and intercept Prediction – avoid extrapolations Residual Plot Analysis R2

Correlation does not imply causation Reasons variables may be correlated