m7s3 - regression thoughts · m7s3 - regression thoughts professor jarad niemi stat 226 - iowa...

21
M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 1 / 21

Upload: others

Post on 11-Apr-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

M7S3 - Regression Thoughts

Professor Jarad Niemi

STAT 226 - Iowa State University

November 27, 2018

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 1 / 21

Page 2: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Outline

Regression thoughtsProperties

Coefficient of determination (r2) is amount of variation explainedNot reversibleAlways through (x, y)Residuals sum to zeroResidual plotsLeverage and influence

Cautions

ExtrapolationCorrelation does not imply causationLurking variables (Simpson’s Paradox)Correlations based on average data

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 2 / 21

Page 3: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Simple linear regression Review

Simple linear regression

For a collection of observations (xi, yi) for i = 1, . . . , n, we can fit aregression line

yi = b0 + b1xi + ei

where

b0 is the sample intercept,

b1 is the sample slope, and

ei is the residual for individual i

by minimizing the sum of squared residuals

n∑i=1

e2i where ei = yi − yi = yi − (b0 + b1xi)

and yi is the predicted value for individual i.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 3 / 21

Page 4: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Simple linear regression Review

Simple linear regression graphically

−30

−20

−10

0

0 1 2 3 4 5 6 7 8 9 10

x

y

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 4 / 21

Page 5: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Coefficient of determination

Coefficient of determination

The sample correlation r measures the direction and strength of the linearrelationship between x and y.

Definition

The coefficient of determination

r2 = 1−∑n

i=1 e2i∑n

i=1(yi − y)2

measures the amount of variability in y that can be explained by the linearrelationship between x and y.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 5 / 21

Page 6: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Coefficient of determination

Example

The correlation between weekly sales amount and weekly radio ads is 0.98.The coefficient of variation is r2 ≈ 0.96. Thus about 96% of the variabilityin weekly sales amount can be explained by the linear relationsihp withweekly radio ads.

If you are only told r2, you cannot determine the direction of therelationship.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 6 / 21

Page 7: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Coefficient of determination

Symmetric

Correlation is symmetric, the correlation of x with y is the same as thecorrelation of y with x.

cor(x,y)

[1] -0.8866024

cor(y,x)

[1] -0.8866024

Thus the coefficient of determination is symmetric.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 7 / 21

Page 8: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Not reversible

Equation not reversible

The regression line isy = b0 + b1x

but the opposite regression line is not

x = −b0b1

+1

b1y.

regress(y,x)

(Intercept) x

-1.904408 -2.589194

-b0/b1; 1/b1

[1] -0.7355215

[1] -0.3862206

regress(x,y)

(Intercept) x

0.4915144 -0.3035940

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 8 / 21

Page 9: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Always through (x, y)

Always through (x, y)

Recall that knowing any two points is enough to determine a straight line.It can be proved that the regression line always passes through the point(x, y).Suppose you know that x = 5, y = −15, and the y-intercept is −2. Whatis the slope?

y = b0 + b1x =⇒ b1 = (y − b0)/x

So the slope is

(ybar-b0)/xbar

[1] -2.6

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 9 / 21

Page 10: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Always through (x, y)

−30

−20

−10

0

0 1 2 3 4 5 6 7 8 9 10

x

y

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 10 / 21

Page 11: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Residuals sum to zero

Residuals sum to zero

When the regression includes an intercept (b0), it can be proved that theresiduals sum to zero, i.e.

n∑i=1

ei = 0.

We will often look at residual plots:

Residuals vs explanatory variable

Residuals vs predicted value

These will be centered on 0 due to the result above.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 11 / 21

Page 12: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Residual plots

Residual vs explanatory variable

−4

0

4

8

0.0 2.5 5.0 7.5 10.0

x

resi

dual

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 12 / 21

Page 13: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Residual plots

Residual vs predicted

−4

0

4

8

−20 −10

predicted

resi

dual

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 13 / 21

Page 14: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Leverage and influence

Leverage and influence

Definition

An individual has high leverage if its explanatory variable value is far fromthe explanatory variable values of the other observations. An individualwith high leverage will be an outlier in the x direction. An individual hashigh influence if its inclusion dramatically affects the fitted regression line.Only individuals with high leverage can have high influence.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 14 / 21

Page 15: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Properties Leverage and influence

Leverage and influence

high lowhigh

low

0 5 10 15 0 5 10 15

−40

−30

−20

−10

0

−40

−30

−20

−10

0

x

y

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 15 / 21

Page 16: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Cautions Correlation does not imply causation

Correlation does not imply causation

You have all likely heard the addage

correlation does not imply causation.

If two variables have a correlation that is close to -1 or 1, the two variablesare highly correlated. This does not mean that one variable causes theother.

Spurious correlations:http://www.tylervigen.com/spurious-correlations

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 16 / 21

Page 17: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Cautions Correlation does not imply causation

Correlation does not imply causation (cont.)

From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5402407/:

My attention was drawn to the recent article by Song at al. entitled “How jet lag impairs Major League Baseballperformance” (1), not only by its slightly unusual subject but more importantly because I wondered how onecould ever actually prove the effect of jet lag on baseball performance.

...Although I do not dispute the large amount of work involved and would be well-nigh incapable of judging thevalidity of the analyses performed, I must admit that I was taken aback by the way Song et al. (1) systematicallypresent the correlations they identify as direct proof of causality between jet lag and the affected variables. It isactually quite remarkable to me that the word “correlation” does not appear even once in the paper, when this isactually what the authors have been looking at and, in my opinion, to be scientifically accurate, the title of thearticle should really read: “How jet lag correlates with impairments in Major League Baseball performance.”

...this tendency to amalgamate correlation with causality is apparently extremely frequent in this field of investi-gation. But given the broad readership of PNAS and the subject of this article, I feel that it is likely to be relayedby the press and to attract the attention of many people, both scientists and nonscientists.

Considering the current tendency to misinterpret scientific data, via the misuse of statistics in particular, I feelthat a journal such as PNAS should aim to educate by example, and thus ought to enforce more rigor in thepresentation of scientific articles regarding the difference between correlations and proven causality.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 17 / 21

Page 18: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Cautions Lurking variables

Lurking variables

Definition

A lurking variable is a variable that has an important effect on therelationship of the variables under investigation, but that is not included inthe variables being studied.

What is the relationship between a person’s height and their idealpartner’s height?

62636465666768697071727374757677

idea

l hei

ght j

60 65 70 75 80own height j

Linear Fitideal height = 80.081761 - 0.1654006 own height

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 18 / 21

Page 19: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Cautions Lurking variables

Ideal partner height

In this example, gender is a lurking variable:

62

64

66

68

70

72

74

76

idea

l hei

ght

60 65 70 75 80own height

Linear Fit Females Linear Fit Males

Linear Fit Femalespredicted ideal height = 35.798818 + 0.5469203 own height

Linear Fit Malespredicted ideal height = 34.971329 + 0.4484906 own height

This phenomenon is called Simpson’s Paradox.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 19 / 21

Page 20: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Cautions Correlations based on average data

Correlations based on average data

Correlations based on average data are often much higher (closer to -1 or1) than correlations based on individual data. This occurs because theaverages smooth out the variability between individuals.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 20 / 21

Page 21: M7S3 - Regression Thoughts · M7S3 - Regression Thoughts Professor Jarad Niemi STAT 226 - Iowa State University November 27, 2018 Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression

Cautions Extrapolation

Extrapolation

Definition

Extrapolation occurs when making predictions for explanatory variablevalues below the sample minimum or above the sample maximum of theexplanatory variable.

Regression assumes a linear pattern between the response variable and theexplanatory variable. Even if this linear assumption is correct for a rangeof explanatory variable values, there is no reason to expect that this willcontinue beyond that range.

Professor Jarad Niemi (STAT226@ISU) M7S3 - Regression Thoughts November 27, 2018 21 / 21