regression/correlation. a linear model regression modeling is the process of evaluating the goodness...

45
Regression/Correlation

Upload: clemence-burns

Post on 03-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Regression/Correlation

Page 2: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

A Linear Model

• Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random variables.– Linear models (your predictors don’t show up in the

exponents of the model)

• This is the case when you are looking at two variables measured on the same person where you did not set the levels of the predictors.– If you set the levels of the predictors, you can force

the amount of correlation you want.

Page 3: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Least Squares from Last Time

• The goal is to minimize the differences between each point and the regression line (plane/vector in n-dimensional space).

• It is the same old game where you try to minimize the squared difference between, in this case, the points and the regression line.– Last time I showed the algebra to find the

values but you will never do that in real life.– You will always use a computer to do the

algebra.

Page 4: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

I Forgot Something

• As I mentioned, you will want to see a regression line showing your best guess for the outcome given the predictors.

You will also want to see the confidence limits around the line.

You can use the scatter plot graphic and …

Page 5: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Guess at new subjects (individual)

Guess at the means

Page 6: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Strength of the Association

• After you are done building the linear model, you want to know how good of a job you did describing the outcome given the predictors.

• Essentially, you want to quantify the amount of variability around the regression line.

• Last time I demonstrated that you can test to see if the line is statistically different from zero as part of building the linear model.– You care about the strength of the relationship. As

one predictor goes up, the other goes up, but is there a lot or a little variability around the prediction?

Page 7: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Dependence

• How do two variables vary together? – Are they independent? If independent, you

do not gain any information about the values of one variable given what you know about the second.

– Is there a dependence? You can use your knowledge of one to make a “good” guess at the other.

Page 8: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

What is a good guess?

• You have already seen the algebra showing that you can quantify the relationship between two variables using a t-statistic or using the more intuitive F statistic.

Page 9: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Partitioning the Variance

Page 10: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

d.f. SS MS F-ratioRegresson 1 36,464.20 36,464.20 99.8Residual 47 17,173.10 365.384

Total 48 53,637.30

ΣΣ = = 53,637.353,637.3ΣΣ = 36,464.2 = 36,464.2

ΣΣ = 17,173.1 = 17,173.1

Page 11: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Quantify the Variability

• How do you quantity the variability around that line?

• You want a correlation coefficient.

Page 12: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Correlation

• The correlation concepts were invented by Sir Francis Galton.

• He penned the concept of regression to the mean (meaning that it is hard to stay unusual). Mr. Big’s baby is big, but smaller than Mr. Big.

• He invented the concept of correlation.• He is the father of psychometrics.• He popularized the use of questionnaires and

surveys for collecting data. • He coined the phrase "nature versus nurture“.• He espoused eugenics and was Darwin’s half-

cousin.

Sir Francis Galton

Page 13: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Korrelation

• Galton's ideas where expanded and the math was worked out by Karl Pearson (he changed the C in Carl to K because he was a fan of Karl Marx).

• You want a statistic that expresses the amount of variability in one measure that can be explained by the other variable. KP’s statistic to describe correlation is known as Pearson’s Product Moment Correlation Coefficient which is naturally abbreviated r.

Karl Pearson

Page 14: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Univariate Distributions

• You already know how to visualize and quantify the variability of one variable around its mean. – Look at its histogram (density).– Variance

Y ~ N(μ, σ)

Y is distributed with a normal distribution described by mean μ and variance σ2

Page 15: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Bivariate

• If you have two continuous variables, visualize them as a two-dimensional histogram (a topographic map or ideally, a simulated 3D surface).

• Drop a big ol’ pile of beans onto a table. You want a way to describe the distance from each point to the middle of the pile. Think of a combined variance as adding up the variance in two dimensions, from left to right as you look at the pile and also away from you vs. toward you.

Page 16: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Bivariate Normal

• Many of the statistics you will play with assume the data is described by a bivariate normal distribution. That means normal in both direction in the plot and the relationship between the variables can be described by a single other parameter. The population pattern is typically represented with these 5 parameters.

(X, Y) ~ N(μx, μy, σx, σy, ρxy)2 Means, 2 Variances and a Correlation

Page 17: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Bivariate Distribution Plots

Page 18: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Covariance

• How do you assess the relationship between two variables’ variance and their means?

• In math books you will see this written as “the expectation of the vector product x*y.”

][),cov( yyxxyx E

)()(

)(

][

][

yxxy

yxyxyxxy

yxyxyxxy

yxyxyxxy

yyxx

EEE

E

EEE

E

E

Page 19: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Difference vs. the Mean

• If xi is greater than the mean of X, and yi is greater than the mean of Y, what is the product?

• If xi is less than the mean of X, and yi is less than the mean of Y, what is the product?

• If xi is greater than the mean of X, and yi is less than the mean of Y, what is the product?

• If xi is less than the mean of X, and yi is greater than the mean of Y, what is the product?

][),cov( yyxxyx E

Page 20: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Expectation (xy)

• You know how to get the expectation of a variable. Try using the mean.

• Because you will be hanging out with someone who loves math soon ….

• Probability functions are frequently written as the likelihood of events across all possible events.

• You will see the expectation written as the weighted average of the products.

yx

yxxyf,

),(

For each unique X and Y combination, calculate the product and multiply it by the percentage of the values that have that pattern. This is a weighted mean of the unique values.

Page 21: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Weighted Meanx y x*y1 -1 -11 0 01 0 01 0 01 2 21 2 22 -1 -22 -1 -22 2 42 2 42 2 43 0 03 0 03 0 03 0 03 2 63 2 63 2 6

average 2.06 0.72 1.61E(x) E(y) E(xy)

y = -1 y = 0 y = 2x = 1 1 3 2x = 2 2 0 3x = 3 0 4 3

18

Scores

y = -1 y = 0 y = 2x = 1 -1 0 2x = 2 -2 0 4x = 3 -3 0 6

Product (xy)

y = -1 y = 0 y = 2x = 1 0.055556 0.166667 0.111111x = 2 0.111111 0 0.166667x = 3 0 0.222222 0.166667

Percentiles f(x,y)

y = -1 y = 0 y = 2x = 1 -0.055556 0 0.222222x = 2 -0.222222 0 0.666667x = 3 0 0 1 sum

1.611111

xy * f(x,y)

Page 22: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Hold the Matrix Algebra, Pleasex y x*y1 -1 -11 0 01 0 01 0 01 2 21 2 22 -1 -22 -1 -22 2 42 2 42 2 43 0 03 0 03 0 03 0 03 2 63 2 63 2 6

average 2.06 0.72 1.61E(x) E(y) E(xy)

72.*06.261.1

)()(

yxxy EEE

The covariance is approximately .12

Nice … what does that mean.?

Page 23: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Covariance is Difficult to Interpret

• The values of a covariance depend on the unit in which the variables were measured. If you change the scale by a factor of 10, you increase the covariance accordingly.

Page 24: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Covariance and Scale

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5 3 3.5

-15

-10

-5

0

5

10

15

20

25

0 5 10 15 20 25 30 35

Covariance approximately .12 Covariance approximately 12

If you rescale the values (say cm to mm), you increase the covariance even though the patterns are the same.

Page 25: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Correlation Coefficient

• You have seen cases where you rescale variables so they are measured in a common scale (the number of standard deviations away from a mean).

• With Z scores, you can change the values to the number of standard deviations away from the mean by dividing by the SD.

• Guess what…?

Page 26: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

KP’s Correlation

• His little adjustment forced the covariance to fall between -1 and 1.

• If you can perfectly predict y given a score on x, you will have a score of 1 or -1.

1 if as x increases, y increases

-1 if as x increases, y decreases

• If you can’t use x to make a linear prediction about y, the correlation is 0.

)var(*)var(

),cov(),(

yx

yxyx

Page 27: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

KP’s Correlation is Linear

• It is critically important to remember that r is for measuring linear relations only.– Are there patterns that are not linear in

medicine?

• Keep in mind that the statistic is being driven by a couple of means. – Are means sensitive to outliers?– What could possibly go wrong?

Page 28: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

0

5

10

15

0 5 10 15 200

5

10

15

0 5 10 15 20

0

5

10

15

0 5 10 15 200

5

10

15

0 5 10 15 20

Scatter Plot for Correlations

All have r2 = .67Anscombe 1973, Graphs in Statistical Analysis

Page 29: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Non-linear Patterns

• You can easily model curves in data. Anybody remember the formula for a parabola? It’s something like this:

y = a + b * Score + c * Score2

Page 30: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

A Bad Fit

• What happens when you fit a straight linear model to curvilinear data?

0 10 20 30 40 50

02

04

06

08

01

00

12

01

40

X = Age

Y=

Siz

e

0 10 20 30 40 50

02

04

06

08

01

00

12

01

40

X = Age

Y=

Siz

e

Is this better than a flat line at the mean?

residual

Data from Statistical Computing

Page 31: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Is it good?

• A tiny p-value does not mean a good model!

• Where on the output does it tell that this is a good or a poor model?

Page 32: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Residuals?

0 10 20 30 40 50

02

04

06

08

01

00

12

01

40

X = Age

Y=

Siz

e

Flatten the line, then look up and down to see if you are systematically off.

Page 33: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Curve Fitting!

• You can build a model that has a curve using a polynomial. The degree of the polynomial determines how many “bends” appear in a curve. So a 2nd degree polynomial would use x and x2 while a 3rd degree polynomial would use x and x2 and x3. These squared or cubed values don’t do anything especially complicated. They are just like adding new variables.

Page 34: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

0 10 20 30 40 50

02

04

06

08

01

00

12

01

40

X = Age

Y=

Siz

e

Polynomials

size = intercept + X * something + X2*something else

0 10 20 30 40 50

02

04

06

08

01

00

12

01

40

X = Age

Y=

Siz

e

size = intercept + X * something + X2* something else +X3 * another thing

poly2 = lm(y~poly(x,2)) poly3 = lm(y~poly(x,3))

Page 35: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

What is a good fit?

• Choosing where to stop adding terms to a model is as much an art as a science. You can do comparisons between the models and ask to see if it is a statistically significant difference.

• There are systems for penalizing your model as you add more and more factors to a model like AIC.

Page 36: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Comparing Models the Hard Way

• If you have R, it can compare models easily. In SAS you have to do this by hand.

Residual in 2nd model

Differences between models

Page 37: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

EG Example

• There is an example on the class website that shows how to do the regression in SAS. Unfortunately, SAS is not built to compare models as easily as R.

Page 38: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Lowess

• R can easily fit localized regression or spline curves. I like these for detecting non-linearity in data.

0 10 20 30 40 50

02

04

06

08

01

00

12

01

40

x

y

Page 39: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Nonlinear Correlation Coefficient

• Spearman’s Rho (i.e., ρ) is the non-parametric version of Pearson’s r. It is essentially the same statistic only it works on the rank ordered values.– You will see differences if there are lots of

ties.

Page 40: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Conditional Changes in Variance

• X is partially explaining the variance in Y.• If there is no correlation (R2 = 0) the SD

around Y is just as wide as when you ignored X.

• With a perfect correlation (R2 = 1), the SD of Y is reduced to 0 around the regression line.

• The percentage of reduction of the SD is small unless R2 is big.

Page 41: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

0.1 0.9950.3 0.950.5 0.870.7 0.710.9 0.43

R

)(

)|(

YSD

XYSD

From Biostatistics: The Bare Essentials. Norman & Streiner

Page 42: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Caution on Interpretation of R2

• Amount of variance explained:• You can get tiny p-values quickly as your sample size

goes up.– Sample size = 10, two tailed p < .05 with r around .55

• You have statistical significance but it only explains less than 31% of the variability.

– Sample size = 30, two tailed p < .05 with r of .35 • You have statistical significance but it only explains about 12% of

the variability.– Sample size = 50, two tailed p < .05 with r of .27

• You have statistical significance but it only explains about 7% of the variability.

– Sample size = 250, two tailed p < .05 with r of .124 • You have statistical significance but it only explains less than 2% of

the variability.

Page 43: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

ThoS (The Theory of the Stork)

• There is a well documented correlation between the stork population in Germany and the number of births outside of hospitals!

• Thomas Höfera, Hildegard Przyrembelb and Silvia Verlegerc in Paediatric and Perinatal Epidemiology, 2004 Jan;18(1):88-92.

• Similar patterns, with reductions in stork populations in Scandanavia and Scandanavian birth rates, have been documented.

Page 44: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Causality

• Do not assume that if you change one value, the other value will change.– There is a strong correlation between height

and weight in adult males.– Therefore, if I eat more I should get taller!

Page 45: Regression/Correlation. A Linear Model Regression modeling is the process of evaluating the goodness of fit of a linear model between two or more random

Extrapolating

• If you see a pattern within the range of data you studied, do not blindly assume the same pattern will exist outside the range of values you studied.