lecture 12 correlation and linear regression

19
Lecture 12 Correlation and linear regression y = ax + b 2 2 1 1 ( ) [ ( )] n n i i i i D y y ax b 1 1 2 ( ) 0 2 ( ) 0 n i i i i n i i i D x y ax b a D y ax b b 1 2 2 1 n i i i n i i xy nxy a x nx b y ax ( ) ( ) y ax y ax y y ax x The least squares method of Carl Friedrich Gauß. 0 5 10 15 20 0 5 10 15 20 Y X y 2 OLRy y

Upload: phyre

Post on 25-Feb-2016

92 views

Category:

Documents


3 download

DESCRIPTION

Lecture 12 Correlation and linear regression. The least squares method of Carl Friedrich Gauß . OLRy. y = ax + b. D y 2. D y. Covariance. Variance. C orrelation coefficient. Slope a and coefficient of correlation r are zero if the covariance is zero. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 12 Correlation  and  linear regression

Lecture 12Correlation and linear regression

y = ax + b

2 2

1 1

( ) [ ( )]n n

i ii i

D y y ax b

1

1

2 ( ) 0

2 ( ) 0

n

i i ii

n

i ii

D x y ax baD y ax bb

1

22

1

n

i ii

n

ii

x y nx ya

x nx

b y ax

( ) ( )y ax y ax y y a x x

The least squares method of Carl Friedrich Gauß.

0

5

10

15

20

0 5 10 15 20

Y

X

y2

OLRy

y

Page 2: Lecture 12 Correlation  and  linear regression

2

1

2

1

1

22

1

1

22

1

)(1

))((1

1

1

x

xyn

ii

n

iii

n

ii

n

iii

n

ii

n

iii

ss

xxn

yyxxn

xxn

yxyxn

xnx

yxnyxa

Covariance

Variance

Correlation coefficient

xy

x y

xy

x y

sr

s s

22

2 2xy

x y

r

Coefficient of determination

2 Explained varianceTotal variance

R

y

x

yxxyx

ssar

srssas

2

Slope a and coefficient of correlation r are zero if the covariance is zero.

11 r

10 2 r

Page 3: Lecture 12 Correlation  and  linear regression

y = 0.192x + 0.4671R² = 0.1723

01234567

0 10 20 30

Brac

hypt

erou

s spe

cies

Macropterous species

y = 0.3875x + 3.7188R² = 0.4455

02468

101214

0 10 20 30

Dim

orph

ic sp

ecie

s

Macropterous species

Relationships between macropterous, dimorphic and brachypterous ground beetles

on 17 Mazurian lake islandsPositive correlation; r =r2= 0.41The regression is weak. Macropterous species richness explains only 17% of the variance in brachypterous species richness.We have some islands without brachypterous species.We really don’t know what is the independent variable.There is no clear cut logical connection.

Positive correlation; r =r2= 0.67The regression is moderate. Macropterous species richness explains only 45% of the variance in dimorphic species richness.The relationship appears to be non-linear. Log-transformation is indicated (no zero counts).We really don’t know what is the independent variable.There is no clear cut logical connection.

Page 4: Lecture 12 Correlation  and  linear regression

y = -36.203x + 5.5585R² = 0.2311

01234567

0 0.05 0.1 0.15

Brac

hypt

erou

s spe

cies

Isolation

y = 0.4894x + 22.094R² = 0.0037

05

1015202530354045

-3 -2 -1 0 1 2

Brac

hypt

erou

s spe

cies

ln Area

Negative correlation; r =r2= -0.48The regression is weak. Island isolation explains only 23% of the variance in brachypterous species richness.We have two apparent outliers. Without them the whole relationship would vanish, it est R2 0.Outliers have to be eliminated fom regression analysis.We have a clear hypothesis about the logical relationships. Isolation should be the predictor of species richness.

No correlation; r =r2= 0.06The regression slope is nearly zero. Area explains less than 1% of the variance in brachypterous species richness.We have a clear hypothesis about the logical relationships. Area should be the predictor of species richness.

Page 5: Lecture 12 Correlation  and  linear regression

The matrix perspective

y = 0.192x + 0.4671R² = 0.1723

01234567

0 10 20 30

Brac

hypt

erou

s spe

cies

Macropterous species

)(

61......13112171

2...364

6...13127

1...111

2...364

6...13127

...2...364

10

10

1

1

1

1

0

0

0

0

aa

aa

a

aa

a

a

aaa

XaY

X is not quadratic. It doesn’t possess an inverse

aIaXaXXXYXXXXaXYXXaY

TTTT

TT

11 )()(

Brachy Macro Constant4 7 16 12 13 13 14 18 11 10 14 14 12 7 15 22 11 9 10 7 10 15 10 13 11 8 14 10 12 8 16 14 12 6 1

Transpose Macro 7 12 13Constant 1 1 1

Dispersion matrixXTX 2499 193

193 17

Inverse0.003248 -0.03687-0.03687 0.477455

XTY 57045

Coefficientsa1 0.192014a0 0.467138

YXXXaXaY

TT 1)(

Page 6: Lecture 12 Correlation  and  linear regression

y = 0.192x + 0.4671R² = 0.1723

01234567

0 10 20 30Br

achy

pter

ous s

pecie

s

Macropterous species

Dispersion matrixXTX 2499 193

193 17

n

i

n

ii

n

ii

n

ii

T

constconstx

constxx

1

2

1

11

2

XX

n

ii

n

ii xx

nxx

n 1

2

1

222 11

Brachy Macro Constant4 7 16 12 13 13 14 18 11 10 14 14 12 7 15 22 11 9 10 7 10 15 10 13 11 8 14 10 12 8 1

6 14 12 6 1

Brachy Macro Constant4 7 16 12 13 13 14 18 11 10 14 14 12 7 15 22 11 9 10 7 10 15 10 13 11 8 14 10 12 8 16 14 12 6 1

n

iii

n

iiixy yyxx

nyxyx

n 11

11

Variance

Covariance

Page 7: Lecture 12 Correlation  and  linear regression

Brachy Macro Constant Brachy Macro Constant Brachy 4 6 3 4 1 4 2 5 14 7 1 2.64706 11.3529 1 Macro 7 12 13 18 10 14 7 22 96 12 1 2.64706 11.3529 1 Constant 1 1 1 1 1 1 1 1 13 13 1 2.64706 11.3529 14 18 1 2.64706 11.3529 1 Brachy 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6471 2.6470591 10 1 2.64706 11.3529 1 Macro 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.353 11.352944 14 1 2.64706 11.3529 1 Constant 1 1 1 1 1 1 1 1 12 7 1 2.64706 11.3529 15 22 1 2.64706 11.3529 11 9 1 2.64706 11.3529 1 XTX 185 570 45 MTM 119.12 510.88 450 7 1 2.64706 11.3529 1 570 2499 193 510.88 2191.1 1930 15 1 2.64706 11.3529 1 45 193 17 45 193 170 13 1 2.64706 11.3529 1 Variance Covariance1 8 1 2.64706 11.3529 1 Brachy Constant4 10 1 2.64706 11.3529 1 XTX - MTM 65.882 59.118 0 XTX - MTM 3.8754 3.4775 02 8 1 2.64706 11.3529 1 59.118 307.88 0 /17 Macro 3.4775 18.111 06 14 1 2.64706 11.3529 1 0 0 0 Constant 0 0 02 6 1 2.64706 11.3529 1

Raw data Arithmetic mean

Dispersion matrix Squared means

)()[(1)(1 MXMXMMXXΣ TTT

nn

221

22212

11221

...............

...

...

nnn

n

n

Σ

The covariance matrixis square and

symmetric

VariancesCovariances

y

x

y

x

yx

xyr

/100/1

/100/1

Σ

Covariance VSS 3.8754 3.4775 1.9686 1.7665

3.4775 18.111 0.8171 4.2557V VSV

1/x 0.508 0 1 0.41511/y 0 0.235 r 0.4151 1

r2 0.1723

Page 8: Lecture 12 Correlation  and  linear regression

Non-linear relationships

y = 0.0056x + 24.305R² = 0.2963

0

10

20

30

40

50

60

0 2000 4000

Spec

ies

IndividualsThe species – individuals relationship are obviously non-linear.

Ground beetles on Mazurian lake islands

y = 6.0987ln(x) - 8.3513R² = 0.6003

1

10

100

1 100 10000

Spec

ies

Individuals

y = 6.7337x0.2306

R² = 0.67

0

10

20

30

40

50

60

0 2000 4000

Spec

ies

Individuals

Linear function Logarithmic function Power function

y = 6.7337x0.2306

R² = 0.67

1

10

100

1 100 10000

Spec

ies

Individuals

IISIS

ln2308.0907.1ln2308.0)733.6ln(ln733.6 2308.0

Intercept Slope

The power function has the highest R2 and explains therefore most of the variance in species richness.The coefficient of determination is a measure of goodness of fit.

Page 9: Lecture 12 Correlation  and  linear regression

Having more than one predictor

Individuals

Isolation

Area

Species

Describe species richness in dependence of numbers of individuals, area, and isolation of islands.

We need a clear hypothesis about dependent and independent predictors.Use a block diagram.

Island Species Individuals Area Isolation1pog 13 55 0.01 0.0887192pog 24 149 0.9 0.0885923pog 31 206 2.1 0.081131cor 29 3450 6.84 0.089384dab 31 505 10 0.080644ful 37 996 9.9 0.094508gil 54 1895 10 0.093676guc 27 476 0.92 0.097195hel 25 325 2.3 0.088938lip 30 459 4.19 0.088367mil 34 1410 0.2 0.089204sos 33 829 20.09 0.087405swi 34 1704 2.08 0.096915ter 16 91 0.03 0.085875

wil 21 102 1 0.096584wron 28 342 0.15 0.01wros 21 258 0.15 0.01

Page 10: Lecture 12 Correlation  and  linear regression

Individuals

Isolation

Area

Species

Predictors are not independent.Numbers of individuals depends on area and degree of isolation.

We need linear relationships

We use ln transformed variables of species, area, and individuals. Check for multicollinearityusing a correlation matrix.We check for non-linearities using plots.

Of the predictors area and individuals are highly correlated.

The correlation between area and individuals is highly significant.The probability of H0 = 0.004.

In linear regression analysis correlations of predictors below 0.7 are acceptable.

Collinearity

Page 11: Lecture 12 Correlation  and  linear regression

The final data for our analysis

The model

Isolation a Area a Ind a a S3 2 1 0ln ln ln

YXXXaXaY

TT 1)(

Multiple linear regression

The vector Y contains the

response variable

The matrix X contains the effect (predictor) variables

Island ln_S Constant ln_Ind. ln_Area Isolation1pog 2.564949 1 4.007333 -4.60517 0.0887192pog 3.164068 1 5.003946 -0.10536 0.0885923pog 3.427515 1 5.327876 0.741937 0.081131cor 3.366817 1 8.14613 1.922788 0.089384dab 3.443352 1 6.224558 2.302585 0.080644ful 3.609114 1 6.903747 2.292535 0.094508gil 3.985008 1 7.546974 2.302585 0.093676guc 3.294602 1 6.165418 -0.08338 0.097195hel 3.236061 1 5.783825 0.832909 0.088938lip 3.401197 1 6.12905 1.432701 0.088367mil 3.521447 1 7.251345 -1.60944 0.089204sos 3.483143 1 6.72022 3.000222 0.087405swi 3.531251 1 7.440734 0.732368 0.096915ter 2.772589 1 4.51086 -3.50656 0.085875

wil 3.060271 1 4.624973 0 0.096584wron 3.332205 1 5.834811 -1.89712 0.01wros 3.020425 1 5.55296 -1.89712 0.01 60

The predictor variables have to contain different information.

If X is singular no inverse exists

Page 12: Lecture 12 Correlation  and  linear regression

IsolationAreaIndS 91.0ln07.0ln15.048.2ln

The model explains 78.6 % of variance in species richness.21.4% of avriance remains unexplained.

The probability that R2 is zero is only 0.01%.With 99.9% R2 > 0and hence statistically significant.

The probabilities that the coefficients deviate from zero.Isolation is not a significant predictor.

error StandardtCoefficient

2

22

1)2(

rrntF

Page 13: Lecture 12 Correlation  and  linear regression

0

5

10

15

20

0 5 10 15 20

Y

X2

yOLRx

xy

sa

s2xy

OLRyx

sa

s

Model I regression

2

222 *

x

yyxyx s

saOLRyaOLRx

aOLRxs

ysaOLRys

What distance to minimize?

y2

x2 OLRx

OLRy

Page 14: Lecture 12 Correlation  and  linear regression

2

2 xy y yRMA x y

x xy x

s s sa a a

s s s

y OLRyRMA

x

s aa

s r

Reduced major axis regression is the geometric average of aOLRy and aOLRx

Model II regression

OLRyRMA aa

0

5

10

15

20

0 5 10 15 20

Y

X

y2

x2 OLRx

OLRyx y

RMA

Page 15: Lecture 12 Correlation  and  linear regression

Past standard output of linear regression Reduced major axis

Parameters and standard errors

Parametric probability for r = 0

2

2( 2)1

r nt df nr

2

22

1)2(

rrntF

We don’t have a clear hypothesis about the causal relationships.In this case RMA is indicated.

Permutation test for statistical significance

Both tests indicate that Brach and Macro are not significantly correlated.The RMA regression slope is insignificant.

Page 16: Lecture 12 Correlation  and  linear regression

Macro Brach Los() Macro Los() Macro Los() Macro Los() Macro Los() Macro7 4 0.335757 14 0.531818 10 0.258728 14 0.296023 10 0.809377 1412 6 0.787809 10 0.580728 18 0.860314 9 0.524753 8 0.801854 1013 3 0.310238 12 0.101989 6 0.709402 15 0.826895 15 0.942821 2218 4 0.626757 22 0.115425 8 0.793515 12 0.064408 13 0.722662 1210 1 0.220597 13 0.413435 14 0.965281 7 0.25255 7 0.218747 1814 4 0.012454 6 0.684826 10 0.305505 13 0.976486 8 0.404831 137 2 0.909548 9 0.474608 22 0.701483 10 0.170293 22 0.745551 822 5 0.299534 10 0.830635 7 0.061196 22 0.517693 14 0.968818 69 1 0.177327 8 0.581156 13 0.204792 8 0.355126 10 0.822951 77 0 0.953261 7 0.916832 7 0.72657 8 0.38976 6 0.78764 1415 0 0.242402 7 0.974389 7 0.013131 18 0.639621 7 0.878803 1513 0 0.595826 13 0.625952 15 0.066869 10 0.511781 7 0.032343 78 1 0.596459 8 0.260397 13 0.414809 6 0.489293 14 0.92727 1010 4 0.880829 14 0.61705 14 0.093979 7 0.504421 12 0.267633 88 2 0.548183 15 0.588517 9 0.462482 7 0.630868 13 0.106493 7

14 6 0.790054 7 0.015239 8 0.234162 13 0.778739 18 0.89634 136 2 0.999702 18 0.253364 12 0.011327 14 0.815214 9 0.4389 9

0.099125 -0.05535 0.302746 0.358917 -0.0413

N 1000Observed r 0.41508801 Mean r 0.061

Lower CL -0.538Upper CL 0.768

Permutation test for statistical significance

Randomize 1000 times x or y.Calculate each time r. Plot the statistical distribution and calculate the lower and upper confidence limits.

0102030405060708090

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Nr

Lower CL Upper CL

g > 0

Calculating confidence limits

Rank all 1000 coefficients of correlation and take the values at rank positions 25 and 975.

S N2.5 = 25 S N2.5 = 25

m > 0

Observed r

Page 17: Lecture 12 Correlation  and  linear regression

The RMA regression has a much steeper slope.This slope is often intuitively better.

The coefficient of correlation is independent of the regression method

The 95% confidence limit of the regression slopemark the 95% probability that the regression slope is within these

limits.The lower CL is negative, hence the zero slope is with the 95% CL.

Upper CL

Lower CL

In OLRy regression insignificance of slope means also insignificance of r and R2.

Page 18: Lecture 12 Correlation  and  linear regression

0

5

10

15

20

0 5 10 15 20

Y

X

y2

OLRy

y

Outliers have an overproportional

influence on correlation and

regression.

Outliers should be eliminated from regression analysis.

Instead of the Pearson coefficient of correlations use Spearman’s rank order correlation.

01234567

0 1 2 3 4 5 6 7

Y

X

Normal correlation on ranked data

rPearson = 0.79

rSpearman = 0.77

Page 19: Lecture 12 Correlation  and  linear regression

Home work and literature

Refresh:

• Coefficient of correlation• Pearson correlation• Spearman correlation• Linear regression• Non-linear regression• Model I and model II regression• RMA regression

Prepare to the next lecture:

• F-test• F-distribution• Variance

Literature:

Łomnicki: Statystyka dla biologówhttp://statsoft.com/textbook/