1 chapter 10 linear regression and correlation relationship between variables

1

Chapter 10

Linear regression and correlation

Relationship between variables

2

Relationship between variables

Age and blood pressure

Nutrient level and growth of cells

Height and weight

To determine the strength of relationship between two variables and to test if it is

statistically significant

3

Two samples t test vs regression

group 1 group 2

22 18

21 15

27 13

34 20

30 32

Observe group

22 1

21 1

27 1

34 1

30 1

18 2

15 2

13 2

20 2

32 2

4

Difference, variation and association analysis

Relationship Variable Y Variable X

Two sample t test (quantity)

One way ANOVA (quantity)

Regression and correlation

(quantity)

Group(0,1)

(category)

Group(ABC)

(category)

(quantity)

5

Sir Francis Galton (16 February 1822 – 17 January 1911)

Polymath:Meteorology (the anti-cyclone and the first popular weather maps);Psychology (synaesthesia);Biology (the nature and mechanism of heredity);Eugenicist; Criminology (fingerprints); Statistics (regression and correlation).

8

Related but Different

Regression analysis:

one of the variables (e.g. blood pressure) is dependent on (caused by) the other which are fixed and measured without error (e.g. age).

Correlation analysis:

both variables are experimental and measured with error (e.g height and weight).

9

Regression analysis

Recording

Number

X

temperature

(◦Celsius)

Y

heart rate

(beast/minute)

1 2 5

2 4 11

3 6 11

4 8 14

5 10 22

6 12 23

7 14 32

8 16 29

9 18 32

The experimental data

Repeated experiments

10

Correlation analysis

Animal

X

Length

(cm)

Y

Width

(cm)

1 10.7 5.8

2 11.0 6.0

3 9.5 5.0

4 11.1 6.0

5 10.3 5.3

6 10.7 5.8

7 9.9 5.2

8 10.6 5.7

9 10.0 5.3

10 12.0 6.3

The experimental data

More individuals measured

11

Equation for a straight line

If you know a and b, can predict Y from X

----the goal of regression

Y a bX

Regression analysis

12

Regression vs correlation

13

Concepts

Simple linear regression

Simple linear correlation

Correlation analysis based on ranks

14

Example

Consider growth rate of a yeast colony and nutrient level .

If you increase nutrient level, the growth rate would increase.

Growth rate is dependent on nutrient level but nutrient level is NOT dependent on growth rate.

15

Growth rate is called the Dependent Variable and is given the symbol Y.

Nutrient level (the causal factor) is called the Independent Variable and is given the symbol X.

Variables in Regression

Nutrient level X

Grow

th rate Y

16

Single linear model assumption

a) X’s are fixed and measured without error

b)

c)

d) Homoscedastic

|( ) Y XE Y X

2| ~ . . . (0, )i ii i i Y X i iY X or and i i d N

α,β:constant real numbers, β ≠0

independent identically distributed

18

General steps for simple linear regression analysis

① Graphing the data

② Fitting the best straight line

③ Testing whether the linear relationship is

statistically significant or not

19

① Graphing the data

② Fitting the best straight line

No relationship

Relationship but not straight-lined

Negative linear relationship

Positive linear relationship

Which one？

Need criterion

20

Example: Area of a yeast colony on successive days.

Are

a (y

)

Time days (x)

H

L

Slope (b) = H/L

a

Intercept (at x=0)

00

The best fit

21

ProblemA

rea

(y)

Time days (x)

How to estimate a and b?

22

0

Method

y

x0

Y Y( , )i iX Y

2( )Total iSS Y Y Total sum of squares for Y:

( ) YE Y Fitting to the dataY Y

23

MethodA

rea

(y)

Time days (x)

Y a bX

( , )i iX Y

Residual error sum of squares: 2ˆ( )E i iiSS Y Y

Fitting to the dataY a bX

a and b should minimize the residual error

24

Method

ˆ ˆ( ) ( )i i i iY Y Y Y Y Y ˆ ˆ( ) ( ) ( )i i i iY Y Y Y Y Y

25

ˆ ˆ( ) ( ) ( )i i i iY Y Y Y Y Y

2 2ˆ ˆ( ) [( ) ( )]i i i iY Y Y Y Y Y 2 2ˆ ˆ ˆ ˆ( ) 2 ( )( ) ( )i i i i i iY Y Y Y Y Y Y Y

=0

2 22 ˆ ˆ( ) ( ) ( )i i i iY Y Y Y Y Y Sum of Squares

Total

SSTotal

Sum of Squaresdue to regression

SSR

Sum of SquaresResidual or error

SSE

maximize minimize

26

Least Square Regression Equation

• Minimize SSError by partial derivatives

0 2 { [ ' ( )]} 0'

' ( ) 0

' 0

'

Errori i

i i

i

i

SSY a b X X

a

Y a b X X

Y na

Ya Y

n

2 2ˆ( ) { [ ' ( )]}Error i i i iSS Y Y Y a b X X

ˆ' ( ) ; ' ( )i i i i iY a b X X Y a b X X

=0

27

Least Square Regression Equation

2

0 2{ [ ' ( )]}[ ( )] 0

( ) '( ) ( ) 0

Errori i i

i i i i

SSY a b X X X X

b

Y X X a X X b X X

=0

2

2

2 2

( ) ( )

( )

( )

( )( ) /

( ) /

i i i

i i

i

i i i i XY

i i X

Y X X b X X

Y X Xb

X X

X Y X Y n SSb

X X n SS

28

Result

Least squares regression line

22

( )( )

( )XY

X

X YXY SSnb

X SSX

n

a Y bX

|ˆ ˆ

Y XY a bX |ˆ ˆ ' ( X)Y XY a b X

'a Y

29

③ Simple Linear Regression Analysis

A global test for regression (ANOVA)

A test for regression coefficient (Student

t test)

30

Hypothesis

H0: The variation in Y is not explained by

a linear model, i.e., β=0

Ha: A significant portion of the variation in

Y is explained by a linear model i.e., β≠0

31

Partitioning the Sum of Squares

2 2ˆ( ) ( ( ) )R i iSS Y Y Y b X X Y 2 22 2 2( ) ( )i i Xb X X b X X b SS

XY

X

SSb

SS

22 ( )

( )XY XYX XY

X X

SS SSSS bSS

SS SS

22 2 ( )

( ) iTotal i i

YSS Y Y Y

n

2ˆ( )E i i Total RSS Y Y SS SS

32

n-1SSTotalTotal

MSEn-2SSE

Error

See Table C.7MSR1SSRRegression

c.v.FE(MS)MSDFSSSource of variation

2 2Y XSS R

E

MS

MS

2Y

The ANOVA table for a regression analysis

If H0 is true

2 2

2( ) R Y X

E Y

EMS SSE F

EMS

2 2

2( ) 1Y X

Y

SSE F

If Ha is true,β=0

=1

Test statistic: (1, 2)R

E

MSF F n

MS

33

Coefficient of determination

a measure of the amount of the variability in Y that is explained by its dependence on X.

Coeff of D. = R

Total

SS

SS

34

Simple Linear Regression Analysis

A global test for regression (ANOVA)

A test for regression coefficient (Student

t test)

35

Hypothesis

H0: The variation in Y is not explained by

a linear model, i.e., β=0

Ha: A significant portion of the variation in

Y is explained by a linear model i.e., β≠0

36

t test statistic

Variance of b:

It’s estimate:

Standard error of b:

~ ( 2)b

bt t ns

2 21b Y

XSS

2 1( )

2E E

bX X

SS MSs

SS n SS

Eb

X

MSs

SS

37

F(1,n-2) = t2(n-2)

Eb

X

MSs

SS

2

21 1 / 2

R X

E E

X

b E

MS b SSANOVA F

MS MS

b SSbStudent t

s MS

F t

38

follow student’s t distribution

Confidence interval

Confidence interval for β

1 12 2

( ) 1b

bP t t

s

1 12 2

( ) 1 , 2b bC b t s b t s with df n

1 21 1

2 2

,b bL b t s L b t s

, 2b

bt df n

s

39

Since

And

Standard error of is:

|ˆ ˆ ( )i Y X iY Y b X X

2 2E b E XYs MS n s MS SS

Confidence Interval for |Y X

22

ˆ 2

( )1( ) [ ]

2 ( )i

iE E EiY

X i

X XMS MS SSs X X

n SS n n X X

|ˆY X

Sampling error

40


follow student’s t distribution

Confidence interval

ˆ ˆ| | |1 1

2 2

ˆ ˆ( ) 1 ,

2

Y X Y X Y XY YC t s t s

with df n

ˆ

| | , 2ˆ

Y

Y X Y Xt df nS

L1 L2

41

Since

And

Standard error of is:

( )i i iY Y b X X

2 2E b E XYs MS n s MS SS

Confidence Interval for iY

22

2

( )1( ) [1 ]

2 ( )i

iE E EY E i

X i

X XMS MS SSs MS X X

n SS n n X X

iY

Sampling error

42

Understand the regression

analysis via example

43

Example1: Yield of tomato varieties

Variety 1 Variety 2

22 18

21 15

27 13

34 20

30 32

28 27

21 11

29 20

22 16

14 17

248 189Totals:

Variety

1

Variety

2

Mean 24.8 18.9

St. Dev. 5.8271 6.3675

Variance 33.96 40.54

Nos. of

observations 10 10

Summarized data:

44

A. Student’s t test

There is no difference between the two variances

There is difference between the two mean

22

21

33.96 0.84, 0.79640.54sF ps

Accept H0

0.045p Reject H0

2 40.54 33.9637.3

2ps

1 2

(24.8 18.9) (0 0) 5.92.16 2 18

2.731 137.3 ( )

10 10

t with n n

0 1 2:H

45

B. ANOVA

Item df SS MS F(1,18) P

Between 1 174.05 174.05 4.67** <0.01

Within 18 670.5 37.25

Total 19 844.55

Conclusion: Reject H0

0 1 2:H

46

Compare ANOVA with t test

t was 2.16 for 18df, 0.05 P 0.01

F was 4.67 for 1 and 18 df, 0.05 P 0.01

In fact, F= t2 (i.e. 4.67=2.162)

Why?

Because with t we are dealing with differences

while with F we are dealing with variances

(differences squared)

47

C. RegressionObserve Variety

22 1

21 1

27 1

34 1

30 1

18 2

15 2

13 2

20 2

32

.

2

.

Variety 1 Variety 2

22 18

21 15

27 13

34 20

30 32

28 27

21 11

29 20

22 16

14 17

48

Calculations20 30 437n X Y

2 22 ( ) 30

50 520X

XSS X

n

2 2 2

2

10 1 10 2 50

10393

626

X

Y

XY

2 22 ( ) 437

10393 844.5520Y

YSS Y

n

30 437626 29.5

20XY

X YSS XY

n

49

Regression coefficient

Intercept atx

Intercept

regression equation

Estimation

29.55.9

5XY

X

SSb

SS

5, 844.55, 29.5X Y XYSS SS SS

30.7 5.9Y X

21.85 5.9 1.5 30.7a Y b X

21.85a Y

21.85 5.9( 1.5)Y X

50

Testing the significance ANOVA

H0: no linear relation between y and x. β=0

Ha: the variation in y is linearly explained by

the variation in x. i.e., β≠0

1

2

|

( | 1)

( | 2)

( )

2

Y X

E Y X

E Y X

E Y X

2 1

2 1

0 :

0 :

0H

H

51

ANOVA SS

5, 844.55, 29.5X Y XYSS SS SS

2 2

844.55

( ) ( 29.5)174.05

5

670.5

Total Y

XYR

X

E Total R

SS SS

SSSS

SS

SS SS SS

52

Regression ANOVA

Item df SS MS F P

Regression 1 174.05 174.05 4.67** <0.01

Error 18 670.5 37.25

Total 19 844.55

Conclusion: Reject H0

53


a measure of the amount of the variability in y that is explained by its dependence on x.

Coeff of D. = R

Total

SS

SS

174.0520.6%

844.55

54

test for regression coefficient

• H0: β=0

37.252.729

5E

bX

MSs

SS

5.92.16 ~ (18)

2.729b

bt ts

55

Example 2:Yeast Data

Yeast colony grown on

agar.

Area measured (mm2)

on 9 successive days

and area transformed

to logs.

Time

(days)

Area

(log mm2)

1 3.62 3.8

3 4.2

4 4.5

5 5.0

6 5.2

7 5.5

8 5.6

9 6.1

x = 45 y = 43.5

56

Exponential

57

Scatterplot of Yeast dataA

rea

(2lo

g m

m)

0 2 4 6 81 3 5 7 9

3.0

4.0

5.0

6.0

Days

Apparent positive

linear relationship

58

Nonlinear->linear

Power-law:

log log log

Exponential:

log log log

b

X

Y aX

Y a b X

Y ab

Y a X b

59

Calculations9 45 43.5n X Y

2 22 ( ) 45

285 609X

XSS X

n

2 2 2 2

2 2 2 2

1 2 9 285

3.6 3.8 6.1 216.15

1 3.6 2 3.8 9 6.1 236.2

X

Y

XY

2 22 ( ) 43.5

216.15 5.99Y

YSS Y

n

45 43.5236.2 18.7

9XY

X YSS XY

n

Time

(days)

Area

(2log mm)

1 3.6

2 3.8

3 4.2

4 4.5

5 5.0

6 5.2

7 5.5

8 5.6

9 6.1

X = 45 Y = 43.5

60

Regression coefficient

Intercept atx

Intercept

regression equation

Estimation

18.70.3117

60XY

X

SSb

SS

60, 5.9, 18.7X Y XYSS SS SS

3.27 0.3117Y X

4.83 0.3117 5 3.27a Y b X

4.83a Y

4.83 0.3117( 5.0)Y X

61

To fit line

regression equation

Use two extreme values of x; 0 and 9

When x = 0, y = 3.27

When x = 9, y = 3.27 + 0.3117*9 = 6.08

3.27 0.3117Y a bX X

62

Fitting the best line.

0 2 4 6 81 3 5 7 9

3.0

4.0

5.0

6.0

Days

Are

a

(y= 3.27 + 0.3117x)

(0, 3.27)

(9, 6.08)

(5, 4.83)

63

Testing the significance ANOVA

H0: no linear relation between y and x. β=0

Ha: the variation in y is linearly explained

by the variation in x. i.e., β≠0

Item df SS MS F c.v.

Regression 1

Error n-2

Total n-1

64

ANOVA SS

60, 5.9, 18.7X Y XYSS SS SS

2 2

5.9

( ) 18.75.8282

60

5.9 5.8282 0.0718

Total Y

XYR

X

E Total R

SS SS

SSSS

SS

SS SS SS

65

Regression ANOVA

Item df SS MS F P

Regression 1 5.8282 5.8282 565.8** <0.01

Error 7 0.0718 0.0103

Total 8 5.9000

Conclusion: Reject H0 (of no relationship) and conclude

that a significant portion of the variability in colony area

is explained by regression on time.

66


a measure of the amount of the variability in y that is explained by its dependence on x.

Coeff of D. = R

Total

SS

SS

5.828298.8%

5.9

67

inference from yeast data

H0: Log area has no linear relationship with time

Ha: Log area has a linear relationship with time

inference: Reject H0 and accept Ha,

i.e. log area changes linearly with time.

Moreover, it explains 98.8% of the variation.

68

Confidence interval for β

(standard error of

b)

95% confidence interval for β

2 0.0163= 0.0001717

60

0.0001717 0.0131

Eb

X

b

MSs

SS

s

1

2

0.3117 2.365 0.0131 0.2807

0.3117 2.365 0.0131 0.3427

L

L

0.311723.79 565.8

0.0131b

bt Fs

69

Confidence interval for

standard error

95% confidence interval limits

1

2

4.83 2.365 0.034 4.75

4.83 2.365 0.034 4.91

L

L

|Y X

2

ˆ 2

2

( )1[ ]

( )

1 (5 5)0.0103[ ] 0.034

9 60

i

iEY

i

X Xs MS

n X X

|5ˆ ˆ 4.83 0.3117 (5 5) 4.83i YY

?

?

70

Summary of Regression

Regression analysis is used when one variable is fixed

(x) and likely to cause variation in the other (y)

Graph data to ascertain linear relationship apparent

Calculate the regression equation using the least

squares method

Test the significance of this equation with ANOVA

If significant, plot the equation on the graphed data

Calculate required confidence intervals

71

Partial pressure CO2 (torr)

respiration rate(breaths/minute)

1 30 8.1

2 32 8.0

3 34 9.9

4 36 11.2

5 38 11.0

6 40 13.2

7 42 14.6

8 44 16.6

9 46 16.7

10 48 18.3

11 50 18.2

Example 3: The effect of carbon dioxide on respiration rate

① Construct a scatterplot of these data

② Compute the linear regression equation

③ Test the significance of this equation via ANOVA

④ Calculate the 95% CI for β

⑤ Find the predicted respiration rate for

48 torr and the 95% CI

⑥ Find the predicted respiration rate for

38 torr and the 95% CI

⑦ Why these two CI have difference

length?

72

Scat terpl ot of CO2 pressure andrespi rat i on rate

7

10

13

16

19

22

26 30 34 38 42 46 50 54

CO2 pressure

resp

irat

ion

rate

Positive linear relationship

73

2

2

11

440 40 18040

145.8 13.25 2082.04

6085

n

X X X

Y Y Y

XY

2 22

( )( ) 40 13.256085.0-

11 0.575( ) 440

18040-11

XY

X

X YXYSS nb

XSSX

n

13.25 0.575 40 9.745a Y bX

ˆ 9.745 0.575Y X

Compute the linear regression equation

74

CO2

res rate

5045403530

20

18

16

14

12

10

8

6

S 0. 671009R-Sq 97. 3%R-Sq(adj ) 97. 0%

Fi tted Li ne Pl otres rate = - 9. 745 + 0. 5750 CO2

75

Test the significance 2 2

2

22

22

( ) 145.82082.04 149.53

11

( )( )( ) 253

145.47( ) 440

149.53 145.47 4.05

Total

R

E Total R

YSS Y

n

X YXY

nSSX

Xn

SS SS SS

149.5310Total

0.454.059Remainder

<0.01323.10**145.47145.471Regression

PFMSSSdfItem

Conclusion: Reject H0 and accept Ha, respiration rate

changes linearly with partial pressure CO2

76

The CI for β: [0.503 ， 0.647]

The CI for y48: [17.53 ， 18.18] 0.65

The CI for y38: [11.89 ， 12.32] 0.43

t1-α/2 ＝？2.365, 2.306, 2.262, 2.228

df 7, 8, 9, 10

1 21 1

2 2

,b bL b t s L b t s

ˆ ˆ1 | 2 |1 1

2 2

ˆ ˆ,Y X Y XY YL t s L t s

77

The difference length between the two CI is due to that 48 torr gets further from average partial pressure CO2-

40 torr than 38 torr.

The CI length increases when partial pressure is further from the mean value, as it is determined by the standard error of Y and this standard error is determined by the difference between individual partial pressure and average partial pressure.

2

ˆ 2

( )1[ ]

( )i

iEY

i

X Xs MS

n X X

ˆ ˆ1 | 2 |1 1

2 2

ˆ ˆ,Y X Y XY YL t s L t s

78

SinceAnd Standard error of is:

|ˆ ˆ ( )i Y X iY Y b X X 2 2

E b E XYs MS n s MS SS


22

ˆ 2

( )1( ) [ ]

2 ( )i

iE E EiY

X i

X XMS MS SSs X X

n SS n n X X

|ˆY X

Sampling error

22

2

( )1( ) [1 ]

2 ( )i

iE E EY E i

X i

X XMS MS SSs MS X X

n SS n n X X

79

y = 0. 575x - 9. 7455R2 = 0. 9729

7

9

11

13

15

17

19

30 35 40 45 50

CO2

RES

rate

dataCI f or YCI f or YCI f or YiCI f or Yi

(data)线性

80

Concepts




81

Correlation Analysis

To measure the intensity of association

observed between any pair of variables

and test it’s statistic significance

To test whether two variables are covary

or interdependent

83

Correlation analysis

Two variables, X and Y, are of the same status and we can’t tell which cause and effect not clear.

There may be some common underlying cause of both.

Examples:

height and weight

height of siblings

84

Types of relationship.

y

x

Positive correlation; large X’s are associated with large Y’s

y

x

Negative correlation; large X’s are associated with small Y’s

y

x

No correlation; Y and X no linear correlation.

85

No linear correlation ≠ independent

Y=X2

Dependent

Linear

correlation

Independent

86

( )( )i i

X Y

X X Y Y

S S

Index of association

Positive correlation

Large X’s associated with large Y’s

Negative correlationLarge X’s associated with small Y’s

Standardized normal

deviates for X

Standardized normal

deviates for Y

( )( ) 0i iX X Y Y

( )( ) 0i iX X Y Y

( )( )i i

X Y

X X Y Y

S S

87

Pearson product-moment correlation coefficient

Abbr: Pearson correlation coefficient

2 22 2

( )( )

( 1)

( )( )

( ) ( )[ ][ ]

i i

X Y

XY

X Y

X X Y Yr

n S S

X YXY SSn

SS SSX YX Y

n n

Corrected sum of squares of X

Corrected sum of squares of Y

Corrected cross products

Sample size

88

Pearson correlation coefficient

A widely used index of association

Association of two quantitative variables

An estimate of population correlation

coefficient XY

X Y

COV(X, Y) = E[(X-E(X))(Y-E(Y))]

89

Population correlation efficient is

Sample correlation efficient is

Linear Correlation Model

22

1

)()(

))((

))((1

yx

yx

y

yN

x

x

YX

YX

YX

N

Normal standard deviation of X Normal standard deviation of Y

2 21

( )( )1( )( )

1 ( ) ( )

nxy

X Y x y

SSX X Y YX X Y Yr

n s s SS SSX X Y Y

90

Linear Correlation Model

Y’s at each X are normal distributed

X’s at each Y are normal distributed

Conditional population mean is

Which follow the regression equation:

/ /

/ /

( )

( )Y X Y Y X X

X Y X X Y Y

X

Y

/

/

ˆ ( )

ˆ ( )

Y X

X Y

Y Y b X X

X X b Y Y

91

Regression vs Correlation

22 ( )

( )XY XYX

X X

SS SSSS

SS SS

| |

2

| |

; ;

( )

XY XYY X X Y

X Y

XY XY XYY X X Y

X Y X Y

SS SSb b

SS SS

SS SS SSb b

SS SS SS SS

2R XSS b SS

2 2

| |

( ) ( )XY X XYY X X Y

Y X Y

SS SS SSb b

SS SS SS

2r

Coeff of D. = R

Total

SS

SSSSR: Explainable variability

SSTotal=SSY: Total variability

92

characteristics of correlation coefficient

2 . R R

Total Total

SS SSr Coeff of D r

SS SS

Total R ESS SS SS 0 R TotalSS SS

20 1, . . 0 1R

Total

SSi e r

SS

1 1r

93

r = +1: a complete positive correlation

between x and y

r = -1: a complete negative correlation

between x and y

r = 0 : x and y are not correlation

Correlation has upper and lower limits of +1 and -1 respectively

94

Test of hypothesis: t test Hypothesis:

To test if r differs from zero

The standard error of r is:

follow t distribution with df=n-2

Critical value for given n and alpha

0 : 0

: 0a

H

H

2

0

12

r

r rt

s rn

21

2r

rs

n

21 / 2

21 / 22

tr

n t

95

Regression = Correlation

2

2

. 1

1

R E

Total Total

E

Total

SS SSr Coeff of D

SS SS

SSr

SS

2/

( 2)12

XY E XYr

TotalX Y X E

SS SS SSrt

SS nSS SS SS MSrn

, 2XY X XYb

b X E X E

SS SS SSbt df n

s SS MS SS MS

96

Understand the correlation

analysis via example

97

① Preliminary calculations

The length and width of eight overlapping plates composing the

shell of 10 Chiton olivaceous

Animal X

Length (cm)Y

Width (cm)

1 10.7 5.8

2 11 6

3 9.5 5

4 11.1 6

5 10.3 5.3

6 10.7 5.8

7 9.9 5.2

8 10.6 5.7

9 10 5.3

10 12 6.3

2

105.8

10.58

1123.9

X

X

X

2

56.4

5.64

319.68

Y

Y

Y

10n

599.31XY

98

② Calculate the correlation coefficient:

③ Test of significance

Hypotheses:

test statistic:

since reject H0

④ Confidence interval ???

0 : 0

: 0a

H

H

2

0 0.969 0 0.96911.14

0.0871 0.96910 2

r

rt

s

0.05(8)11.14 2.306t t

2 2 2 22 2

( )( ) 105.8 56.4599.31

10 0.969( ) ( ) 105.8 56.4

[1123.9 ][319.68 ][ ][ ]10 10

X XXY

nrX Y

X Yn n

99

Need different tests of

Can ONLY use t-test to test H0 that = 0,

because only in this case (independent) is the

distribution approximate Normal.

In all other cases, the distribution is

asymmetrical and so we have to use a different

test as follows

100

Distribution of r in samples

-1 0 +1

= 0 = 0.8

Fisher’s Z transformation must be employed:

hyperbolic tangent of r

101

Fisher’s Z Transformation

Transform r to Z: hyperbolic tangent

Z follow normal distribution approximately

Standard error of Z

1

tanh

1 1 1tanh ln [ln(1 ) ln(1 )]

2 1 2

z z

z z

e er z

e er

z r r rr

1( , )

2( 1) 3N

n n

1

3Z n

102

Testing when 0 0

follow approximately normal

distribution:

Hypothesis:

Test statistic:

0 0

0

:

:a

H

H

1 1 1 10 0tanh tanh tanh tanh

~ (0,1)1

3Z

r ru N

n

1

1

tanh

tanhZ r

103

Confidence interval for correlation coefficient

95% CI for ζρ

95% CI for ρ

1

2

11.960 1.960

3

11.960 1.960

3

Z

Z

L Z Zn

L Z Zn

1 1

2 2

tanh

tanh

r L

r L

104

Example when expected 0

For genetical reasons, the

correlation of height among

sibs (brothers or sisters) is

expected to be 0.5

i.e., H0 is that =0.5

Family

X

brother

(cm)

Y

sister

(cm)

1 71 69

2 68 64

3 66 65

4 67 64

5 70 65

6 71 62

7 70 62

8 73 64

9 72 66

… … …

50 66 62

390.558

(74)(64)XY

X Y

SSr

SS SS

2 0.31142r

105

Test of significance when 0

for r=0.5

for r=0.558

Standard error is

1ln[(1 0.5) (1 0.5)] 0.5493

2Z

1 (50 3) 0.146Z

1ln[(1 0.558) (1 0.558)] 0.63

2Z

Therefore cannot reject H0

0 0.63 0.550.55 1.960

0.146Z

Z Zu

106

Example when expected 0

1

1

0.63 1.960 0.146 0.63 0.2859 0.3441

0.63 1.960 0.146 0.63 0.2859 0.9159

L

L

0.3311 0.558 0.7240

0.3441 0.63 0.9159

r

Z

107

Concepts




108

Bivariate random sample (not normal distributed)

first grade students’ ages and their performance on a

standardized test

student heights and GPAs

To test the relationship between the two variables

To validate dependence of two random variables.

109

The characteristics for an index of association

Values only between -1 and +1, inclusive

The stronger the positive correlation is, the

closer the value to be +1

The stronger the negative correlation is, the

closer the value to be -1

For uncorrelated pairs of X an Y, the value

should be close to 0

110

Kendall Correlation Coefficient τ Direct compare the n observations with each other

Concordant (C): positive correlation

Discordant (D): negative correlation

Tie (E):

2( )

( 1) 2 ( 1)

C D C D

n n n n

The total # of comparisons

=C+D+E

Difference between the # of

concordant and discordant pairs

( )( ) 0i j i jX X Y Y

1 1

( )( ) 0i j i jX X Y Y

( )( ) 0i j i jX X Y Y

111

0002.5122712

0104112911

0021102410

1112.59279

04098388

04187377

0425.56326

05275355

080104424

1445.53323

0100112472

0110121531

tied pairs below

discordant pairs below

Concordant pairs below

Rank YiRank XiYiXi① Rank X and Y

② Compare each

observations

with other

observations

below it

③ Summarize C,

D and E

C=4+…=12

D=11+…=52

E=1+1=2

④ Calculate Kendall correlation coefficient:

2( ) 2 (12 52)0.606

( 1) 12 (12 1)

C D

n n

112

Birthmonth

month after cut-off

students evaluated

Dec 1 53

Jan 2 47

Feb 3 32

Mar 4 42

Apr 5 35

May 6 32

Jun 7 37

Jul 8 38

Aug 9 27

Sep 10 24

Oct 11 29

Nov 12 27

The relative age effects on academic and social performance in Geneva.

The older students tend to be

overrepresented and younger

ones underrepresented.

X: the month

Y: the # of students in grades K

through 4 evaluated for the

district’s Gifted and Talented

Student Program.

113

⑥ Test significance: two-tailed test

Hypotheses:

test statistic:

Table C.12 for n=12, p=2*0.0027=0.0054<0.01

reject H0

Conclusion: there is moderately strong negative

correlation between month after cut-off date and

referrals for gifted evaluation.

0 : 0

: 0a

H

H

min( , ) 12 1 132 2

E EC D

114

Spearman’s rank coefficient rs

One of the most common correlation coefficients

Idea: Rank X and Y observations separately and

compute the Pearson correlation coefficient on the

ranks rather than on the original data.

Advantage: to compute more simply

2

12

61

( 1)

n

iis

dr

n n

i ii x ywhere d r r

115

1 1

2 2

1 1

2 2 2

1 1

2

1 1 1 12

2 2

( 1) / 2

( 1)(2 1) / 6

( ) / ( 1) /12

/ ( 1) / 4

( 1) /12

6 ( 2 )

n n

yi xii i

n n

yi xii i

n n

rx ry yi yii i

n n n n

xi yi xi yi xi yii i i ip

rx ry

xi xi yi yii

r r n n

r r n n n

SS SS r r n n n

r r r r n r r n nr

n nSS SS

r r r r

2

12

22

12

2

12

3 ( 1) 2 ( 1)(2 1)

( 1)

6 ( ) ( 1)

( 1)

61

( 1)

n

n

xi yii

n

iis

n n n n n

n n

r r n n

n n

dr

n n

116

X: the month

Y: the difference between

actual and expected

numbers of professional

players

Month Actual players Expected players Difference

1 37 28.27 8.73

2 33 27.38 5.62

3 40 26.26 13.74

4 25 27.60 -2.60

5 29 29.16 -0.16

6 33 30.05 2.95

7 28 31.38 -3.38

8 25 31.83 -6.83

9 25 31.16 -6.16

10 23 30.71 -7.71

11 30 30.93 -0.93

12 27 30.27 -3.27

The distribution of Germany professional players' birthdays and that of the general population of Germany.

117

① Rank X and Y

② Compute the difference

③ Calculate

A negative correlation

The distribution of Germany professional players' birthdays and that of the general population of Germany.

7512-3.2712

4711-0.9311

9110-7.7110

639-6.169

628-6.838

347-3.387

-3962.956

-385-0.165

-264-2.64

-912313.743

-81025.622

-101118.731

Y:DifferenceX: Month yrxr id

i x yd r r

2

12

2

61

( 1)

6 4941

12(12 1)

0.727

n

iis

dr

n n

2 2 2

1

[( 10) 7 ]

494

n

ii

d

118

④ Test significance

Hypotheses:

Table C.13 critical value = 0.587, for n=12 and α=0.05

since |rs|=0.727>0.587 reject H0

inference: there is an excess in the number of players

born early in the competition year and a lack of those

born late among professional soccer players in

Germany.

0 : 0

: 0s

a s

H

H

119

Compare the two coefficients

Kendall’s correlation coefficient τ

easy to test significant difference from zero

Spearman’s rank correlation coefficient rs

more common, related to Pearson’s correlation

coefficient

Produce no radically different correlation

values.

120

Comparing different regressions

Common situation is either to compare your regression with a

published one

or to compare the regressions in two or more experiments you have done

Two linear regression equations:

1 1 1 1

2 2 2 2

ˆ

ˆ

Y a b X

Y a b X

121

Compare intercept: Using t test

Hypothesis:Standard error for a1-a2:

follow t distribution with

0 1 2 1 2: 0H or

1 2

2 21 2

/ 221 2 12 2 2

1 2

1 1( )( )

( ) ( )a a Y X

x xs s

Xn n XX X

n n

21

)()( 2121

aas

aat

2212 2 2

1 1 2 2

/1 2

( )( )( )( 2) ( )( 2)

( 2) ( 2)Y X

YYY n Y n

n nsn n

1 2( 2) ( 2)df n n

122

Compare slope: Using t test

Hypothesis:Standard error for β1-β2:

follow t distribution with

0 1 2 1 2: 0H or

1 2 / 2 21 22 2

1 2

1 1( ) ( )

( ) ( )b b Y Xs s

X XX X

n n

21

)()( 2121

bbs

bbt

2212 2 2

1 1 2 2

/1 2

( )( )( )( 2) ( )( 2)

( 2) ( 2)Y X

YYY n Y n

n nsn n

1 2( 2) ( 2)df n n

123

Combining data from two experiments

If the data for two experiments are not significantly different, can combine them.

Combined estimate of b is:

Standard error is:

Should combine the original data.

1 1 2 2

1 2

2 21 1 2 22 2

1 1 2 21 2

2212 2 2

1 21 2

( ) ( )( ( ) ) ( ( ) )

( )( )( ) ( )

X Y X Y

X X

X Y X YX Y X Y

SS SS n nXSS SS X

X Xn n

1 21 2

2212 2 2

1 21 2

[ ( 2)] [ ( 2)]

( )( )( ) ( )

E ESS n SS n

XXX X

n n

124

Two experiments

y

x

Expt. 1

Expt. 2

Same slopes, different means

y

x

Expt. 1

Expt. 2

Different slopes, means may or may not differ

125

Compare two different correlations

Fisher’s Z transformationHypothesis: variance for z1-z2:

1 20 1 2:H or z z

1 2

2

1 2

1 1

3 3z z n n

1 2

1 2

1 2 1 2

1 2

( ) ( )~ (0,1)

1 13 3

z z

z z z z z zu N

n n

1 chapter 10 linear regression and correlation relationship between variables

Documents

data slide

significant slide

distributed slide

best fit slide

need criterion slide

correlation relationship

relationship relationship

different regression