lecture 1 correlation

Methodology and Statistics | University of Maastricht © Bjorn Winkens 2008

Statistics: part 2Regression Analysis and SPSS

Correlation(syllabus chapter 2)

Bjorn WinkensMethodology and Statistics

University of [email protected]

11 April 2008

2

Content

• Covariance and correlation• Pearson correlation coefficient• Tests and confidence interval for

correlations• Spearmann correlation• Pitfalls

3

Association

Study goal = examine the association between two variables

Some questions arise:• What measure of association should we use?• Is there a positive or negative association?• Is there a linear association?• Is there a significant association?

4

Covariance (1)= measure of how much two random variables

vary together

• difference with variance?• formula:

• Note: cov(X,X) = var(X)

1

))((),(cov

−

−−=∑

n

yyxxYX i

ii

5

Covariance (2)Example:• X = height, Y = weight• Positive or negative

covariance?•• Cov(X,Y) = 35.0

– Positive association– Strong or weak?

• X* = height in meters:– Cov(X*,Y) = 0.35

kg 5.76 cm, 181 == yx

Height (cm)

200190180170160150

Wei

ght (

kg)

110

100

90

80

70

60

50 +

+-

-

6

Correlation (1)

= measure of linear association between two random variables

• Notation: – population: ρ (rho)– sample: r

• Can take any value from -1 to 1• Closer to -1: stronger negative association• Closer to +1: stronger positive association

7

Correlation (2)• Pearson’s correlation coefficient

• No dimension• Invariant under linear transformations

Example (X, X* = height (cm; m), Y = weight (kg)):• Corr(X,Y) = r = 0.38• Corr(X*,Y) = r = 0.38

YX

i iii

iii

ssYX

yyxx

yyxxr ),(cov

)()(

))((

22=

−−

−−=∑ ∑

∑

8

Practical examples (1)

Height (in)

FEV

(l)

Dietary intake of cholesterolSe

rum

cho

lest

erol

(mg/

dL)

Strong positive correlation r = 0.9

Weak positive correlation r = 0.3

9


Number of cigarettes per day

FEV

(l)

No association?Caffeine

Diff

icul

ty N

umer

ical

Tas

k (D

NT)

Weak negative correlation r = -0.2 Correlation r = 0.0

10


X

Y

• r = 0.6• straight line appropriate?

Always check linearity by a (scatter)plot !

11

Size does not matter, shape is important

12

Test for correlation coefficient(s)

1. One-sample t-test: H0: ρ = 0

2. One-sample z-test: H0: ρ = ρ0• Fisher’s z-transformation• Confidence interval for ρ

3. Two-sample z-test: H0: ρ1 = ρ2

(independent samples)

13

One-sample t-test: H0: ρ = 0Example: Is there a correlation between serum-cholesterol levels in spouses?

• X = serum-cholesterol husband (normally distributed)

• Y = serum-cholesterol wife (normally distributed)

• H0: ρ = 0, H1: ρ ≠ 0• t-test:

t-distributed with df = n-2 when H0: ρ = 0 is true

212

rnrt−−

=

14

Example: serum-cholesterol (1)

• n = 100 spouse pairs• Pearson’s correlation coefficient r = 0.25• Is this correlation large enough to reject H0: ρ =

0?• t-test:

• Conclusion?

56.225.01

210025.0 2 =−

−=t

15

Example: serum-cholesterol (2)

• Two-sided p-value: p = 2*0.006 = 0.012

• Conclusion?

-2.56

P(t98 ≤ -2.56) = 0.006

t98 distribution

2.560

P(t98 ≥ 2.56) = 0.006

16

Be aware!

• Significance depends on sample size:

0.142000.201000.28500.44200.6310

Significant (α = 0.05) if r ≥

n

17

Example: Estriol – SPSS (1)

Example:Is there an association between estriol level and birthweight?

Sample: n = 31

Estriol (mg/24 hr)

302520151050Bi

rthw

eigh

t (g)

5000

4500

4000

3500

3000

2500

2000

18

Example: Estriol – SPSS (2)SPSS:

Conclusion?H0: ρ = 0.1?H0: ρ = 0.3?

Correlations

1 .610**. .000

31 31.610** 1.000 .

31 31

Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

Estriol

Birthweight

Estriol Birthweight

Correlation is significant at the 0.01 level (2-tailed).**.

19






20

One-sample z-test: H0: ρ = ρ0 (1)• If ρ0 ≠ 0, r has a skewed distribution

– e.g. H0: ρ = 0.5 more “room” for deviation below 0.5 than above 0.5

– previous t-test for correlations is invalid!

• Solution: Fisher’s z-transformation van r

• ln = natural logarithm (base = e = 2.718)

⎟⎠⎞

⎜⎝⎛−+

=rr z

11ln

21

21

Fisher’s z-transformation

r = 0.05

z = 0.05

z = 1.1

r = 0.8

z = -1.1

r = -0.8

22

One-sample z-test: H0: ρ = ρ0 (2)

• z is approximately normally distributed under H0with mean

and variance 1/(n-3)

• Equivalently,

~ N(0,1)λ = (z – z0)√(n-3)

⎟⎟⎠

⎞⎜⎜⎝

⎛−+

=0

00 1

1ln21

ρρ z

link31

23

One-sample z-test: H0: ρ = ρ0 (3)

In conclusion:• H0: ρ = ρ0 (≠ 0); H1: ρ ≠ ρ0

• Compute sample correlation coefficient r• Transform r and ρ0 to z and z0, respectively,

using Fisher’s z-transformation• Compute test statistic λ = (z – z0)√(n-3)• Compute p-value (λ ~ N(0,1))• Not (yet) available in SPSS!!!

24

Example: Body weight (1)Research question:Association between body weights of father and son different for biological than for non-biological fathers?Previous research:A correlation of 0.1 is expected based on previous research with sons and non-biological fathersSample:• n = 100 biological fathers and sons• Pearson’s correlation coefficient r = 0.38

25

Example: Body weight (2)

• H0: ρ = ρ0 = 0.10; H1: ρ ≠ 0.10

• r = 0.38 z = 0.5*ln(1.38/0.62) = 0.40

• ρ0 = 0.10 z0 = 0.5*ln(1.10/0.90) = 0.10

• λ = (0.40 – 0.10)*√(100 – 3) = 2.955• p-value = 0.0031• Conclusion?• Confidence interval?

26

Confidence interval for ρ

Step 1: compute sample correlation r

Step 2: transform r to a Fisher z-score (z)Step 3: compute a 100%x(1 - α) CI for zρ

Step 4: transform this CI to CI for ρ:

z1 = z – z1-α/2 / √(n – 3), z2 = z + z1-α/2 / √(n – 3)

11,

11

2

2

1

1

2

2

22

2

1 +−

=+−

= z

z

z

z

ee

ee ρρ

27

Example: Body weight (3)95% confidence interval for ρ:Step 1: sample correlation r = 0.38 (n = 100)

Step 2: z = 0.5*ln(1.38/0.62) = 0.40Step 3:

z1 = 0.40 – 1.96/√97 = 0.20 z2 = 0.40 + 1.96/√97 = 0.60

Step 4:ρ1 = (e2*0.2 – 1)/ (e2*0.2 + 1) = 0.20ρ2 = (e2*0.6 – 1)/ (e2*0.6 + 1) = 0.54 Conclusion?

28

Example: Body weight (4)95% CI for ρ:1. Compute r:

r = 0.382. Transform to z-

score: z = 0.403. Compute CI for

zρ: (0.20; 0.60)4. Transform CI for

zρ back to CI for ρ: (0.20; 0.54)

1

2

3

4

29






30

Example: Body-weight (1)– different design –

Research question:Association between body weights of father and son different for biological than for non-biological fathers?

No previous research

Two samples:• First group (biological): n1 = 100; r1 = 0.38• Second group (non-biological): n2 = 50; r2 = 0.10

31

Two-sample z-test: H0: ρ1 = ρ2

• Samples:– group 1: sample size n1, correlation r1

– group 2: sample size n2, correlation r2

• Test statistic:

• λ is approximately N(0,1)-distributed under H0

• Compare with one-sample z-test (sheet 22)

31

31

21

21

−+

−

−=

nn

zzλ

Fisher’s z-scores:z1z2

32

Example: Body-weight (2)– different design –

• Samples:– Group 1 (biological): n1 = 100; r1 = 0.38– Group 2 (non-biological): n2 = 50; r2 = 0.10

• Fisher’s transformation:– z1 = 0.5*ln(1.38/0.62) = 0.40– z2 = 0.5*ln(1.10/0.90) = 0.10

• Test statistic:

• p-value = 0.091

69.1

471

971

10.040.0=

+

−=λ

Conclusion?

33

Rank correlation (1)• Assumed that X and Y are normally distributed

• If X and/or Y are either ordinal or have a distribution far from normal (due to outliers), then significance tests based on the Pearson correlation coefficient are no longer valid

• A non-parametric alternative should then be used. For example, a test based on the Spearman rank correlation coefficient

34

Rank correlation (2)

Spearman’s rank correlation coefficient:= Pearson’s correlation coefficient based on the

ranks of X and Y• Less sensitive for outliers; more general

association (not specifically linear)• n ≥ 10 (or 30): similar tests and CI as for

Pearson correlation• n < 10 (or 30): exact significance levels can be

found in table• Many ties (same value): use Kendall’s Tau

35

Normality check (1)• Use pp-plots and histograms to check normality

(symmetry)

• Problem with (significance) tests for normality:– Small sample size: no or little power to detect

discrepancy from normality– Medium or large sample size: no or small impact

due to central limit theorem

• Data skewed (outliers) & small sample size data transformation

36

Normality check (2)Be aware: significance depends on sample size!

Outcome

654321

Freq

uenc

y

1.5

1.0

.5

0.0

Shapiro-Wilk: p = 0.961 p = 0.039

Outcome

654321

Freq

uenc

y

6

4

2

0

37

Example: Apgar scores

• Apgar score (physical condition) at 1 and 5 minutes for 24 newborns

• Minimal score = 0; maximal score = 10• Spearman rank correlation = 0.593

(Pearson’s correlation = 0.845)• t-test for Spearman rank correlation:

t = 3.45, df = 24 – 2 = 22 p-value < 0.01• Conclusion?• Remarks?

38

Pitfalls• Spurious correlations• No measurement of agreement• Change scores (Y-X) always related to baseline

X (“regression to the mean”)• Dependent pairs of observations (xi, yi)• …

Note:• No mathematical problem• Interpretation is incorrect

39

Dependent pairs of observation• Association between study duration and grade• Plot 1: dependency ignored negatively association• Plot 2: dependency taken into account (data from same

subject connected) positively association

STUDY DURATION

7654

GR

ADE

10

9

8

7

6

5

4

3

STUDY DURATION

7654G

RAD

E

10

9

8

7

6

5

4

3

Students were measured twice!!!

40

Relation between two variables

Three main purposes:• Association

– Pearson or Spearman correlation coefficient

• Agreement (same quantity: X = Y)– Method of Bland and Altman (Lancet, 1986)

• Prediction– Regression analysis

41

QUESTIONS?

lecture 1 correlation

Documents