multivariate data summary

52
Multivariate Data Summary

Upload: raymond-rice

Post on 30-Dec-2015

50 views

Category:

Documents


0 download

DESCRIPTION

Multivariate Data Summary. Linear Regression and Correlation. Pearson’s correlation coefficient r. Slope and Intercept of the Least Squares line. r = 0.0. Scatter Plot Patterns. r = +0.7. r = +0.9. r = +1.0. r = -0.7. r = -0.9. r = -1.0. Non-Linear Patterns. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multivariate Data Summary

Multivariate DataSummary

Page 2: Multivariate Data Summary

Linear Regression and Correlation

Page 3: Multivariate Data Summary

Pearson’s correlation coefficient r.

n

ii

n

ii

n

iii

yyxx

xy

yyxx

yyxx

SS

Sr

1

2

1

2

1

Page 4: Multivariate Data Summary

Slope and Intercept of the Least Squares line

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1 Slope

xS

Syxbya

xx

xy Intercept

Page 5: Multivariate Data Summary

Scatter Plot Patterns

-100

-50

0

50

100

150

200

250

40 60 80 100 120 140

-100

-50

0

50

100

150

200

250

40 60 80 100 120 140 0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

• Circular

• No relationship between X and Y

• Unable to predict Y from X

Ellipsoidal

• Positive relationship between X and Y

• Increases in X correspond to increases in Y (but not always)

• Major axis of the ellipse has positive slope

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = 0.0r = +0.7

r = +0.9 r = +1.0

Page 6: Multivariate Data Summary

Ellipsoidal

• Negative relationship between X and Y

• Increases in X correspond to decreases in Y (but not always)

• Major axis of the ellipse has negative slope slope

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140 0

20

40

60

80

100

120

140

40 60 80 100 120 140

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.7

r = -0.9 r = -1.0

Page 7: Multivariate Data Summary

Non-Linear Patterns

0

200

400

600

800

1000

1200

-20 -10 0 10 20 30 40 50

-20

0

20

40

60

80

100

120

0 10 20 30 40 50

r can take on arbitrary values between -1 and +1 if the pattern is non-linear depending or how well your can fit a straight line to the pattern

Page 8: Multivariate Data Summary

The Coefficient of Determination

n

ii

n

ii

yy

yyr

1

2

1

2

2

ˆ

Page 9: Multivariate Data Summary

An important Identity in Statistics

(Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)

n

iii

n

ii

n

ii yyyyyy

1

2

1

2

1

2 ˆˆ

lainedUnExplainedTotal SSSSSS exp

Page 10: Multivariate Data Summary

It can also be shown:

= proportion variability in Y explained by X.

= the coefficient of determination

n

ii

n

ii

yy

yyr

1

2

1

2

2

ˆ

Page 11: Multivariate Data Summary

Categorical Data

Techniques for summarizing, displaying and graphing

Page 12: Multivariate Data Summary

The frequency tableThe bar graph

Suppose we have collected data on a categorical variable X having k categories – 1, 2, … , k.

To construct the frequency table we simply count for each category (i) of X, the number of cases falling in that category (fi)

To plot the bar graph we simply draw a bar of height fi above each category (i) of X.

Page 13: Multivariate Data Summary

Example

In this example data has been collected for n = 34,188 subjects.

• The purpose of the study was to determine the relationship between the use of Antidepressants, Mood medication, Anxiety medication, Stimulants and Sleeping pills.

• In addition the study interested in examining the effects of the independent variables (gender, age, income, education and role) on both individual use of the medications and the multiple use of the medications.

Page 14: Multivariate Data Summary

The variables were: 1. Antidepressant use, 2. Mood medication use, 3. Anxiety medication use, 4. Stimulant use and 5. Sleeping pills use.6. gender, 7. age, 8. income, 9. education and 10. Role –

i. Parent, worker, partnerii. Parent, partneriii. Parent, workeriv. worker, partner

v. worker onlyvi. Parent onlyvii. Partner onlyviii. No roles

Page 15: Multivariate Data Summary

Frequency Table for Age

Age - (G)

5349 15.7 15.7 15.7

6758 19.8 19.8 35.5

6420 18.8 18.8 54.3

5528 16.2 16.2 70.5

4400 12.9 12.9 83.4

5663 16.6 16.6 100.0

34118 100.0 100.0

20-29

30-39

40-49

50-59

60-69

70+

Total

ValidFrequency Percent Valid Percent

CumulativePercent

Page 16: Multivariate Data Summary

20-29 30-39 40-49 50-59 60-69 70+

Age - (G)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Co

un

t

Bar Graph for Age

Page 17: Multivariate Data Summary

Frequency Table for Role

role

6614 19.4 24.5 24.5

1068 3.1 4.0 28.5

1351 4.0 5.0 33.5

5427 15.9 20.1 53.6

5711 16.7 21.2 74.7

456 1.3 1.7 76.4

3262 9.6 12.1 88.5

3097 9.1 11.5 100.0

26986 79.1 100.0

7132 20.9

34118 100.0

parent, partner, worker

parent, partner

parent, worker

partner, worker

worker only

parent only

partner only

no roles

Total

Valid

SystemMissing

Total

Frequency Percent Valid PercentCumulative

Percent

Page 18: Multivariate Data Summary

parent, partner, workerparent, partner

parent, workerpartner, worker

worker onlyparent only

partner onlyno roles

role

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Co

un

t

Bar Graph for Role

Page 19: Multivariate Data Summary

The pie chart• An alternative to the bar chart

• Draw a circle (a pie)

• Divide the circle into segments with area of each segment proportional to fi or pi = fi /n

Page 20: Multivariate Data Summary

Example• In this study the population are individuals who

received a head injury. (n = 22540)• The variable is the mechanism that caused the head

injury (InjMech) with categories:– MVA (Motor vehicle accident)

– Falls

– Violence

– Other VA (Other vehicle accidents)

– Accidents (industrial accident)

– Other (all other mechanisms for head injury)

Page 21: Multivariate Data Summary

Graphical and Tabular Display of Categorical Data.

• The frequency table

• The bar graph

• The pie chart

Page 22: Multivariate Data Summary

The frequency table

InjMech

565 2.5 2.5 2.5

4875 21.6 21.6 24.1

13565 60.2 60.2 84.3

765 3.4 3.4 87.7

2338 10.4 10.4 98.1

432 1.9 1.9 100.0

22540 100.0 100.0

Accdents

Falls

MVA

other

other VA

Violence

Total

ValidFrequency Percent Valid Percent

CumulativePercent

Page 23: Multivariate Data Summary

The bar graph

MVAFalls

Violenceother VA

Accdentsother

InjMech

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Val

ue

f

Cases weighted by f

Page 24: Multivariate Data Summary

The pie chartMVA

Falls

Violence

other VA

Accdents

other

Cases weighted by f

Page 25: Multivariate Data Summary

Multivariate Categorical Data

Page 26: Multivariate Data Summary

The two way frequency table

The 2 statistic

Techniques for examining dependence amongst two categorical

variables

Page 27: Multivariate Data Summary

Situation

• We have two categorical variables R and C.

• The number of categories of R is r.

• The number of categories of C is c.

• We observe n subjects from the population and count

xij = the number of subjects for which R = i and

C = j.

• R = rows, C = columns

Page 28: Multivariate Data Summary

Example

Both Systolic Blood pressure (C) and Serum Chlosterol (R) were meansured for a sample of n = 1237 subjects.

The categories for Blood Pressure are:

<126 127-146 147-166 167+

The categories for Chlosterol are:

<200 200-219 220-259 260+

Page 29: Multivariate Data Summary

Table: two-way frequency

Serum Cholesterol

Systolic Blood pressure <127 127-146 147-166 167+ Total

< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439

260+ 67 99 46 33 245

Total 388 527 204 118 1237

Page 30: Multivariate Data Summary

Example

This comes from the drug use data.

The two variables are:

1. Age (C) and

2. Antidepressant Use (R)

measured for a sample of n = 33,957 subjects.

Page 31: Multivariate Data Summary

Two-way Frequency Table

Took anti-depressants - 12 mo * Age - (G) Crosstabulation

Count

322 523 570 522 265 249 2451

5007 6201 5822 4982 4114 5380 31506

5329 6724 6392 5504 4379 5629 33957

YES

NO

Took anti-depressants- 12 mo

Total

20-29 30-39 40-49 50-59 60-69 70+

Age - (G)

Total

Age - (G)

20-29 30-39 40-49 50-59 60-69 70+6.04% 7.78% 8.92% 9.48% 6.05% 4.42%

Percentage antidepressant use vs Age

Page 32: Multivariate Data Summary

Antidepressant Use vs Age

0.0%

5.0%

10.0%

20-29 30-39 40-49 50-59 60-69 70+

Page 33: Multivariate Data Summary

The 2 statistic for measuring dependence

amongst two categorical variables

DefineTotal row

1

thc

jiji ixR

1

column Totalc

thj ij

i

C x j

n

CRE ji

ij

= Expected frequency in the (i,j) th cell in the case of independence.

Page 34: Multivariate Data Summary

Columns

1 2 3 4 5 Total

1 x11 x12 x13 x14 x15 R1

2 x21 x22 x23 x24 x25 R2

3 x31 x32 x33 x34 x35 R3

4 x41 x42 x43 x44 x45 R4

Total C1 C2 C3 C4 C5 N

Total row 1

thc

jiji ixR

1

column Totalc

thj ij

i

C x j

Page 35: Multivariate Data Summary

Columns

1 2 3 4 5 Total

1 E11 E12 E13 E14 E15 R1

2 E21 E22 E23 E24 E25 R2

3 E31 E32 E33 E34 E35 R3

4 E41 E42 E43 E44 E45 R4

Total C1 C2 C3 C4 C5 n

n

CRE ji

ij

Page 36: Multivariate Data Summary

Justification if i jij

R CE

n then ij j

i

E C

R n

1 2 3 4 5 Total

1 E11 E12 E13 E14 E15 R1

2 E21 E22 E23 E24 E25 R2

3 E31 E32 E33 E34 E35 R3

4 E41 E42 E43 E44 E45 R4

Total C1 C2 C3 C4 C5 n

Proportion in column j for row i

overall proportion in column j

Page 37: Multivariate Data Summary

and if i jij

R CE

n then ij i

j

E R

C n

1 2 3 4 5 Total

1 E11 E12 E13 E14 E15 R1

2 E21 E22 E23 E24 E25 R2

3 E31 E32 E33 E34 E35 R3

4 E41 E42 E43 E44 E45 R4

Total C1 C2 C3 C4 C5 n

Proportion in row i for column j

overall proportion in row i

Page 38: Multivariate Data Summary

The 2 statistic

r

i

c

j ij

ijij

E

Ex

1 1

2

2

Eij= Expected frequency in the (i,j) th cell in the case of independence.

xij= observed frequency in the (i,j) th cell

Page 39: Multivariate Data Summary

Example: studying the relationship between Systolic Blood pressure and Serum Cholesterol

In this example we are interested in whether Systolic Blood pressure and Serum Cholesterol are related or whether they are independent.

Both were measured for a sample of n = 1237 cases

Page 40: Multivariate Data Summary

Serum Cholesterol

Systolic Blood pressure <127 127-146 147-166 167+ Total

< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439

260+ 67 99 46 33 245

Total 388 527 204 118 1237

Observed frequencies

Page 41: Multivariate Data Summary

Serum Cholesterol

Systolic Blood pressure <127 127-146 147-166 167+ Total

< 200 96.29 130.79 50.63 29.29 307200-219 77.16 104.8 40.47 23.47 246220-259 137.70 187.03 72.40 41.88 439

260+ 76.85 104.38 40.04 23.37 245

Total 388 527 204 118 1237

Expected frequencies

In the case of independence the distribution across a row is the same for each rowThe distribution down a column is the same for each column

Page 42: Multivariate Data Summary

Table Expected frequencies, Observed frequencies, Standardized Residuals

Serum Systolic Blood pressure

Cholesterol <127 127-146 147-166 167+ Total <200 96.29 130.79 50.63 29.29 307 (117) (121) (47) (22) 2.11 -0.86 -0.51 -1.35 200-219 77.16 104.80 40.47 23.47 246 (85) (98) (43) (20) 0.86 -0.66 0.38 -0.72 220-259 137.70 187.03 72.40 41.88 439 (119) (209) (68) (43) -1.59 1.61 -0.52 0.17 260+ 76.85 104.38 40.04 23.37 245 (67) (99) (46) (33) -1.12 -0.53 0.88 1.99 Total 388 527 204 118 1237

2 = 20.85

ij

ijijij

E

Exr

Page 43: Multivariate Data Summary

Standardized residuals

ij

ijijij

E

Exr

85.20

1 1

2

1 1

2

2

r

i

c

jij

r

i

c

j ij

ijij rE

Ex

The 2 statistic

Page 44: Multivariate Data Summary

Example

This comes from the drug use data.

The two variables are:

1. Role (C) and

2. Antidepressant Use (R)

measured for a sample of n = 33,957 subjects.

Page 45: Multivariate Data Summary

Two-way Frequency Table

Percentage antidepressant use vs Role

Took anti-depressants - 12 mo * role Crosstabulation

Count

344 101 201 275 455 63 224 414 2077

6268 967 1150 5150 5249 392 3036 2679 24891

6612 1068 1351 5425 5704 455 3260 3093 26968

YES

NO

Took anti-depressants- 12 mo

Total

parent,partner,worker

parent,partner parent, worker

partner,worker worker only parent only partner only no roles

role

Total

Role parent, partner, worker

parent, partner

parent, worker

partner, worker

worker only parent only

partner only no roles

5.20% 9.46% 14.88% 5.07% 7.98% 13.85% 6.87% 13.39%

Page 46: Multivariate Data Summary

Antidepressant Use vs Role

0.0%

5.0%

10.0%

15.0%

20.0%

parent,partner,worker

parent,partner

parent,worker

partner,worker

workeronly

parentonly

partneronly

no roles

2 = 381.961

Page 47: Multivariate Data Summary

Calculation of 2

1 2 3 4 5 6 7 8 Total

YES 344 101 201 275 455 63 224 414 2077NO 6268 967 1150 5150 5249 392 3036 2679 24891

Total 6612 1068 1351 5425 5704 455 3260 3093 26968

The Raw data

Expected frequencies1 2 3 4 5 6 7 8 Total (R i )

YES 509.24 82.25 104.05 417.82 439.31 35.04 251.08 238.21 2077NO 6102.76 985.75 1246.95 5007.18 5264.69 419.96 3008.92 2854.79 24891

Total (C j ) 6612 1068 1351 5425 5704 455 3260 3093 26968

ij

ijijij

E

Exr

i jij

R CE

n

Page 48: Multivariate Data Summary

The Residuals

The calculation of 2

ij

ijijij

E

Exr

1 2 3 4 5 6 7 8

YES -7.32 2.07 9.50 -6.99 0.75 4.72 -1.71 11.39NO 2.12 -0.60 -2.75 2.02 -0.22 -1.36 0.49 -3.29

2

2 2 381.961ij ij

iji j i j ij

x Er

E

Page 49: Multivariate Data Summary

Example

• In this example n = 57407 individuals who had been victimized twice by crimes

• Rows = crime of first vicitmization

• Cols = crimes of second victimization

Page 50: Multivariate Data Summary

Table 1: Frequencies Second Victimization in Pair

Ra A Ro PP/PS PL B HL MV Total Ra 26 50 11 6 82 39 48 11 273 A 65 2997 238 85 2553 1083 1349 216 8586

First Ro 12 279 197 36 459 197 221 47 1448 Victimization PP/PS 3 102 40 61 243 115 101 38 703

in pair PL 75 2628 413 229 12137 2658 3689 687 22516 B 52 1117 191 102 2649 3210 1973 301 9595 HL 42 1251 206 117 3757 1962 4646 391 1237 MV 3 221 51 24 678 301 367 269 1914 Total 278 8645 1347 660 22558 9565 12394 1960

Page 51: Multivariate Data Summary

Table 2: Standardized residuals Second Victimization in Pair

Ra A Ro PP/PS PL B HL MV Ra 21.5 1.4 1.8 1.6 -2.4 -1.0 -1.9 0.6 A 3.6 47.4 2.6 -1.4 -14.1 -9.2 -11.7 -4.5

First Ro 1.9 4.1 28.0 4.7 -4.6 -2.8 -5.2 -0.3 Victimization PP/PS -0.2 -0.4 5.8 18.6 -2.0 -0.2 -4.1 2.9

in pair PL -3.3 -13.1 -5.0 -1.9 35.0 -17.9 -16.8 -2.9 B 0.8 -8.6 -2.3 -0.8 -18.3 40.3 -2.2 -1.5 HL -2.3 -14.2 -4.9 -2.1 -15.8 -2.2 38.2 -1.5 MV -2.1 -4.0 0.9 0.4 -2.7 -1.0 -2.3 25.2

11,430 (highly significant)

Page 52: Multivariate Data Summary

Next Topic:

Brief introduction to Statistical Packages