© department of statistics 2012 stats 330 lecture 30: slide 1 stats 330: lecture 30

29
© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

Upload: britton-carson

Post on 17-Jan-2016

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1

Stats 330: Lecture 30

Page 2: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 2

Plan of the dayIn today’s lecture we continue our discussion of contingency tables.Topics

– Simpson’s Paradox

– Collapsing tables

– Adequacy of the chi-square approximation

See also Tutorial 10 (can download)

Page 3: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 3

Example: Florida murders

If we “collapse” the table over victim, we get the 2x 2 table of defendant by death penalty:

Odds of being black for “no DP group” are 149/141Odds of being black for “DP group” is 17/19

OR = (149/141)/(17/19) = (149*19)/(141*17) = 1.181The odds of being black are 18% higher for the no DP group than for the DP group i.e. blacks favoured

Death

= N

Death

= Y

Defendant = black 149 17

Defendant = white 141 19

Page 4: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 4

Example: Florida murders

The tables conditional on victim are

Death

= N

Death

= Y

Defendant = black

97 6

Defendant = white

9 0

Death

= N

Death

= Y

Defendant = black

52 11

Defendant = white

132 19

Victim=black, OR =0.79 (add 0.5 to each cell)

Victim=white, OR =0.67

Now odds of being black are greater in the DP group!

Page 5: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 5

Association Reversal

In our regression lectures, we saw that regression coefficients could change sign when new variables were added to the model.

i.e. the coefficient of x1 in the model

y ~ x1

can have the opposite sign to the coefficient of x1 in the model

y ~ x1 + x2– See e.g. Lecture 5, Slides 12 and 13.

Page 6: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 6

The reason

• The coefficient of x1 in the model y~x1 is the slope of the fitted line in the scatter plot, summarising the marginal relationship between y and x1

• The coefficient of x1 in the model y~x1+x2 is the slope of the line in the coplot, which summarises the relationship between y and x1, conditional on x2

Page 7: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 7

Marginal and conditional association

• In the first case, we are looking at association between y and x1

• In the second case, we are looking at the association between y and x1, with x2 held fixed (conditional on x2)

• We can do the same sort of thing in contingency tables

Page 8: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 8

Association in contingency tables

• In 2 x 2 tables, we measure association using the odds ratio.

• If we have 3 factors A, B and C, each at 2 levels, we can look at the marginal odds ratio of A and B, using the AB table, ignoring C.

• Or, the 2 conditional odds ratios, one the AB table corresponding to C=1, and the other corresponding to C=2.

Page 9: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 9

When will these be the same?

• In ordinary regression, the coefficient of x1 in the model y~x1 will be the same as the coefficient of x1 in the model y~x1+x2 if and only if x1 and x2 are uncorrelated.

• In contingency tables, the marginal and conditional population OR’s for the AB tables will be the same if– A and C are independent, given B, or,– B and C are independent, given A.

• Thus, we can collapse the table over C if – The ABC interaction is zero, and either the AC interaction is zero,

or the BC interaction is zero.

• If this is not the case, association reversal ( Simpson’s paradox) may occur. The more associated A and/or B is with C, the more likely it is to happen.

Page 10: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 10

Florida data:• Very strong association between DP and race of

victim: if victim is black, small chance of DP

• Very strong association between race of victim and race of defendant

• Since blacks murder blacks, (and whites murder whites), not so many DPs for black defendants overall

Page 11: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 11

When can we collapse?

• Thus, we can collapse the table over C if – The ABC interaction is zero, and

• Either the AC interaction is zero, or• The BC interaction is zero.

If these conditions are not met, then the marginal and conditional tables may have different (even reversed) degrees of association

Page 12: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 12

Florida dataRecall that there are significant interactions between Race of victim and race of defendantRace of victim and death penalty

Thus, can’t collapse over race of victim

Call:glm(formula = counts ~ defendant * dp * victim, family = poisson,

data = murder.df)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.27300 0.07161 73.633 < 2e-16 ***defendantw -2.32856 0.24033 -9.689 < 2e-16 ***dpy -2.70805 0.28645 -9.454 < 2e-16 ***victimw -0.61904 0.12105 -5.114 3.15e-07 ***defendantw:dpy -0.23639 1.06521 -0.222 0.82438 defendantw:victimw 3.25433 0.26657 12.208 < 2e-16 ***dpy:victimw 1.18958 0.36750 3.237 0.00121 ** defendantw:dpy:victimw -0.16131 1.10322 -0.146 0.88375

Page 13: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 13

In graphical form

B

CA

Can collapse over B or C but not A

Page 14: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 14

Example: Berkeley admissions

• Admission data from UC Berkeley grad school:

• OR’s are(Odds admission male/odds admission female)

Admitted

Yes No

Gender Gender

Depart-ment Male Female Male Female

A 512 89 313 19

B 353 17 207 8

C 120 202 205 391

D 138 131 279 244

E 53 94 138 299

F 22 24 351 317

A 0.35

B 0.80

C 1.13

D 0.92

E 1.22

F 0.83

Page 15: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 15

Collapse over departments

Admitted

Gender Yes No

Male 1198 1493

Female 557 1278

• OR = 1.84 (odds admission male/ odds admission female)

• Seems to be strong evidence of bias against women

• Reason: females apply for programs that have low admission rates, males for programs that have high admission rates, so strong dependence between dept and the other 2 factors

Page 16: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 16

Anova Table Df Deviance Resid. Df Resid. Dev P(>|Chi|)NULL 23 2650.10

Dept 5 159.52 18 2490.57 1.251e-32

Gender 1 162.87 17 2327.70 2.665e-37

Admit 1 230.03 16 2097.67 5.879e-52

Dept:Gender 5 1220.61 11 877.06 1.006e-261

Dept:Admit 5 855.32 6 21.74 1.242e-182

Gender:Admit 1 1.53 5 20.20 0.22

Dept:Gender:Admit 5 20.20 0 7.239e-14 1.144e-03

Page 17: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 17

Thus

• 3 factor interaction is significant, so

– Dept is not conditionally independent of admission, given gender

– Dept is not conditionally independent of gender, given admission

• Collapse over Dept at your peril!

Page 18: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 18

Moral: Collapsing tables

• Dangerous to collapse tables when factor being collapsed over (eg victim’s race, Berkeley department) is strongly associated with the other factors

• Marginal (collapsed) tables may give a very misleading picture, need to look at conditional tables

Page 19: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 19

How good is chi-square?

• The bigger the cell counts, the better the chi-square approximation to the deviance

• The smaller the number of cells, the better the approximation

• But approximation is often OK even if some cells counts are small: our next example illustrates this.

Page 20: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 20

The Dayton Study

The data in this case study were gathered by researchers investigating teenage substance abuse in a community near Dayton, Ohio in 1992.

• The researchers asked 2276 high school students if they had ever used alcohol, cigarettes or marijuana.

Page 21: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 21

The dataA 3-dimensional contingency table:

Used marijuana

Yes No

Used cigarettes Used cigarettes

Used alcohol Yes No Yes No

Yes 911 44 538 456

No 3 2 43 279

Page 22: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 22

Making the data frame:

> counts = c(911,3,44,2,538,43,456,279) > ACM.df = data.frame(counts,expand.grid(A=c("Yes","No"), C=c("Yes","No"), M=c("Yes","No")))> ACM.df counts A C M1 911 Yes Yes Yes2 3 No Yes Yes3 44 Yes No Yes4 2 No No Yes5 538 Yes Yes No6 43 No Yes No7 456 Yes No No8 279 No No No

Page 23: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 23

Fitting modelsham.glm = glm(counts~A*C*M, family=poisson, data=ACM.df) > anova(ACM.glm, test="Chisq")Analysis of Deviance Table Df Deviance Resid. Df Resid. Dev P(>|Chi|)NULL 7 2851.46 A 1 1281.71 6 1569.75 1.064e-280C 1 227.81 5 1341.93 1.787e-51M 1 55.91 4 1286.02 7.575e-14A:C 1 442.19 3 843.83 3.607e-98A:M 1 346.46 2 497.37 2.504e-77C:M 1 497.00 1 0.37 4.283e-110A:C:M 1 0.37 0 2.509e-14 0.54

Page 24: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 24

Try the indicated homogeneous association

model> ham.glm = glm(counts~A*C+A*M+C*M, family=poisson, data=ACM.df)> summary(ham.glm)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.81387 0.03313 205.699 < 2e-16 ***ANo -5.52827 0.45221 -12.225 < 2e-16 ***CNo -3.01575 0.15162 -19.891 < 2e-16 ***MNo -0.52486 0.05428 -9.669 < 2e-16 ***ANo:CNo 2.05453 0.17406 11.803 < 2e-16 ***ANo:MNo 2.98601 0.46468 6.426 1.31e-10 ***CNo:MNo 2.84789 0.16384 17.382 < 2e-16 ***

Null deviance: 2851.46098 on 7 degrees of freedomResidual deviance: 0.37399 on 1 degrees of freedom

Page 25: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 25

Summary• The model counts~ A*C + C*M + A*M

seems to fit well

• This is the homogeneous association model, can write

counts~ (C+A+M)^2

Or counts~ A*C*M – A:C:M

with all possible 2-way interactions but no 3-way interaction

Page 26: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 26

Is Chi-square adequate?• Consider the 3-dimensional table: it had two

small counts• We fitted a model to these data having all 2

factor interactions but no 3-factor interactions.• We can estimate the Poisson means using

predict

> means = predict(ham.glm, type="response")> means 1 2 3 4 5 6 7 8 910.383170 3.616830 44.616830 1.383170 538.616830 42.383170 455.383170 279.616830

Page 27: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 27

Is Chisquare adequate (2)• Assuming these estimated means are the true ones, we

can simulate a table by generating 8 new table entries.• Thus, for example, if cell 1 has assumed mean 910.383,

we generate an entry for cell 1 by selecting a value at random from a Poisson distribution with mean 910.38

> rpois(1, 910.38)[1] 969• Repeat for every cell, get a simulated table• Calculate the deviance of the model for the simulated

table• Repeat for 10000 tables, record deviance each time.

Page 28: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 28

Results

> 1-pchisq(0.37399,1)[1] 0.5408374> mean(result > 0.37399)[1] 0.6006

Not too bad!

Histogam of 10000 simulated deviances

deviance

De

nsi

ty

0 2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

Deviance = 0.37399

Page 29: © Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 29

Codemeans = predict(ham.glm, type="response")Nsim=10000# generate 10000 tablesdata<-matrix(rpois(Nsim*8, means), 8, Nsim)

result<-numeric(Nsim)for(i in 1:Nsim)result[i]<-deviance(glm(data[,i]~ A*C + C*M + A*M , family=poisson, data=ACM.df))# draw a picture of the resultshist(result, nclass=30, freq=F, xlab="deviance",main="Histogam of 10000 simulated deviances")# add chisquare 1 densityxx<-seq(0,14,length=100)lines(xx,dchisq(xx,1), lwd=2, col="red")