© department of statistics 2012 stats 330 lecture 30: slide 1 stats 330: lecture 30

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1

Stats 330: Lecture 30


Plan of the dayIn today’s lecture we continue our discussion of contingency tables.Topics

– Simpson’s Paradox

– Collapsing tables

– Adequacy of the chi-square approximation

See also Tutorial 10 (can download)


Example: Florida murders

If we “collapse” the table over victim, we get the 2x 2 table of defendant by death penalty:

Odds of being black for “no DP group” are 149/141Odds of being black for “DP group” is 17/19

OR = (149/141)/(17/19) = (149*19)/(141*17) = 1.181The odds of being black are 18% higher for the no DP group than for the DP group i.e. blacks favoured

Death

= N

Death

= Y

Defendant = black 149 17

Defendant = white 141 19


Example: Florida murders

The tables conditional on victim are

Death

= N

Death

= Y

Defendant = black

97 6

Defendant = white

9 0

Death

= N

Death

= Y

Defendant = black

52 11

Defendant = white

132 19

Victim=black, OR =0.79 (add 0.5 to each cell)

Victim=white, OR =0.67

Now odds of being black are greater in the DP group!


Association Reversal

In our regression lectures, we saw that regression coefficients could change sign when new variables were added to the model.

i.e. the coefficient of x1 in the model

y ~ x1

can have the opposite sign to the coefficient of x1 in the model

y ~ x1 + x2– See e.g. Lecture 5, Slides 12 and 13.


The reason

• The coefficient of x1 in the model y~x1 is the slope of the fitted line in the scatter plot, summarising the marginal relationship between y and x1

• The coefficient of x1 in the model y~x1+x2 is the slope of the line in the coplot, which summarises the relationship between y and x1, conditional on x2


Marginal and conditional association

• In the first case, we are looking at association between y and x1

• In the second case, we are looking at the association between y and x1, with x2 held fixed (conditional on x2)

• We can do the same sort of thing in contingency tables


Association in contingency tables

• In 2 x 2 tables, we measure association using the odds ratio.

• If we have 3 factors A, B and C, each at 2 levels, we can look at the marginal odds ratio of A and B, using the AB table, ignoring C.

• Or, the 2 conditional odds ratios, one the AB table corresponding to C=1, and the other corresponding to C=2.


When will these be the same?

• In ordinary regression, the coefficient of x1 in the model y~x1 will be the same as the coefficient of x1 in the model y~x1+x2 if and only if x1 and x2 are uncorrelated.

• In contingency tables, the marginal and conditional population OR’s for the AB tables will be the same if– A and C are independent, given B, or,– B and C are independent, given A.

• Thus, we can collapse the table over C if – The ABC interaction is zero, and either the AC interaction is zero,

or the BC interaction is zero.

• If this is not the case, association reversal ( Simpson’s paradox) may occur. The more associated A and/or B is with C, the more likely it is to happen.


Florida data:• Very strong association between DP and race of

victim: if victim is black, small chance of DP

• Very strong association between race of victim and race of defendant

• Since blacks murder blacks, (and whites murder whites), not so many DPs for black defendants overall


When can we collapse?

• Thus, we can collapse the table over C if – The ABC interaction is zero, and

• Either the AC interaction is zero, or• The BC interaction is zero.

If these conditions are not met, then the marginal and conditional tables may have different (even reversed) degrees of association


Florida dataRecall that there are significant interactions between Race of victim and race of defendantRace of victim and death penalty

Thus, can’t collapse over race of victim

Call:glm(formula = counts ~ defendant * dp * victim, family = poisson,

data = murder.df)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.27300 0.07161 73.633 < 2e-16 ***defendantw -2.32856 0.24033 -9.689 < 2e-16 ***dpy -2.70805 0.28645 -9.454 < 2e-16 ***victimw -0.61904 0.12105 -5.114 3.15e-07 ***defendantw:dpy -0.23639 1.06521 -0.222 0.82438 defendantw:victimw 3.25433 0.26657 12.208 < 2e-16 ***dpy:victimw 1.18958 0.36750 3.237 0.00121 ** defendantw:dpy:victimw -0.16131 1.10322 -0.146 0.88375


In graphical form

B

CA

Can collapse over B or C but not A


Example: Berkeley admissions

• Admission data from UC Berkeley grad school:

• OR’s are(Odds admission male/odds admission female)

Admitted

Yes No

Gender Gender

Depart-ment Male Female Male Female

A 512 89 313 19

B 353 17 207 8

C 120 202 205 391

D 138 131 279 244

E 53 94 138 299

F 22 24 351 317

A 0.35

B 0.80

C 1.13

D 0.92

E 1.22

F 0.83


Collapse over departments

Admitted

Gender Yes No

Male 1198 1493

Female 557 1278

• OR = 1.84 (odds admission male/ odds admission female)

• Seems to be strong evidence of bias against women

• Reason: females apply for programs that have low admission rates, males for programs that have high admission rates, so strong dependence between dept and the other 2 factors


Anova Table Df Deviance Resid. Df Resid. Dev P(>|Chi|)NULL 23 2650.10

Dept 5 159.52 18 2490.57 1.251e-32

Gender 1 162.87 17 2327.70 2.665e-37

Admit 1 230.03 16 2097.67 5.879e-52

Dept:Gender 5 1220.61 11 877.06 1.006e-261

Dept:Admit 5 855.32 6 21.74 1.242e-182

Gender:Admit 1 1.53 5 20.20 0.22

Dept:Gender:Admit 5 20.20 0 7.239e-14 1.144e-03


Thus

• 3 factor interaction is significant, so

– Dept is not conditionally independent of admission, given gender

– Dept is not conditionally independent of gender, given admission

• Collapse over Dept at your peril!


Moral: Collapsing tables

• Dangerous to collapse tables when factor being collapsed over (eg victim’s race, Berkeley department) is strongly associated with the other factors

• Marginal (collapsed) tables may give a very misleading picture, need to look at conditional tables


How good is chi-square?

• The bigger the cell counts, the better the chi-square approximation to the deviance

• The smaller the number of cells, the better the approximation

• But approximation is often OK even if some cells counts are small: our next example illustrates this.


The Dayton Study

The data in this case study were gathered by researchers investigating teenage substance abuse in a community near Dayton, Ohio in 1992.

• The researchers asked 2276 high school students if they had ever used alcohol, cigarettes or marijuana.


The dataA 3-dimensional contingency table:

Used marijuana

Yes No

Used cigarettes Used cigarettes

Used alcohol Yes No Yes No

Yes 911 44 538 456

No 3 2 43 279


Making the data frame:

> counts = c(911,3,44,2,538,43,456,279) > ACM.df = data.frame(counts,expand.grid(A=c("Yes","No"), C=c("Yes","No"), M=c("Yes","No")))> ACM.df counts A C M1 911 Yes Yes Yes2 3 No Yes Yes3 44 Yes No Yes4 2 No No Yes5 538 Yes Yes No6 43 No Yes No7 456 Yes No No8 279 No No No


Fitting modelsham.glm = glm(counts~A*C*M, family=poisson, data=ACM.df) > anova(ACM.glm, test="Chisq")Analysis of Deviance Table Df Deviance Resid. Df Resid. Dev P(>|Chi|)NULL 7 2851.46 A 1 1281.71 6 1569.75 1.064e-280C 1 227.81 5 1341.93 1.787e-51M 1 55.91 4 1286.02 7.575e-14A:C 1 442.19 3 843.83 3.607e-98A:M 1 346.46 2 497.37 2.504e-77C:M 1 497.00 1 0.37 4.283e-110A:C:M 1 0.37 0 2.509e-14 0.54


Try the indicated homogeneous association

model> ham.glm = glm(counts~A*C+A*M+C*M, family=poisson, data=ACM.df)> summary(ham.glm)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.81387 0.03313 205.699 < 2e-16 ***ANo -5.52827 0.45221 -12.225 < 2e-16 ***CNo -3.01575 0.15162 -19.891 < 2e-16 ***MNo -0.52486 0.05428 -9.669 < 2e-16 ***ANo:CNo 2.05453 0.17406 11.803 < 2e-16 ***ANo:MNo 2.98601 0.46468 6.426 1.31e-10 ***CNo:MNo 2.84789 0.16384 17.382 < 2e-16 ***

Null deviance: 2851.46098 on 7 degrees of freedomResidual deviance: 0.37399 on 1 degrees of freedom


Summary• The model counts~ A*C + C*M + A*M

seems to fit well

• This is the homogeneous association model, can write

counts~ (C+A+M)^2

Or counts~ A*C*M – A:C:M

with all possible 2-way interactions but no 3-way interaction


Is Chi-square adequate?• Consider the 3-dimensional table: it had two

small counts• We fitted a model to these data having all 2

factor interactions but no 3-factor interactions.• We can estimate the Poisson means using

predict

> means = predict(ham.glm, type="response")> means 1 2 3 4 5 6 7 8 910.383170 3.616830 44.616830 1.383170 538.616830 42.383170 455.383170 279.616830


Is Chisquare adequate (2)• Assuming these estimated means are the true ones, we

can simulate a table by generating 8 new table entries.• Thus, for example, if cell 1 has assumed mean 910.383,

we generate an entry for cell 1 by selecting a value at random from a Poisson distribution with mean 910.38

> rpois(1, 910.38)[1] 969• Repeat for every cell, get a simulated table• Calculate the deviance of the model for the simulated

table• Repeat for 10000 tables, record deviance each time.


Results

> 1-pchisq(0.37399,1)[1] 0.5408374> mean(result > 0.37399)[1] 0.6006

Not too bad!

Histogam of 10000 simulated deviances

deviance

De

nsi

ty

0 2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

Deviance = 0.37399


Codemeans = predict(ham.glm, type="response")Nsim=10000# generate 10000 tablesdata<-matrix(rpois(Nsim*8, means), 8, Nsim)

result<-numeric(Nsim)for(i in 1:Nsim)result[i]<-deviance(glm(data[,i]~ A*C + C*M + A*M , family=poisson, data=ACM.df))# draw a picture of the resultshist(result, nclass=30, freq=F, xlab="deviance",main="Histogam of 10000 simulated deviances")# add chisquare 1 densityxx<-seq(0,14,length=100)lines(xx,dchisq(xx,1), lwd=2, col="red")

© department of statistics 2012 stats 330 lecture 30: slide 1 stats 330: lecture 30

Documents