introduction to log-linear models -...
TRANSCRIPT
Stat 504, Lecture 16 1'
&
$
%
Introduction to
log-linear models
Key Concepts:
• Benefits of models
• Two-way Log-linear models
• Parameters Constraints, Estimation and
Interpretation
• Inference for log-linear models
Objectives:
• Understand the structure of the log-linear models
in two-way tables
• Understand the concepts of independence and
associations described via log-linear models in
two-way tables
Stat 504, Lecture 16 2'
&
$
%
Useful Links:
• The CATMOD procedure in SAS:http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/catmod_index.htm
• The GENMOD procedure in SAS:http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/genmod_index.htm
• The SAS source on log-linear model analysishttp://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/
catmod_sect30.htm#stat_catmod_catmodllma
• Fitting Log-linear models in Rhttp://stat.ethz.ch/R-manual/R-patched/library/stats/html/loglin.html
• Fitting Log-linear models in R via generalizedlinear models (glm())http://spider.stat.umn.edu/R/library/stats/html/glm.html
Readings:
• Agresti (2002) Ch. 8, 9
• Agresti (1996) Ch. 6, 7
Stat 504, Lecture 16 3'
&
$
%
Benefits of models over significance tests
Thus far our focus has been on describing interactions
or associations between two or three categorical
variables mostly via single summary statistics and
with significance testing.
Models can handle more complicated situation, and
analyze the simultaneous effects of multiple variables,
including mixtures of categorical and continuous
variables.
For example, the Breslow-Day statistics only works
for 2× 2 × K tables, while log-linear models will allow
us to test of homogenous associations in I × J × K
and higher-dimensional tables.
The structural form of the model describes the
patterns of interactions and associations. The model
parameters provide measures of strength of
associations.
Stat 504, Lecture 16 4'
&
$
%
In models, the focus is on estimating the model
parameters. The basic inference tools (e.g., point
estimation, hypothesis testing, and confidence
intervals) will be applied to the these parameters.
When discussing models, we will keep in mind
• Objective
• Model structure (e.g. variables, formula,
equation)
• Model assumptions
• Parameter estimates and interpretation
• Model fit (e.g. goodness-of-fit tests and statstics)
• Model selection
Stat 504, Lecture 16 5'
&
$
%
For example, recall a simple linear regression model
• Objective: model the expected value of a
continuous variable, Y , as a linear function of the
continuous predictor, X, E(Yi) = β0 + β1xi
• Model structure: Yi = β0 + β1xi + ei
• Model assumptions: Y is is normally distributed,
ei ∼ N(0, σ2), and independent, and X is fixed,
and constant variance σ2.
• Parameter estimates and interpretation: β̂0 is
estimate of β0 or the intercept, and β̂1 is estimate
of the slope, etc... What is the interpretation of
the slope?
• Model fit: R2, residual analysis, F-statistic
• Model selection
See handout labeled as Lec16LinRegExample.doc on
modeling average water usage given the amount of
bread production:
Water = 2273 + 0.0799 Production
Stat 504, Lecture 16 6'
&
$
%
Two-way ANOVA
Does the amount of sunlight and watering affect the
growth of geraniums?
Objective: model the continuous response as function
of two factors.
Model structure: Yijk = µ + αi + βj + γij + eijk with
eijk ∼ N(0, σ2), i = 1, ..., I, j = 1, ...., J , and
k = 1, ..., nij
Model assumptions: At each combination of levels the
outcome is normally distributed with the same
variance: yijk ∼ N(µij , σ2), where
µij = E(yijk) = µ + αi + βj + γij
Stat 504, Lecture 16 7'
&
$
%
This model is over-parametrized because term γij
already has I × J parameters corresponding to the
cell means µij . The constant, µ, and the main effects,
αi and βj give us additional 1 + I + J parameters.
We use constraints such asP
i αi =P
j βj =P
i
P
j γij = 0, to deal with this
overparametrization.
Does level of watering affect the growth of potted
geraniums? (Is there a significant main effect for
factor A?, e.g. H0 : αi = 0 for all i)
Does level of sunlight affect the growth of potted
geraniums? (Is there a significant main effect for
factor B?)
Does the effect of level of sunlight depend on level of
watering? (Is there a significant interaction between
factors A and B?)
Stat 504, Lecture 16 8'
&
$
%
Analysis of Variance for YIELD
Source DF SS MS F P
WATER 1 342.3 342.3 24.02 0.000
SUNLIGHT 1 20.3 20.3 1.42 0.256
Interaction 1 132.3 132.3 9.28 0.010
Error 12 171.0 14.3
Total 15 665.8
Individual 95% CI
WATER Mean ------+---------+---------+---------+-----
HIGH 22.0 (------*------)
LOW 12.8 (------*------)
------+---------+---------+---------+-----
12.0 16.0 20.0 24.0
Individual 95% CI
SUNLIGHT Mean ----+---------+---------+---------+-------
HIGH 18.5 (--------------*-------------)
LOW 16.3 (-------------*--------------)
----+---------+---------+---------+-------
14.0 16.0 18.0 20.0
Stat 504, Lecture 16 9'
&
$
%
Two-way Log-Linear Model
Now let µij be the expected counts, E(nij), in an
I × J table. An analogous model to two-way ANOVA
is
log(µij) = µ + αi + βj + γij
or in the notation used by Agresti
log (µij) = λ + λAi + λ
Bj + λ
ABij
with constraints:P
i λi =P
j λj =P
i
P
j λij = 0, to
deal with overparametrization.
Log-linear models specify how the cell counts depend
on the levels of categorical variables. They model the
association and interaction patterns among
categorical variables.
The log-linear modeling is natural for Poisson,
Multinomial and Product-Mutlinomial sampling.
They are appropriate when there is no clear
distinction between response and explanatory
variables, or there are more than two responses.
Stat 504, Lecture 16 10'
&
$
%
Example: General Social Survey
Cross-classification of respondents according to
choice for the president in 1992 presidental election
(Bush, Clinton, Perot) and political view on the 7
point scale (extremely liberal, liberal, slightly liberal,
moderate, slightly conservative, conservative,
extremely conservative)
http://sda.berkeley.edu:7502/D3/GSS96/Doc/gss90017.htmpres92
Let’s consider a 3 × 3 table:
Bush Clinton Perot Total
Liberal 70 324 56 450
Moderate 195 332 101 628
Conservative 382 199 117 698
Total 647 855 274 1774
Are political view and choice independent?
You already know how to answer this via chi-square
test of independence, but now we want to model the
cell counts with the log-linear model of independence
and ask if this model fits well.
Stat 504, Lecture 16 11'
&
$
%
Two-way Log-linear models
Given two categorical random variables, A and B,
there are two main models we will consider:
• Independence model, (A, B)
• Saturated model, (AB)
Objective: Model the cell counts: µij = nπij
Main assumption: The N = IJ counts in the cells are
assumed to be independent observations of a Poisson
random variable.
Stat 504, Lecture 16 12'
&
$
%
Log-linear model of independence for 2-way
tables
Recall the independence in terms of cell probabilities
as a product of marginal probabilities:
πij = πi+π+j i = 1, ..., I, j = 1, ..., J
in terms of cell frequencies:
µij = nπij = πi+π+j i = 1, ..., I, j = 1, ..., J
By taking logarithms of the expected number of
counts we obtain the loglinear model of
independence:
log µij = log n + log πi+ + log π+j
log (µij) = λ + λAi + λ
Bj
where A and B stand for two categorical variables.
Stat 504, Lecture 16 13'
&
$
%
log (µij) = λ + λAi + λ
Bj
This is an ANOVA type-representation where,
λ represents an ”overall” effect, or a grand mean of
the logarithms of the expected counts, and it ensures
thatP
i
P
j µij = n
λAi represents a ”main” effect of variable A, or a
deviation from a grand mean, and it ensures thatP
j µij = ni+. It represents the effect of classification
in row i.
λBj represents a ”main” effect of variables B, or a
deviation from a grand mean, and it ensures thatP
iµij = n+j. This is the effect of classification in
???
and, λAI = λB
J = 0
Stat 504, Lecture 16 14'
&
$
%
The ML fitted values are the same as expected values
under the test of independence:
Thus, the X2 and G2 for the test of independence are
goodness-of-fit statistics for the loglinear model of
independence testing that the independence model
holds vs. that it does not.
The model also implies that ALL odds ratios are
equal to 1
For our example, see vote.sas and compare the resultsof PROC FREQ and PROC GENMOD procedures.
Statistics for Table of pview by choice
Statistic DF Value Prob
------------------------------------------------------
Chi-Square 4 238.5354 <.0001
Likelihood Ratio Chi-Square 4 247.6951 <.0001
...
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 4 247.6951 61.9238
Stat 504, Lecture 16 15'
&
$
%
Parameter Constraints & Uniqueness
There are I − 1 unknown parameters in the set {λAi }
and J − 1 in the set {λBj }.
There can be many different parametrizations. We
need to set the constraints to account for redundant
parameters.
One way is to fix one value in the set to be equal to a
constant, typically 0. This corresponds to using
dummy coding for the categorical variables (e.g.
A = 1, 0). In SAS/GENMOD, the last level is set to 0.
µ11 =?
Another way is to fix the sum of the terms equal to a
constant, typically 0. That’s the ANOVA-type
constraint. This corresponds to using ”effect” coding
for categorical variables (e.g. A = 1, 0,−1). In
SAS/CATMOD, uses zero-sum.
µ11 =?
Stat 504, Lecture 16 16'
&
$
%
Link to Odds ratio
We can have different parameter estimates depending
on type of constraints we set. So, what is unique
about parameters?
The differences are unique:
λAi − λ
Ai′
λBj − λ
Bj′
The odds is also unique!
log(odds) = log(µi1
µi2
) = log(µi1) − log(µi2)
= (λ + λAi + λ1B) − (λ + λ
Ai + λ
B2 )
= λ1B − λB2
How about odds ratio?
log(oddsratio) = log(µ11µ22
µ12µ21
)
= log(µ11) + log(µ22) − log(µ12) + log(µ21)...
=
Stat 504, Lecture 16 17'
&
$
%
The odds ratio measures the strength of the
association and depends only on the interaction terms
{λABij }
How many numbers do we need to completely
characterize associations in I × J tables?
Stat 504, Lecture 16 18'
&
$
%
Saturated Loglinear Model for two-way tables:
log µij = λ + λAi + λ
Bj + λ
ABij
λABij represents an interaction/association between
two variables, and reflects the departure from
independence, and ensures that µij = nij
What constraint must hold?
The saturated model
1. the fitted values are exactly equal to ...
2. df=0,
3. he most complex model
4. has independence model as a special case
5. there is a direct functional relationship with the
odds ratio (and the unique number of those)
See vote.sas example.
We typically want a simpler model that smoothes the
data more, and it’s more parsimonious.
Stat 504, Lecture 16 19'
&
$
%
Hierarchical Models
These models include all lower order terms that
comprise higher-order terms in the model.
(A,B) is a simpler model than (AB)
Interpretation does not depend on how the variables
are coded.
Is this a hierarchical model?
log µij = λ + λAi + λ
ABij
Stat 504, Lecture 16 20'
&
$
%
Loglinear Models for three-way tables
Expending the log-linear model notation to 3-way
tables:
log µij = λ+λAi +λ
Bj +λ
Ck +λ
ABij +λ
ACik +λ
BCjk +λ
ABCijk
The main questions for the next lecture are:
What do the λ terms mean in this model? What
hypothesis about them correspond to the models of
independence we are already know?
What are some efficient ways to specify and interpret
these models and tables?
What are some efficient ways to fit and select among
many possible models in three and higher dimensions?
Stat 504, Lecture 16 21'
&
$
%
-
Stat 504, Lecture 16 22'
&
$
%
Example. Let’s go back to our familiar dataset on
graduate admissions at Berkeley:
Men Men Women Women
Dept. rejected accepted rejected accepted
A 313 512 19 89
B 207 353 8 17
C 205 120 391 202
D 278 139 244 131
E 138 53 299 94
F 351 22 317 24
Let D = department, S = sex, and A = admission
status (rejected or accepted). We analyzed this as a
three-way table on Assignment 5, more specifically we
looked at partial and marginal tables. Now we’ll look
at it from a loglinear point of view. Let yi be the
frequency or count in a particular cell of the
three-way table.
Stat 504, Lecture 16 23'
&
$
%
Saturated loglinear model:
Using PROC GENMOD, let’s fit the saturated
loglinear model.
options nocenter nodate nonumber linesize=72;
data berkeley;
input D $ S $ A $ y;
cards;
DeptA Male Reject 313
DeptA Male Accept 512
DeptA Female Reject 19
DeptA Female Accept 89
DeptB Male Reject 207
DeptB Male Accept 353
DeptB Female Reject 8
DeptB Female Accept 17
DeptC Male Reject 205
DeptC Male Accept 120
DeptC Female Reject 391
DeptC Female Accept 202
DeptD Male Reject 278
DeptD Male Accept 139
DeptD Female Reject 244
DeptD Female Accept 131
DeptE Male Reject 138
DeptE Male Accept 53
DeptE Female Reject 299
DeptE Female Accept 94
DeptF Male Reject 351
DeptF Male Accept 22
DeptF Female Reject 317
DeptF Female Accept 24
;
proc genmod data=berkeley order=data;
class D S A;
model y = D S A D*S D*A S*A D*S*A / dist=poisson link=log;
run;
Stat 504, Lecture 16 24'
&
$
%
When you use the order=data option, GENMOD
orders the levels of class variables in the same order
as they appear in the dataset. For each class variable,
GENMOD creates a set of dummy using the last
category as a reference group. Therefore, we can
interpret a two-way association as a log-odds ratio for
the two variables in question, with the other variable
held constant at its last category.
Here’s a portion of the SAS output. I edited the table
of ML estimates to remove the omitted zero terms.
Stat 504, Lecture 16 25'
&
$
%
Analysis Of Parameter Estimates
Standard
Parameter DF Estimate Error
Intercept 1 3.1781 0.2041
D DeptA 1 1.3106 0.2300
D DeptB 1 -0.3448 0.3170
D DeptC 1 2.1302 0.2159
D DeptD 1 1.6971 0.2220
D DeptE 1 1.3652 0.2287
S Male 1 -0.0870 0.2952
A Reject 1 2.5808 0.2117
D*S DeptA Male 1 1.8367 0.3167
D*S DeptB Male 1 3.1203 0.3857
D*S DeptC Male 1 -0.4338 0.3169
D*S DeptD Male 1 0.1463 0.3193
D*S DeptE Male 1 -0.4860 0.3415
D*A DeptA Reject 1 -4.1250 0.3297
D*A DeptB Reject 1 -3.3346 0.4782
D*A DeptC Reject 1 -1.9204 0.2288
D*A DeptD Reject 1 -1.9589 0.2378
D*A DeptE Reject 1 -1.4237 0.2425
S*A Male Reject 1 0.1889 0.3052
D*S*A DeptA Male Reject 1 0.8632 0.4027
D*S*A DeptB Male Reject 1 0.0311 0.5335
D*S*A DeptC Male Reject 1 -0.3138 0.3374
D*S*A DeptD Male Reject 1 -0.1177 0.3401
D*S*A DeptE Male Reject 1 -0.3891 0.3650
Scale 0 1.0000 0.0000
The intercept is a normalizing constant and should be
ignored. The main effects for D, A and A are all
difficult to interpret and not very meaningful. But the
two- and three-way associations are highly
meaningful. For example, the estimated coefficient for
the SA association is 0.1889.
Stat 504, Lecture 16 26'
&
$
%
Exponentiating this coefficient gives
exp(0.1889) = 1.208,
which is the estimated SA odds ratio for Department
F. The reference group for S is “women,” and the
reference group for A is “accept.” If we write the
2 × 2 table for S × A in Department F, with the
reference groups in the last row and column, we get
Dept F Reject Accept
Men 351 22
Women 317 24
for which the estimated odds ratio is
351 × 24
317 × 22= 1.208.
The Wald z-statistic for this coefficient,
z =0.1889
0.3052= 0.62,
indicates that the SA odds ratio for Department F is
not significantly different from 1.00.
Stat 504, Lecture 16 27'
&
$
%
To get the SA odds ratio for any other department,
we have to combine the SA coefficient with one of the
DSA coefficients. For example, the SA odds ratio for
Department A is
exp(0.1889 + 0.8632) = 2.864.
The Wald z-statistic for the first DSA coefficient,
z =0.8632
0.4027= 2.14,
indicates that the SA odds ratio for Department A is
significantly different from the SA odds ratio in
Department F. To see if the SA odds ratio in
Department A is significantly different from 1.00, we
would have to compute the standard error the sum of
the two coefficients using the estimated covariance
matrix.