multilevel models for binary and ordered response...
TRANSCRIPT
Multilevel Models for Binary and Ordered Response Data
UK Data Archive Study Number 6765 - Health Survey for England, 2003-2005: Multilevel Modelling Teaching Datasets
Generalisedmultilevel models
•So Far
Response at level 1 has been a continuous variable and
associated level 1 random term has been assumed to have
a Normal distribution
•Now a range of other data types for the response
All can be handled routinely by MLwiN
•Achieved by 2 aspects
a non‐linear link between response and predictors
a non‐Gaussian level 1 distribution
Typology of discrete responses
Response Example Model
BinaryCategorical
Yes/No Logit or probit or log-log model with binomial L1 random term
Proportion Proportion un-employed Logit etc. with binomial L1 random term
Multiple categories Travel by train, car, foot Logit model with unordered multi-nomialrandom term
Ordered Category Underweight, Ideal, Overweight, Obese, Morbidly Obese
Logit model with ordered multi-nomialrandom term
Count No of crimes in area Log model with L1 Poisson random term
Count LOS Log model with L1 NBD random term
4
Binary Response Data
yij = 0 or 1
)1Pr( == ijij yπ
),Binomial(~ ijijij ny π
ijijijij nyVar /)1()( ππ −= , where nij = 1 for binary data.
5
Modelling Proportions
E.g. yij is proportion using contraception in area i of region j
Denominator nij could be number of women of reproductive age in area i of region j
After defining nij we assume yij has a binomial distribution with mean
ijπ
Focus on modelling proportions•Proportions eg death rate; employment rate; obesity rates: can be conceived as the underlying probability of dying; probability of being employed; probability of being obese
•Four important attributes of a proportion that MUST be taken into account in modelling
(1) Closed range: bounded between 0 and 1
(2) Anticipated non‐linearity between response and predictors; as predicted response approaches bounds, greater and greater change in x is required to achieve the same change in outcome; examination analogy(3) Consists of two numbers: numerator which is subset of denominator(4) Heterogeneity: variance is not homoscedastic; two aspects(a) the variance depends on the mean;
as approach bound of 0 and 1, less room to varyie Variance is a function of the predicted probability
(b) the variance depends on the denominator;small denominators result in highly variable proportions
Modelling Proportions
•Linear probability model: that is use standard regression model with linear relationship and Gaussian random term
•But 3 problems
(1) Nonsensical predictions: predicted proportions are unbounded, outside range of 0 and 1
(2) Anticipated non‐linearity as approach bounds
(3) Heterogeneity: inherent unequal variance
dependent on mean and on denominator
•Logit model with Binomial random term resolves all three problems (could use probit, clog‐clog)
The logistic model: resolves problems 1 & 2
• Models not the proportion but a non‐linear transformation of it (solves problems 1+2)
•The relationship between the probability and predictor(s) can be represented by a logistic function, that resembles a S-shaped curve
• L = LOGe(p/ (1‐p))
• L = Logit = the log of the odds
• p = proportion having an attribute
• 1‐p = proportion not having the attribute
• p/(1‐p) = the odds of having an attribute compared to not having an attribute
• As p goes from 0 to 1, L goes from minus to plus infinity, so if model L, cannot get predicted proportions that lie outside 0 and 1; (ie solves problem 1)
• Easy to move between proportions, odds and logits
The Logit transformation
Proportions, Odds and LogitsProportion/Probability Odds
A 5 out of 10 5 to 5B 6 out of 10 6 to 4C 8 out of 10 8 to 2
Logit OddsA e0 1.0B e0.41 1.5C e1.39 4
Proportion(p)
Odds(p/1-p)
Log of oddsLoge (p/1-p)
A 0.5 1.0 0B 0.6 1.5 0.41C 0.8 4 1.39
Logit ProportionA e0/(1+ e0) 0.5B e0.41/(1+ e0.41) 0.6C e1.39/(1+ e1.39) 0.8
The logistic model
110
110
1 x
x
ee
ββ
ββ
π +
+
+=
where e is the base of the natural logarithm
• linearized by the logit transformation(log = natural logarithm)
1101log xββ
ππ
+=⎟⎠⎞
⎜⎝⎛−
The underlying probability or proportion is non‐linearly related to the predictor
The logistic model: key characteristics
• The logit transformation produces a linear function of the parameters.
• Bounded between 0 and 1• Thereby solving problems 1 and 2
Solving problem 3: Assume Binomial variation
• Variance of the response in logistic models is presumed to be binomial:
1,)ˆ1(ˆ, 2 =
−=+= ei
i
iiiiiii nzzey σπππ
nyVar )1()|( πππ −
=
I.E. depends on underlying proportion and the denominator
• In practice this is achieved by replacing the constant variable at level 1 by a binomial weight, z, and constraining the level‐1 variance to 1 for exact binomial variation
• The random (level‐1) component can be written as
Multilevel Logistic Model
),(Binomial~ ijijij ny π
jijijijij
ijij uxxx 03322110)1(
ln)(logit ++++=−
= ββββπ
ππ
• Underlying proportions/probabilities, in turn, are related to a set of individual and neighborhood predictors by the logit link function
• Linear predictor of the fixed part and the higher‐level random part
• Assume observed response comes from a Binomial distribution with a denominator for each cell, and an underlying probability/proportion
15
2‐Level Random Intercept Logit Model
ijijij ey += π , where eij has mean 0 and variance )1( ijij ππ −
jijij
ij ux ++=⎟⎟⎠
⎞⎜⎜⎝
⎛
− 101log ββ
ππ
),0N(~ 2
uju σ
16
Other Link Functions
More generally, a model of binary responses can be written:
jijij uxf ++= 10)( ββπ where f(.) is logit (see above), probit, or complementary log-log function – basically any function that maps from the whole real line to [0,1]
17
Interpretation: Fixed Part
jijij
ij ux ++=⎟⎟⎠
⎞⎜⎜⎝
⎛
− 101log ββ
ππ , ),0N(~ 2
uju σ
• 0β is the log-odds that y = 1 when x = 0 and u = 0
• 1β is effect on log-odds of 1-unit increase in x for individuals in same group (same value of u)
• 1β often referred to as cluster-specific effect of x
18
Interpretation: Random Part
jijij
ij ux ++=⎟⎟⎠
⎞⎜⎜⎝
⎛
− 101log ββ
ππ , ),0N(~ 2
uju σ
• uj is the effect of being in group j on the log-odds that y = 1;
also known as the level 2 residual
• As for continuous y, we can obtain estimates and confidence intervals for uj
• 2uσ is the level 2 (residual) variance, or the between-group
variance in the log-odds that y = 1 after accounting for x
19
Response Probabilities from Logit Model
Response probability for individual i in group j calculated as:
)exp(1
)exp(
10
10
jij
jijij ux
ux
+++++
=ββ
ββπ
Substitute estimates of 0β , 1β and uj to get predicted probability ijπ . We can also make predictions for ‘typical’ individuals with particular values for x, but need to decide what to substitute for uj (see later).
20
Example: US Voting Intentions
Individuals (at level 1) within states (at level 2) Results from null logit model (no x): Parameter Estimate se 0β (intercept) -0.107 0.049 2uσ (between-state variance) 0.091 0.023
• Is 2
uσ significantly different from zero? • Does 2
uσ = 0.091 represent a large state effect?
21
Testing the Level 2 Variance
• Likelihood ratio test. Only available if model estimated via maximum likelihood (not in MLwiN)
• Wald test (equivalent to t-test), but only approximate because variance estimates do not have normal sampling distributions
• Bayesian credible intervals. Available if model estimated using Markov chain Monte Carlo (MCMC) methods
22
Example of Wald Test: US Voting Data
Wald statistic = 65.15023.0091.0
seˆ 22
=⎟⎠⎞
⎜⎝⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛ uσ
Compare with 21χ → reject H0 that 2
uσ = 0 and conclude there are state differences.
Take p-value/2 because HA is that 2uσ > 0 (one-sided).
23
State Effects on Probability of Voting Bush
Calculate π for ‘average’ states (u = 0) and for states 2 s.d. above and below the average (u = uσ2± )
302.0091.0ˆ ==uσ
uu σ2−= = -0.603 → π = 0.33 u = 0 → π = 0.47
uu σ2+= = +0.603 → π = 0.62 Under a normal distribution, expect 95% of states to have π within (0.33, 0.62)
24
State Residuals with 95% Confidence Intervals
25
Adding Income as a Predictorxij is household income (9 categories) centred at mean of 5.23 Parameter Estimate se 0β -0.099 0.056
1β (income effect) 0.140 0.008 2uσ (between-state variance) 0.125 0.030
• 0.140 is effect on the log-odds of voting Bush of a 1-category
increase in income, adjusting for state differences
• odds of voting Bush are exp(8×0.14) = 3.1 times higher for an individual in the highest income band than for individual in same state but lowest income band
26
Prediction Lines by State: Random Intercepts
27
Latent Variable RepresentationConsider latent continuous y* underlying observed binary y:
⎪⎩
⎪⎨⎧
<
≥=
0if0
0if1*
*
ij
ijij y
yy
Threshold model:
*10
*ijjijij euxy +++= ββ
• *ije ~ N(0,1) → probit model
• *ije ~ standard logistic (variance ≈ 3.29) → logit model
28
Variance Partition Coefficient for Binary y
From threshold model for latent y* we obtain:
22*
2VPC
ue
u
σσσ+
=
where 2
*eσ = 1 for probit model and 3.29 for logit model In voting intentions example, VPC = 0.037 (for logit). Adjusting for income, about 4% of variance in propensity to vote Bush is attributable to state differences.
29
Impact of Adding uj on Coefficients
Single-level model: *10
*iii exy ++= ββ
29.3)var()|var( ** == iii exy
Multilevel model: *
10*
ijjijij euxy +++= ββ
29.3)var()var(),|var( 2** +=+= uijjii euuxy σ Adding uj increases residual variance, so scale of y* stretched out and 0β and 1β increase in absolute value. Coefficients have different interpretation to single-level andother marginal models.
30
Marginal Model for Clustered y
An alternative to a random effects model is a marginal model, which has 2 components: • A generalised linear model for relationship between ijπ and
xij
• Specification of structure of correlation between pairs of
individuals in same group, e.g. exchangeable (equal correlation) or autocorrelation structure
Estimated using Generalised Estimation Equations (GEE)
31
Marginal vs. Random Effects Approaches
Marginal • Clustering regarded as nuisance • No parameter representing between-group variance • No distributional assumptions about group effects (but no
estimates either) Random effects • Clustering of substantive interest • Estimate of between-group variance 2
uσ and group effects • Can allow between-group variance to depend on x (via
random coefficients)
32
Marginal and Random Effects Models
Marginal β have population-averaged (PA) interpretation Random effects β have cluster-specific (CS) interpretation Random intercept logit model
jijCSCS
ij ux ++= 10)logit( ββπ , where ),0(~ 2uj Nu σ
Marginal logit model
ijPAPA
ij x10)logit( ββπ += with specification of structure of within-cluster covariance matrix
33
Interpretation of CS and PA Effects
Cluster-specific • CS
1β is effect of x on log-odds for a given cluster, i.e. holding constant (or conditioning on) cluster-specific unobservables
• Compares two individuals in same cluster with x-values 1 unit apart
Population-averaged • PA
1β is effect of x on log-odds in the study population, i.e. averaging over cluster-specific unobservables
34
Example: PA vs. CS Interpretation (1)
Consider longitudinal study with yij indicating whether patient j has adverse reaction at occasion i to dose xij of a drug. CS1β is effect of 1-unit increase in dose, holding constant
time-invariant unobserved characteristics uj.
PA1β compares individuals whose dosage xij differs by 1 unit,
averaging over between-individual differences in tolerance.
Dose is time-varying (level 1). What about a level 2 x?
35
Example: PA vs. CS Interpretation (2)
Suppose we add gender (x2j) with coefficient 2β . • Because x2j is fixed over time, cannot interpret CS
2β as a within-person effect. Instead CS
2β compares men and women with same value of xij and uj (same dose and unobserved characteristics)
• PA2β compares men and women receiving same dose xij,
averaging over individual unobservables For a level 2 variable PA
2β usually of greater interest.
36
Obtaining PA Effects from a CS Model: Predicted Probabilities
Compute π for *xx = using usual formula for predicted probabilities from a logit model:
)*ˆexp(1)*ˆexp(
*ux
ux++
+=
ββπ
What value for u? u=0, i.e. the mean?
Because of nonlinearity of logistic transformation, *π at u=0 is not equal to the mean *π
Solutions: integrate out u, or simulate distribution of u
37
Predicted Probabilities by Simulation (now implemented in MLwiN)
(i) Generate M values for random effect u from )ˆ,0( 2uN σ :
u(1) , u(2) . . ., u(M)
(ii) For m = 1,…,M compute (for x=x*):
)*ˆexp(1
)*ˆexp(
)(
)(*)(
m
mm ux
ux
++
+=
ββ
π
(iii) Mean (population average) probability:
∑=m
mM*)(
1* ππ
38
Estimation in MLwiN
• Quasi-likelihood (MQL/PQL – 1st and 2nd
order) are approximate procedures, good for model screening– MQL1 crudest approximation. Estimates may be biased
downwards (esp. if clusters small). But stable.– PQL2 best approximation, but may not converge.– Tip: Start with MQL1 to get starting values for PQL.
• MCMC better (check results for final model)
Ordered Response Models
• When we have several categories and there is an underlying ordering to the categories then a convenient parameterisation is to work with cumulative probabilities i.e. the probabilities that an inidividual crosses each threshold between categories. For example with obesity ranges:
Class Probability Threshold Cumulative Probability
Under π1i ≤Under γ1i = π1i
Ideal π2i ≤Ideal γ2i = π1i+π2i
Over π3i ≤Obese γ3i = π1i+π2i+π3i
Obese π4i ≤ Obese γ4i = π1i+π2i+π3i+π4i
M. Obese π5i ≤ M.Obese γ5i = π1i+π2i+π3i+π4i+π5i= 1
With an ordered multinomial model we work with the set of cumulative probabilities, γki. As we see that γ5i = 1 we need only model four categories in the model. This modelling will collapse to the binary logistic modelling we have already considered when there are only 2 categories.
A model with no explanatory variables
• We will again use the logit link so we have:
obese of odds log ))1/(log(tover weigh of odds log ))1/(log(
weightideal of odds log ))1/(log(htunder weig of odds log ))1/(log(
344
233
122
011
≤=−≤=−≤=−≤=−
βγγβγγβγγβγγ
ii
ii
ii
ii
The threshold probabilities γki. are given by anti‐logit(βki )Because γ1i ≤ γ2i ≤ γ3i ≤ γ4i it follows that β0i ≤ β1i ≤ β2i ≤ β3i
Adding covariates to the model
...obese of odds log ))1/(log(
tover weigh of odds log ))1/(log( weightideal of odds log ))1/(log(
htunder weig of odds log ))1/(log(
2514
344
233
122
011
++=≤+=−≤+=−≤+=−≤+=−
iii
iii
iii
iii
iii
xxhhhhh
βββγγβγγβγγβγγ
Here predictors are added in common across the response threshold categories. This is referred to as proportional odds modelling. This means that the log odds ratios and odds ratios for threshold category membership are independent of the predictor variables.
Multilevel Modelling etc.
• Ordered multinomial models can be extended to multilevel settings by adding in random effects to the equation for hi as in the binary case.
• The proportional odds assumption can be tested and relaxed.
• These features need a more advanced workshop but see material at the multilevel modelling centre website for details.