multilevel models for binary and ordered response...

Multilevel Models for Binary and Ordered Response Data

UK Data Archive Study Number 6765 - Health Survey for England, 2003-2005: Multilevel Modelling Teaching Datasets

Generalisedmultilevel models

•So Far

Response at level 1 has been a continuous variable and

associated level 1 random term has been assumed to have

a Normal distribution

•Now a range of other data types for the response

All can be handled routinely by MLwiN

•Achieved by 2 aspects

a non‐linear link between response and predictors

a non‐Gaussian level 1 distribution

Typology of discrete responses

Response Example Model

BinaryCategorical

Yes/No Logit or probit or log-log model with binomial L1 random term

Proportion Proportion un-employed Logit etc. with binomial L1 random term

Multiple categories Travel by train, car, foot Logit model with unordered multi-nomialrandom term

Ordered Category Underweight, Ideal, Overweight, Obese, Morbidly Obese

Logit model with ordered multi-nomialrandom term

Count No of crimes in area Log model with L1 Poisson random term

Count LOS Log model with L1 NBD random term

4

Binary Response Data

yij = 0 or 1

)1Pr( == ijij yπ

),Binomial(~ ijijij ny π

ijijijij nyVar /)1()( ππ −= , where nij = 1 for binary data.

5

Modelling Proportions

E.g. yij is proportion using contraception in area i of region j

Denominator nij could be number of women of reproductive age in area i of region j

After defining nij we assume yij has a binomial distribution with mean

ijπ

Focus on modelling proportions•Proportions eg death rate; employment rate; obesity rates: can be conceived as the underlying probability of dying; probability of being employed; probability of being obese

•Four important attributes of a proportion that MUST be taken into account in modelling

(1) Closed range: bounded between 0 and 1

(2) Anticipated non‐linearity between response and predictors; as predicted response approaches bounds, greater and greater change in x is required to achieve the same change in outcome; examination analogy(3) Consists of two numbers: numerator which is subset of denominator(4) Heterogeneity: variance is not homoscedastic; two aspects(a) the variance depends on the mean;

as approach bound of 0 and 1, less room to varyie Variance is a function of the predicted probability

(b) the variance depends on the denominator;small denominators result in highly variable proportions

Modelling Proportions

•Linear probability model: that is use standard regression model with linear relationship and Gaussian random term

•But 3 problems

(1) Nonsensical predictions: predicted proportions are unbounded, outside range of 0 and 1

(2) Anticipated non‐linearity as approach bounds

(3) Heterogeneity: inherent unequal variance

dependent on mean and on denominator

•Logit model with Binomial random term resolves all three problems (could use probit, clog‐clog)

The logistic model: resolves problems 1 & 2

• Models not the proportion but a non‐linear transformation of it (solves problems 1+2)

•The relationship between the probability and predictor(s) can be represented by a logistic function, that resembles a S-shaped curve

• L = LOGe(p/ (1‐p))

• L = Logit = the log of the odds

• p = proportion having an attribute

• 1‐p = proportion not having the attribute

• p/(1‐p) = the odds of having an attribute compared to not having an attribute

• As p goes from 0 to 1, L goes from minus to plus infinity, so if model L, cannot get predicted proportions that lie outside 0 and 1; (ie solves problem 1)

• Easy to move between proportions, odds and logits

The Logit transformation

Proportions, Odds and LogitsProportion/Probability Odds

A 5 out of 10 5 to 5B 6 out of 10 6 to 4C 8 out of 10 8 to 2

Logit OddsA e0 1.0B e0.41 1.5C e1.39 4

Proportion(p)

Odds(p/1-p)

Log of oddsLoge (p/1-p)

A 0.5 1.0 0B 0.6 1.5 0.41C 0.8 4 1.39

Logit ProportionA e0/(1+ e0) 0.5B e0.41/(1+ e0.41) 0.6C e1.39/(1+ e1.39) 0.8

The logistic model

110

110

1 x

x

ee

ββ

ββ

π +

+

+=

where e is the base of the natural logarithm

• linearized by the logit transformation(log = natural logarithm)

1101log xββ

ππ

+=⎟⎠⎞

⎜⎝⎛−

The underlying probability or proportion is non‐linearly related to the predictor

The logistic model: key characteristics

• The logit transformation produces a linear function of the parameters.

• Bounded between 0 and 1• Thereby solving problems 1 and 2

Solving problem 3: Assume Binomial variation

• Variance of the response in logistic models is presumed to be binomial:

1,)ˆ1(ˆ, 2 =

−=+= ei

i

iiiiiii nzzey σπππ

nyVar )1()|( πππ −

=

I.E. depends on underlying proportion and the denominator

• In practice this is achieved by replacing the constant variable at level 1 by a binomial weight, z, and constraining the level‐1 variance to 1 for exact binomial variation

• The random (level‐1) component can be written as

Multilevel Logistic Model

),(Binomial~ ijijij ny π

jijijijij

ijij uxxx 03322110)1(

ln)(logit ++++=−

= ββββπ

ππ

• Underlying proportions/probabilities, in turn, are related to a set of individual and neighborhood predictors by the logit link function

• Linear predictor of the fixed part and the higher‐level random part

• Assume observed response comes from a Binomial distribution with a denominator for each cell, and an underlying probability/proportion

15

2‐Level Random Intercept Logit Model

ijijij ey += π , where eij has mean 0 and variance )1( ijij ππ −

jijij

ij ux ++=⎟⎟⎠

⎞⎜⎜⎝

⎛

− 101log ββ

ππ

),0N(~ 2

uju σ

16

Other Link Functions

More generally, a model of binary responses can be written:

jijij uxf ++= 10)( ββπ where f(.) is logit (see above), probit, or complementary log-log function – basically any function that maps from the whole real line to [0,1]

17

Interpretation: Fixed Part

jijij

ij ux ++=⎟⎟⎠

⎞⎜⎜⎝

⎛

− 101log ββ

ππ , ),0N(~ 2

uju σ

• 0β is the log-odds that y = 1 when x = 0 and u = 0

• 1β is effect on log-odds of 1-unit increase in x for individuals in same group (same value of u)

• 1β often referred to as cluster-specific effect of x

18

Interpretation: Random Part

jijij

ij ux ++=⎟⎟⎠

⎞⎜⎜⎝

⎛

− 101log ββ

ππ , ),0N(~ 2

uju σ

• uj is the effect of being in group j on the log-odds that y = 1;

also known as the level 2 residual

• As for continuous y, we can obtain estimates and confidence intervals for uj

• 2uσ is the level 2 (residual) variance, or the between-group

variance in the log-odds that y = 1 after accounting for x

19

Response Probabilities from Logit Model

Response probability for individual i in group j calculated as:

)exp(1

)exp(

10

10

jij

jijij ux

ux

+++++

=ββ

ββπ

Substitute estimates of 0β , 1β and uj to get predicted probability ijπ . We can also make predictions for ‘typical’ individuals with particular values for x, but need to decide what to substitute for uj (see later).

20

Example: US Voting Intentions

Individuals (at level 1) within states (at level 2) Results from null logit model (no x): Parameter Estimate se 0β (intercept) -0.107 0.049 2uσ (between-state variance) 0.091 0.023

• Is 2

uσ significantly different from zero? • Does 2

uσ = 0.091 represent a large state effect?

21

Testing the Level 2 Variance

• Likelihood ratio test. Only available if model estimated via maximum likelihood (not in MLwiN)

• Wald test (equivalent to t-test), but only approximate because variance estimates do not have normal sampling distributions

• Bayesian credible intervals. Available if model estimated using Markov chain Monte Carlo (MCMC) methods

22

Example of Wald Test: US Voting Data

Wald statistic = 65.15023.0091.0

seˆ 22

=⎟⎠⎞

⎜⎝⎛=⎟⎟

⎠

⎞⎜⎜⎝

⎛ uσ

Compare with 21χ → reject H0 that 2

uσ = 0 and conclude there are state differences.

Take p-value/2 because HA is that 2uσ > 0 (one-sided).

23

State Effects on Probability of Voting Bush

Calculate π for ‘average’ states (u = 0) and for states 2 s.d. above and below the average (u = uσ2± )

302.0091.0ˆ ==uσ

uu σ2−= = -0.603 → π = 0.33 u = 0 → π = 0.47

uu σ2+= = +0.603 → π = 0.62 Under a normal distribution, expect 95% of states to have π within (0.33, 0.62)

24

State Residuals with 95% Confidence Intervals

25

Adding Income as a Predictorxij is household income (9 categories) centred at mean of 5.23 Parameter Estimate se 0β -0.099 0.056

1β (income effect) 0.140 0.008 2uσ (between-state variance) 0.125 0.030

• 0.140 is effect on the log-odds of voting Bush of a 1-category

increase in income, adjusting for state differences

• odds of voting Bush are exp(8×0.14) = 3.1 times higher for an individual in the highest income band than for individual in same state but lowest income band

26

Prediction Lines by State: Random Intercepts

27

Latent Variable RepresentationConsider latent continuous y* underlying observed binary y:

⎪⎩

⎪⎨⎧

<

≥=

0if0

0if1*

*

ij

ijij y

yy

Threshold model:

*10

*ijjijij euxy +++= ββ

• *ije ~ N(0,1) → probit model

• *ije ~ standard logistic (variance ≈ 3.29) → logit model

28

Variance Partition Coefficient for Binary y

From threshold model for latent y* we obtain:

22*

2VPC

ue

u

σσσ+

=

where 2

*eσ = 1 for probit model and 3.29 for logit model In voting intentions example, VPC = 0.037 (for logit). Adjusting for income, about 4% of variance in propensity to vote Bush is attributable to state differences.

29

Impact of Adding uj on Coefficients

Single-level model: *10

*iii exy ++= ββ

29.3)var()|var( ** == iii exy

Multilevel model: *

10*

ijjijij euxy +++= ββ

29.3)var()var(),|var( 2** +=+= uijjii euuxy σ Adding uj increases residual variance, so scale of y* stretched out and 0β and 1β increase in absolute value. Coefficients have different interpretation to single-level andother marginal models.

30

Marginal Model for Clustered y

An alternative to a random effects model is a marginal model, which has 2 components: • A generalised linear model for relationship between ijπ and

xij

• Specification of structure of correlation between pairs of

individuals in same group, e.g. exchangeable (equal correlation) or autocorrelation structure

Estimated using Generalised Estimation Equations (GEE)

31

Marginal vs. Random Effects Approaches

Marginal • Clustering regarded as nuisance • No parameter representing between-group variance • No distributional assumptions about group effects (but no

estimates either) Random effects • Clustering of substantive interest • Estimate of between-group variance 2

uσ and group effects • Can allow between-group variance to depend on x (via

random coefficients)

32

Marginal and Random Effects Models

Marginal β have population-averaged (PA) interpretation Random effects β have cluster-specific (CS) interpretation Random intercept logit model

jijCSCS

ij ux ++= 10)logit( ββπ , where ),0(~ 2uj Nu σ

Marginal logit model

ijPAPA

ij x10)logit( ββπ += with specification of structure of within-cluster covariance matrix

33

Interpretation of CS and PA Effects

Cluster-specific • CS

1β is effect of x on log-odds for a given cluster, i.e. holding constant (or conditioning on) cluster-specific unobservables

• Compares two individuals in same cluster with x-values 1 unit apart

Population-averaged • PA

1β is effect of x on log-odds in the study population, i.e. averaging over cluster-specific unobservables

34

Example: PA vs. CS Interpretation (1)

Consider longitudinal study with yij indicating whether patient j has adverse reaction at occasion i to dose xij of a drug. CS1β is effect of 1-unit increase in dose, holding constant

time-invariant unobserved characteristics uj.

PA1β compares individuals whose dosage xij differs by 1 unit,

averaging over between-individual differences in tolerance.

Dose is time-varying (level 1). What about a level 2 x?

35

Example: PA vs. CS Interpretation (2)

Suppose we add gender (x2j) with coefficient 2β . • Because x2j is fixed over time, cannot interpret CS

2β as a within-person effect. Instead CS

2β compares men and women with same value of xij and uj (same dose and unobserved characteristics)

• PA2β compares men and women receiving same dose xij,

averaging over individual unobservables For a level 2 variable PA

2β usually of greater interest.

36

Obtaining PA Effects from a CS Model: Predicted Probabilities

Compute π for *xx = using usual formula for predicted probabilities from a logit model:

)*ˆexp(1)*ˆexp(

*ux

ux++

+=

ββπ

What value for u? u=0, i.e. the mean?

Because of nonlinearity of logistic transformation, *π at u=0 is not equal to the mean *π

Solutions: integrate out u, or simulate distribution of u

37

Predicted Probabilities by Simulation (now implemented in MLwiN)

(i) Generate M values for random effect u from )ˆ,0( 2uN σ :

u(1) , u(2) . . ., u(M)

(ii) For m = 1,…,M compute (for x=x*):

)*ˆexp(1

)*ˆexp(

)(

)(*)(

m

mm ux

ux

++

+=

ββ

π

(iii) Mean (population average) probability:

∑=m

mM*)(

1* ππ

38

Estimation in MLwiN

• Quasi-likelihood (MQL/PQL – 1st and 2nd

order) are approximate procedures, good for model screening– MQL1 crudest approximation. Estimates may be biased

downwards (esp. if clusters small). But stable.– PQL2 best approximation, but may not converge.– Tip: Start with MQL1 to get starting values for PQL.

• MCMC better (check results for final model)

Ordered Response Models

• When we have several categories and there is an underlying ordering to the categories then a convenient parameterisation is to work with cumulative probabilities i.e. the probabilities that an inidividual crosses each threshold between categories. For example with obesity ranges:

Class Probability Threshold Cumulative Probability

Under π1i ≤Under γ1i = π1i

Ideal π2i ≤Ideal γ2i = π1i+π2i

Over π3i ≤Obese γ3i = π1i+π2i+π3i

Obese π4i ≤ Obese γ4i = π1i+π2i+π3i+π4i

M. Obese π5i ≤ M.Obese γ5i = π1i+π2i+π3i+π4i+π5i= 1

With an ordered multinomial model we work with the set of cumulative probabilities, γki. As we see that γ5i = 1 we need only model four categories in the model. This modelling will collapse to the binary logistic modelling we have already considered when there are only 2 categories.

A model with no explanatory variables

• We will again use the logit link so we have:

obese of odds log ))1/(log(tover weigh of odds log ))1/(log(

weightideal of odds log ))1/(log(htunder weig of odds log ))1/(log(

344

233

122

011

≤=−≤=−≤=−≤=−

βγγβγγβγγβγγ

ii

ii

ii

ii

The threshold probabilities γki. are given by anti‐logit(βki )Because γ1i ≤ γ2i ≤ γ3i ≤ γ4i it follows that β0i ≤ β1i ≤ β2i ≤ β3i

Adding covariates to the model

...obese of odds log ))1/(log(

tover weigh of odds log ))1/(log( weightideal of odds log ))1/(log(

htunder weig of odds log ))1/(log(

2514

344

233

122

011

++=≤+=−≤+=−≤+=−≤+=−

iii

iii

iii

iii

iii

xxhhhhh

βββγγβγγβγγβγγ

Here predictors are added in common across the response threshold categories. This is referred to as proportional odds modelling. This means that the log odds ratios and odds ratios for threshold category membership are independent of the predictor variables.

Multilevel Modelling etc.

• Ordered multinomial models can be extended to multilevel settings by adding in random effects to the equation for hi as in the binary case.

• The proportional odds assumption can be tested and relaxed.

• These features need a more advanced workshop but see material at the multilevel modelling centre website for details.

multilevel models for binary and ordered response...

Documents