beyond mcmc in fitting complex bayesian models: the inla...

Beyond MCMC in fitting complex Bayesian models:The INLA method

Valeska Andreozzi

Centre of Statistics and Applications of

Lisbon University

(valeska.andreozzi at fc.ul.pt)

European Congress of EpidemiologyPorto, Portugal

September 2012

Summary

• This talk will be focused on an alternative method to performBayesian inference of complex models, namely the integratednested Laplace approximation (INLA).

• This alternative technique will be illustrated by an application of ageneralized linear/additive mixed model in the analysis of a cohortstudy.

• An R interface for INLA program (r-INLA package) will bedemonstrated.

2 of 36

Outline

• Describe statistical modelling

• Give an illustrative example (Cohort study)

• Describe the Bayesian inference

• Summarize the MCMC methods

• Present a briefly introduction to INLA

• Ilustrate the software available to implement INLA (r-INLApackage)

• Provide some references to more details about INLA

• Concluding remarks

3 of 36

Complex models in Health

• By nature, Health is a complex system.

• Applied statisticians are more and more facing model complexity torepresent Health problems satisfactory/appropriately.

4 of 36

Statistical Modelling

Set a hypothesis⇓

Collect data (y)⇓

Trend + error⇓

Statistical model for the data⇓

Joint probability distribution for the data → f(y|θ)

5 of 36

Statistical Modelling

• Actually, the joint probability distribution f(y|θ) for the datadepend on the values of the unknown parameters θ.

• In practice, when f(y|θ) is viewed as a function of θ rather than y,f(y|θ) is known as the likelihood for the data.

• The likelihood under the proposed model representhow likely it is to observe the data y we have collected, given

specific values for the parameters θ

6 of 36

Illustrative example

Let´s introduce some example to illustrate the methods that will bediscussed:

• Investigate the rate of infant weight gain.◦ Observational cohort study: infants attending the Public Health

System of the city of Rio de Janeiro.◦ Project: Quality assessment of care for infants under six months of

age at the Public Health System of the city of Rio de Janeiro◦ Coordinator: Doctor Maria do Carmo Leal (ENSP/Fiocruz/Brazil)

7 of 36

Infant weight gain in Rio de Janeiro

Data

• Data consist of repeated measures of infant weight collected fromthe child health handbook.

2

4

6

8

10

0 5 10 15 20 25Months

Wei

ght (

kg)

Covariates available:• Infant characteristics: age, birth

weight, gender, type ofbreastfeeding, whetherattending nursery school,whether has been hospitalized.

• Mother characteristics: age,level of education, maritalstatus, number of pregnancies,number of people in thehousehold, number of child lessthan 5 years in the household,Kotelchuck Index

8 of 36


Model formulation

• The simplest model (likelihood - f(y|θ)) for the data yi is thenormal distribution

• To take into account the longitudinal structured of the data (withincorrelation among repeated measures of weight), random effectshave been included.

• The result model for the infant weight gain can be expressed as

yi = Xiβ + Ziωi + ǫi

9 of 36


Model formulation (cont.)

• Considering a hierarchical model formulation

yi ∼ N(µi,Σ−1)

µi = Xiβ + Ziωi

ωi ∼ N(0, D−1)

• Using the previous notation, the data y is assumed to have anormal likelihood (f(y|θ)) and the vector of unknown parametersconsist of θ = (β,Σ, D)

10 of 36

Model inference

Maximum likelihood estimates (MLE)

• To find good estimates for θ, one can choose values θ̂ thatmaximize the likelihood f(y|θ) ⇒ MLE

• Maximum likelihood methods is fine, but if the form of thelikelihood f(y|θ) is complex and/or the number of individualparameters involved is large then the approach may prove eithervery difficult or infeasible to implement.

11 of 36

Bayes inference

• In this scenario, Bayesian inference is a appealing approach.

• Start by reviewing the Bayes theoremThe posterior probability distribution for the parameters given theobserved data is:

P (θ|y) =P (θ)f(y|θ)

f(y)=

P (θ)f(y|θ)∫θP (θ)f(y|θ)dθ

• This results show that:

12 of 36

Bayes inference

Posterior ∝ Prior × Likelihood

• The prior probability distribution for the parameters P (θ) expressthe uncertainty about θ before taking the data into account.

• Usually it is chosen to be non-informative.

• The posterior probability distribution for the parameters P (θ|y)express the uncertainty about θ after observed the data.

• So in Bayesian inference all parameter information comes from theposterior distribution.

• For example:

θ̂ = E(θ|y) =

∫

θ

θP (θ|y)dθ =

∫θθf(y|θ)P (θ)dθ∫

θf(y|θ)P (θ)dθ

13 of 36

Computing the posterior distribution

• Until recently, model solutions were very hard to obtain becauseintegrals of the posterior distribution cannot be evaluatedanalitically.

• Solution: draw samples values from the posterior distributionapproximating any characteristics of it by the characteristics ofthese samples.

• But how?

14 of 36

MCMC

Using Monte Carlo integration methods with Markov Chain (MCMC)

• This algorithm construct a Markov chain with stationarydistribution identical to the posterior and use values from theMarkov chain after a sufficiently long burn-in as simulated samplesfrom the posterior.

• An attractive method to implement an MCMC algorithm is theGibbs sampling.

15 of 36

Gibbs sampling

• Suppose the set of conditional distributions:

[θ1|θ2, . . . , θp, data]

[θ2|θ1, . . . , θp, data]

...

[θp|θ1, . . . , θp−1, data]

• The idea behind Gibbs sampling is that we can set up a Markovchain simulation algorithm from the joint posterior distribution bysuccessfully simulating individual parameters from the set of pconditional distributions.

• Under general conditions, draws from the simulating algorithm willconverge to the target distribution of interest (the joint posteriordistribution of θ) .

16 of 36

MCMC estimation

• Although very flexible to program complex models.

• One has to monitor the performance of a MCMC algorithm todecide, at a long run (?), if the simulated sample provides areasonable approximation to the posterior density.

• Issues to ensure good estimates:◦ Convergence (burn-in required).◦ Mixing (required number of samples after convergence).

17 of 36

MCMC estimation

• Difficult to construct a MCMC scheme that converges in areasonable amount of time (can take hours or even days to delivercorrect results).

• Difficulty in specifying prior distributions.

• In practice, the handicap of data analysis using MCMC is the largecomputational burden.

How to overcome this efficiency problem?

18 of 36

Alternative method

Use alternative approximation methods to MCMC, namely, theIntegrated Nested Laplace Approximation.

• Designed to a class of hierarchical models called latent Gaussianmodels

• Examples where INLA can be applied:◦ Generalized linear mixed models◦ Generalized additive mixed models◦ Spline smoothing models◦ Disease mapping (including ecological studies)◦ Spatial and spatio-temporal models◦ Dynamic generalized linear models◦ Survival models

19 of 36

Latent Gaussian models

• It is a hierarchical model defined in 3 stages:

Observation model: yi|θ ∼ f(yi|θ)

Latent Gaussian field or Parameter model: θ|γ ∼ N(µ(γ), Q(γ)−1)

Hyperparameter: γ ∼ f(γ)

20 of 36


• The observation yi belongs to the exponential family where themean µi is linked to a structured additive predictor ηi through alink function g(µi) = ηi

ηi = β0 +

nβ∑

k=1

βkxki +

nh∑

j=1

h(j)(zji) + ǫi

• βk represent the linear effect of covariates x

• h(j) are unknown functions of covariates z. Can representnon-linear effects, time-trend, seasonal effects, random effects,spatial random effects.

• ǫi are the unstructured random effects.

21 of 36


Observation model: yi|θ ∼ f(yi|θ)

Latent Gaussian field or Parameter model: θ|γ ∼ N(µ(γ), Q(γ)−1)

Hyperparameter: γ ∼ f(γ)

The models are assumed to satisfy two properties:

1. The latent Gaussian field θ, usually of large dimension, isconditionally independent, so its precision matrix Q(γ) is sparse(full of zeros because many parameters are not correlated)

2. the dimension of the hyperparameter vector γ is small (≤ 6)

22 of 36

INLA

• With gaussian parameters and sparse precision matrix, it can beassumed a multivariated gaussian distribution for the ordinaryscenario (normal response variable)

• In the Gaussian layout, inference is an easy task. It means that theposterior distribution of parameters is easy to calculate.

• For more complex models (non Gaussian response variable, forexample), approximation method of integration of the posteriordensity has to be used ⇒ Laplace.

23 of 36

INLA

• The marginal posterior density for θ is given by

P (θi|y) =

∫

γ

P (θi|γ, y)P (γ|y)dγ

• and INLA approximate this by

P̃ (θi|y) =∑

k

P̃ (θi|γk, y)P̃ (γk|y)∆k

• where Laplace is applied to carry out the integrations required forevaluation of P̃ .

• So no simulations are needed to find estimate for θ.

• The INLA output is the marginal posterior distribution which canbe summarized by means, variances and quantiles.

• DIC and predictive measures (CPO, PIT) are also available.24 of 36

Results

weighti = ηi = β0 +

nβ∑

k=1

βkxki +

nh∑

j=1

h(j)(zji) + ǫi

Structured additive predictor Processing DICηi

total time (s)

β0 + β1agei + β2age2i+ h(1)(ui) + ǫi 9.70 4647.21

β0 + β1agei + β2age2i+ h(1)(ui) + ǫi + cov 15.62 4631.66

β0 + h(1)(ui) + h(2)(agei) + ǫi 15.46 4641.46β0 + h(1)(ui) + h(2)(agei) + ǫi + cov 19.52 4627.81

h(1)(ui) ∼ N(0, τ−1u

) and τ1 ∼ Gamma(a, b)

h(2)(agei) is a random walk smoothness prior with precision τ2

25 of 36

Results

Model hyperparameters:

Precision mean sd 0.025quant 0.975quant

β0 + β1agei + β2age2i+ h(1)(ui) + ǫi + cov

within child (ǫ) 5.54 0.16 5.23 5.87intercept (u) 3.14 0.16 2.83 3.47

β0 + β1agei + β2age2i+ h(1)(ui) + ǫi + cov

within child (ǫ) 5.55 0.16 5.24 5.87intercept (u) 3.55 0.19 3.19 3.95

β0 + h(1)(ui) + h(2)(agei) + ǫiwithin child (ǫ) 5.56 0.16 5.25 5.89intercept (u) 3.14 0.16 2.82 3.47age 3710.58 2207.72 949.28 9313.81

β0 + h(1)(ui) + h(2)(agei) + ǫi + covwithin child (ǫ) 5.56 0.17 5.21 5.88intercept (u) 3.57 0.20 3.15 3.95age 3756.31 2187.72 909.55 9182.95

26 of 36

Results

• Posterior density estimation of hyperparameters.

4 5 6 7 8

0.0

0.5

1.0

1.5

2.0

PostDens [Precision for Gaussian observations]

2 3 4 5 6 7

0.0

0.5

1.0

1.5

PostDens [Precision for id]

0 10000 20000 30000 40000 500000.00

000

0.00

010

0.00

020

PostDens [Precision for age]

27 of 36

Results

• Posterior estimated of the non-linear effect of age by the modeladjusted by mother and infant covariates

0 5 10 15 20 25

−2

−1

01

2

age

PostMean 0.025% 0.5% 0.975%

28 of 36

Results

• Posterior density estimation of covariate effects

5.0 5.5 6.0 6.5 7.0

0.0

1.0

2.0

PostDens [(Intercept)]

Mean = 5.86 SD = 0.199

−0.5 −0.4 −0.3 −0.2 −0.1

02

46

8

PostDens [genderFem]

Mean = −0.297 SD = 0.037

−0.04 −0.02 0.00 0.02

020

40

PostDens [education]

Mean = −0.011 SD = 0.007

−0.1 0.0 0.1 0.2 0.3 0.4 0.5

02

46

PostDens [parity2+]

Mean = 0.202 SD = 0.053

−0.15 −0.05 0.05 0.10

05

1015

PostDens [nbornalive]

Mean = −0.038 SD = 0.023

−0.10 −0.05 0.00 0.05 0.10

05

1020

PostDens [seindex]

Mean = 0.016 SD = 0.018

−0.10 −0.05 0.00 0.05

010

20

PostDens [nhouse]

Mean = −0.012 SD = 0.014

−0.2 −0.1 0.0 0.1 0.2

04

8

PostDens [n5house]

Mean = −0.005 SD = 0.036

−1.0 −0.5 0.0 0.5 1.0

0.0

1.0

2.0

PostDens [kcindexInadequado]

Mean = 0.051 SD = 0.174

−0.5 0.0 0.5 1.0

0.0

1.0

2.0

PostDens [kcindexIntermediário]

Mean = 0.117 SD = 0.17

−0.5 0.0 0.5 1.0

0.0

1.0

2.0

PostDens [kcindexAdequado]

Mean = 0.182 SD = 0.169

−1.0 −0.5 0.0 0.5 1.0

0.0

1.0

2.0

PostDens [kcindexMais que adequado]

Mean = 0.079 SD = 0.185

−0.002 0.002 0.006 0.010

010

030

0

PostDens [blength]

Mean = 0.004 SD = 0.001

−0.4 −0.2 0.0 0.2

02

46

PostDens [feedingnomore]

Mean = −0.038 SD = 0.065

−0.4 −0.3 −0.2 −0.1 0.0

02

46

8

PostDens [feedingsupplemented]

Mean = −0.157 SD = 0.039

−0.5 0.0 0.5

0.0

1.0

2.0

3.0

PostDens [nurseryschoolsim]

Mean = 0.07 SD = 0.13

−0.6 −0.4 −0.2 0.0 0.2

02

46

PostDens [hospitalsim]

Mean = −0.191 SD = 0.063

29 of 36

INLA in R

> library(INLA)

> model<-inla(y ~ x1+x2+

f(age,model="rw2") +

f(id,model="iid"),

data=dataframe,family="gaussian",

control.compute=list(dic=TRUE,cpo=T))

> summary(model)

> plot(model)

30 of 36

www.r-inla.org

31 of 36

Concluding remarks

0

1

2

3

4

2010 2011 2012Year

Num

ber

of a

rtic

les

in P

ubm

edSearch: Integrated Nested Laplace Approximation(s)

Explore it!32 of 36

Concluding remarks

• The main aim of this presentation was twofold:◦ to highlight the fact that MCMC sampling is not a simple tool to be

used in routine analysis due to convergence and computational timeissues.

◦ to show that applied research can benefit, specially computationaly, ifan alternative method, such as INLA is adopted for Bayesian inference.

33 of 36

Concluding remarks

• The lessons are:◦ At the end of the day, there´s no free lunch. INLA does not substitute

MCMC. It is designed for a specific class of models, whilst veryflexible one.

◦ It comes to complement and make Health problems statistical analysis(Bayesian speaking) more straightforward.

◦ And beyond any doubt:

If you want r-INLA to have a particular feature, observation

model or prior model, you need to ask us!

Simpson D, Lindgren F, Rue H

34 of 36

Selected references

Rue H., Martino S. and Chopin N. 2009, Approximate Bayesian Inference forLatent Gaussian Models Using Integrated Nested Laplace Approximations (withdiscussion). Journal of the Royal Statistical Society, Series B, 71, 319-392

Fong Y., Rue H. and Wakefield J. 2010, Bayesian inference for GeneralizedLinear Mixed Models. Biostatistics, 11, 397-412. (R-code and supplementarymaterial)

Simpson D., Lindgren, F. and Rue H. 2011, Fast approximate inference withINLA: the past, the present and the future. Technical report at arxiv.org

Held L., Schrodle B. and Rue H., Posterior and Cross-validatory PredictiveChecks: A Comparison of MCMC and INLA (2009), a chapter in StatisticalModelling and Regression Structures. Editors: Tutz, G. and Kneib, T.Physica-Verlag, Heidelberg.

Martino S. and Rue H. Case Studies in Bayesian Computation using INLA(2010) to appear in Complex data modeling and computationally intensivestatistical methods (R-code)

A complete list can be found at www.r-inla.org/papers35 of 36

This research has been partially supported by National Funds throughFCT — Fundação para Ciência e Tecnologia, projectsPTDC/MAT/118335/2010 and PEst-OE/MAT/UI0006/2011

Thank you very much for your attention!

36 of 36

beyond mcmc in fitting complex bayesian models: the inla...

Documents