comparing hierarchical models for spatio-temporally misaligned data using the deviance information...

14
STATISTICS IN MEDICINE Statist. Med. 2000; 19:2265–2278 Comparing hierarchical models for spatio-temporally misaligned data using the deviance information criterion Li Zhu and Bradley P. Carlin *;Division of Biostatistics; School of Public Health; University of Minnesota; Box 303; Mayo Memorial Building; Minneapolis; Minnesota 55455-0392; U.S.A. SUMMARY Bayes and empirical Bayes methods have proven eective in smoothing crude maps of disease risk, elim- inating the instability of estimates in low-population areas while maintaining overall geographic trends and patterns. Recent work extends these methods to the analysis of areal data which are spatially misaligned, that is, involving variables (typically counts or rates) which are aggregated over diering sets of regional bound- aries. The addition of a temporal aspect complicates matters further, since now the misalignment can arise either within a given time point, or across time points (as when the regional boundaries themselves evolve over time). Hierarchical Bayesian methods (implemented via modern Markov chain Monte Carlo computing methods) enable the tting of such models, but a formal comparison of their t is hampered by their large size and often improper prior specications. In this paper, we accomplish this comparison using the deviance information criterion (DIC), a recently proposed generalization of the Akaike information criterion (AIC) designed for complex hierarchical model settings like ours. We investigate the use of the delta method for obtaining an approximate variance estimate for DIC, in order to attach signicance to apparent dierences between models. We illustrate our approach using a spatially misaligned data set relating a measure of trac density to paediatric asthma hospitalizations in San Diego County, California. Copyright ? 2000 John Wiley & Sons, Ltd. 1. INTRODUCTION The problem of selecting the best of a collection of candidate models has a long history in the statistical literature. For years, researchers operating within the Bayesian paradigm were advised to use only Bayes factors for this purpose. More specically, given models M 1 and M 2 having prior probabilities p(M 1 ) and p(M 2 ), the Bayes factor in favour of M 1 is the ratio of the posterior odds of M 1 to the prior odds of M 1 BF = P(M 1 | y)=P(M 2 | y) P(M 1 )=P(M 2 ) * Correspondence to: Bradley P. Carlin, Division of Biostatistics, School of Public Health, University of Minnesota, Box 303, Mayo Memorial Building, Minneapolis, Minnesota 55455-0392, U.S.A. E-mail: [email protected] Contract=grant sponsor: NIEHS; contract=grant number: 1-R01-ES07750 Copyright ? 2000 John Wiley & Sons, Ltd.

Upload: li-zhu

Post on 06-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

STATISTICS IN MEDICINEStatist. Med. 2000; 19:2265–2278

Comparing hierarchical models for spatio-temporallymisaligned data using the deviance information criterion

Li Zhu and Bradley P. Carlin∗;†

Division of Biostatistics; School of Public Health; University of Minnesota; Box 303; Mayo Memorial Building;Minneapolis; Minnesota 55455-0392; U.S.A.

SUMMARY

Bayes and empirical Bayes methods have proven e�ective in smoothing crude maps of disease risk, elim-inating the instability of estimates in low-population areas while maintaining overall geographic trends andpatterns. Recent work extends these methods to the analysis of areal data which are spatially misaligned, thatis, involving variables (typically counts or rates) which are aggregated over di�ering sets of regional bound-aries. The addition of a temporal aspect complicates matters further, since now the misalignment can ariseeither within a given time point, or across time points (as when the regional boundaries themselves evolveover time). Hierarchical Bayesian methods (implemented via modern Markov chain Monte Carlo computingmethods) enable the �tting of such models, but a formal comparison of their �t is hampered by their largesize and often improper prior speci�cations. In this paper, we accomplish this comparison using the devianceinformation criterion (DIC), a recently proposed generalization of the Akaike information criterion (AIC)designed for complex hierarchical model settings like ours. We investigate the use of the delta method forobtaining an approximate variance estimate for DIC, in order to attach signi�cance to apparent di�erencesbetween models. We illustrate our approach using a spatially misaligned data set relating a measure of tra�cdensity to paediatric asthma hospitalizations in San Diego County, California. Copyright ? 2000 John Wiley& Sons, Ltd.

1. INTRODUCTION

The problem of selecting the best of a collection of candidate models has a long history in thestatistical literature. For years, researchers operating within the Bayesian paradigm were advisedto use only Bayes factors for this purpose. More speci�cally, given models M1 and M2 havingprior probabilities p(M1) and p(M2), the Bayes factor in favour of M1 is the ratio of the posteriorodds of M1 to the prior odds of M1

BF =P(M1 | y)=P(M2 | y)P(M1)=P(M2)

∗ Correspondence to: Bradley P. Carlin, Division of Biostatistics, School of Public Health, University of Minnesota,Box 303, Mayo Memorial Building, Minneapolis, Minnesota 55455-0392, U.S.A.

† E-mail: [email protected]

Contract=grant sponsor: NIEHS; contract=grant number: 1-R01-ES07750

Copyright ? 2000 John Wiley & Sons, Ltd.

2266 L. ZHU AND B. P. CARLIN

which by Bayes theorem also equals the ratio of the observed marginal densities for the twomodels, p(y |M1)=p(y |M2). Unfortunately, the Bayes factor can be quite di�cult both to computeand interpret for high-dimensional hierarchical models, and in any case is not well-de�ned formodels having improper prior distributions, though alternative formulations to correct this probleminclude the intrinsic Bayes factor [1] and the fractional Bayes factor [2].These di�culties have led to a host of recently developed alternative Bayesian model choice

criteria. These include cross-validatory residual analyses [3; 4], various fairly informal conditionalpredictive schemes [5; 6], as well as more formal decision-theoretic methods for minimizing pos-terior predictive loss [7].Most recently, Spiegelhalter et al. [8] suggested a generalization of the Akaike information

criterion (AIC [9]) that is based on the posterior distribution of the deviance statistic

D(X)=− 2 logp(y|X) + 2 logf(y)where p(y|X) is the likelihood function for the observed data vector y given the parameter vectorX, and f(y) is some standardizing function of the data alone (which thus has no impact on modelselection). In this approach the �t of a model is summarized by the posterior expectation of thedeviance, �D = E�|y[D], while the complexity of a model is captured by the e�ective number ofparameters pD (note that this is often less than the total number of model parameters due to the‘borrowing of strength’ across individual-level parameters in hierarchical models). Below we showthat a reasonable de�nition of pD is the expected deviance minus the deviance evaluated at theposterior expectations

pD=E�|y[D]− D(E�|y[X])= �D − D(�X)The deviance information criterion (DIC) is then de�ned as

DIC= �D + pD=2 �D − D(�X) (1)

with smaller values of DIC indicating a better-�tting model. As with other model choice criteria,we caution that DIC is not intended for identi�cation of the ‘correct’ model, but rather merely asa method of comparing a collection of alternative formulations (all of which may be incorrect).An asymptotic justi�cation of DIC is straightforward in cases where the number of observations

n grows with respect to the number of parameters p, and where the prior p(�) is non-hierarchicaland completely speci�ed (that is, having no unknown parameters). Here we may expand D(X)around �X to give, to second order

D(X)≈D(�X)− 2(X− �X)T L′ − (X− �X)T L′′(X− �X) (2)

where L= logp(y|X) = −D(X)=2 and L′ and L′′ are the �rst derivative vector and second derivativematrix with respect to X. However, from the well-known ‘Bayesian central limit theorem’ we havethat X | y is approximately distributed as N(X̂;−[L′′]−1), where N denotes the multivariate normaldistribution, and �X = X̂ are the maximum likelihood estimates such that L′=0. This in turn impliesthat (X− X̂)T(−L′′)(X− X̂) has an approximate chi-squared distribution with p degrees of freedom.Thus writing Dnon(X) to represent the deviance for a non-hierarchical model, from (2) we havethat

Dnon(X) ≈ D(X̂)− (X− X̂)T L′′(X− X̂)=D(X̂) + �2p (3)

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

HIERARCHICAL MODELS FOR SPATIO-TEMPORALLY MISALIGNED DATA 2267

Rearranging (3) and taking expectations with respect to the posterior distribution of X, we have

p≈E�|y[Dnon(X)]− D(X̂) (4)

so that the number of parameters is approximately the expected deviance �D=E�|y[Dnon(X)] minusthe �tted deviance. But since AIC=D(X̂) + 2p, from (4) we obtain AIC≈ �D + p, the expecteddeviance plus the number of parameters. The DIC approach for hierarchical models thus followsthis equation and equation (4), but substituting the posterior mean �� for the maximum likelihoodestimate X̂. It is a generalization of Akaike’s criterion, since for non-hierarchical models �X≈ X̂,pD≈p, and DIC≈AIC. DIC can also be shown to have much in common with the hierarchicalmodel selection tools previously suggested by Ye [10] and Hodges and Sargent [11], though theDIC idea applies much more generally.As with many other model comparison tools, DIC consists of two terms, one representing

‘goodness of �t’ and the other a penalty for increasing model complexity. Besides its generality,another attractive aspect of DIC arises when it is coupled with the Markov chain Monte Carlo(MCMC) methods now typically used to generate samples from posterior distributions. Speci�cally,DIC may be readily calculated during an MCMC run by monitoring both X and D(X), and at theend of the run simply taking the sample mean of the simulated values of D, minus the plug-inestimate of the deviance using the sample means of the simulated values of X. This quantity can becalculated for each model being considered without analytic adaptation, complicated loss functions,additional MCMC sampling (say, of predictive values), or any matrix inversion.Spatial models of the sort typically used in disease mapping o�er an ideal opportunity for

exploiting the DIC, since they are typically high-dimensional and not easily handled using earlierBayesian model choice criteria. Developed by Clayton and Kaldor [12] and re�ned by Besag, Yorkand Molli�e [13], these models typically assume the observed disease count in region i, Yi, has aPoisson distribution with mean Eie�i , where Ei is an expected disease count (perhaps obtained viareference to an external standard table) and �i is a log-relative risk of disease, modelled linearlyas

�i = x′iR+ �i + �i; i = 1; : : : ; I (5)

Here the xi are explanatory spatial covariates, while R is a vector of �xed e�ects. The �i captureheterogeneity among the regions via the mixture speci�cation �i

iid∼ N(0; 1=�h); while the �i cap-ture regional clustering by assuming that �i|�j 6=i ∼ N( ��i; 1=(�cmi)); where mi is the number of‘neighbours’ of region i, and ��i = m

−1i∑

j∈@i �j with @i denoting the neighbour set of region i.The usual assumption is that regions are neighbours if and only if they are adjacent on the map,though other (for example, distance-based) modi�cations are often considered. This distributionfor M ≡ {�i} is called a conditionally autoregressive (CAR) speci�cation, which for brevity wetypically write in vector notation as M ∼ CAR(�c).Model (5) with the CAR prior formulation has several quirks that encourage use of DIC instead

of other Bayesian model choice criteria. First, the CAR prior is translation invariant, since anarbitrary constant could be added to all of the �i without changing the joint probability speci�-cation. This necessitates the addition of an identi�ability-preserving constraint (say,

∑Ii=1 �i=0),

which is awkward theoretically but easy to implement ‘on the y’ during an MCMC algorithm.Even with this correction to the prior, only the sum �i≡ �i+�i is identi�ed by the datapoint Yi, sothe e�ective dimension of the full model (as can be measured by pD) is often much smaller thanthe actual parameter count. Finally, in practice these models are often extended to spatio-temporal

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

2268 L. ZHU AND B. P. CARLIN

forms, or applied in settings where the data are spatially misaligned, that is, where the covariateinformation x is available over a di�erent regional grid than the response variable Y . As we shallsee in the next section, these modi�cations lead to substantial size and complexity in the parameterspace, and hence signi�cant di�culties in implementing model choice criteria that require posteriorpredictive samples.As with any new data analysis technique, however, several practical questions must be resolved

before DIC can become a standard element of the applied Bayesian toolkit. For example, DIC isnot invariant to parameterization, so (as with prior elicitation) the most plausible parameterizationmust be carefully chosen beforehand. Unknown scale parameters and other innocuous restructuringof the model can also lead to small changes in the computed DIC value. Still, the most vexingpractical problem in the application of DIC lies in determining an appropriate variance estimatefor it. Reasonably accurate estimates of the Monte Carlo variance of DIC are crucial if we areto make statements about one model having a ‘signi�cantly’ smaller DIC value than another (andhence, suggestive of a signi�cantly better model).In Section 2 we lay out the details of our models for spatio-temporally misaligned data, de�ne

the DIC in this context, and describe the methods by which we attempt to estimate var(DIC).Section 3 presents our data set, which relates tra�c density in San Diego County, California,to the numbers of paediatric asthma hospitalizations in each zip code in the area over the period1983–1990, and subsequently applies and compares our Section 2 methods in this context. We thengo on to select the best of a collection of models for this data set. Finally, Section 4 discussesour �ndings, and presents some possible avenues for future research.

2. ESTIMATION OF DIC FOR SPATIAL MODELS

2.1. Models for spatio-temporally misaligned areal data

Zhu et al. [14] developed hierarchical models to analyse spatio-temporally misaligned data whereinthe covariate is available on a grid that is a re�nement of the regional grid for which the responsevariable is available, and where the regional boundaries may also evolve over time. The basic ideais to model the number of disease events in subregion (atom) j of region i during time period t,Yijt , as conditionally independent of the other atom-level disease counts given the covariate values.The model is

Yijt | �ijt ∼ Po(Eijt exp(�ijt)); i=1; : : : ; It ; j=1; : : : ; Jit ; t=1; : : : ; T (6)

where the expected count for atom j of region i in time period t, Eijt , is proportional to thecorresponding known population count, nijt . (In our data set only the zip-level populations nit areknown, so we ‘break out’ the atom-level counts nijt by interpolating them proportional to atomarea.) Speci�cally, we set Eijt =Rnijt , where the proportionality constant R is the grand diseaserate (that is, the total number of events divided by the total population size, summing over allregions and time periods). The log-relative risk in atom ijt is then modelled as

�ijt = xijt�t + �t + �it + �it (7)

where xijt is a (possibly vector-valued) atom-level explanatory covariate, �t is the correspondingmain e�ect, �t is an overall intercept for time period t, and �it and �it are zip- and time-speci�c

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

HIERARCHICAL MODELS FOR SPATIO-TEMPORALLY MISALIGNED DATA 2269

heterogeneity and clustering random e�ects. In the spatio-temporal case, the distributions on thesee�ects become

Xtind∼ N

(0;1�tI)

and Mtind∼ CAR(�t)

where Xt =(�1t ; : : : ; �It t)′, Mt =(�1t ; : : : ; �It t)′, and we encourage similarity among these e�ectsacross time periods by assuming �t

iid∼ Gamma(a; b) and �tiid∼ Gamma(c; d). Note that we add the

constraints∑

i �it =0, t=1; : : : ; T , to identify the �xed e�ects �t . Placing at priors on the maine�ects �t and �t completes the prior speci�cation.Since only the zip-level disease counts Yit (and not the atom-level counts Yijt) are observed, we

use the additivity of the conditionally independent Poisson distributions in (6) to obtain

Yit | �t; �t ; �it ; �it ∼ Po(Jit∑j=1Eijt exp(�ijt)

); i = 1; : : : ; It ; t = 1; : : : ; T (8)

Our full Bayesian model speci�cation thus takes the form[T∏t=1

It∏i=1p(yit |�t; �t ; �it ; �it)

] [T∏t=1p(Xt |�t)p(Mt |�t)p(�t)p(�t)

]We remark that only the �t and �t parameters are easily updated using Gibbs sampling (since theirfull conditional distributions emerge as gamma distributions); Metropolis steps are typically usedfor the remaining parameters.

2.2. Estimating DIC and var(DIC)

Under model (8), the deviance (in the absence of any standardizing functions f(y)) is

D(\) =−2 logp(y|\)

= 2T∑t=1

It∑i=1

Jit∑j=1Eijte�ijt − 2

T∑t=1

It∑i=1yit log

(Jit∑j=1Eijte�ijt

)+ 2

T∑t=1

It∑i=1log(yit)! (9)

As mentioned above, our Monte Carlo approach to estimating DIC �rst draws {�(g)ijt }Gg=1 valuesfrom the posterior, and then computes corresponding {D(g)}Gg=1 values from (9). Finally, writing

\= {�ijt} we estimate DIC as �D − D(�\), where �D= 1G

∑Gg=1D

(g) and �\= 1G

∑Gg=1 \(g). A ‘brute

force’ approach to estimating var(DIC) would thus simply replicate the calculation of DIC a largenumber of times N , obtaining a sequence of DIC estimates {DICl; l=1; : : : ; N}. Since theseestimates are independent, we could then estimate var(DIC) by the sample variance

v̂ar(DIC)=1

N − 1N∑l=1(DICl − DIC)2 (10)

This algorithm is appealingly straightforward, but also painfully time-consuming. The need to re-estimate this variance over a possibly large collection of models in order to make comparisonsmotivates our search for computationally quicker alternatives.

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

2270 L. ZHU AND B. P. CARLIN

Suppose that in our Monte Carlo sampling we use two independent samples of size G1 andG2, respectively, in computing �D and D(�\). Then these two estimates would be uncorrelated, andhence

var(DIC)=var[2 �D − D(�\)]= 4 var( �D) + var[D(�\)] (11)

The �rst variance term in the rightmost part of (11) can be estimated directly from the {D(g1)}G1g1=1output; for the second we use the multivariate delta method (see, for example, Reference [15],pp. 71–72) with the {\(g2)}G2g2=1 output. That is, we express the variance of D(�\) as a function ofvar( ��ijt) and cov( ��ijt ; ��(ijt)′) for (ijt) 6= (ijt)′ as follows:

v̂ar(D(�\)) ≈∑ijt

(@D(�\)@ ��ijt

)2var( ��ijt) +

∑ijt 6=(ijt)′

@D(�\)@ ��ijt

@D(�\)@ ��(ijt)′

cov( ��ijt ; ��(ijt)′) (12)

Since the ��ijt are posterior means of the log-relative risk, their variance and covariance can nowbe estimated directly from the {\(g2)}G2g2=1 output.We now consider three di�erent approaches to estimating the individual variances and covari-

ances required by the delta method. First, we assume that the D(g1) samples are approximatelyuncorrelated for g1 = 1; : : : ; G1, so that correlation between two distinct samples D(k) and D(l) canbe ignored. Similarly, we assume independence among the �(g2)ijt samples for g2 = 1; : : : ; G2. Thenby the central limit theorem we have that var( �D) is estimated by the sample variance of theD(g1) samples divided by G1, and the variance and covariance of �ijt can be estimated by thecorresponding sample variance and covariance divided by G2.In most cases, however, the autocorrelations in the realized Monte Carlo draws are not neg-

ligible, since these draws come from a Markov chain. As such, our second approach modi�esour �rst by replacing the actual Monte Carlo sample sizes G1 and G2 with ‘e�ective samplesizes’ (Neal, [16], p. 105; Kass et al., [17], p. 99). The e�ective sample size is the actualsample size (G1 or G2) divided by the autocorrelation time, !, which is in turn de�ned as1 + 2

∑∞l=1 �(l) where �(l) is the sample autocorrelation at lag l for the parameter of interest

(in our case, D or one of the �ijt). It is necessary to cut o� the sum at a de�nite value L beyondwhich the autocorrelation estimate is close to zero, since adding autocorrelations for higher lagswill only add in excess noise. Thus in the case of var( �D) we would have

v̂ar( �D)= s2D =(G1=!1) (13)

where s2D is the usual sample variance of the D(g1) samples, and !1 = 1+ 2

∑L1l=1 �(l) where L1 is

the chosen autocorrelation sum cut-o�.Finally, a third (and computationally simpler) method of estimating var( �D) and var( ��ijt) while

still accounting for autocorrelation is through batching (see, for example, Carlin and Louis, [18],pp. 194–195). Illustrating in the case of var( �D), divide the sequence D(g), g=1; : : : ; G1 into msuccessive batches of length k with batch means B1; : : : ; Bm, and �B= 1

m

∑mi=1Bi. Then we have the

variance estimate

v̂ar( �D)= v̂ar( �B)=1

m(m− 1)m∑i=1(Bi − �B)2 (14)

which is valid if k is large enough so that the correlation between batches is negligible, and ifm is large enough to reliably estimate var(Bi). Choice of appropriate k and m is problematic,

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

HIERARCHICAL MODELS FOR SPATIO-TEMPORALLY MISALIGNED DATA 2271

Figure 1. Adjusted tra�c density (average vehicles per kilometre of major roadway per year)in thousands by zip code subregion for 1983.

analogous to the problem of selecting an appropriate upper bound for the �(l) sum in our secondapproach above. Plots of sample autocorrelations may be useful in both circumstances.

3. APPLICATION TO SPATIO-TEMPORALLY MISALIGNED ASTHMA RATES

3.1. Description of data set

We evaluate our method in the context of a spatially misaligned data set arising from San DiegoCounty, California, a region pictured in Figure 1. The city of San Diego is located near thesouthwestern corner of the map; the map’s western boundary is the Paci�c Ocean, while Mexicoforms its southern boundary. Our analytic goal is to investigate the link between proximity tohighway tra�c and hospitalization for asthma in children living in the region. A full descriptionof the data set appears in Zhu et al. [14]; here we brie y review its components. First, overthe eight-year (1983–1990) study period, we have computerized boundaries for every zip code inthe county. A complicating factor is that these boundaries actually changed four times during theperiod (in 1984, 1987, 1988 and 1990), due to changes made to the zip code grid by the U.S.Postal Service. Second, for each year t we have the number of discharges Yit from hospitalizationsdue to asthma for children aged 14 and younger living in zip code i, i=1; : : : ; It , t = 1; : : : ; T =8(California O�ce of Statewide Health Planning and Development [19]). Third, we have estimates

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

2272 L. ZHU AND B. P. CARLIN

of the number of residents aged 14 and younger nit in zip i and year t, i = 1; : : : ; It , t = 1; : : : ; T(Scalf and English [20]). While there are possible sources of uncertainty in these nit estimates, weassume these population counts to be �xed and known.Finally, for each of the major roads in San Diego County, we have mean yearly tra�c counts

on each road segment. While we could easily obtain zip-level summaries (for example, total tra�ccount per kilometre of major road), we wished to use our road segment information to createmore re�ned exposure estimates. Previous epidemiological work [21; 22] suggests that 99 per centof airborne tra�c pollutants have decayed at distances greater than 500 m. As such, we used thegeographic information system (GIS) ARC=INFO to create 500m bu�ers around each major road,thus subdividing the zips into ‘exposed’ and ‘unexposed’ subregions. Figure 1 shows the result forthe year 1983. Our de�nition leads to some urban zips becoming ‘entirely exposed’, since theycontain no point further than 500 m from a major road. On the other hand, many zips in the thinly-populated eastern part of the county contain at most one major road, suggestive of little or notra�c exposure. As a result, we rede�ned those zips having tra�c densities less than 2000 cars peryear per km of major roads as being ‘entirely unexposed’ (adjusted tra�c density= 0). This leftslightly less than half of the zips (47 of the 97 for the displayed year, 1983) in the middle range,having some exposed and some unexposed subregions. These adjusted tra�c densities are shownat the subregional level for 1983 in Figure 1, and are used as the covariate xijt in model (7).Notice that the 500m bu�ers are apparent only in regions which are neither ‘entirely exposed’(urban) nor ‘entirely unexposed’ (rural=desert). Zhu et al. [14] discuss alternative de�nitions ofthe exposure covariate for this data set.

3.2. Comparison of methods for the full model

We completed the speci�cation of our full model (7) by setting a=1, b=10 (that is, the �t haveprior mean and standard deviation both equal to 10) and c = 0:1, d = 10 (that is, the �t haveprior mean 1, standard deviation

√10). Recall that since �t and �t are precisions (not variances),

these priors imply that for any given year t, we expect the heterogeneity random e�ects �it to havestandard deviation around (10)−1=2 = 0:316, while the clustering random e�ects �it should haveconditional standard deviation (given the neighbouring �jt) around (1)−1=2 = 1:0. These choicesare broadly consistent with advice given by Bernardinelli et al. [23] for establishing a ‘fair’ priortrade-o� between these two components of variance. Moreover, the resulting priors are fairly vague,and thus allow the data to dominate the allocation of excess spatial variability to heterogeneityand clustering. We ran two sets of three parallel MCMC chains for 1000 iterations followinga 2000 iteration burn-in period for each, resulting in G1 =G2 = 3000 post-convergence samplesfor posterior summarization. Table I presents the resulting variance estimates for �D, D(�\), pD(e�ective model size), and DIC obtained using the methods described in Section 2.2. Besides the‘brute force’ method (10), we compare the three delta methods: method 1 (assuming autocorrelationin the sampling chains to be negligible); method 2 (handling autocorrelation by replacing actualwith e�ective sample sizes), and method 3 (handling autocorrelation via batching). Speci�cally,method 2 estimates the components of variance in the manner of equation (13), where !z =1+ 2

∑L1l=1�(l) for z = 1; 2; based on inspection of sample autocorrelations for D and several �ijt

we select the cut-o�s L1 = 10 and L2 = 40. Similarly, method 3 requires selection of batch sizes kzsuch that kzmz =Gz, z=1; 2, to use with equation (14). Here we compare results from six possiblechoices: case (a) takes k1 = k2 = 10; case (b) takes k1 = 10, k2 = 20; case (c) takes k1 = 20, k2 = 10;case (d) takes k1 = k2 = 20; case (e) takes k1 = k2 = 50; and case (f) takes k1 = k2 = 100.

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

HIERARCHICAL MODELS FOR SPATIO-TEMPORALLY MISALIGNED DATA 2273

Table I. Comparison of variance estimation methods for the full model.

Method Runtime hours Var( �D) Var(D(�\)) Var(pD) Var(DIC)

Brute force 80 6.9 25.0 31.0 49.6Delta method: 1 12 0.4 0.3 0.7 1.9

2 12 6.6 4.0 10.5 30.33(a) 2 1.5 2.5 4.0 8.5(b) 2 1.5 4.2 5.7 10.3(c) 2 2.0 2.5 4.5 10.5(d) 2 2.0 4.2 6.2 12.2(e) 1.5 2.8 8.0 10.8 19.2(f) 1 3.0 10.9 14.0 23.2

In implementing the ‘brute force’ method, the calculation was repeated N =1000 times, pro-ducing an empirical distribution of 1000 DIC estimates which we take as the ‘exact’ distribution.While our FORTRAN program took about 80 hours to complete this work, it successfully producedMonte Carlo variance estimates for both pD and DIC.The results from the three delta methods are all based on the same MCMC samples, to heighten

comparability. Looking �rst at method 1, we see variance estimates that are all far too small,suggesting strongly that the assumption of uncorrelated posterior estimates is not even remotelyappropriate in this case. In particular, this method’s estimate of var(DIC), 1.9, is less than4 per cent of the exact estimate. Method 2 does a relatively good job of estimating var( �D),but signi�cantly underestimates var(D(�\)), leading to corresponding underestimation of var(pD)and var(DIC). Here the culprit could be our use of a common upper bound L2 for all ijt, insteadof allowing the correlation time !2 in this second term to vary across atoms and years. However,since there are roughly 800 atoms in our data set, appropriate calculation of such !(ijt)2 is not at allclear. Moreover, the culprit might instead be the method itself, rather than its computation; perhapsthe multivariate delta approximation (12) simply is not accurate enough in our high-dimensionalmodel setting.Finally, the method 3 results are disappointing, featuring signi�cant underestimation of the vari-

ances for both components of DIC. Submethods (a)–(f) do feature relatively short runtimes, dueto the replacement of the full MC sample by the batch means, the number of which naturallydecrease as output is grouped into ever-larger batches. The variance estimates generally increasewith batch size, but of course quality begins to deteriorate as the number of batches availablefrom our �xed generated sample becomes too small to e�ectively compute the necessary samplevariance of the batch means. Increasing the total run length in submethod (f) by a factor of 10(that is, mz =300 batches of length kz =100) and adjusting for the larger overall sample sizeimproves the variance estimate of �D to 4.7, but of course at a corresponding cost in runtime. Aswith method 2, the de�ciencies in the method may be due to our assumption of a common batchsize across all ijt in our Var(D(�\)) estimation, or to the inadequacy of the approximation (12)itself.

3.3. Comparison of models

In Table II we present the estimated pD and DIC values and their corresponding variance estimatesfor four competing models using the ‘exact’ variance estimation method. The log-relative risk

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

2274 L. ZHU AND B. P. CARLIN

Table II. Comparison of pD and DIC estimates and associated‘exact’ variance estimates for various models.

Model ptotal pD Var(pD) DIC Var(DIC)

I 16 16.0 0.09 4245.9 0.3II 818 363.1 18.3 3268.5 39.8III 818 440.6 21.5 3287.0 39.2IV 1620 438.6 29.7 3271.7 49.6

models we compare are:

Model I: �ijt = xijt�t + �t Model II: �ijt = xijt�t + �t + �itModel III: �ijt = xijt�t + �t + �it Model IV: �ijt = xijt�t + �t + �it + �it

Thus model IV is our full model (7) and the others are various simpli�cations of it, with modelIII omitting the heterogeneity random e�ects, model II omitting the clustering random e�ects, andmodel I omitting both (retaining only the covariate and intercept main e�ects). Also included inthe table are values for ptotal, the total number of parameters present in each model (obtainedsimply by counting).Beginning with the �xed e�ect model I, we note that pD = 16:0, equal (to one decimal place)

to the true number of parameters ptotal = 16. Thus there is little evidence of shrinkage across yearsfor the main e�ects �t and �t , as expected under their at prior speci�cations. For models IIand III, the true number of parameters is 818, while the pD estimates are roughly 363 and 441,respectively. This suggests that the exchangeable random e�ect model is a more parsimoniousexplanation of the excess between-area variability in log-relative risk (beyond that explained byour exposure covariate), since this model has a smaller e�ective sample size. Finally, model IVis estimated to have no more e�ective parameters than model III, even though its actual size isnearly twice as large (ptotal = 1620).Turning to the comparison of DIC and its variance estimate, the rather large advantage model II

enjoys in parsimony seems to translate into a relatively small advantage in our omnibus modelchoice criterion (1). As judged by their variance estimates, the DIC values for models II, III andIV are fairly similar, with model I falling well behind. (A referee has pointed out that, assumingindependence and approximate normality of the DIC replicates across models, a formal test ofH0:DICII = DICIII produces a test statistic of T =1:97, hence a signi�cant advantage for model IIat the usual 0.05 level.) Di�erences among the DIC scores can be seen even more clearly fromthe histograms of the simulated DIC posterior estimates for each of the four models shown inFigure 2. The overlap in support among the model II, III and IV histograms is substantial.In the original analysis of these data, Zhu et al. [14] do not seriously consider the issue of

model choice, instead simply adopting the full model (model IV) and subsequently obtaining amap of the �tted rates Re�̂ijt , where R is again the grand asthma rate across all zips and yearsand �̂it is obtained by plugging in the estimated posterior means for the various components inequation (7). In light of Table II, we might instead opt for model II, concluding that the excessheterogeneity in the data can be adequately explained by the non-spatially referenced randome�ects �it . Figure 3 maps these model II �tted rates alongside a set of crude rates obtained inARC/INFO by allocating the zip-level asthma hospitalizations to the atoms proportional to tra�cexposure, and then computing the crude rates as rijt =Yijt=nijt . Even though spatial smoothing is not

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

HIERARCHICAL MODELS FOR SPATIO-TEMPORALLY MISALIGNED DATA 2275

Figure 2. Comparison of simulated DIC values across models.

explicitly �t by model II, the �tted rates none the less exhibit a fairly high degree of smoothness,while correcting for oddities in the crude rates (for example, the surprisingly high rates in a fewsuburban and eastern desert districts).

4. DISCUSSION AND FUTURE DIRECTIONS

In this paper we have investigated the use of the deviance information criterion [8] for choosingamongst a collection of high-dimensional models for spatio-temporally misaligned data, and havesought to obtain a suitable estimate of its variance when it is calculated (as is customary) fromMCMC output. While the DIC statistic itself seems to perform reasonably well, our various deltamethod approaches to the variance estimation problem met with mixed results, with none emergingas su�ciently accurate to merit routine use. At least in complex problem settings like ours, itappears replicating the DIC calculation as in (10) is currently the only suitable method for capturingits variability (or its entire distribution, as in Figure 2).Of the delta method approaches, method 2 (using e�ective sample sizes) performed the best,

providing a reasonable estimate for the variance of �D, but not for that of D(�\) (and hence not forthat of pD or DIC either). Future work on this approach might attempt to see if the quality of theapproximation could be improved using a di�erent autocorrelation sum cut-o� L2 for each atomijt. Alternatively, perhaps a higher-order approximation (as opposed to the essentially �rst-orderapproximation provided by our multivariate delta method) would be necessary; D is after all ahighly non-linear function of \.Turning to more general implementation issues, while DIC can be sensitive to choice of param-

eterization, the fact that pD equalled ptotal (to one decimal place) for our model I, in agreement

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

2276 L. ZHU AND B. P. CARLIN

Figure 3. Paediatric asthma hospitalization rate (per thousand children) by zip code for 1983:(a) naively imputed crude rate; (b) Model II �tted rate.

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

HIERARCHICAL MODELS FOR SPATIO-TEMPORALLY MISALIGNED DATA 2277

with equation (4), is encouraging; we obtained similar agreement under the smaller �xed e�ects-only model �ijt = �t (pD=ptotal = 8:0). Thus the normal-theory asymptotics justifying pD as ane�ective model size seem acceptable here under our canonical parameterization. Van der Linde[24] provides further theoretical justi�cation for DIC (using Kullback–Leibler distances and utilitytheory), as well as connections to (and further doubts about) Bayes factors. Overall, combinedwith more traditional residual analyses and posterior predictive model checks, DIC appears to o�era comprehensive framework for comparison and evaluation within our complex model class.

ACKNOWLEDGEMENTS

The research of both authors was supported in part by National Institute of Environmental Health Sciences(NIEHS) grant 1-R01-ES07750. The contents of this paper are solely the responsibility of the authors and donot necessarily represent the o�cial views of the NIEHS or NIH. The authors are grateful to Dr Paul Englishand Mr Rusty Scalf of the Environmental Health Investigations Branch, California Department of HealthServices, for permission to analyse the San Diego data, as well as substantial assistance with backgroundmaterial, database issues and analytic methods.

REFERENCES

1. Berger JO, Pericchi LR. The intrinsic Bayes factor for linear models. In Bayesian Statistics 5, Bernardo JM, Berger JO,Dawid AP, Smith AFM (eds). Oxford University Press: Oxford, 1996; 25–44.

2. O’Hagan A. Fractional Bayes factors for model comparison (with discussion). Journal of the Royal Statistical Society,Series B 1995; 57:99–138.

3. Gelfand AE, Dey DK, Chang H. Model determination using predictive distributions with implementation via sampling-based methods (with discussion). In Bayesian Statistics 4, Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds).Oxford University Press: Oxford, 1992; 147–167.

4. Xia H, Carlin BP, Waller LA. Hierarchical models for mapping Ohio lung cancer rates. Environmetrics 1997;8:107–120.

5. Laud P, Ibrahim J. Predictive model selection. Journal of the Royal Statistical Society, Series B 1995; 57:247–262.6. Waller LA, Carlin BP, Xia H, Gelfand AE. Hierarchical spatio-temporal mapping of disease rates. Journal of theAmerican Statistical Association 1997; 92:607–617.

7. Gelfand AE, Ghosh SK. Model choice: a minimum posterior predictive loss approach. Biometrika 1998; 85:1–11.8. Spiegelhalter DJ, Best N, Carlin BP. Bayesian deviance, the e�ective number of parameters, and the comparison ofarbitrarily complex models. Research Report 98-009, Division of Biostatistics, University of Minnesota, 1998.

9. Akaike H. Information theory and an extension of the maximum likelihood principle. In 2nd International Symposiumon Information Theory, Petrov BN, Cs�aki F (eds). Akad�emiai Kiad�o: Budapest, 1973; 267–281.

10. Ye J. On measuring and correcting the e�ects of data mining and model selection. Journal of the American StatisticalAssociation 1998; 93:120–131.

11. Hodges JS, Sargent DJ. Counting degrees of freedom in hierarchical and other richly-parameterised models. Technicalreport, Division of Biostatistics, University of Minnesota, 1998.

12. Clayton DG, Kaldor J. Empirical Bayes estimates of age-standardized relative risks for use in disease mapping.Biometrics 1987; 43:671–681.

13. Besag J, York JC, Molli�e A. Bayesian image restoration, with two applications in spatial statistics (with discussion).Annals of the Institute of Statistical Mathematics 1991; 43:1–59.

14. Zhu L, Carlin BP, English P, Scalf R. Hierarchical modeling of spatio-temporally misaligned data: relating tra�cdensity to pediatric asthma hospitalizations. Environmetrics 2000; 11:43–61.

15. Elandt-Johnson RC, Johnson NL. Survival Models and Data Analysis. Wiley: New York, 1980.16. Neal RM. Probabilistic inference using Markov chain Monte Carlo methods. Technical report CRG-TR-93-1, Department

of Computer Science, University of Toronto, 1993.17. Kass RE, Carlin BP, Gelman A, Neal R. Markov chain Monte Carlo in practice: a roundtable discussion. American

Statistician 1998; 52:93–100.18. Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall=CRC Press: Boca

Raton, FL, 1996.19. California O�ce of Statewide Health Planning and Development. Hospital Patient Discharge Data

(Public Use Version). State of California: Sacramento, CA, 1997.

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278

2278 L. ZHU AND B. P. CARLIN

20. Scalf R, English P. Border Health GIS Project: documentation for inter-censal zip code population estimates. Technicalreport, Impact Assessment Inc., Environmental Health Investigations Branch, California Department of Health Services,1996.

21. Versluis AH. Methodology for predicting vehicle emissions on motorways and their impact on air quality in theNetherlands. Science of the Total Environment 1994; 146=147:359–364.

22. Fraigneau YC, Gonzalez M, Coppalle A. Dispersion and chemical reaction of a pollutant near a motorway. Science ofthe Total Environment 1995; 169:83–91.

23. Bernardinelli L, Clayton DG, Montomoli C. Bayesian estimates of disease maps: How important are priors? Statisticsin Medicine 1995; 14:2411–2431.

24. Van der Linde A. DIC put into Bayesian perspective: From residual deviances to adjusted Bayes factors. Technicalreport, Department of Mathematics and Statistics, University of Edinburgh, 1998.

Copyright ? 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:2265–2278