global, local and focused geographic clustering for case-control data with residential histories

19
BioMed Central Page 1 of 19 (page number not for citation purposes) Environmental Health: A Global Access Science Source Open Access Methodology Global, local and focused geographic clustering for case-control data with residential histories Geoffrey M Jacquez* 1 , Andy Kaufmann 1 , Jaymie Meliker 2 , Pierre Goovaerts 1 , Gillian AvRuskin 1 and Jerome Nriagu 2 Address: 1 BioMedware, Inc., 516 North State Street, Ann Arbor, MI, 48104-1236, USA and 2 Department of Environmental Health Sciences, The University of Michigan School of Public Health, 109 S. Observatory St. Ann Arbor, MI, 48109-2029, USA Email: Geoffrey M Jacquez* - [email protected]; Andy Kaufmann - [email protected]; Jaymie Meliker - [email protected]; Pierre Goovaerts - [email protected]; Gillian AvRuskin - [email protected]; Jerome Nriagu - [email protected] * Corresponding author Abstract Background: This paper introduces a new approach for evaluating clustering in case-control data that accounts for residential histories. Although many statistics have been proposed for assessing local, focused and global clustering in health outcomes, few, if any, exist for evaluating clusters when individuals are mobile. Methods: Local, global and focused tests for residential histories are developed based on sets of matrices of nearest neighbor relationships that reflect the changing topology of cases and controls. Exposure traces are defined that account for the latency between exposure and disease manifestation, and that use exposure windows whose duration may vary. Several of the methods so derived are applied to evaluate clustering of residential histories in a case-control study of bladder cancer in south eastern Michigan. These data are still being collected and the analysis is conducted for demonstration purposes only. Results: Statistically significant clustering of residential histories of cases was found but is likely due to delayed reporting of cases by one of the hospitals participating in the study. Conclusion: Data with residential histories are preferable when causative exposures and disease latencies occur on a long enough time span that human mobility matters. To analyze such data, methods are needed that take residential histories into account. Background U.S. population-based surveys estimate that adults spend 87% of their day indoors, 69% in their place of residence, and 6% in a vehicle [1-3]. To date, most published disease cluster investigations use static geographies in which indi- viduals are assumed to be sessile. Examples include the use of geocoded place of residence at time of diagnosis, death, and at time of birth (e.g. [4]), as well as the address of the admitting hospital (e.g. [5]) to record locations of health events, even though most researchers acknowledge that residential mobility should be accounted for, espe- cially for diseases with long latencies such as cancer. In a recent review of standard methods for evaluating expo- sure/hazards, disease mapping and clustering techniques, Bayesian approaches, Markov Chain Monte Carlo (MCMC) and geostatistical methods, Mather et al. [6] Published: 22 March 2005 Environmental Health: A Global Access Science Source 2005, 4:4 doi:10.1186/1476-069X-4-4 Received: 28 October 2004 Accepted: 22 March 2005 This article is available from: http://www.ehjournal.net/content/4/1/4 © 2005 Jacquez et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: independent

Post on 20-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

BioMed Central

Environmental Health: A Global Access Science Source

ss

Open AcceMethodologyGlobal, local and focused geographic clustering for case-control data with residential historiesGeoffrey M Jacquez*1, Andy Kaufmann1, Jaymie Meliker2, Pierre Goovaerts1, Gillian AvRuskin1 and Jerome Nriagu2

Address: 1BioMedware, Inc., 516 North State Street, Ann Arbor, MI, 48104-1236, USA and 2Department of Environmental Health Sciences, The University of Michigan School of Public Health, 109 S. Observatory St. Ann Arbor, MI, 48109-2029, USA

Email: Geoffrey M Jacquez* - [email protected]; Andy Kaufmann - [email protected]; Jaymie Meliker - [email protected]; Pierre Goovaerts - [email protected]; Gillian AvRuskin - [email protected]; Jerome Nriagu - [email protected]

* Corresponding author

AbstractBackground: This paper introduces a new approach for evaluating clustering in case-control datathat accounts for residential histories. Although many statistics have been proposed for assessinglocal, focused and global clustering in health outcomes, few, if any, exist for evaluating clusters whenindividuals are mobile.

Methods: Local, global and focused tests for residential histories are developed based on sets ofmatrices of nearest neighbor relationships that reflect the changing topology of cases and controls.Exposure traces are defined that account for the latency between exposure and diseasemanifestation, and that use exposure windows whose duration may vary. Several of the methodsso derived are applied to evaluate clustering of residential histories in a case-control study ofbladder cancer in south eastern Michigan. These data are still being collected and the analysis isconducted for demonstration purposes only.

Results: Statistically significant clustering of residential histories of cases was found but is likely dueto delayed reporting of cases by one of the hospitals participating in the study.

Conclusion: Data with residential histories are preferable when causative exposures and diseaselatencies occur on a long enough time span that human mobility matters. To analyze such data,methods are needed that take residential histories into account.

BackgroundU.S. population-based surveys estimate that adults spend87% of their day indoors, 69% in their place of residence,and 6% in a vehicle [1-3]. To date, most published diseasecluster investigations use static geographies in which indi-viduals are assumed to be sessile. Examples include theuse of geocoded place of residence at time of diagnosis,death, and at time of birth (e.g. [4]), as well as the address

of the admitting hospital (e.g. [5]) to record locations ofhealth events, even though most researchers acknowledgethat residential mobility should be accounted for, espe-cially for diseases with long latencies such as cancer. In arecent review of standard methods for evaluating expo-sure/hazards, disease mapping and clustering techniques,Bayesian approaches, Markov Chain Monte Carlo(MCMC) and geostatistical methods, Mather et al. [6]

Published: 22 March 2005

Environmental Health: A Global Access Science Source 2005, 4:4 doi:10.1186/1476-069X-4-4

Received: 28 October 2004Accepted: 22 March 2005

This article is available from: http://www.ehjournal.net/content/4/1/4

© 2005 Jacquez et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

identified as substantial weaknesses (1) the lack of tempo-ral referencing of geospatial data and (2) the inability ofmethods to account for residential histories. A recentmeeting of this nation's experts recognized the need toaccount for latency and human mobility as especiallypressing in studies of cancer [7]. Boscoe et al. [8] identi-fied residential history information as a primary need forthe analysis of cancer data.

The representation of individuals as sessile (immobile)rather than vagile (mobile) in part is due to the staticworld view of GIS software, which is not well suited torepresenting temporal change [9,10]. Recently, technolog-ical advances have resulted in Space Time Intelligence Sys-tems (e.g. [11-13]) that implement several constructsfrom Geographic Information Science for representinghuman mobility (see [14] for a review). The methods pre-sented in this paper build on this body of prior work toproduce case-control cluster statistics for residentialhistories.

We begin with a brief background on tests for disease clus-tering, followed by a summary of approaches to modelinghuman mobility. We then develop a suite of novel testsfor evaluating local, global and focused clustering in resi-dential histories using case-control data. Finally, we illus-trate several of the new techniques by quantifying local,global and focused clustering of residential histories in acase-control study of bladder cancer in Michigan.

Background on cluster testsCluster tests work within a hypothesis testing frameworkthat proceeds by calculating a statistic (e.g. clustering met-ric) to quantify a relevant aspect of spatial pattern in ahealth outcome (e.g. case/control location, disease inci-dence, or mortality rate). The numerical value of this sta-tistic is then compared to the distribution of that statistic'svalue under a null spatial model, providing a probabilisticassessment of how unlikely an observed cluster statistic isunder the null hypothesis [15]. Waller and Jacquez [16]formalized this approach by identifying five componentsof a spatial cluster test. The test statistic quantifies a rele-vant aspect of spatial pattern (e.g. Moran's I). The alterna-tive hypothesis describes the spatial pattern that the test isdesigned to detect. This may be a specific alternative, suchas a circular cluster for the scan statistic, or it may be theomnibus "not the null hypothesis". The null hypothesisdescribes the spatial pattern expected when the alternativehypothesis is false (e.g. uniform cancer risk). The null spa-tial model is a mechanism for generating the reference dis-tribution. This may be based on distribution theory, or itmay use randomization (e.g. Monte Carlo) techniques.Most disease cluster tests employ heterogeneous Poissonand Bernoulli models for specifying null hypotheses [17].The reference distribution is the distribution of the test

statistic when the null hypothesis is true. Comparison ofthe test statistic to the reference distribution allows calcu-lation of the probability of observing that value of the teststatistic under the null hypothesis of no clustering. Thisfive-component mechanism underpins most commonlyused clustering methods.

There are dozens of cluster statistics (see [17-19] forreviews) that may be categorized for convenience as glo-bal, local, and focused tests. Global cluster statistics aresensitive to spatial clustering, or departures from the nullhypothesis, that occur anywhere in the study area. Manyearly tests for spatial pattern, such as Moran's I [20] areglobal tests. While global statistics can determine whetherspatial structure (e.g. clustering, autocorrelation, uniform-ity) exists, they do not identify where the clusters are, nordo they quantify how spatial dependency varies from oneplace to another.

Local statistics such as Local Indicators of Spatial Autocor-relation (LISA) [21] quantify spatial autocorrelation andclustering within the small areas that together comprisethe study geography. Local statistics quantify spatialdependency (e.g. not significantly different from the nullexpectation, cluster of high values, cluster of low values,and high or low spatial outlier) in a given locality. Manylocal statistics have global counterparts that often are cal-culated as functions of local statistics. For example,Moran's I is the sum of the scaled local Moran statistics.

Focused statistics quantify clustering around a specificlocation or focus. These tests are particularly useful forexploring possible clusters of disease near potentialsources of environmental pollutants. For example, Law-son and Waller [22,23] proposed tests that score each areafor the difference between observed and expected diseasecounts, weighted by exposure to the focus (also see [24]for a review of these approaches). A commonly used expo-sure function is inverse distance to the focus (1/d). Thenull hypothesis is no clustering relative to the focus, withexpected number of cases calculated as the Poisson expec-tation using the population at risk in each area and theassumption that risk is uniform over the study area.

Hundreds of cluster investigations are recorded in the lit-erature, and several of these have resulted in cancer con-trol activities such as epidemiological studies tounderstand potential causes. Notable examples of clusterstudies include brain cancer [25], liver cancer [26], breastcancer [27,28], prostate cancer [29], colorectal cancer[30], and cancer disparities [31], to name only a few.

In studies of lung, breast and colorectal cancer on CapeCod, stronger evidence for spatial clustering was foundonce latency was taken into account [32]. In a population-

Page 2 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

based case-control study Vieira et al. [33] incorporatedresidential location to evaluate lung cancer risks notexplained by age and smoking. Han et al. [34] exploredgeographic clustering of breast cancer based on place ofresidence early in life and found space-time clustering incase-control data. They also explored clustering of casesusing place of residence at critical time points including atthe subject's birth, menarche, and at the women's firstbirth. Boscoe et al. [8] recognized representation of resi-dential mobility as a primary need for data used in studiesof cancer. But to date and to our knowledge, residentialmobility has yet to be directly accounted for in clusterstudies.

How might one account for residential mobility in clusterstudies? Hagerstrand [35] conceptualized the space timepath as an individual's continuous physical movementthrough space and time, and visually represented this as a3-dimensional graph. Hornsby and Egenhofer [36] recog-nized that space-time paths mediate individual-levelexposure to pathogens and environmental toxins, andthat practical application would require a mechanism forrepresenting location uncertainty. A space time prism refersto the possible locations an individual could feasibly passthrough in a specific time interval, given knowledge oftheir actual locations in the times bracketing that interval.The potential path area [37] shows the locations the indi-vidual could occupy given these constraints, and repre-sents places where exposure events might occur. Theseconstructs enabled new research approaches in diversefields such as student life [38], sports analysis [39], socialsystems [40], transportation [37], and the analysis of dis-parities in gender accessibility in households [41]. Whilethese approaches provide a proven mechanism for mode-ling geospatial lifelines and related constructs, to date andto our knowledge there are no methods for the statisticalevaluation of clustering among such lifelines other thanthe paper by Sinha and Mark [42], who use Minkowski-type metrics to calculate a dissimilarity metric for geospa-tial lifelines, and then cluster this dissimilarity metric.

This paper proposes a novel technique for undertaking thestatistical evaluation of clustering of residential historiesfor case-control data. We first develop the method, andthen apply it to an ongoing case-control study of bladdercancer in southeastern Michigan.

Setting the ProblemA naïve approach when considering residential histories isto take an existing test for spatial clustering, and to thenapply it repeatedly for different time values. For example,when considering the geographic distribution of bladdercancer, one might use place of residence of individuals ina case-control study from T years ago, and then allow T tovary in a range of several decades. Locations of place of

residence will change, as may the numbers of cases andcontrols extant in the study area. How might results varydepending on when one looks at the system (e.g. on selec-tion of T)? To answer this question we analyzed data froma population-based bladder cancer case-control study cur-rently underway in southeastern Michigan. Cases arerecruited from the Michigan State Cancer Registry anddiagnosed in the years 2000–2004.

Controls are frequency matched to cases by age (± 5years), race, and gender, and recruited using a randomdigit dialing procedure from an age-weighted list. Thisdata set is described more fully later in this paper, and iscomprised of 63 cases and 182 controls. Using Cuzick andEdwards Tk statistic with k = 5 nearest neighbors we thenanalyzed these data at every point in time when the topol-ogy of place of residence of the cases and controls changedby having a participant move, enter or leave the studyarea. The graph of Tk through time (Figure 1) is ascending,reflecting the larger number of cases and controls residingin the study area in later time periods. We found five peri-ods when cases were significantly clustered relative to thecontrols: January 1 1929 through January 1 1935, January1 1941 through November 26 1942, January 1 1960through January 1 1961, August 22 1967 through January1 1975 and January 1 1995 through January 1 1997.Clearly, results of cluster analyses that rely on single loca-tions may be highly sensitive to the choice of the time atwhich the analysis is conducted. What are needed are newmethods that account for the dynamic topology of casesand controls that arise as a consequence of residentialmobility, and that are suited to multi-temporal analyses.The development of such techniques is the focus of thispaper.

MethodsWe begin by defining an algebra for residential histories,and a matrix representation that describes how spatialnearest neighbor relationships change through time. Nextwe develop a local case control cluster test, and thenextend it to create global, local and focused tests at specifictime points, and then for entire residential histories. Aftercompleting development of the cluster statistics for resi-dential histories, we next describe exposure traces thataccount for latency periods and exposure windows. Wethen develop clustering methods for exposure traces. Afterthat we describe the bladder cancer data set that was ana-lyzed with the new methods. In the Results section wedescribe application of several of the new cluster tests toevaluate possible clustering of residential histories ofcases of bladder cancer in Michigan.

NotationDefine the coordinate ui,t = {xi,t, yi,t} to indicate the geo-graphic location of the place of residence of the ith case or

Page 3 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

Cuzick and Edward's statistic through timeFigure 1Cuzick and Edward's statistic through time. Graph of Cuzick and Edward's Tk statistic (top) and its Probability (bottom) through time for k = 5. Shown in red are those time intervals in which the probability of Tk was 0.0l or smaller.

Page 4 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

control at time t. Residential histories for individual casesand controls can then be represented as the set of space-time locations:

Li = (ui0, ui1,..., uiT) (Equation 1)

This defines individual i living at his or her place of resi-dence found at ui0 at the beginning of the study (time 0),and moving to location ui1 at time t = 1. At the end of thestudy individual i may be found at uiT. T is defined to bethe number of unique observation times on all individu-als in the study. This bears some emphasis as understand-ing of how T is recorded is essential in order to understandthe cluster tests for residential histories. In other words, Tis the total number of different observation times acrossall individuals, and so one might expect several geo-graphic locations in an individual residential history to bethe same. For example, suppose we have 2 individuals (iand j) and record their residential histories (Figure 2). Werecord their places of residence at t = 0, the beginning ofthe study. At some time t = 1 "i" moves to a differenthome, and moves again at time t = 2. "j" never moves atall and hence has the location of the same initial place ofresidence recorded at times t = 0, 1, and 2. In this exampleT = 2. Notice the duration between t = 0 to t = 1 may notequal the duration from t = 1 to t = 2. This will be impor-

tant later when we develop duration-weighted versions ofthe statistics.

While observations on residential histories occur at afinite number of time points or observation times, theseobservations do not have to happen at the same time forall individuals under scrutiny. When residential historiesare self-reported, these observation times are defined bythe "move" dates reported by the respondent. We mod-eled this as an instantaneous displacement from the spa-tial coordinates for entity i at time t (uit) to those at timet+1 (uit+1). We defined this instantaneous displacement asoccurring at time t+1. We viewed this as an observationalmodel in which the entity is assumed to reside at itsknown location up until that moment when it is observedelsewhere (e.g. Figure 2).

Individual residential histories can be associated withtime-dependent attributes such as weight, height, diseasestate, smoking status, case control status, and so on. Theseattributes may be associated with risk and thereby influ-ence calculation of the latency period and exposure win-dows defined later. Later we also will use time of diagnosisto define exposure windows during which carcinogenesiswas thought to have occurred. For now let us define a case-control identifier, ci to be

Define na to be the number of cases and nb to be thenumber of controls. The total number of individuals inthe study is then N = na+nb.

Nearest Neighbor RelationshipsLet k indicate the number of nearest neighbors to considerwhen evaluating nearest neighbor relationships (e.g.[63]), and define a nearest neighbor indicator to be:

We then define a binary matrix of kth nearest neighborrelationships at a given time t as:

By convention we define ηi,i,k,t = 0 (the diagonal elements)since we do not wish to count individuals as nearestneighbors of themselves. This matrix enumerates the knearest neighbors (indicated by a 1) for each of the Nindividuals. The entries of this matrix are 1 (indicating

Schematic of residential historiesFigure 2Schematic of residential histories. Graphical representa-tion of residential histories from Equation 1 using the instan-taneous displacement movement model. Location is on the x-axis, time on the y-axis. Individual i moves from location ui0 to ui1 at time t = 1, and stays at that place of residence until t = T. Individual j stays at the same place of residence from t = 0 to t = T.

t

x,y

Ui,0

Ui,11

0

T Ui,T Uj,T

Uj,0

ci

i =

( )1if and only if is a case

0 otherwiseEquation 2

ηi j k tj k i

, , , =1 if and only if is a nearest neighbor of att time

otherwiseEquation 3

t

0

( )

ηηk t

k t N k t

k t

N N k t

N k t

,

, , , , , ,

, , ,

, , ,

, , ,

. .

.

. . .

. .

.

=

0

01 2 1

2 1

1

1

η ηη

ηη .. , , ,ηN N k t−

( )

1 0

Equation 4

Page 5 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

that j is a k nearest neighbor of i at time t) or 0 (indicatingj is not a k nearest neighbor of i at time t). It may be asym-metric about the 0 diagonal since nearest neighbor rela-tionships are not necessarily reflexive (e.g. Imagine 3people, call them A, B and C, standing in a line. B is in themiddle but is closer to person A than to person C. Thenearest neighbor to C is B, but the nearest neighbor to B isA. The nearest neighbor relationships are not reflexive).Since two individuals cannot occupy the same location,we assume at any time t that any individual has k uniquek-nearest neighbors. (While it is true that two individualscannot occupy the exact same location, such as the spaceoccupied by one individuals body, residential historyinformation can assign two individuals the same coordi-nate when they live in the same house. How might tiednearest neighbor relationships arising from this situationbe resolved? Two approaches have been proposed. Thefirst creates fractional nearest neighbor weights [43], thesecond propagates uncertainty in the nearest neighborrelationships by evaluating the permutations of possiblenearest neighbors for the tied nearest neighbor relation-ships [44]). The row sums thus are equal to k (ηi,•,k,t = k)although the column sums vary depending on the spatialdistribution of case control locations at time t. The sum ofall the elements in the matrix is Nk. There exists a 1 × T+1vector of times denoting those instants in time wheneither (1) the system is observed and the locations of theentities are recorded, or (2) under continuous observationat least one entity changes geographic location. We canthen consider the sequence of T nearest neighbor matricesgiven by

This defines the sequence of k nearest neighbor matricesfor each unique temporal observation recorded in thedata set, and thus quantifies how nearest neighbor rela-tionships change through time. This demonstrates oneway in which spatial weights (here the nearest neighborrelationship) can be specified from residential histories.We will now use these nearest neighbor relationships toconstruct case control spatial and space-time cluster testsfor residential histories.

Spatially and Temporally Local Spatial Cluster StatisticA spatially and temporally local case-control cluster statis-tic is:

This is the count, at time t, of the number of k nearestneighbors of case i that are cases, and not controls (assum-ing i indeed is a case, if it isn't Qi,k,t = 0). Since a given indi-vidual i may have k unique nearest neighbors, this statistic

is in the range 0..k. It always is 0 when i is a control. Wheni is a case, low values indicate cluster avoidance (e.g. a casesurrounded by controls), and large values (near k)indicate a cluster of cases. When Qi,k,t = k, at time t all ofthe k nearest neighbors of case i are cases.

Probabilities, Null Hypotheses and RandomizationThe statistical significance of Qi,k,t may be evaluated usingconditional randomization that holds the case controlidentifier for individual i fixed and then allocates the vec-tor of remaining N-1 case-control identifiers across theremaining individuals with a given probability function.If we assume equiprobability such that all individualshave equal disease risk we obtain:

Given the case-control identifier for individual i, this is theprobability of individual j being a case under Goovaertand Jacquez's [45] neutral model Type IV (HIV) of spatialindependence of risk for a spatially heterogeneous popu-lation density. As expressed in Equation 7, the exactnumber of cases (na) and controls (nb) might not be repro-duced under probabilistic sampling.

Their neutral model type V retains a specified level of spa-tial autocorrelation and may be simulated using rejectionsampling, sequential indicator simulation, or conditionalcase-control index swapping to achieve the observed levelof spatial autocorrelation [46]. Probabilities for neutralmodel type V are difficult to write in a closed form analo-gous to Equation 7.

Probabilities for neutral model type HVI describe the situ-ation where not all individuals have the same probabilityof being labeled a case. This occurs, for example, when weare concerned with detecting clusters that arise from addi-tional risk above and beyond that of a background risk thatis itself spatially heterogeneous. This may be accom-plished in a variety of fashions to model known individ-ual and environmental risk factors. Tests of thesignificance of Qi,k,t are then identifying clusters of casesabove and beyond that expected under the neutral model.

One calculates the value of the test statistic for each reali-zation of the spatial distribution of cases generated underthe chosen neutral model. These values under randomiza-tion are retained and used to construct the reference dis-tribution of the statistic under the corresponding nullhypothesis. The observed value of the test statistic for the

not randomized data (denoted ) is then compared tothe reference distribution to calculate the p-value:

ηη ηηkT

k t t T= = ( ){ ; .. }, 0 Equation 5

Q c ci k t i i j k t jj

N

, , , , ,= ( )=∑η

1

Equation 6

P c c Hn c

n nj i IVa i

a b( | , )= =

−+ −

( )11

Equation 7

Qikt*

Page 6 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

Here a is the number of conditional randomizationswhose cluster statistic was greater than or equal to thatobserved for the not randomized data, and b is the totalnumber of randomization runs conducted.

A convenient algorithm for conditional randomizationunder neutral model IV is to hold the case-control identi-fier for the ith individual constant, and to then draw fromthe 1 × N-1 vector of remaining case-control identifiersnew case-control identifiers for the k nearest neighborssurrounding i. This sampling is accomplished withoutreplacement. Alternatively, one could populate the k-near-est neighbors about i using the probabilities from Equa-tion 7. This equation is correct for the first identifier sodrawn, but needs to be adjusted for the second, third andso on. For the mth identifier the correct probability forsampling without replacement is:

If one assumes sampling with replacement, so that thecases and controls are assumed drawn from a larger pop-ulation, one can use Equation 7 without modification.

This approach does not work for neutral models type Vand VI, since spatial structure in the background risk islost. Instead one calculates the value of the test statistic foreach of the N locations, for each realization of the spatialneutral model (of type V or VI) that produces a spatialpoint pattern of cases and controls with the desired levelof spatial autocorrelation. The probability assigned toclusters from these tests (as given by Equation 8) thenaccounts for the specified background variation in diseaserisk.

Note for each of the approaches listed above, that a refer-ence distribution, test statistic, and corresponding p-value, may be calculated for each of the na case locations.

Simes Correction for Local DependencyThe P-values for the k individuals surrounding the ith caseare not independent of one another, as they include oneanother as their own k nearest neighbors. We thereforeemploy a modified Simes correction [47] to account forthe lack of spatial independence in the local Q statistics.The Simes adjustment is calculated as pi' = (k + 1 - a) pi.Here k is the number of p-values being considered (thenumber of neighbors), and a is the index (starting at 1)indicating the rank in the sorted vector of p values for

individual i and its neighbors. We employ this correctionlater when reporting p-values for the local Q-statistics.

Global Test for Spatial Clustering at Time tA global statistic for spatial clustering at time t may beconstructed as:

This is the time-referenced form of Cuzick and Edward's[43] global test for case-control clustering used in Figure1. It is the count, over all cases, of the number of cases thatare k-nearest neighbors to those cases at time t. One coulddivide this statistic, and others to follow, by na to facilitatetheir interpretation. The test statistic would then be anaverage number of neighbor cases per case instead of theinteger total number of cases, and would facilitate com-parison across different studies with different numbers ofcases. In this paper we will use the case-count version.

The probability of Qk,t under HIV is evaluated by allocatingthe case-control id's with equal probability over the Nlocations at time t. Qk,t is then calculated and this processis repeated b times to construct the reference distributionand probability (Equation 8). Notice that since this is aglobal test conditional randomization that holds the case-control id for individual i constant is not needed.

Global Test for Spatial Clustering of Residential HistoriesA global test for spatial clustering among the N residentialhistories as represented in Equation 1 is

This is the sum, over all T+1 time points, of the global sta-tistic Qk,t. It is a measure of the persistence of global clus-tering and is large when case clustering persists throughtime. Its reference distribution may be constructed undera randomization procedure in which the case-control idsare allocated with equal probability over the residentialhistories comprising the set

{Li, i = 1..N} (Equation 12)

This randomization procedure is conditioned on the totalnumber of cases and controls in the data set, so that eachdata set constructed under randomization has the samenumber of cases and controls as the original data.

Local Test for Spatial Clustering of Residential Histories through TimeTo determine whether cases tend to cluster through timearound a specific case we may construct a test statistic:

p Q Ha

bikt m( | )( )

( )* = +

+( )1

1Equation 8

P c c c j m H

n c c

n n mm i j IV

a i jj

m

a b( | , .. , )= ∀ = − =

− −

+ − −=

−∑

1 1 12

1

1

Equationn 9( )

Q Qk t i k ti

N

, , ,= ( )=∑

1

Equation 10

Q Qk k tt

T= ( )

=∑ ,

0

Equation 11

Page 7 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

For the ith residential history, this is the sum, over all T+1time points, of the local spatial cluster statistic Qi,k,t. It isthe number of cases that are k-nearest neighbors of the ith

residential history (a case), summed over all T+1 timepoints. It will be large when cases tend to cluster aroundthe ith case through time. Under neutral model type IV, thesignificance of Qi,k,t is evaluated under a conditional rand-omization that holds the case id for i constant, and thenallocates the remaining case-control id's at random overthe N-1 remaining residential histories. This statistic isuseful for determining whether there is local clustering ofresidential histories about a specific case. The statistic canbe calculated for all cases in the data set to identify thosecases whose residential histories form local spatial clus-ters. However, when calculating significance one shouldcorrect for the multiple testing inherent when many spa-tial locations are evaluated.

Focused Test for Spatial Clustering at Time tSuppose that one suspects that the cases may be clusteringabout a specific focus defined by the lifeline (e.g. record ofbusiness addresses):

LF = {uF,0, uF,1,.., uF,T} (Equation 14)

This records the locations of the focus as it moves aboutthrough space-time, and includes situations in which thefocus doesn't move as a degenerate instance. A test for spa-tial clustering of cases about the focus at a given time t isthen:

Here ηF,j,k,t is the nearest neighbor index indicating at timet whether the jth individual is a kth nearest neighbor of thegeographic location of the focus defined by uF,t. The statis-tic QF,k,t is then the count of the number of k-nearestneighbors about the focus at time t that are cases. Undernull hypothesis type IV randomization at time t may beaccomplished by allocating the case control identifierswith equal probability over the N-individuals. Since onlythe k-nearest neighbors are considered it is only necessaryto allocate their indices. This may be accomplished bysampling without replacement from the 1 × N vector ofthe case-control identifiers, or by drawing the k requiredcase control identifiers with probabilities defined byEquation 9 (for sampling without replacement) or Equa-tion 7 (for sampling with replacement).

Focused Test for Spatial Clustering of Residential Histories about a Mobile FocusA test for focused clustering of residential historiesthrough time is:

This is the count, over the T times, of the number of casesthat are k nearest neighbors of the focus at each timepoint. This statistic is large when residential histories thatare near the focus are cases. Its maximal value is

max(QF,k) = kT. (Equation 17)

One drawback of using nearest neighbor relationships forfocused tests is that the set of nearest neighbors to thefocus are given equal weight in Equations 15 and 16,regardless of their actual geographic distance and direc-tion with respect to the focus. But diffusion and activetransport mechanisms that might carry emissions fromthe focus typically result in higher exposures near thefocus, and it thus may make sense to use a maximum dis-tance within which ki nearest neighbors are found. Inthese instances the set of nearest neighbors to the focuswill vary (hence the i subscript denoting the ith focus)depending on the number of cases and controls foundwithin the specified distance of the focus.

Power of the Focused Tests and Specification of the ExposureNotice that the power of the tests given by Equations 15and 16 decreases as k approaches N since QF,k,t = na whenk = N, and its probability is then:

P(QF,k,t | H0, k = N) = 1.0. (Equation 18)

When one wishes to search for clustering in instanceswhere k approaches N power may be retained by con-structing a weight function to model the hypothesizedexposure. For geographically localized foci this may bebased on proximity to the focus. One choice is

Here rF,j,t is the rank indicating proximity of the locationof the jth individual at time t (as given by uj,t) to the loca-tion of the focus at time t (uF,t). For example, the first near-est neighbor to the focus has rank 1, the second rank 2,and so on.

In many situations, such as airborne pollution or ground-water contamination, the magnitude of exposure is afunction not only of the proximity to the focus but also of

Q Qi k i k tt

T

, , ,= ( )=∑

0

Equation 13

Q cF k t F j k t jj

N

, , , , ,= ( )=∑η

1

Equation 15

Q QF k F k tt

T

, , ,= ( )=∑

0

Equation 16

wrF j tF j t

, ,, ,

= ( )1Equation 19

Page 8 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

its orientation, since most dispersion processes (i.e.winds, infiltration through porous media) are anisotropicor direction-dependent. Depending on the amount ofinformation available, exposure models of increasingcomplexity can be built.

An easy way to account for anisotropy is to replace therank value rF,j,t by a function of the separation vector hjF,t= |uj,t - uF,t| joining the location of the jth individual attime t to the location of the focus at time t. Covariancefunctions seem to be natural choices for the weight func-tions wF,j,t since they incorporate the spatial pattern ofdependence of exposure data. For example, one could usethe Exponential or Gaussian covariance functions definedas:

where a(θ) is the practical range of autocorrelation of thecovariance models; that is the distance h at which the cov-ariance function equals 0.05. This range is a function ofthe azimuth of the separation vector hjF,t. For example, therange of exposure to an airborne contaminant is expectedto be larger in the direction of the prevailing winds.

More complex weight functions could be created if a proc-ess-based model of dispersion is available. For the exam-ple of airborne pollution, an atmospheric dispersion anddeposition model could be developed to predict the fateof emissions and dust plumes from targeted facilities [48].However, such models require many more parametersand assumptions concerning, for example, the emissionrate, the meteorological conditions, complex terraineffects, the particle size and density for depositioncalculation.

A limitation of process-based models is that they fail toprovide a measure of uncertainty attached to their predic-tions and field exposure data are not readily incorporated.Geostatistics [49] provide tools for modeling the spatio-temporal distributions of exposure and assessing theattached uncertainty. Various sources of information canbe taken into account, such as measurements at a fewmonitoring stations, coordinates of major sources ofexposure (i.e. factories) and transport characteristics (i.e.wind directions) that could be either directly incorporatedinto the prediction algorithm [50] or fed into physicalmodels to derive spatial trends [51]. In the latter case,geostatistics are used to model the unexplained or

residual part of the variability predicted by the process-based models.

The weight function, either based on geographic proxim-ity (as in Equation 19) or derived using a process-basedmodel or geostatistics (as in Equations 20 & 21), is thenused to construct the weighted focused test at time t as:

The test for spatial clustering of residential histories aboutthe focus through time is then:

Notice these weighted tests are conducted for the k nearestneighbors being considered. When k = N the maximumvalues are:

Duration-Weighted Tests for Clustering of Residential HistoriesThe number of time points defined by the t = 0 ..T obser-vation times, and the frequency with which they are taken,can have some influence on the value of the above statis-tics. For example, many repeated observations when thereis a chance of clustering could lead to spurious signifi-cance for the local and global tests for clustering of resi-dential histories. We therefore developed duration-weighted versions of the tests, and these are presented inthe Appendix [see Additional file 1].

Accounting for Exposure Windows and Latency PeriodsWhen dealing with cancers, causative exposures mayoccur during an exposure window (∆E), followed by alatency period (∆L) before cancer is manifested and diag-nosed. Given the residential history for case i, Li, furtherdenote the space-time coordinate representing place of

residence at time of diagnosis as , noting that

∈ Li We can then define that subset of the residential his-tory Li over which the exposure window occurred as:

Here ti,D is the time of diagnosis for individual i. The term(ti,D - ∆L) indicates the time prior to diagnosis when thelatency period began and (ti,D - ∆L - ∆E) is the time whenthe causative exposure began. Hence equation 25 denotesthat portion of individual i's residential history wherecausative exposures could have occurred. Notice that boththe exposure window and latency period could be

C ExpaExp( )

| |

( )h

h= −

( )3

θEquation 20

C Expa

Gaus( )| |

[ ( )]h

h= −

( )3 2

2θEquation 21

′ = ( )=∑Q w cF k t F j t jj

N

F j k t, , , , , , ,1

η Equation 22

′ = ′ ( )=∑Q QF k F k tt

T

, , ,0

Equation 23

max( ) max( ) max( ), , , , ,′ = ′ = ′= =∑ ∑Q

kQ QF k t

k

N

F k F k tt

T1

1 0

and Equatiion 24( )

ui tD, ui tD,

L uiE

i t i D L i D L Et t t= ∀ − > > − − ( ){ ( ) ( )}, , ,∆ ∆ ∆ Equation 25

Page 9 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

covariate-adjusted to account for risk factors such assmoking and age (see Discussion). In this instance thelatency period and exposure window vary from one indi-vidual to another and we write:

Here ∆i,L and ∆i,E are the latency period and exposure win-dows for the ith individual. In either case (Equations 25 or

26) we call the exposure trace for the ith individual.

Randomization Procedures for Exposure TracesIn order to evaluate whether exposure traces of the casescluster we must first construct a randomization procedurefor generating representative times of diagnosis, latencyperiods, and exposure windows. Once this is accom-plished we will be able to determine whether the exposuretraces for the cases cluster relative to those so constructedfor the controls. For a case, the exposure trace is definedby the time of diagnosis and the latency period, with thelatency period potentially dependent on age, gender andother covariates. The procedure proceeds as follows:

(1) Since controls are matched to cases, the "time of diag-nosis" for each control is set equal to the time of diagnosisfor the matched case.

(2) The exposure window and latency period for each con-trol is then defined based on the covariates for each con-trol as was accomplished for that controls matched case.

(3) Completion of steps (1) and (2) will result in exposuretraces defined for both cases and controls. Now randomlyassign case control identifiers across the residential histo-ries with equiprobability conditioned on the totalnumber of cases and the total number of controls.

(4) Calculate the desired test statistic for clustering ofexposure traces.

(5) Repeat steps 3 and 4 a desired number of times to con-struct the reference distribution of the statistic underrandomization.

Test statistics for assessing clustering of exposure traces arepresented below.

Local Case-Control Test for the Spatial Clustering of Exposure Traces at Time tWhen health events such as cancers are caused by expo-sure to geographically localized factors we might expectthe exposure traces for the cases to cluster relative to theexposure traces that are generated for the controls. Thedurations of the exposure traces may vary, and we

therefore will employ duration-weighted statistics. Wewould like to know whether exposure traces for the casesexhibit spatial clustering relative to the controls bothlocally (to identify places where causative exposuresoccurred) and globally (to ascertain whether the exposuretraces for the cases cluster when considered as a group).We also might wish to ask whether exposure traces for thecases exhibit focused clustering.

The exposure trace for case i ( ) records those placeswhere that individual lived during that time when expo-sures occurred that might have caused cancer later in life.Now define an indicator, ei,t, as:

When ei,t is 1, let us say the exposure trace is "active". Alocal case-control test for spatial clustering of exposuretraces at time t is then:

This is the count, at time t, of the number of k nearestneighbors of case i's active exposure trace that are cases(and not controls) whose exposure traces also are active.Hence the statistic will be large at those times when expo-sure traces of a group of cases are active and cluster. Itsvalue is 0 when individual i is a control, and also whenindividual i is a case with an inactive exposure trace. Theduration weighted version of this statistic is:

Local Case-Control Test for the Spatial Clustering of Exposure Traces through TimeWe can explore whether active exposure traces of casestend to cluster spatially through time. A statistic sensitiveto this pattern is:

will tend to be large when active exposure traces for

cases tend to cluster around the active exposure trace ofthe ith case. It will be 0 when i is a control, and small whena given case i has the traces of many controls as its neigh-bors. The duration-based version of this statistic is:

L uiE

i t i D i L i D i L i Et t t= ∀ − > > − − ( ){ ( ) ( )}, , , , , ,∆ ∆ ∆ Equation 26

LiE Li

E

et

i t, =1if and only if time is within the exposure trace foor individual

0 otherwiseEquation 27

i ( )

Q c e c ei k tE

i i t i j k t j j tj

N

, , , , , , ,= ( )=∑η

1

Equation 41

Q c e c ei kE

t i i t i j k t j j tj

N

t, , , , , , ,ω ω η= ( )=∑

1

Equation 29

Q Qi kE

i k tE

t

T

, , ,= ( )=∑

0

Equation 30

Qi kE,

Page 10 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

This statistic will be expressed in case-time units, indicat-ing the number (for example) of case-days over the entirestudy period for which cases with active traces were k-nearest neighbors of the active trace of case i.

Global Case-Control Test for the Spatial Clustering of Exposure Traces at Time tWe can ask whether, as a group, active case traces are spa-tially clustered relative to the active traces of the controlsat a given time t. This is accomplished using the statistic:

This is simply the sum, over all cases, of the local statisticfor clustering of case exposure traces at time t. This statisticwill be large when active traces of cases tend to be nearone another and small when the active traces of cases tendto have controls as their k nearest neighbors. The dura-tion-based version is:

Global Case-Control Test for the Spatial Clustering of Exposure Traces through TimeA global test for the spatial clustering of the active expo-sure traces of cases through time is:

This is the sum, over all time periods, of the global clustertest for the clustering of exposure traces. It will be largewhen global clustering of active exposure traces tends topersist through time. The duration-based version of thisstatistic is:

Focused Case-Control Test for the Spatial Clustering of Exposure Traces at Time tWe can also ask whether the exposure traces of cases clus-ter near putative emission sources. Again, these sourcesmay be mobile, and we accomplish this by assigninglarger weights for those cases that are near the focus. Recallfrom Equation 14 that we can represent a mobile source

as LF = {uF,0, uF,1,.., uF,T}. The test for spatial clustering ofcases about a focus at a given time t (Equation 15) maythen be extended to be a focused test for clustering ofexposure traces as:

This is the count of the number of cases with active expo-sure traces that are k nearest neighbors of the focus at timet. Significance of this statistic may be evaluated by con-structing exposure traces for the controls as described ear-lier, and by then repeatedly allocating case-controlidentifiers across the N lifelines that are k nearest neigh-bors of the focus in order to construct the reference distri-

bution for . The duration weighted version of this

statistic is

Focused Test for Spatial Clustering of Exposure Traces about a Mobile Focus through TimeWe can evaluate whether there is statistically significantclustering of exposure traces of cases about a mobile focusthrough time using the statistic:

This is the count, over T+1 times, of the number of casesthat have active exposure traces that are k nearest neigh-bors of the focus at each time point. The maximum valueof this statistic is kT, and its significance may be evaluatedunder randomization by reallocating case-control identi-ties over the exposure traces of the cases and controls asdescribed in the previous section. The duration-weightedversion of this statistic is:

Weighted Focused tests for Exposure TracesThe power of the k-nearest neighbor based focused test forexposure traces decreases as k approaches N. Weights suchas that suggested in Equations 19–21 may be used to con-struct a weighted focused test for exposure traces at a giventime t:

The test for focused clustering of exposure traces throughtime is then:

ηηk t

k t N k t

k t

N N k t

N k t

,

, , , , , ,

, , ,

, , ,

, , ,

. .

.

. . .

. .

.

=

0

01 2 1

2 1

1

1

η ηη

ηη .. , , ,ηN N k t−

( )

1 0

Equation 4

Q Qk tE

i k tE

i

N

, , ,= ( )=∑

1

Equation 32

Q QkE

i kE

i

N

t t, , ,ω ω= ( )=∑

1

Equation 33

Q QkE

k tE

t

T= ( )

=∑ ,

0

Equation 34

Q QkE

kE

t

T

t

,,

ωω= ( )

=

−∑

0

1Equation 35

Q c eF k tE

F j k t jj

N

j t, , , , , ,= ( )=∑η

1

Equation 36

QF k tE

, ,

Q c eF kE

t F j k t jj

N

j tt, , , , , ,ω ω η= ( )

=∑

1

Equation 37

Q QF kE

F k tE

t

T

, , ,= ( )=∑

0

Equation 38

Q QF kE

F kE

t

T

t,,

, ,ω

ω= ( )=

−∑

0

1Equation 39

′ = ( )=∑Q w c eF k t

EF j t

j

N

F j k t j j, , , , , , ,1

η Equation 40

Page 11 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

The significance of these statistics is evaluated using rand-omization across the k nearest neighbors of the focus asdescribed earlier. The corresponding duration-weightedversions are

This is the weighted focused test over duration ωt. Theduration-based weighted focused test for exposure tracesthrough time is

Bladder Cancer in southeastern MichiganA population-based bladder cancer case-control study iscurrently underway in southeastern Michigan. Cases arerecruited from the Michigan State Cancer Registry anddiagnosed in the years 2000–2004. Controls are fre-quency matched to cases by age (± 5 years), race, and gen-der, and recruited using a random digit dialing procedurefrom an age-weighted list. To be eligible for inclusion inthe study, participants must have lived in the elevencounty study area for at least the past 5 years and had noprior history of cancer (with the exception of non-melanoma skin cancer). Participants are offered a modestfinancial incentive and research is approved by the Uni-versity of Michigan IRB-Health Committee.

The data presented here are from 63 cases and 182 con-trols. As part of the study, participants complete a writtenquestionnaire describing their residential mobility his-tory. The duration of residence and exact street addresswere obtained, otherwise the closest cross streets wereprovided. Each residence in the study area was geocodedand assigned a geographic coordinate in ArcGIS; resi-dences outside the study area were not geocoded. Partici-pants resided at 1004 homes within the study area, withtime spent averaging 64% of their lifetimes. Residenceswithin the study area were successfully geocoded: 76%automatically matched using ArcGIS settings of spellingsensitivity equal to 75, minimum candidate score equal to10, and a minimum match score equal to 60. Theunmatched addresses were manually matched using crossstreets with the assistance of internet mapping services(15%). If cross streets were not provided, best informedguess placed the address on the road (5%), and as a lastresort, residence was matched to town centroid (4%).

Industrial histories have also been collected for the studyarea, and will be explored to explain local clustering.Industries reported to or believed to emit contaminantsthat have been associated with bladder cancer were iden-tified using the Toxics Release Inventory [52] and theDirectory of Michigan Manufacturers (Manufacturer Pub-lishing Co., 1946, 1953, 1960, 1969, 1977, 1982). Stand-ard Industrial Classification (SIC) codes were adopted,but prior to SIC coding, industrial classification titles wereselected. Characteristics of 268 industries, including, butnot limited to, fabric finishing, wood preserving, pulpmills, industrial organic chemical manufacturing, andpaint, rubber, and leather manufacturing, were compiledinto a database. Industries were geocoded following thesame matching procedure as described for residences:89% matched to the address, 5% were placed on the roadusing best informed guess, and as a last resort, 6% werematched to town centroid. Each industry was assigned astart year and end year, based on best available data. Thedata on these industries is used to demonstrate thefocused versions of the Q statistics.

ResultsAt the time of this writing, geocoding and data collectionare ongoing; hence the results reported in this manuscriptare entirely preliminary and should not be used to drawany conclusions regarding the spatial patterns of bladdercancer in Michigan. The analysis undertaken in the man-uscript is provided only as an example application of thenew Q statistics.

To demonstrate the methods we implemented the localand global Q statistics for clustering of residential histo-ries, specifically the local test at time t, Qi,k,t (Equations 6),and its global counterpart Qk,t (Equation 10). We alsoimplemented the local test for clustering of residential his-tories through time Qi,k (Equation 13), and the global testfor clustering of residential histories Qk (Equation 11). Wealso were concerned with possible clustering of cases nearthe industrial facilities, and evaluated this using thefocused test at time t QF,k,t (Equation 15) as well as thefocused test through time QF,k (Equation 16). In additionwe programmed the duration-weighted versions of thesestatistics, and for the focused tests we also employed expo-sure weights calculated using the inverse rank distance(Equation 19).

Results for QktThese techniques were implemented in TerraSeer's STISsoftware using the Application Programmer's Interface.This allowed us to create a methods dynamic linkedlibrary with our new techniques that we then invokedusing an automatically generated dialog. Time animatedmaps of the places of residence of the cases and controls,and of the changing geography of the municipal water

′ = ′ ( )=∑Q QF k

EF k tE

t

T

, , ,0

Equation 41

′ = ( )=∑Q w c eF k w

Et F j t

j

N

F j k t j jt, , , , , , ,ω η

1

Equation 42

′ = ′ ( )=∑Q QF k

EF kE

t

T

t, , ,ω

ω0

Equation 43

Page 12 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

supplies, were constructed using STIS (Figure 3 [see Addi-tional file 2] [see Additional file 3]). These display thechanging geography of the cases and controls as theymove from one place to another, alterations in the geog-raphy of the municipal water supplies as they arefounded, expand and merge, as well as township bound-aries. To verify the methods we compared results using theQ statistics to those obtained using Cuzick and Edward'stest in the ClusterSeer software. Specifically, we used STISto calculate the Qkt statistics through time and thenexported the data for July 1, 1969. We choose this timepoint, because Qkt reached a local peak of Qkt = 77 that wasstatistically significant (see Figure 1). The Cuzick andEdward's test in ClusterSeer returned T5 = 77, confirming

the results from STIS. As noted earlier, Cuzick andEdward's test is a special case of the Q-statistic for the glo-bal test at time t, Qkt. Note that Qkt is calculated as the sumof the local Q statistics at time t, Qikt, and thereby providesverification that the statistic Qikt, from which the family ofQ statistics is derived, is being calculated correctly. Wemust remind the reader that these results are highly pre-liminary and that data collection is incomplete. In fact,and as noted later in the Discussion, it is likely theobserved clustering in these data is due to the geographicordering in which the data are being collected. Nonethe-less, this example demonstrates how plots of the Qkt statis-tics may be used to evaluate geographic case clustering ofresidential histories.

Still from the animation of residential histories of cases, controls and industry in southeastern Michigan (additional file 2 and additional file 3).Figure 3Still from the animation of residential histories of cases, controls and industry in southeastern Michigan (addi-tional file 2 and additional file 3).

Page 13 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

Results for

The results reported above were not time standardized.We therefore undertook an analysis using the time-stand-

ardized version of Qkt called as per Equation A4.

This expresses the amount of clustering at a given timeinterval in cases per unit time period. STIS reports timesdown to the second, hence results are recorded in personseconds. Figure 4 shows a similar overall increasing trendbut also a greater variability in the value of the Q statisticthrough time. This is driven both by the increased numberof cases through time and also by differences in the dura-tions between movement events. When these sources ofvariability are accounted for we find episodic case cluster-ing in approximately the same time intervals as found forthe not time weighted statistic.

Results for Qk

Having found some of the Qkt and statistics to bestatistically significant, the question then arises as towhether there is overall global clustering given themultiple time points evaluated. To accomplish this weused the global Qk test under the randomization proce-dure that holds the residential histories of the cases andcontrols as given and then allocates the na case and nb con-trol identifiers across the N residential histories. Weaccomplished this randomization 99 times in the STISwith a resulting p-value of 0.01, and concluded there wasglobal clustering in the residential histories.

Results for Qi,k to evaluate Clustering of Residential Histories

The statistics Qkt and are sensitive to a clustering of

cases relative to the controls, and are evaluated at each ofthe T+1 time points in the set of residential histories. Wealso can ask, whether residential histories of the cases clus-ter near the residential histories of other cases by using thestatistics Qi,k (Equation 13) and its duration-weighted ver-

sion (Equation A6). Since our analysis above demon-

strated the results are not overly sensitive to durationweighting, we report results only for the not-weightedtests. This test will associate a statistic and a p-value witheach residential history. A map of the residential historieson April 12, 1997 is shown in Figure 5. Note the two reddots that denote the place of residence of the two caseswith statistically significant clustering of residentialhistories. Over the entire time span of the study, these twocases tend to be surrounded by residential histories ofother cases, rather than the residential histories of con-trols. Because of residential mobility, the two red dotsmove about through time. This animation is quite com-pelling in the STIS and is approximated by the simpler

animation in Figure 3. Note the animation in Figure 3 issampled from the complete animation created when run-ning the STIS software. This is necessary to create .avi filesof small enough size for effective posting on the internet.Periods in which a red dot disappears from the animationdenote time periods when that individual moved out ofthe study area. It is important to note that we have notadjusted these local tests for the multiple testing in themany spatial locations that were evaluated. However, theglobal Qk statistic was statistically significant and the largelocal statistics observed for the two red dots reference thetwo residential histories that contributed the most to theglobal Qk.

Focused clusteringTo demonstrate the use of the focused versions of the Qstatistic we analyzed possible clustering of the residentialhistories of cases near the 268 industrial facilities that pro-duced compounds thought to be putative carcinogens forbladder cancer. We undertook two sets of analyses usingQF,k (Equation 16). The first evaluated focused clusteringof residential histories using k = 5 nearest neighbors. Thesecond only considered those nearest neighbors within 1kilometer of the focus.

When considering the 5 nearest neighbors to each indus-try, 24 of the 268 industrial facilities had p-values lessthan 0.05. Thus under the null hypothesis that each per-son in the study had an equal probability of being labeleda case, these 24 candidate foci had a significant excess ofcases among each of their five nearest neighbors, at leastat the nominal 0.05 level. Notice that at the 0.05 level, wewould have expected 13.4 foci to be significant under thisnull hypothesis. Using an experiment-wise errorapproach, and a 5% critical value, the adjusted alpha levelof the test is 0.000187 using the Bonferonni correction,and is 0.000191 using Sidak's multiplicative inequality.Using 49,999 randomizations, we were able to resolve p-values as small as 0.00005. None of these industriesproved to be statistically significant foci once multipletesting was accounted for.

We also used the distance-based approach consideringthose neighbors within 4,000 m of each industrial facility.Under this approach, 10 industrial facilities had p-values< 0.05, but none of these were significant once multipletesting was accounted for.

DiscussionThis paper presented a new approach for evaluating casecontrol clustering of residential histories. To date and toour knowledge, almost all case control cluster tests rely onthe static view, analyzing clustering at one point in time orindependently at several points in time. By using themathematical construct of a residential history in

Qk wt,

Qk wo,

Qk wo,

Qk wo,

Qi k,ω

Page 14 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

through timeFigure 4

through time. Graph of (top) and its probability (bottom) through time for k = 5. is the time weighted

version of Qtki and is expressed in case-seconds.

0.00E+00

1.00E+09

2.00E+09

3.00E+09

0

0.2

0.4

0.6

25-Nov-21 4-Aug-35 12-Apr-49 20-Dec-62 28-Aug-76 7-May-90 14-Jan-04

Qk wt, Qk wt, Qk wt,

Page 15 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

Equation 1, and the notion of super sets of proximitymatrices (Equation 5) to represent the changing geometryof place of residence, we have derived local, global andfocused tests that are realistic in the sense that they quan-tify human residential mobility.

The results of the analysis of the bladder cancer data areentirely preliminary, and should not be interpreted toreach any inferences or conclusions regarding case-controlclustering of bladder cancer in Michigan. At the time ofthis writing we believe statistically significant spatial clus-tering of cases is the result of a geographic pattern in thetemporal order in which cases are reported. Because of

recent implementation of the HIPPA (Health InsurancePortability and Accountability Act) legislation, The Uni-versity of Michigan hospital systems had been unwillingto release case data until its official position on theserequirements was completely formulated. As a result,bladder cancer cases that were treated at the University ofMichigan hospitals are only now being recruited to thestudy's data set. Because many of these cases are from thesurrounding environs of Washtenaw and Livingstoncounties, the data set analyzed in this paper has a deficitof cases in these areas. Selection of controls employs apopulation sample using random digit dialing, andappropriately represents the entire study area. As a result,

Map of cases and controls on 4/12/1997Figure 5Map of cases and controls on 4/12/1997. Map of cases and controls on 4/12/1997. Cases are shown as dots within a circle, controls are shown as crosses. The two cases whose residential histories tend to be surrounded by the residential histories of other cases are shown in red.

Page 16 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

there is a deficit of cases in Washtenaw and Livingstoncounties, and a concomitant clustering of cases in the bal-ance of the study area. Further investigation of any blad-der cancer clusters would also entail including knownbladder cancer risk factors, such as cigarette smoking andoccupational exposure history, in the analysis. We intendto revisit this analysis once the data set is complete.

Selection of controls through random digit dialing canintroduce bias, since not everyone is equally likely to beselected due to different numbers of phones and the like-lihood of answering the phone. While such bias might bereduced by first selecting a census block group based oncensus numbers, adjusted for age and gender, and thendoing random digit dialing within that block group, sucha procedure has the potential of over-matching on expo-sure [53]. This would make it very difficult to detect anyspatial pattern that arises at a spatial scale greater than theblock group. In this study we chose not to match on geog-raphy because some of the exposures of interest display ageographic pattern and over-matching on exposure was apossibility. These exposures include regional patterns inarsenic concentration in drinking waters associated withsurficial geology and regional differences in householdwater supply sources [54,55].

Southeastern Michigan includes rural farming areas aswell as portions of metropolitan Detroit, and differentialresponse rates under random digit dialing are a concern.We attempted to ensure these areas do not have differen-tial response rates by comparing addresses of respondersand non-responders in age-weighted lists.

Exposures in early life and over an individual's life coursemay be important risk factors for the onset of cancer[56,57,34], thereby impacting both the date of diagnosisand latency period. But how can such risk factors beaccounted for in exposure trace analysis? We need toexplicitly model the latency period to take into accountnot only exposures of direct interest (arsenic in our exam-ple) but also additional risk factors (such as smoking) thatmight decrease the latency period and accelerate diseaseonset. Many common epidemiological risk-disease meas-ures (e.g. odds ratio) are concerned with whether an expo-sure occurred, rather than with when it occurred, and arethus of little use for estimating relationships between tim-ing of exposure and disease onset [58]. For cohort analy-ses, Robins and Greenland[59] argued that, whenconditioned on age, Years of Life Lost (YLL) due to earlyexposures cannot be estimated without bias in theabsence of causal models for how exposure causes death.This result was demonstrated analytically by Morfeld [60],who developed a framework for causal thinking in epide-miology, and applied it to evaluate the estimability of YLLand related measures. Candidate causal modeling

approaches cited by Morfeld include Robin's G-estima-tion procedure [61,62] which can be used to estimate thetime period between exposures and outcomes such asdeath, and thus appear promising for incorporating cov-ariates into models of the latency period. Applications ofG-estimation and the YLL procedures of Robins [61] inexposure trace modeling is thus an important futureresearch direction.

Discussion of the type of spatial metric to use (nearestneighbor, adjacency, or geographic distance-based) aswell as the number of k nearest neighbors to analyze iswarranted. The approaches detailed in this paper are gen-eral in the sense that weights such as inverse distance andadjacency could be used in place of k-nearest neighborrelationships in Equation 4. We chose to work with near-est neighbor measures because we've found them to bemore powerful than adjacency- and distance-based meas-ures in some situations (e.g. [63]). As noted earlier in thispaper, we used k = 5 because we found in the past that spa-tial clustering under nearest neighbor methods often maybe detected at that level of k. Such a justification issufficient in analyses conducted purely for demonstrationpurposes but is deficient in applied settings. In practice,two approaches may be used, which we call a priori andexploratory. When prior information is available on thescale of clustering this can be used to select a specificnumber of nearest neighbors to explore. Hence if onewishes to detect clusters of five individuals one might setk = 5. When such prior information is lacking an explora-tory approach may be used in which several levels of k areanalyzed, and probabilities from the analyses must thenbe adjusted to account for multiple testing [63].

We could not demonstrate each of the statistics developedin this paper, due to both data and space constraints. Wenote that exposure traces could be implemented to repre-sent cases and controls of similar ages, in addition tothose at a point in time. For example, a researcher maywish to determine whether cases cluster together whenthey were children, irrespective of year, thereby indicatingearly-lifetime vulnerability to an environmental exposurein the area. These clustering tools thus can be used to dis-play cancer clusters of similarly-aged participants, as wellas clusters based on the years a participant lived at aresidence. In this manner, clusters of children can beinvestigated, whether they are born in the same genera-tion or born in different generations.

ConclusionIn conclusion, the methods presented in this paperaccount for residential mobility and are thus far morerealistic than existing tests that are founded on static geo-graphic representations. They thus are preferred over clus-tering methods that ignore human mobility. The

Page 17 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

techniques demonstrated in this paper have been pro-grammed in a dynamic linked library that can be obtainedfrom the first author and used in conjunction with a STIS.

List of AbbreviationsGIS: Geographic Information System

HIPPA: Health Insurance Portability and AccountabilityAct

HIV: Goovaert and Jacquez's [45] neutral model Type IV

HVI: Goovaert and Jacquez's [45] neutral model Type IV

IRB: Institutional Review Board

LISA: Local Indicators of Spatial Autocorrelations

MCMC: Markov Chain Monte Carlo

STIS: Space-Time Intelligence System

SIC: Standard Industrial Classification code

YLL: Years of Life Lost

Competing interestsGeoffrey Jacquez is President of BioMedware, the softwarecompany that is developing the STIS software.

Authors' contributionsGJ derived the methods and drafted the majority of thismanuscript. He also accomplished the analysis of thebladder cancer data set. AK programmed and tested thestatistical methods in the STIS software. GA and JM pro-vided data and wrote the data set description. PG wrotethe sections on geostatistical weighting functions for thefocused tests. JN is Principal Investigator on the R01project that is collecting the bladder cancer data set.

Additional material

AcknowledgementsThis study was supported by grant R01 CA96002–10, Geographic-Based Research in Cancer Control and Epidemiology, from the National Cancer Institute. Development of the STIS software was funded by grants R43 ES10220 from the National Institutes of Environmental Health Sciences and R01 CA92669 from the National Cancer Institute. The critical comments and suggestions of Martin Kulldorff, Thomas Webster, and Al Ozonoff greatly improved the manuscript. Access to cancer case records was pro-vided by Michigan Cancer Surveillance Program within the Division for Vital Records and Health Statistics, Michigan Department of Community Health. The authors thank the Michigan Public Health Institute for conducting the telephone interviews and Stacey Fedewa and Lisa Bailey for entering writ-ten surveys into a database. Thanks to Wanda Angelomatis and Fred Wal-lace who hosted the first author and his daughter for two weeks at Berkenhead Lake in British Columbia where these methods were originally formulated.

References1. Collia DV, Sharp J, Giesbrecht L: The 2001 National Household

Travel Survey: a look into the travel patterns of olderAmericans. J Safety Res 2003, 34(4):461-70.

2. Klepeis NE, Nelson WC, Ott WR, Robinson JP, Tsang AM, Switzer P,Behar JV, Hern SC, Engelmann WH: The National Human Activ-ity Pattern Survey (NHAPS): a resource for assessing expo-sure to environmental pollutants. J Expo Anal Environ Epidemiol2001, 11:231-252.

3. Reuscher TR, Schmoyer RL, Hu PS: Transferability of NationwidePersonal Transportation Survey data to regional and localscales. Transport Res Rec 2002, 1817:25-32.

4. Sabel CE, Boyle PJ, Löytönen M, Gatrell AC, Jokelainen M, FlowerdewR, Maasilta P: Spatial clustering of amyotrophic lateral sclero-sis in Finland at place of birth and place of death. Am J Epidemiol2003, 157:898-905.

5. Syndromic surveillance in public health practice, New YorkCity [http://www.cdc.gov/ncidod/EID/vol10no5/03-0646.htm]

6. Mather FJ, Whited LE, Langlois EC, Shorter CF, Swalm CM, Shaffer JG,Harley WR: Statistical methods for linking health, exposureand hazards. Environ Health Perspect 2004, 112:1440-1445.

7. Pickle LW, Waller LA, Lawson AB: Current practices in cancerspatial data analysis: a call for guidance. Int J Health Geogr inpress.

8. Boscoe FP, Ward MH, Reynolds P: Current practices in spatialanalysis of cancer data: data characteristics and data sourcesfor geographic studies of cancer. Int J Health Geogr 2004, 3:28.

9. Goodchild M: GIS and transportation: status and challenges.GeoInformatica 2000, 4:127-139.

10. AvRuskin GA, Jacquez GM, Meliker JR, Slotnick MJ, Kaufmann A,Nriagu JO: Visualization and exploratory analysis of epidemi-ologic data using using a novel space time informationsystem. Int J Health Geogr 2004, 3:26.

11. Jacquez GM, Greiling D, Kaufmann A: Design and implementa-tion of a Space-Time Intelligence System for diseasesurveillance. J Geograph Syst 2005, 7:7-23.

12. Greiling DA, Jacquez GM, Kaufmann AM, Rommel RG: Space timevisualization and analysis in the Cancer Atlas Viewer. J Geo-graph Syst 2005, 7:67-84.

13. Meliker J, Slotnick M, AvRuskin GA, Kaufmann A, Jacquez GM, NriaguJO: Improving exposure assessment in environmental epide-

Additional File 1Appendix This file contains the article's Appendix.Click here for file[http://www.biomedcentral.com/content/supplementary/1476-069X-4-4-S1.doc]

Additional File 2Animation for Figure 3in QuickTime This is the animation for Figure 3 in QuickTime format.Click here for file[http://www.biomedcentral.com/content/supplementary/1476-069X-4-4-S2.mov]

Additional File 3Animation for Figure 3as an animated GIF This is the animation for Figure 3 in GIF format.Click here for file[http://www.biomedcentral.com/content/supplementary/1476-069X-4-4-S3.gif]

Page 18 of 19(page number not for citation purposes)

Environmental Health: A Global Access Science Source 2005, 4:4 http://www.ehjournal.net/content/4/1/4

miology: application of a spaho-temp oral visualization tools.J Geograph Syst 2005, 7:49-66.

14. Miller H: A measurement theory for time geography. GeogrAnal 2005, 37:17-45.

15. Gustafson EJ: Quantifying landscape spatial pattern: What isthe state of the art? Ecosystems 1998, 1:143-156.

16. Waller LA, Jacquez GM: Disease models implicit in statisticaltests of disease clustering. Epidemiology 1995, 6(6):584-590.

17. Lawson AB, Kulldorff M: A review of cluster detection methods.In Advanced Methods of Disease Mapping and Risk Assessment for PublicHealth Decision Making Edited by: Lawson A, Bohning D, Lesaffre E,Viel J-F, Bertollini R. London: Wiley; 1999:99-110.

18. Jacquez GM, Grimson R, Waller L, Wartenberg D: The analysis ofdisease clusters Part 2: Introduction to techniques. Infect Con-trol Hosp Epidemiol 1996, 17:385-397.

19. Jacquez GM, Waller L, Grimson R, Wartenberg D: The analysis ofdisease clusters Part I: State of the art. Infect Control HospEpidemiol 1996, 17:319-327.

20. Moran PA: Notes on continuous stochastic phenomena.Biometrika 1950, 37:17-23.

21. Ord JK, Getis A: Local spatial autocorrelation statistics: Distri-butional issues and an application. Geogr Anal 1995, 27:286-306.

22. Lawson AB: Score tests for detection of spatial trend in morbidity dataDundee: Dundee Institute of Technology; 1989.

23. Waller LA, Turnbull BW, Clark LC, Nasca P: Chronic disease sur-veillance and testing of clustering of disease and exposure:Application to leukemia incidence and TCE-contaminateddumpsites in upstate New York. Environmetrics 1992, 3:281-300.

24. Lawson AB, Waller LA: A review of point pattern methods forspatial modelling of events around sources of pollution. Envi-ronmetrics 1996, 7:471-487.

25. Kulldorff M, Athas WF, Feuer E, Miller B, Key C: Evaluation of clus-ter alarms: A space-time scan statistic and brain cancer inLos Alamos. Am J Public Health 1998, 88:1377-1379.

26. Zhan FB: Are deaths from liver cancer, kidney cancer, andleukemia clustered in San Antonio? Tex Med 2002, 98:51-6.

27. Roche LM, Skinner R, Weinstein RB: Use of a geographic infor-mation system to identify and characterize areas with highproportions of distant stage breast cancer. J Public Health ManagPract 2002, 8:26-32.

28. Gregorio DI, Kulldorff M, Barry L, Samociuk H: Geographic differ-ences in invasive and in situ breast cancer incidence accord-ing to precise geographic coordinates, Connecticut, 1991–95. Int J Cancer 2002, 100:194-8.

29. Jemal A, Kulldorff M, Devesa SS, Hayes RB, Fraumeni JF: A geo-graphic analysis of prostate cancer mortality in the UnitedStates. Int J Cancer 2002, 101:168-174.

30. Thomas A, Carlin BP: Late detection of breast and colorectalcancer in Minnesota counties: An application of spatialsmoothing and clustering. Stat Medicine 2003, 22:113-127.

31. Hsu CE, Jacobson HE, Mas FS: Evaluating the disparity of femalebreast cancer mortality among racial groups; a spatiotem-poral analysis. Int J Health Geogr 2004, 3:4.

32. Vieira V, Webster T, Weinberg J, Aschengrau A, Ozonoff D: Spatialanalysis of lung, breast and colorectal cancer on Cape Codusing generalized additive modeling [abstract]. Epidemiology2002, 13:s202.

33. Vieira V, Webster T, Aschengrau A, Ozonoff D: A method for spa-tial analysis of risk in a population-based case-control study.Int J Hyg Environ Health 2002, 205:115-120.

34. Han D, Rogerson PA, Nie J, Bonner MR, Vena JE, Vito D, Muti P, Tre-visan M, Edge SB, Freudenheim JL: Geographic clustering of resi-dence in early life and subsequent risk of breast cancer(United States). Cancer Causes Control 2004, 15:921-929.

35. Hagerstrand T: What about people in regional science? Pap RegSci 1970, 24:7-21.

36. Hornsby K, Egenhofer M: Modeling moving objects over multi-ple granularities, special issue on Spatial and TemporalGranularity. Ann Math Artif Intell 2002, 36:177-194.

37. Miller HJ: Modeling accessibility using space-time prism con-cepts within geographical information systems. IJGIS 1991,5:287-301.

38. Huisman O, Forer P: Computational agents and urban lifespaces: A preliminary realization of the time-geography ofstudent lifestyles. In Proceedings of the 3rd International Conference onGeoComputation University of Bristol, UK; 1998.

39. Moore AB, Whigham P, Holt A, Aldridge C, Hodge K: A time geog-raphy approach to the visualisation of sport. Proceedings of the7th International Conference on GeoComputation University of Southamp-ton, United Kingdom, CD-ROM . 8 – 10 September 2003

40. Kwan MP: Extensibility and individual hybrid accessibility inspace time: A multi scale representation using GIS. In Informa-tion, Place and Cyberspace: Issues in Acessibility Edited by: Janelle DG,Hodge DC. New York: Springer; 2000.

41. Kwan MP, Janelle DG, Goodchild MF: Accessibility in space andtime: A theme in spatially integrated social science. J GeographSyst 2003, 5:1-3.

42. Sinha G, Mark DM: Measuring similarity between geospatiallifelines in studies of environmental health. J Geograph Syst2005, 7:115-136.

43. Cuzick J, Edwards R: Spatial clustering for inhomogeneouspopulations. J R Stat Soc [Ser B] 1990, 52:73-104.

44. Jacquez GM: Cuzick and Edwards test when exact locationsare unknown. Am J Epidemiol 1994, 140:58-64.

45. Goovaerts P, Jacquez G: Accounting for regional backgroundand population size in the detection of spatial clusters andoutliers using geostatistical filtering and spatial neutral mod-els: the case of lung cancer in Long Island, New York. Int JHealth Geogr 2004, 3:14.

46. Liebisch N, Jacquez GM, Goovaerts P, Kaufmann A: New methodsto generate neutral images for spatial pattern recognition. InLect Notes Comput Sc Volume 2478. Springer-Verlag Berlin Heidelberg;2002:181-195.

47. Simes RJ: An improved Bonferroni procedure for multipletests of significance. Biometrika 1986, 73:751-4.

48. Small MJ, Nunn AR, Forslund BL, Daily DA: Source attribution ofelevated residential soil lead near a battery recycling site.Environ Sci Tech 1995, 29:883-895.

49. Goovaerts P: Geostatistics for Natural Resources Evaluation New York:Oxford University Press; 1997.

50. Saito H, Goovaerts P: Accounting for source location and trans-port direction into geostatistical prediction of contaminants.Environ Sci Tech 2001, 35:3924-3930.

51. Goovaerts P, Van Meirvenne M: Delineation of hazardous areasand additional sampling strategy in presence of a location-specific threshold. In geoENV III – Geostatistics for EnvironmentalApplications Edited by: Soares A, Gomez-Hernandez J, Froidevaux R.Dordrecht: Kluwer Academic Publishers; 2001:125-136.

52. Toxics Release Inventory (TRI) Data Files 2000 [http://www.epa.gov/tri/tridata/tri00/data/index.htm].

53. Agudo A, Gonzalez CA: Secondary matching: A method forselecting controls in case-control studies on environmentalrisk factors. Int J Epidemiol 1999, 28(6):1130-1133.

54. Goovaerts P, Avruskin G, Meliker J, Slotnick M, Jacquez G, Nriagu J:Geostatistical Modeling of the Spatial Variability of Arsenicin Groundwater of Southeast Michigan. Water Resour ResAccepted .

55. Kolker A, Haack SK, Cannon WF, Westjohn DB, Kim MJ, Nriagu J,Woodruff LG: Arsenic in southeastern Michigan. In Arsenic inGround Water Edited by: Welch AH, Stollenwerk KG. Norwell, Mas-sachusetts: Kluwer Academic Publishers; 2003.

56. Barker DJP: Fetal and Infant Origins of Adult Disease London: BMJPublishing; 1992.

57. Kuh D, Ben-Shlomo Y: A Life Course Approach to Chronic Disease Epide-miology: Tracing the Origins of Ill-health from Early to Later Life Oxford:Oxford University Press; 1997.

58. Morfeld P: Letter to the Editor. An alternate characterizationof hazard in occupational epidemiology: Years of life lost peryears worked. Am J Ind Med 2003, 43:332-333.

59. Robins JM, Greenland S: Estimability and estimation ofexpected years of life lost due to a hazardous exposure. StatMedicine 1991, 10:79-93.

60. Morfeld P: Years of Life Lost due to Exposure – Causal Con-cepts and Empirical Shortcomings. Epidemiol Perspect Innov2004, 1:5.

61. Robins JM: Causal inference from complex longitudinal data.In Latent variable modeling with applications to causality Edited by: Ber-kane M. New York: Springer; 1997:69-117.

62. Rothmann KJ, Greenland S: Modern Epidemiology 2nd edition. Philadel-phia: Lippincott-Raven; 1998.

63. Jacquez GM: A k-nearest neighbor test for space-timeinteraction. Stat Medicine 1996, 15:1934-1949.

Page 19 of 19(page number not for citation purposes)