shape and scale in detecting disease clusters
DESCRIPTION
mapsTRANSCRIPT
-
University of IowaIowa Research Online
Theses and Dissertations
2008
Shape and scale in detecting disease clustersSoumya MazumdarUniversity of Iowa
Copyright 2008 Soumya Mazumdar
This dissertation is available at Iowa Research Online: http://ir.uiowa.edu/etd/208
Follow this and additional works at: http://ir.uiowa.edu/etd
Part of the Geography Commons
Recommended CitationMazumdar, Soumya. "Shape and scale in detecting disease clusters." PhD (Doctor of Philosophy) thesis, University of Iowa, 2008.http://ir.uiowa.edu/etd/208.
-
1
SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS
by
Soumya Mazumdar
An Abstract
Of a thesis submitted in partial fulfillment of the requirements for the Doctor of
Philosophy degree in Geography in the Graduate College of
The University of Iowa
December 2008
Thesis Supervisor: Professor Gerard Rushton
-
1
ABSTRACT
This dissertation offers a new cluster detection method. This method looks at the
cluster detection problem from a new perspective. I change the question of What do real
clusters look like? to the question of What do spurious clusters look like? and How
do spurious clusters affect the ability to recover real clusters? Spurious clusters can be
identified from their geographical characteristics. These are related to the spatial
distribution of people at risk, the shape and scale of the geographic units used to
aggregate these people, the shape and scale of the spatial configurations that the disease
mapping or cluster detection method may impose on the data and the shape and scale of
the area of increased risk. The statistical testing process may also create spurious clusters.
I propose that the problem of spurious clusters can be resolved using a computational
geographic approach. I argue that Monte Carlo simulations can be used to estimate the
patterns of spurious clusters in any situation of interest given knowledge of the first three
of these four determinants of spurious clusters. Then, given these determinants, where
real measurements of disease or mortality are known, it is possible to show those areas of
increased risk that are true clusters as opposed to those that are spurious clusters. This
distinction is made in a three dimensional signature space, with shape, size and rate as the
three axes. The extent of similarity (or dissimilarity) of a cluster to the simulated spurious cluster influences whether it can be recovered. These experiments show that this method
is successful in detecting clusters. This method can also predict with reasonable certainty
which clusters can be recovered, and which cannot. I compare this method with
Rogersons Score statistic method. These comparisons expose the weaknesses of
Rogersons method. Finally these two methods and the Spatial Scan Statistic are applied
to searching for possible clusters of prostate cancer incidence in Iowa. The implications
of the findings are discussed.
-
2
Abstract Approved: ___________________________________ Thesis Supervisor
___________________________________
Title and Department
___________________________________
Date
-
SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS
by
Soumya Mazumdar
A thesis submitted in partial fulfillment of the requirements for the Doctor of
Philosophy degree in Geography in the Graduate College of
The University of Iowa
December 2008
Thesis Supervisor: Professor Gerard Rushton
-
Graduate College The University of Iowa
Iowa City, Iowa
CERTIFICATE OF APPROVAL
_______________________
PH.D. THESIS
_______________
This is to certify that the Ph.D. thesis of
Soumya Mazumdar
has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Geography at the December 2008 graduation.
Thesis Committee: ___________________________________ Gerard Rushton, Thesis Supervisor
___________________________________
David Bennett
___________________________________
Naresh Kumar
___________________________________
Marc Linderman
___________________________________
Dale Zimmerman
-
ii
ACKNOWLEDGMENTS
I would like to acknowledge the help I have received during the course of my stay
in Iowa. I would like to thank Dr Rushton for supervising my research. I would also like
to thank my committee members for their contributions. The last four years of my life
have been emotionally challenging for me. I thank the great masters before us who have
helped me through. I am thankful to the writings of M. Scott Peck, Viktor Frankl, Swami
Vivekananda, and the yogic practices of Sri Sri Ravishankar @ Art of Living Foundation.
I would also like to thank my family members, especially my mom, mishtimashi and late
Dr Mazumdar for their support. Thanks are also due to all my friends and well wishers.
-
iii
TABLE OF CONTENTS
LIST OF TABLES ......................................................................................................v
LIST OF FIGURES .................................................................................................. vi
CHAPTER
1. DETECTING CLUSTERS OF DISEASE: INVESTIGATING SPURIOUS CLUSTERS---------------------------------------------------------------------1
1.1 Statement of Purpose------------------------------------------------------1 1.2 Introduction-----------------------------------------------------------------2 1.3 Organization of the dissertation------------------------------------------7 1.4 Review of existing methods of cluster detection----------------------7
1.4.1 Map data without further geographic processing---------------9 1.4.1.1 Methods that do not smooth the data------------------10 1.4.1.2 Methods that smooth the data--------------------------10
1.4.2 Methods that pre-process the data before calculating and/or testing for significant disease risk----------------12
1.4.2.1 Non combinatiorial approches-------------------------13 1.4.2.2 Combinatorial approaches------------------------------17 1.4.2.3 Hybrid approaches---------------------------------------18
1.4.3 Significance testing and spurious clusters---------------------19 1.4.4 Identifying spurious clusters and distinguishing true clusters from spurious clusters---------------------------------22
1.4.4.1 The spatial distribution of the locations of people in the map-----------------------------------------------24
1.4.4.2 The scale and spatial configuration of the geographic units that are used to aggregate data into discrete small areas-------------------------------27
1.4.5 Identifying spurious clusters and distinguishing true clusters from spurious clusters---------------------------------29
1.4.6 Why use size, shape and rate----------------------------------- 30
2. THE SHAPE SIZE SENSITIVE (S.S.S) METHOD FOR DETECTING DISEASE CLUSTERS-------------------------------------------------------55
2.1 Theoretical foundations of the S.S.S method-------------------------55 2.2 Hypothesis testing ---------------------------------------------------60 2.3 The simulated dataset---------------------------------------------------65
2.3.1 Hypothetical study area and population------------------------65 2.3.2 Hypothetical case population------------------------------------66 2.3.3 Datasets under the null hypothesis of no clustering----------66 2.3.4 Extracting the cluster candidates--------------------------------68 2.3.5 Datasets under the alternative hypothesis of clustering------69
-
iv
2.3.5.1 Rationale Behind the choice of these configurations of synthetic clusters------------------------------69
2.4 Rogersons Score Statistic-----------------------------------------------73 2.4.1 Theory--------------------------------------------------------------73
2.5 Diagnostics----------------------------------------------------------------75 2.6 Computational Scheme--------------------------------------------------76 2.7 Results- ------------------------------------------------------------------ 77
2.8 Discussions and future directions--------------------------------------81
3. INVESTIGATING THE SPATIAL PATTERNS OF PROSTATE CANCER IN IOWA---------------------------------------------------------------------109
3.1 Background-------------------------------------------------------------109 3.2 Methods-----------------------------------------------------------------111 3.3 Results-------------------------------------------------------------------115
3.4 Discussion---------------------------------------------------------------119 3.5 Conclusion--------------------------------------------------------------120
3.6 Contribution that this dissertation makes to the geography literature-----------------------------------------------------------------120
REFERENCES----------------------------------------------------------------------------135
-
v
LIST OF TABLES
Table
2.1 Hold one validation for null hypothesis.-----------------------------------------102
2.2 Hold one validation for alternative hypothesis.---------------------------------102
2.3 Summary statistics of the simulated 3675 spurious clusters.------------------103
2.4 Shape, size, risk (signature) and the ability to recover simulated clusters.--104 2.5 The table illustrates the average sensitivity (ability to detect a cluster
when it exists) and specificity (ability to classify an area that is not a cluster as such).--------------------------------------------------------------------105
2.6 This table compares sensitivity and specificity with which clusters are recovered for SSS and Rogersons method and the higher the sensitivity the better the cluster is recovered.-------------------------------------------------106
2.7 Cluster recovery using only rates and only shapes.-----------------------------107
2.8 How do true clusters differ in shape and size from spurious clusters.-------108
-
vi
LIST OF FIGURES
Figure
1.1 This figure displays the statistical significance of accidents per square kilometer (a p- map over densities) , where accidents have been randomly scattered across the study area . A 30 meter grid was laid over the entire study area and a 600 meter filter was used to estimate the accident densities. The black areas are significant noisy clusters--------35
1.2 This figure displays a spurious cluster detected by Duczmals Simulated Annealing based SaTScan method. This cluster has a high, statistically significant likelihood value.-------------------------------------------36
1.3 In the geographic area, 42 people are distributed over a uniform grid. Each circle represents an individual. They are color coded white to indicate that they are healthy. ------------------------------------------------------37
1.4 A noise or spurious cluster generating process operates at the scale of the entire geographical area. No person is at a greater risk of disease than any other. All people are at a risk of 0.24. Diseased people are randomly diseased over the map. These disease people are color coded black to indicate a diseased state.-------------------------------------------------------------38
1.5 A boundary is drawn around those people who are diseased. This represents our gerrymandered cluster. Note the highly irregular and large shape of the cluster.-------------------------------------------------------39
1.6 In contrast to 1.4, a cluster generating process operates on this geographic area. The cluster generating process predisposes the people living in the area bound by the dotted lines to a greater risk than other areas of the map. These people are at a risk of 0.56. In one realization of the process cluster of 10 people therefore are diseased in this area.----------------------40
1.7 The cluster is then enclosed within a boundary. Note the relatively regular shape of the cluster (compared to a random distribution of diseased people). ------------------------------------------------------------------41
1.8 People are distributed non uniformly over space.--------------------------------42
1.9 The entire geographic space is subject to the same risk (0.24) noise generating process. The resulting 10 diseased people and the gerrymandered cluster are shown.--------------------------------------------------43
1.10 The cluster generating process in figure 6 operates on the inhomogenously distributed population. The risk elevation is the same as in Figure 1.6 0.56. This causes 8 people to fall ill from an at-risk population of 14.--------44
-
vii
1.11 The estimated cluster shape and size is very different from what the shape and size of the cluster is in reality (The dotted line in Figure 10). It is also very different from what was obtained for a homogenous distribution of people in Figure 1.6.------------------------------------------------45
1.12 Now a cluster generating process operates on this space. The white river within the dotted lines is the area of excess risk. People living within this area are at an excess risk of disease.--------------------------46
1.13 Assuming an inhomogeneous distribution of people as in figure 1.8 and a risk elevation of 0.71, we see that a certain number of people (10) within the area of excess risk are diseased.----------------------47
1.14 The gerrymandered cluster now encloses the diseased people. Note the highly irregular and large shape of this cluster.------------------------------48
1.15 Two cluster generating processes of circular shape and risk elevation of 0.75 operate on a homogenous distribution of people.-----------------------49
1.16 The clusters that are estimated from this have the same triangular shape. This is highly unlikely in reality.---------------------------------------------------50
1.17 In this example a slightly larger area of increased risk is considered than in the earlier example. 6 people in each of the two clusters are subject to a risk of 0.5, which results in 3 of them becoming cases/ falling ill.-----------------------------------------------------------------------51
1.18 The clusters that are generated have very different shapes. In fact the larger the area of increased risk, the greater the number of possible shapes and sizes of the estimated cluster.----------------------------------------52
1.19 In this example people are inhomogenously distributed. The same cluster generating process in Figure 1.15 gives rise to two circular areas of increased risk where the risk elevation is 0.5.-----------------------------------53
1.20 The two clusters generated have very different shapes. There is no configuration of cases within the clusters for which two estimated clusters could have the same shape.------------------------------------------------54
2.1 Using echelons to extract cluster candidates.----------------------------------------87
2.2 A set of 50,000 cardiovascular disease mortality cases are randomly distributed by population weights to each of 942 ZCTAs in the state of Iowa. A pattern is then extracted using Spatial Filtering. The pattern is binarized, and the resulting polygon cluster candidates are extracted using a GIS.----------------------------------------------88
2.3 An example set of spurious cluster signatures S(ZN ) in signature space.---89 2.4 An example set of spurious cluster signatures S(ZN ) in signature space
with a few candidate clusters (grey squares).-------------------------------------90 2.5 Bounding rectangle for elliptical footprint.---------------------------------------91
-
viii
2.6 Flowchart of the S.S.S method.-----------------------------------------------------92
2.7 Population distribution of ZCTAs in Iowa, 2000.--------------------------------93
2.8 This figure displays the computational process used to create the simulated dataset. Each bin is labeled as k and has a specific size. For the simulations in this research n=942.-------------------------------------------93
2.9 The simulated datasets follow a multinomial distribution.----------------------94
2.10 Summary of shapes of simulated spurious clusters, frequency and cumulative frequency.----------------------------------------------------------------95
2.11 Summary of sizes of simulated spurious clusters, frequency and cumulative frequency.----------------------------------------------------------------96
2.12 Summary of rates of simulated spurious clusters, frequency and cumulative frequency.----------------------------------------------------------------97
2.13 Characteristics of the four clusters simulated under the alternative hypothesis.-----------------------------------------------------------------------------98
2.14 Cluster detection diagnostics (The key to the numbers is in the text).--------99 2.15 Patterns detected by the Score statistic and the S.S.S method for one
dataset among 20 datasets simulated for cluster-4. The true cluster pattern can be seen inset. In this particular dataset S.S.S is able to identify 62% of the true cluster pattern, while the Score statistic is able to identify 20%.----------------------------------------------------------------100
2.16 Patterns detected by the Score statistic and the S.S.S method for one dataset among 20 datasets simulated for cluster-3. The true cluster pattern can be seen in the inset. In this particular dataset S.S.S is able to identify 98% of the true cluster pattern, while the Score statistic is able to identify 91%.-------------------------------------------------------------101
3.1 Spatial patterns of prostate cancer incidence (1999-2004) in Iowa.----------123 3.2 Cluster of prostate cancer incidence in Iowa, detected by the S.S.S
method. ----------------------------------------------------------------------------124
3.3 Cluster detected by SaTScan when the geometry of the cluster is assumed to be ellipsoidal.----------------------------------------------------------125
3.4 Cluster detected by SaTScan when the geometry of the cluster is assumed to be circular.-------------------------------------------------------------126
3.5 Large secondary cluster with low elevation in risk detected by Kulldorffs SaTScan when the geometry of the cluster is assumed to be elliptical.-----------------------------------------------------------------------127
3.6 ZCTAs in Iowa with a significant value of Rogersons Score statistic.-----128
-
ix
3.7 Expected number of cases in ZCTAs: Entire Iowa versus areas with a significant value of Rogersons Score statistic.---------------------------------129
3.8 ZCTAs in the North West Iowa cluster of high prostate cancer incidence.-----------------------------------------------------------------------------130
3.9 Counties boundaries with ZCTAs in the North West Iowa cluster of high prostate cancer incidence.----------------------------------------------------------131
3.10 Change in mortality and incidence rates from 1990-2004 in five counties Dickinson, Clay, Buena-Vista, Emmet and Clay Counties in the cluster. The expected counts for the particular year (1990, 1991.2000) are calculated using 2000 census population for the local area, and incidence/mortality information for the state of Iowa (Same procedure as indirect standardization).-----------------------------------132
3.11 Variations in the directly standardized incidence and mortality rate in Iowa, and incidence of Prostate cancer in Dickinson County for the years 1990-2004.----------------------------------------------------------------133
3.12 Variations in the directly standardized incidence and mortality rate in Iowa, and incidence of Prostate cancer in Clay County for the years 1990-2004.---------------------------------------------------------------------------134
-
1
CHAPTER 1: DETECTING CLUSTERS OF DISEASE: INVESTIGATING
SPURIOUS CLUSTERS
1.1 Statement of Purpose
This dissertation offers a new cluster detection method. This method looks at the
cluster detection problem from a new perspective. I change the question of What do real
clusters look like? to the question of What do spurious clusters look like? and How
do spurious clusters affect the ability to recover real clusters? Spurious clusters can be
identified from their geographical characteristics. These are related to the spatial
distribution of people at risk, the shape and scale of the geographic units used to
aggregate these people, the shape and scale of the spatial configurations that the disease
mapping or cluster detection method may impose on the data and the shape and scale of
the area of increased risk. The statistical testing process may also create spurious clusters.
I propose that the problem of spurious clusters can be resolved using a computational
geographic [1] approach. I argue that Monte Carlo simulations can be used to estimate the patterns of spurious clusters in any situation of interest given knowledge of the first
three of these four determinants of spurious clusters. Then, given these determinants,
where real measurements of disease or mortality are known, it is possible to show those
areas of increased risk that are true clusters as opposed to those that are spurious clusters.
The extent of similarity (or dissimilarity) of a cluster to the simulated spurious cluster influences whether it can be recovered. These experiments show that this method is
successful in detecting clusters. This method can also predict with reasonable certainty
which clusters can be recovered, and which cannot. I compare this method with
Rogersons Score statistic method [2]. These comparisons expose the weaknesses of Rogersons method. Finally these two methods and the Spatial Scan Statistic [3] are
-
2
applied to searching for possible clusters of prostate cancer incidence in Iowa. The
implications of the findings are discussed.
1.2 Introduction
Disease mapping has a long history. Starting with the example of John Snows
cholera map to the intelligent agents [4] of the present century, disease mapping has progressed with developments in science, especially Geographical Information Systems
(G.I.S) and epidemiology. Some of the first disease maps were simple dot maps indicating the location of disease cases. These gave way to maps of statistical summaries
known as thematic maps". These maps convey more information than simple dot maps
and are therefore, powerful exploratory and decision making tools. For example, when
mortality maps of lung cancer for the United States were made in the 1960s, high rates
were found in areas of the Eastern Seaboard [5, 6]. Later, these high rates were attributed to exposure to asbestos among shipyard workers in these areas. A disease map can thus
be used to map spatial variations in disease risk. A decision maker can ask Is a person
living in a given area at a greater risk of disease than a person living in another area? or
In which areas of the map do people have the greatest risk of disease? In the disease
mapping literature the problem of finding areas of excess risk is often called cluster
detection", a cluster being defined as A geographically bounded group of occurrences of
sufficient size and concentration to be unlikely to have occurred by chance" [7] or in plain English, a geographic area of high disease risk. A geographical cluster is therefore
spatially analogous to statistical clustering [8], where the question of interest is finding things near in statistical space instead of geographical space.
While investigating the causal factors (or etiology) of areas of increased risk are important, there are other important applications of these methods. Public health agencies
are often interested in allocating resources to areas with an increased burden of disease
[9, 10]. Cluster detection methods are used to identify areas with increased burden of
-
3
disease. Sometimes, environmental policy is formulated on the basis of such studies. In
one instance, the Vatican was taken to task for operating radio transmitters at illegal
frequencies after studies showed an increased risk of cancer among people living close to
these transmitters [11, 12]. Note that policies are often formulated on the basis of evidence that an increased risk exists even though the etiological basis for the increased
risk may not have been established. An interesting extension to etiological research is that
the presence of spatial clusters of increased risk could also be used to prove the existence
of disease risk factors that are spatially non random. For example, it has been claimed
that clusters of autism in California prove the existence of risk factors that are not related
to genetics or the vaccine hypothesis1 (barring selective migration) [13]. Many public health agencies maintain on the fly cluster investigation infrastructure to address
cluster related enquiries [14]. A number of methods exist that can be used to delineate clusters. A persistent
problem with many of these methods is the detection of areas not at high risk being
identified as such. Some convenient terms for such false positives are noise" [15], noisy clusters or spurious clusters [16-19] . In this research I develop a method to detect and adjust for the occurrence of spurious clusters in cluster detection studies. The cluster detection literature identifies at least three types of spurious clusters.
The first is when the estimate of risk in an area is based on a small number of people
[15]. These estimates of risk are unreliable and therefore the area may not have a significant excess risk. A number of solutions exist to solve this problem [20-26]. The second type of spurious clusters stem from statistical issues in the cluster detection
method. For example, failing to adjust for multiple hypothesis testing problems may give rise to spurious clusters [18, 27]. This problem is an area of active research [28].
1 The vaccine hypothesis is that exposure to Thimerosol a mercury based additive in
vaccines is a risk factor for autism.
-
4
Kulldorffs SaTScan method resolves this problem by adopting a likelihood based
hypothesis testing framework [3]. The third type of spurious cluster is created by a mismatch in the scale and spatial
structure of the process that generates the cluster, with the scale and spatial structure used
to measure the process. The scale and spatial structure or spatial form of the cluster
search process (which measures or samples the underlying data) can generate spurious clusters. Unlike the other sources of spurious clusters very little research exists on this
form of noise. There are a number of reasons for this. Until recently, the computational
power available to researchers, for cluster detection problems was limited. A cluster can
have any geometry or spatial form in reality. However a limited amount of computational
power confined researchers to searching for clusters within a small range of spatial forms.
For instance, it is a common strategy to search for circular clusters. This strategy was
adopted by some of the first cluster search methods [27], and remains common today [29]. If the real cluster is not circular in shape, then the power to detect non circular clusters is greatly reduced. But, a limited search also implies that the likelihood of
mismatch between the circles and the underlying true cluster is also limited (given that the spatial form of this true cluster is unknown). In contrast, if the cluster search incorporates a number of different spatial forms, then the likelihood of mismatch
increases. Since computational power is not a limiting factor anymore, some researchers
have developed shape free" disease cluster detection methods. These methods, that draw
from the work of geographers in the 1960s and 70s [30] measure spatial attributes (like disease counts or rates) at a large number of possible shapes , sizes and scales. The measured spatial attributes or some functions of the attributes are used to decide if an
area of a given shape and size at a given scale is a cluster or not. For example, Duczmals
[31] scan assigns a likelihood value to each cluster it finds, where the likelihood is a function of attributes such as an observed number of cases in the cluster. The clusters
with the highest likelihood are most likely to be clusters. These methods thus, promise to
-
5
seek out the true clusters, no matter what their spatial form. However, this also means,
that at some shape and scale, noise or spurious clusters will be detected. These spatial
forms will represent a mismatch between the shape and scale of the process that
generated the process and the shape and scale of the process being used to detect it. The
closest analogy that can be drawn to this is similar to what is known in the disease
mapping literature as the Texas Sharpshooter Effect. If a shotgun is used on a wall,
then the wall is splattered with seemingly random bullet holes. At the scale of the wall,
the process is random. However, it is always possible to draw targets a posteriori around
the bullet holes. The act of drawing a target is similar to searching for a cluster at a scale
different from the scale at which the original process occurred (the entire wall). Duczmals search procedure, thus often finds clusters that are spurious. Such spurious
clusters will be found by any method that offers the least amount of geometric freedom to
the clusters search. In fact, these spurious clusters have even been found when the search
is limited to circular geometries (for example, see Kulldorff [32]). Tackling this problem therefore requires a) A thorough understanding of the problem of what gives rise to these spurious clusters. b) Suggesting a method to solve or in the very least, manage this problem. This dissertation is an attempt at this.
It is clear that an understanding of this problem requires an understanding of scale
and shape of the spurious cluster or noise generating process. The shape, size and risk
elevation of a cluster, whether spurious or real, is unique to each and every disease
mapping/cluster detection situation. The characteristics (shape, size and risk elevation) of a cluster depend on : a) The cluster generating process, especially the shape and size of the area of excess risk, b) The spatial distribution of people over space and c) The scale at which the spatial data are aggregated [19]. These factors are unique to each disease mapping situation/example, and these factors are responsible for creating spurious
clusters. Once we have established these facts, two take home facts are: 1) Every disease mapping situation has a unique noise or spurious cluster signature b) It is not possible to
-
6
guess this signature a-priori. However this signature may be computed as explained
below.
Since, each disease mapping situation has a unique noise or spurious cluster
signature, it follows that in every disease mapping situation there will be some clusters
which will be hard to detect. These clusters will be in ways similar to the spurious or
noisy clusters. This issue or the issue of recoverability has just started being discussed in the disease mapping literature [33, 34]. The methods I describe incorporate the following features. First, it extracts cluster candidates using an exploratory approach.
Second, shape, size and rate are used to distinguish true clusters from spurious clusters.
Third, the method incorporates recoverability of clusters into the analyses. The researcher
is able to know (computationally) a-priori what spatial form of clusters are recoverable. The method utilizes computational geography and two fundamental geographic aspects of
clusters- shape, and size to analyze the recoverability of clusters and to separate cluster
from non cluster or spurious clusters. This dissertation diverges from the traditional
disease clustering literature in taking shape and size into consideration. Traditionally only
the rate at a given location or some function of the rate is used to separate a true cluster
from a spurious one. Since the method incorporates the shape and size of the cluster in its
analysis, I call it the Shape, Size Sensitive disease cluster detection method or the S.S.S
method. The S.S.S method is tested and validated on simulated data. This method
demonstrates the power of computational geography over traditional methods [35]. The ideas and methods developed and tested in this dissertation are either new, or have been
discussed only in scant detail in the literature. Yet, they are fundamental to geography
and disease mapping. This research thus makes an important contribution to the disease
mapping literature.
-
7
1.3 Organization of the dissertation
In this chapter (Chapter 1) I discuss how various disease mapping and cluster detection techniques approach the problem of spurious clusters. I then argue that these
methods do not address the issue of spurious clusters adequately. I suggest that a
geographical approach can help us better understand the problem and explain how
geography gives rise to spurious clusters. Then, having understood the geographical
bases for spurious clusters I propose a geographically sensitive disease cluster detection
method. I explain this method the Shape Size Sensitive (S.S.S) method in Chapter 2. Then, using simulated data, I test the sensitivity of this method. I also compare the
performance of the S.S.S method with Rogersons Score statistic method for detecting
disease clusters. The final, short chapter is Chapter 3. Here I use the S.S.S method and
Rogersons Score Statistic and Kulldorffs Spatial Scan Statistic to investigate the spatial
patterns of prostate cancer risk in Iowa. The implications of the findings are discussed.
1.4 Review of existing methods of cluster
detection
All disease mapping and cluster detection approaches share a common goal. This
is to uncover the underlying pattern of risk. These methods calculate statistics as rates or
likelihoods which serve as measures of risk. The patterns" on a map are obtained by
mapping either these statistics, or those areas that cross some threshold of the calculated
statistic. When the second procedure is followed, that is, the rate, or, the likelihood of an
area having an excess risk is statistically tested; the method is often called a cluster
detection method. Most cluster detection methods test a large number of areas which
could possibly be clusters. These are called candidate clusters [31, 36] or cluster candidates. If a cluster passes the statistical test, but demarcates an area where no
cluster exists in reality, then, it is a noisy cluster [31] or spurious cluster [16-19]. The term true cluster may be used to indicate geographic areas of excess risk. It is also
-
8
possible that a true cluster is suppressed by the cluster detection process. In the disease
cluster detection literature this problem is usually not discussed separately, but forms an
integral part of the spurious cluster detection problem. Spurious clusters may be created
at various stages in the disease mapping/cluster detection process. The first step for
applying a cluster detection method is to collect spatial data. This data may come pre-
aggregated into administrative regions, or it may come in the individual form [37, 38]. If the data are in the individual form, they need to be processed and aggregated
such that summary statistics may be gleaned from them and the summary statistics
mapped. The process of aggregation may create spurious clusters. One solution is to use
the individual level data to search for clusters [39]. While a number of methods will work with both aggregated and individual level data, there are a very few methods, that have
been developed exclusively for individual level data [40, 41] . With better quality data being increasingly available, such analyses will become more common [37, 42]. The majority of disease mapping situations start with aggregated data and summary statistics are calculated from these datasets. When the summary statistics are calculated based on a
small base population (also called a small support size), then these statistical estimates are likely to be unreliable. This is the small number problem. Some methods carry out
a process called smoothing", where information from neighboring regions is used to
obtain a better estimate of the mapped statistic for a given region. This, to some extent
alleviates the problem of spurious clusters created from small numbers. The statistical
testing procedure could also create spurious clusters. If multiple hypothesis tests without
adjustment are carried out then, this process may also give rise to spurious clusters. In a famous example, Openshaw [27] carried out multiple hypothesis tests when searching for leukemia clusters in Northern England. Whenever a test was significant, a circle was
drawn. Some of these circles were spurious clusters, and would not have existed if
adjustments for multiple testing were carried out. Sometimes, using the wrong reference distribution may also create spurious clusters. Conversely, using overly conservative
-
9
multiple testing correction techniques may suppress true clusters [28]. Waller and Gotway [4] write of situations where for a Poisson reference distribution, it is not possible to distinguish a lack of fit to the Poisson distribution (spurious cluster) from a rejection of the null hypothesis (true cluster). This is an area of active statistical research, and some new and innovative solutions have been proposed to these problems [43, 44]. Kulldorffs SatScan method uses a likelihood based hypothesis testing framework to
solve the problem of multiple testing [3]. Instead of testing multiple hypotheses, this method tests only one hypothesis. This hypothesis test is carried out on the cluster
candidate that is most likely to be a cluster. The likelihood is a statistical function,
that is calculated under the assumption that the observed data conform to certain known
distributions (ex: Poisson or binomial). There still remains the third source of spurious clusters. Unlike the first two, there
is little research on this source of spurious clusters. This is when spurious clusters are
created from mismatch between the process that generates the disease map patterns, and
the processes used to recover the patterns. This mismatch could arise when the data are
aggregated to administrative regions, or to other shapes and scales by the method of
analysis. In this section I discuss the various methods for the detection of cluster
detection in context of their ability to handle this problem. Among the various methods
available, some methods offer the opportunity of multiscalar analysis. In these methods,
the data may be geographically rescaled. While these methods geographically process the
data before mapping patterns other methods consider the sanctity of geographic
boundaries unbreachable. The latter attempts to expose the underlying risk pattern by
mapping summary statistics within existing geographic boundaries without any further
geographic processing of the data.
-
10
1.4.1 Map data without further geographic
processing
In these methods the geographic boundaries of regions are left as they are,
however various statistical manipulations are carried out on the data. Some researchers
prefer to call this group of methods as disease mapping methods [45]. As I discussed earlier, these methods can again be subdivided into two groups, methods that smooth the
data and methods that do not smooth the data.
1.4.1.1 Methods that do not smooth the data
The vast majority of diseases maps are maps of raw rates, where the number of cases per unit population within existing geographic regions such as counties or states are
mapped [46]. Another approach is a map of probabilities" [47, 48], where instead of mapping a rate, the probability of observing the rate within a geographic region is
mapped. Mapping raw rates are often problematic when the rates are based on small base
populations [15]. The maps thus produced are likely to display noisy (small number problem) patterns.
1.4.1.2 Methods that smooth the data
In these methods various statistical manipulations are used to smooth the rates
in each region while at the same time keeping the geographic boundaries intact.
Information from neighboring regions are used to stabilize the rates in a given region.
Some examples of this approach can be found in the Bayesian disease mapping literature
[23, 24]. Other examples are method of moving averages and headbanging [20, 22].These methods are not very successful in dealing with the problem of spurious clusters. A study by Kafadar [22] has shown that many of the popular smoothers such as headbanging and empirical Bayes are unable to detect true patterns in the data or have
issues with detecting spurious patterns or clusters. Some of the methods smooth the data
-
11
by averaging rates over kernels or filters. For example Sabel et al. [49] investigate rates of Amylotropic Lateral Sclerosis (Lou Gehrings disease) incidence in Finland by smoothing rates using Gaussian Kernels. Another method is Rogersons Local Score
statistic [2, 4, 50]. In this method the deviations from the expected rate are smoothed using Gaussian Kernels. Like other methods, if the rates are based on small numbers,
then smoothing these unreliable rates may create spurious clusters. I use Rogersons
Score statistic in my research and therefore, this method is discussed in detail in later
sections. Spurious clusters are often created by these methods. First, because these
methods map the rates based on small areas before smoothing them, they are prone to the
small number problem. Second, these methods do not in any way attempt to deal with the
problem of spurious clusters from spatial mismatch discussed earlier. Third, the statistical
tests that these methods carry out may not be able to distinguish spurious clusters from
true clusters. For example, there is no consensus on what the correct reference
distribution is for Rogersons Score statistic [2, 4, 50]. A separate group of methods that often smooth the data, are local measures of
spatial similarity. These methods , which are also known as LISA (Local Indicators of Spatial Autocorrelation) [51] address the question, - How similar is the risk at a given small area to that of its neighbors? The greater the similarity, the higher the likelihood
that the small area belongs to (or is) a cluster. Some of the LISA statistics are local Morans-I and local Gearys C [50-54]. Since, the underlying philosophy of this approach is that things nearer are more similar than things farther away [55], the implicit definition of scale here is the distance at which this similarity is manifested. Thus a process that acts
at a large scale may cause similarity among immediately neighboring local areas, than
processes that work at a smaller scale. Like other methods, if the statistics are calculated
on small areas, they could be unreliable. The reference distribution of LISA statistics are
often not known [4] and the scale at which a process operates is not investigated before
-
12
LISA statistics are calculated. Any of these factors could lead to the creation of spurious
clusters.
1.4.2 Methods that pre-process the data
before calculating and/or testing for significant
disease risk
These methods allow the modification of geographic boundaries to extract the
underlying risk surface and/or to find which area has the greatest excess in disease risk.
One group of methods, often called density estimation methods, [56] simply ignore existing geographic boundaries. Drawing from the field" theory of geographic
phenomena [20]; they consider that disease risk patterns are continuous in nature and that they do not change or stop abruptly at geographic boundaries. When appropriately used,
these methods provide the opportunity to control the spatial basis of support, and thus, the
scale of the analysis [57, 58].The other group of methods draw from concepts of region building which were developed by geographers [30]. One approach to building regions is to coalesce groups of areas to build aggregate regions. These methods attempt to find
that combination of areas which has the greatest likelihood of being a zone of high
disease risk. A third group of methods combine concepts of region building methods with
the first group of methods or with methods discussed in the last section. The ability of all
these methods is limited by the scale of the data. Often the data come aggregated into
small areas and the analysis must be carried out at scales equal or greater than the scale of
aggregation. Nevertheless, these methods are better equipped than other methods to
control the shape and the scale of the data, and this gives them an edge over other
methods when dealing with the problem of spurious clusters.
-
13
1.4.2.1 Non combinatorial approaches
These methods ignore geographic boundaries and attempt to extract the
underlying patterns of risk. They often lay a uniform grid over the map area and measure
the statistic of interest at each grid point. Irrespective of whether the data are aggregated
or not, a value can be obtained at each grid point. While there are a number of approaches
to calculating the statistic at each grid point [21]; a simple and common approach is to filter" the data using circular spatial filters [3, 9, 21, 27]. Some methods map the statistic calculated at each grid point [9] while others do not [3]. These circles can be of fixed or varying sizes. However, since these filters are of a certain shape, they bias the cluster
search. The bias is in favor of detecting clusters of or similar to, the shape of the filter
(circles in this case). Statistically, the clusters that are of the shape of the filter have a higher power of detection than clusters of other shapes. This approach therefore,
overcomes the limitation outlined in the methods discussed earlier, but is limited in its
treatment of geographic shape. Ellipses and other geometric shapes have also been
studied [29, 59]. One of the methods, based on Rushtons Adaptive DMap [9] maps rates at grid points using adaptive filters and interpolates these with an IDW (Inverse Distance Weighting) interpolation algorithm. The adaptive filter [58, 60] ensures that the rates are based on the same number of people or the same support size. Thus, unlike the
LISA methods, all statistics are equally reliable. Also, the use of an adaptive filter
ensures that the scale of the analysis can be precisely controlled. The Inverse Distance
Weighting Algorithm used for creating the final pattern was also found by Kafadar [22] to be the least noisy of all smoothing/interpolation methods. Thus, by allowing
multiscalar analysis, relative freedom of cluster shape (clusters dont have to conform to geographic boundaries) and using a robust interpolation technique, Rushtons Adaptive Filtering method is best suited for dealing with the problem of spurious clusters from
mismatch between the process and analysis scales. I use this method in my analyses.
Another important density estimation method is Kulldorff's SaTScan [3]. While the
-
14
DMap method maps the extracted pattern, and is therefore good for visualizing and
exploring the underlying pattern, SaTScan can be used to map only those areas that are
significant clusters. SaTScan has found wide acceptance in the public health community
because of its ability to account for the multiple hypotheses testing problem and a robust,
freely available software. Some of the recent developments in the disease clustering
literature have followed the combinatorial approaches that I discuss next, and their
method of choice has been based on the Spatial Scan Statistic method of cluster
detection. Since multiple testing is an issue with these combinatorial approaches, the
Spatial Scan Statistic is a reasonable choice. Since I use the Spatial Scan Statistic in
Chapter-3 to investigate clusters of prostate cancer in North West Iowa, some of the
details of the Spatial Scan Statistic are provided next:
The scan statistic originated as a one dimensional test. Its objective was to test if a one dimensional point process is purely random. The one dimensional spatial scan
statistic was extended by Kulldorff into the spatial domain [3] .The spatial scan statistic moves a circle across the study area. The circle centers on to a centroid. The centroid
could be the location of a single individual for unaggregated data, the centroid of a census
tract (for example) for aggregated data or for a set of grid points. Kulldorff (1997) [3] states The zone defined by a circle consists of all individuals in those cells whose
centroids lie inside the circle and each zone is uniquely identified by these individuals.
Thus, although the number of circles is infinite the number of zones will be finite. For
unaggregated data the zones are perfectly circular, that is, the individuals in the zone are
exactly those located within a defining circle. With data aggregated into census districts,
a zone may have irregular boundaries that depend on the size and the shape of the several
contiguous census districts it includes. The Spatial Scan Statistic is implemented
through the freely available software SaTScan [32]. The methodology of the Spatial Scan Statistic is explained as follows. The method involves two steps, - 1. Confounder
adjustment and 2. Hypothesis testing
-
15
In disease cluster detection studies known risk factors or confounders are
adjusted for, before the cluster detection algorithm is implemented. Thus, for example, it is known that age is associated with prostate cancer. It may be desirable to remove the
effect of age from the analyses, such that the clusters that are detected reflect the presence
of other, yet unknown, risk factors. The confounder adjustment procedure that SaTScan utilizes is known as the indirect standardization method. It is as follows:
If ,
ei= Expected number of cases in local area/ZCTA i after confounder adjustment. ni = Observed number of cases in local area/ZCTA i after confounder adjustment. r = specific cofounder group, for example age group from 45-65 yrs.
= Total number of confounder groups.
nr = Total number of cases in G in age group r
Nir= Total number of people in G in local area i, in age group r.
The confounder adjustment procedure is:
ei = [ (nr / Nri1 )* N]
The adjusted numbers of cases are then used to test the hypothesis if a given local
area/ZCTA i has an excess risk/belongs to a cluster. The hypothesis testing procedure is
explained next. The Spatial Scan Statistic tests the hypothesis if a given area of the map
(for example a collection of ZCTAs) has a greater (or lesser) risk, than the rest of the
ZCTAs in the entire geographic region G.
If Zj is the jth cluster :
-
16
For all possible Zjs in Z (The collection of k possible clusters in G), if the risk inside Zj is
R(inside, j) is the risk inside Zj while R(outside, j) is the risk outside Zj ,then under the null hypothesis and alternative hypothesis:
H0: R(inside, j) = R(outside, j)
H1: R(inside, j) > R(outside, j)
The observed number of cases nj inside (or outside) a cluster candidate is assumed to be Poisson Distributed, and a function of the expected number of cases in the cluster ej and the risk R(inside, j) .
Let n= k Nirri1 nj Poisson [ ej *R(inside, j) ] The likelihood function that is used, from these null and alternative hypotheses are as follows:
= Likelihood (R(inside, j) > R(outside, j) ) / Likelihood(R(inside, j) = R(outside, j) )
This likelihood ratio can be solved and written in the logarithmic form as follows:
Log Likelihood Ratio or LLRj = (nj ln (nj/ ej)) + ((n- nj) ln [(n- nj)/(n- ej)])
The significance of the log likelihood ratio is tested using a Monte Carlo
hypothesis test. The SaTScan program carries out a user-specified number of Monte
Carlo randomizations of the data and tests to 0.001 % (The percentage can be user
specified too) significance of the presence of a cluster. A p value is reported. This is
calculated as p value = Rank of LLR / (1 + #simulation)." Note that the spatial scan
statistic procedure does not adjust for multiple testing in the traditional sense for example
by carrying out a Bonferroni or other multiple testing adjustment procedure. Instead, it
avoids the problem of testing multiple hypotheses, by concentrating on those clusters
candidates that are most likely to be true clusters (and thus have the highest log likelihood
-
17
value). Also note that the Spatial Scan Statistic procedure explained above is the spatial
Poisson model, which is the model used in disease mapping. There are numerous other
modifications to the Spatial Scan Statistic procedure [29].
1.4.2.2 Combinatorial Approaches
Some geographers are interested in creating or building regions [30, 61-64]. Regions are built up by assigning small areas to groups such that they fulfill certain
criteria. Regional geographers have called this the assignment problem. Small areas
are so assigned to regions, that a certain attribute of the region is optimized [30, 62]. Sometimes, the problem could involve maximizing the variation in an attribute of the
newly built region as a proportion of the variation within the entire map [30, 65]. The general question in this approach is What combination of areas will optimize a given
objective? ". In the disease mapping context disease risk or the likelihood of risk can be maximized. An example in the disease mapping context was investigated by Alvanides
[61]. A similar strategy was also suggested (but not implemented) by Rushton [66]. These ideas were implemented in computer programs first by Openshaw [64] and later by other researchers [63, 67, 68]. Independently Duczmal suggested a similar solution to finding disease clusters of any shape. He operationally achieved this by maximizing the
Spatial Scan Statistic likelihood function over possible combinations of areas. While it is
sometimes possible to look at all possible combinations/ collections of areas, for most
realistic geographical areas this is not possible (For example, see Cliff and Haggett [62]). Neither are there theoretical solutions to the problem. In operations research, such
problems are called np-complete. This means that for a collection of n areas, the problem
cannot be solved in polynomial computer time. Heuristics are used to solve such
problems. Duczmal uses the Simulated Annealing (SA) and Genetic Algorithm (GA) heuristics in his research [31, 69]. An important aspect of these methods is that they provide enormous freedom of analysis of shape and scale. The analysis scale and shape
-
18
vary across a multitude of combinations. Thus instead of asking the question Is there a
cluster at a given scale of the following shape? these methods demand - Find clusters
of any shape at any scale. This makes these methods immensely powerful. But this
strength also brings about a weakness. If spurious clusters are created from a mismatch
between the process and analysis scale and shapes, and if a large number of scales and
shapes are evaluated by this analysis method, then it follows that noisy clusters will
almost always be detected by these methods alongside genuine or true clusters. At the
end of this section will shall see an example of this. The next section discusses some of
the modifications that researchers have proposed to these methods. These modifications
offer better power of detecting clusters.
1.4.2.3 Hybrid Approaches
These approaches combine some of the strategies of the non-combinatorial
approaches with a combinatorial search. Some examples are the approaches proposed by
Patil and Tallie [70], Tango [71] and Yinnakoulias [36]. Tango proposed that the search begin with a circular cluster as a seed", but then regions adjacent to the circular cluster be coalesced with it and the resulting hybrid be tested as a possible cluster. With every
level of adjacency enumerated the problem becomes computationally complex, and therefore in their example Tango suggested that three levels of adjacency be tested. Patil and Tallie`s [70] approach is limited to restricting the search space to areas with the highest rates, which Patil and Tallie call the Upper level sets". These methods provide
interesting extensions to the combinatorial shape-free methods of cluster search.
We are now in a position to summarize the various methods discussed. All the
methods outlined above have one singular goal: To extract the underlying pattern of
significant excess risk. Some methods are good at mapping the entire pattern [9], while others are good at testing for significant excess risk [3]. In the next section, I discuss how problems with significance testing can introduce spurious clusters.
-
19
1.4.3 Significance Testing and Spurious
Clusters
In general all methods at some point, address the following question: Of all the
candidate clusters in the pattern of risk (whether mapped or not), what clusters are true clusters? Each candidate cluster has a specific risk elevation, a size, and a shape.
Traditionally most cluster detection" techniques have used some function of the risk
elevation or rate of a given area to decide if the area is a true cluster. The question that is
asked is How likely are we to observe this risk elevation or rate in this area if the
underlying process is noise? " If the probability is small then the area is not a cluster.
The distribution of risks/rates under the process of noise is also known as the reference
distribution. Traditionally, the reference distribution is normatively chosen. Some
choices are the normal distribution [2, 50], the chi-squared distribution [2, 50], the Poisson [3] distribution and the Gumbel distribution [43]. However, using such distributions is problematic. If the populations are small, the normal distribution cannot
be used. It is often hard to distinguish a lack of fit to the Chi-Squared distribution from
a genuine deviation from the Chi-Squared distribution (indicating clustering) [4] . A more robust method of achieving this is to use a Monte Carlo simulation approach to
empirically determine the reference distribution. Methodologically this may be achieved
by simulating a series of maps, in each of which noise is the underlying process. Multiple
Monte-Carlo simulations of the data are used to mimic the noise process. If the observed
risk elevation (or some function of the risk value such as the rate) for the area is significantly different from the ones in the simulated maps, then the area is considered to
be a cluster. However Monte Carlo simulations do not guarantee that spurious clusters
will not be detected. Steenberghen et al.,[72] carried out an experiment that illustrates this problem. This is displayed in Fig 1.1. Fig 1.1 is a map in which simulated locations
of traffic accidents (points) were randomly scattered [72], filtered using 600 meter filters,
-
20
the density of points estimated, the resulting clusters tested for significance and the level
of significance was displayed (also known as a p-map). If areas which show 0.025 % significance are called clusters, the black shapes in Figure 1.1 are spurious clusters.
Some methods attempt to tackle this problem with a combination of both Monte
Carlo and normative statistical techniques. Examples are Duczmals and Kulldorffs
methods. Duczmals method [3, 31, 43, 69, 73] (which derives from Kuldorffs method) generates a large number of irregular cluster candidates. For each candidate the rate is
calculated. The rate is then fed into a function known as a likelihood function to yield a
likelihood value of the cluster candidate being a true cluster. This value is divided by
the likelihood of the cluster candidate not being a true cluster. This ratio is known as the
likelihood ratio. The likelihood ratios for all cluster candidates are calculated. The
cluster candidates with the highest ratios are the most likely clusters. Multiple Monte
Carlo simulations are carried out, and the rates at all the candidate clusters calculated.
Again, the rates are fed into the likelihood function, thus generating a reference
distribution of likelihood ratios for each cluster candidate. The likelihood ratio value of
the cluster candidate is compared with the reference distribution to decide if the cluster
candidate is a true cluster. However when Duczmal applied this approach to some of his
data, problems with this approach were dramatically exposed. In one of his studies
Duczmal [31] simulated breast cancer cases and randomly distributed them over 245 counties in New England (Fig 1.2). When he instructed his Simulated Annealing (SA) SaTScan based irregular cluster search algorithm to search for clusters, one of the clusters
that it found was a large and extremely irregular cluster encompassing 122 counties, and
enclosing a large percentage of the randomly scattered cases. This cluster is an example
of a noisy cluster. The noise generating process (random distribution of cases) operated at the scale of 245 counties (aggregated). The shape of the area at which this process operated is the shape of the New England region that we see in Fig 1.2. At this scale and
shape, the process generates noise. However, if this process is studied at the scale of an
-
21
aggregation of 122 counties and at the shape that follows the darker (orange if your copy of this document is in color) shaded counties in Figure 1.2, then, a noisy or spurious cluster is generated. It is known that the process that generated this cluster is noise.
This example thus illustrates a situation where spurious clusters are created from a
mismatch between the scale and shape of the process that generates the cluster and the
scale and the shape imposed by the method of analysis. Duczmal [31] noted that this noisy cluster was large in size and extremely irregular in shape. Duczmal [73] suggests that large and irregular clusters like the one found in his study (above) are likely to be spurious. He and some other researchers [36] therefore, incorporate a penalty for irregularity of shape in this cluster search algorithm. The extent of this penalty is decided
on a priori knowledge of the shape of the cluster. Therefore, if researchers believe that
the clusters in an area are likely to be circular; they place a high penalty on clusters that
are not circular in shape and vice versa. The spurious cluster detected by Duczmals
method and the proposed solution raises some important questions. Is this spurious
cluster large and irregular with a high risk/rate elevation a cluster of his particular
method, or is it possible that if a cluster detection method is given freedom of shape and
size then these clusters are likely to be detected? We note that the shape and size of the
spurious clusters in Fig 1.1 are different from the shape and size of Duczmals spurious
cluster. Thus not all spurious clusters are large and irregular.
Duczmals problem has reintroduced the otherwise rarely discussed issue of shape
and size in the disease cluster detection literature [69, 74, 75]. Risk elevation is just one possible characteristic of a cluster. McCullagh [76] states - In map analysis, features of prime importance may be size, shape, orientation and spacing". It is possible for clusters
of different shapes and sizes to have the same risk elevation. It is also possible for
clusters of same shape and sizes to have different risk elevations. The first objective of any cluster search should therefore be to distinguish spurious or noisy clusters from
everything else. The risk or rate value of a possible cluster alone is not sufficient to make
-
22
this distinction. The shape and size of the cluster must also be factored in, when
considering if a cluster is a true cluster. Duczmal proposes a solution that makes certain a
priori assumptions about the shape and size of a cluster. This solution is interesting.
However, the problem of spurious clusters may be approached from a different angle.
Instead of asking the question What is the shape of a true cluster? which is what these
methods do, and which is a question which is hard if not impossible to answer, the
question that should be asked is What is the shape of a spurious cluster?. Unlike the
first question, this is easier to answer. This is because the shape of a spurious cluster,
unlike a true cluster can be mined a-posteriori from the data. To know how this can be
done, we first need to understand how spurious clusters are generated in the first place.
Thus, in the chapter that follows I discuss in depth, the phenomenon of noise and the
creation of spurious clusters.
1.4.4 Identifying spurious clusters and
distinguishing true clusters from spurious
clusters
Spurious clusters enclose noise. Across disciplines noise is defined as .. a
random and unpredictable signal" [77]. By this definition if the nature of the signal is known, then noise can be detected and filtered out. For example in a satellite image, it
may be known that certain frequencies are the signal frequencies and therefore a spectral
analysis and subsequent filtering may help remove the undesirable noise. In a satellite
image the signal has a physical existence. For example, infrared radiation emitted by
vegetation can be measured with certain instruments. In contrast, in mapping disease the
signal cannot be physically measured. The signal is conceptual and has to be estimated
from the available data. Some geographers and statisticians attempt to tackle the problem
by developing statistical models that attempt to separate signal from noise [21, 23, 78-
-
23
80]. Perhaps a better approach to understanding signal and noise in a disease map is to understand the physical process that gives rise to the signal (as in a satellite signal). It is known that in a disease map, the observed patterns are the result of underlying processes.
The observed patterns are patterns obtained from mapping statistical summaries of
disease outcomes. For example, a map of patterns of cholera mortality in England could
be displaying the number of cholera deaths per unit population in each county. The
outcome in this case is cholera mortality which is the outcome of a disease process. Since
cholera is a communicable disease it is possible that the spread of cholera can be modeled
as a contact network process [81]. There exist many other spatially explicit disease processes2. For example, patterns of disease could be the result of processes that reflect
an underlying lack of access to healthcare [10, 56, 82-84]. Whatever the specific process may be, these processes have a common trait in having a spatial form [85], and this means that they predispose some areas of the map to have a greater risk than any other.
It is also possible that the underlying process does not cause any region of the
map to have a greater risk than any other. Since a disease case may appear at any point on
the map by random chance, by the earlier definition of noise, this is a noise generating
process. A cluster defined by enclosing some of these disease cases is a spurious cluster.
On any given map disease patterns can be the result of one or more processes. It could be
the result of one process that generates clusters and another process that generates noise.
The challenge therefore, is to distinguish the areas of a pattern that are the result of a
cluster generating process from those that are not. Also, given a disease process that
generates patterns on a map; a number of other factors also influence the patterns we
2 It is important to distinguish between a spatially explicit disease process and a
spatial disease process. Some scientists attempt to model diseases as purely spatial processes. Examples of this can be seen from the cellular automata based disease modeling literature. No disease process is purely spatial and therefore such models are misleading.
-
24
actually observe. Given a cluster generating process, the following factors influence the
pattern that is then extracted:
1. The spatial distribution of the locations of people in the map.
2. The shape and size of the geographic units that are used to aggregate individuals
into discrete small areas.
3. The shape and size of the spatial configuration, the disease mapping or cluster
detection method may impose on the data (In addition to 2).
Understanding these factors is essential to understanding noise and spurious
clusters. I discuss this next.
1.4.4.1 The spatial distribution of the locations
of people in the map
A cluster generating process causes an area of the map to have a greater risk than
other areas of the map. Cluster detection methods seek to estimate the shape, size and risk
elevation of the area of increased risk using the locations of people as proxy sample sites.
A representative spatial sample of the area of risk would be a uniform grid [86]. People are never distributed uniformly over space; instead, a likely spatial distribution consists
of dense settlements interspaced with sparsely populated areas. This creates a challenge
in estimating the true shape of the cluster. As I illustrate from figures 1.3 to 1.11, a
cluster that in reality has a uniform shape, may be estimated as having a highly irregular
shape, because of the way people are distributed over space [75].The shape of the actual area of increased risk or true cluster created by the cluster generating process also
influences the shape of the cluster that is finally estimated. If the shape of the true cluster
-
25
is highly irregular, it is quite likely that the shape of the cluster that is estimated is also
highly irregular, but the converse may also be true! This is illustrated from figures 1.12 to
1.14.Another phenomenon long observed by geographers is that the same risk process
may give birth to different shaped clusters in different areas of the map or, in more
general terms, the same cluster generating process may give rise to different patterns
[87]. While the shape of the original area of the increased risk or true cluster may be the same in two areas and the spatial distribution of the people may be the same, it is not
necessary that the pattern of people who are diseased (and who are not) will be the same in both areas. This means that the shape of the estimated area of increased risk will not be
the same in both areas. This is further complicated by the fact that people are almost
never distributed similarly over space in two different regions (Figures 1.15 to 1.20). First, for the purposes of understanding this issue, let us assume the highly
improbable situation that people are uniformly distributed over space. Let the distribution
be over a uniform grid. Figure 1.3 illustrates the situation. Next, let us consider that out
of the 42 people in the region, 10 are afflicted by some disease. However, we assume that
the process that causes disease is a noise generating process. Therefore, we expect
diseased people (or cases) to be randomly distributed over the region among 42 people as shown in figure 1.4. A convex hull boundary of these cases is seen in Figure 1.5. In
contrast, if there is a cluster generating process, we would expect the diseased people to
be clustered together. Figure 1.6 illustrates such a situation. People enclosed within a
dotted area of increased risk are diseased, the risk being 0. 24 (the risk in other areas being 0). We observe in Figure 1.6 one realization of the risk process, so 10 people are diseased. Figure 1.7 displays the convex hull boundary of this cluster of diseased
people. The smooth and regular shape of this cluster is in sharp contrast to the irregular
cluster shape that we observe in Figure 1.5. Since it is highly unlikely, that people will be
uniformly distributed over space, Figure 1.8 illustrates the more realistic possibility of
people being non uniformly distributed over space. If the entire geographic area in figure
-
26
1.8 is subject to a risk, we expect some people to become diseased (again, one realization of the process) . Figure 1.9 illustrates this and the boundary that demarcates the cluster. The shape of the cluster is very different from what was obtained in Figure 1.5. An
increased area of risk on such a heterogeneously distributed population gives rise to
clusters of unpredictable shapes (figures 1.10 and 1.11).These example show how the spatial distribution of the people affect the shape and size of the risk surface detected.
From these examples it may seem that for a given distribution of people over
space, a cluster generating process gives rise to patterns on a map that are regular
compared to the shapes generated by a noise generating process. Indeed, some scientists
use measures of regularity of a clusters shape to distinguish a true cluster from a
cluster spurious cluster [73]. Also, people never are distributed uniformly over geographic space. Next, we see how this affects the shape and size of the cluster detected.
In the example I have discussed I assumed that the cluster generating process gives rise to
a very regularly shaped area of increased risk (The area within the dotted line). In reality this may not be true. The area of increased risk may have a very irregular shape. Some
examples of geographic features that can be areas of increased risk are rivers, roads,
underground groundwater streams, plumes of aerial pollution or a combination of some
of these. We therefore observe that the shape and size of a cluster cannot be predicted a-
priori and is unique to the risk elevation of the cluster generating process and the spatial
distribution of the people. Another aspect of a cluster generating process is that the same
process can give rise to different shaped clusters in different regions of the map. This can
happen even if people are uniformly distributed. The examples below illustrate this:
From the discussion and the examples, we can conclude that both the spatial
distribution of people and the shape and size of the area of increased risk, have an
important bearing on the shape and size of the cluster that is finally detected. The area of
increased risk or the true cluster may have a very different spatial configuration from
the cluster that is detected. Parts of the true cluster may be suppressed or spurious areas
-
27
of increased risk may arise. Spurious clusters are created from the method used to
measure the outcome of the process of clustering. By definition, the method uses a scale
and (or) shape of measurement that is dependent on the spatial distribution of people. Since this distribution is not representative of the underlying area of increased risk, there
is a mismatch between the measurement shape/scale and the process shape scale. While
the above examples are with individual level data, the conclusions drawn can be
generalized to aggregated data. The act of data aggregation itself could introduce noise
over and above the problem of heterogeneously distributed people. This is discussed in
the next section.
1.4.4.2 The scale and spatial configuration
of the geographic units that are used to
aggregate data into discrete small areas
In the geography literature the term scale is used to refer to three different kinds
of scales, two of which are of relevance here. The first is the phenomenon scale, or the
scale at which a spatial process operates. The second is the analysis scale the scale at
which data are aggregated for measurement and analysis [88]. When a phenomenon such as a disease operates at a given scale, its outcome is often registered as heterogeneity in
disease rates at that scale [89]. Geographers have often attempted to find the scale at which a process operates [90]. Two well known methods are the use of spectral analysis [65] and variogram [91] modeling. The latter approach is often used in the health geography literature. Studies in China have shown that Esophageal and Liver Cancers
operate at scales of less than 150 kms while stomach cancers operate at scales less than
90 km [91]. In Sweden substance related disorders operate at scales less than 3 kms [92]. Unfortunately, the scale at which a given process operates is not known in most
geographic studies. A geographer attempts to study a process by collecting and analyzing
-
28
spatial data. This process involves analysis through the calculation of statistical
summaries of data aggregated at an appropriate scale. When the process scale is not
known there is every possibility of a mismatch between the process scale and the analysis
scale. This mismatch or misalignment arises from two sources. First, geographic data are
often aggregated into discrete units often for purposes different from the analyses for
which they are being used. These units of aggregation could differ in shape and scale
from the process scale and shape. As Haining [93] states in Conceptual models of spatial variation [93] ...This might be referred to as process-induced spatial heterogeneity. This source of heterogeneity may be compounded in the case of regional data by measuring
attributes through spatial units of different size. This might be referred to as
measurement-induced heterogeneity because it is a product of how attributes are
observed and measured. A second source of mismatch is from the spatial structures that a
disease mapping/ cluster detection method imposes on the data. For example, spatial
filtering [9, 10] and Spatial Scan Statistic based methods calculate summary statistics by aggregating data along circular filters. In the geography literature the problems that
arise from spatial mismatch are grouped under MAUP or the Modifiable Area Unit
Problem [91, 94]. MAUP phenomena are again grouped under two broad sub groups as the zone effect and the scale effect. The creation of spurious heterogeneity or destruction
of true heterogeneity with changing scales is a manifestation of the scale effect. If the
scale is kept fixed but the shape of the zones of aggregation are changed, then the zone
effect is likely to be seen. Geographic data aggregated to administrative units often
display both the zone and scale effects of MAUP. Aggregating data has a smoothing
effect on disease rates [95], and therefore clusters at scales smaller than the scale of aggregation could be missed, when analyses are done using these data. Conversely, if the
scale of aggregation is smaller than the process scale, then noisy clusters could be
detected. A recent study by Ozonoff et al., [19] demonstrated that when individual level data are aggregated and a Spatial Scan Statistic cluster search method used on the data,
-
29
then noise increases with increasing levels of aggregation. Therefore, analysis and
process scales interact in complex ways to create noisy clusters and suppress true clusters
We can conclude from our discussions above, that a number of complex factors
influence the shape, size and the risk elevation of the clusters that are detected and the
spurious clusters created. These factors are dependent on the spatial distribution of the
people and the process and analysis scales. It is not possible to make a priori assumptions
about these factors, and it is certainly not possible to predict the shape of a noisy cluster a
priori. What approach is then appropriate if the spurious clusters have to be separated
from the true clusters? The section that follows answers this question.
1.4.5 Identifying the noisy" or spurious
components of the pattern
A reasonable cluster detection technique should take into consideration not only
the risk elevation but also the shape and size of the cluster. I propose a spatially enabled
computational process that uses these attributes of a cluster, to identify the signature of
spurious clusters from patterns on a disease map. Earlier, I introduced the idea that a
pattern is the outcome of a process. Analyzing a pattern or the components of a pattern
such as individual clusters may yield clues about the underlying process. A map of
disease patterns represents one realization of the underlying process. It may not be
possible to draw conclusions on the process that generated the pattern or components of
the pattern by analyzing just one map. However, if multiple maps were available, representing multiple realizations of the process, then analyzing the patterns may yield
clues about the underlying process. A classic example of this approach can be found in
Hagerstrands classic paper [96] in which he simulates multiple maps assuming an underlying process. He then compares maps of empirical data with the maps that he has
simulated to draw conclusions about the validity with which he represents the process in
his model. Another example can be seen from Diggle [97].Therefore, if maps were
-
30
created using a known process, then analysis of the simulated patterns on the maps would
yield clues on the signature" of that particular process. Once this signature" is known,
then the pattern could imply (or not imply) the existence of this process. More specifically, this scheme can help identify a signature" for spurious clusters. These
signatures can then be used to distinguish clusters that are spurious from clusters that are
true", in any given pattern of disease risk. Shape, size and risk elevation are part of this
signature". For example, the signature of spurious clusters in Duczmals [73] method was that these clusters were large in size and had irregular shapes. The next chapter is
devoted to the method I have developed based on these ideas. The method is first
described, then tested and validated on simulated data.
1.4.6 Why use size, shape and rate
The reason I add the dimensions of size and shape, in addition to rate, is to
characterize the reference space in which spurious clusters are located. I know from
theory (as discussed in this chapter) that spurious clusters arise differently to the extent that the numbers of people at risk in relation to the overall relative risk of the disease
exist differ across the space. When people are distributed uniformly in space, the average
number and average size of spurious clusters in that space can be determined from
theory. As Schinazi [98] shows, deterministic statistics can be used to determine the chance of finding a given number of clusters with a rate higher or lower than the expected
rate. However, when people at risk are distributed non-uniformly in space, the equivalent
number is more difficult to determine directly from theory. The same theory still applies;
it is just more difficult to implement in the case of non-uniform distribution of people at risk. For this reason, I use Monte Carlo simulation to discover the rate, size, shape space
in which typical spurious clusters lie, given the particular distribution of people at risk
and the particular overall relative risk of the disease in the study area in question. In his
seminal paper King [85] states The mathematics of stochast