shape and scale in detecting disease clusters

160
University of Iowa Iowa Research Online eses and Dissertations 2008 Shape and scale in detecting disease clusters Soumya Mazumdar University of Iowa Copyright 2008 Soumya Mazumdar is dissertation is available at Iowa Research Online: hp://ir.uiowa.edu/etd/208 Follow this and additional works at: hp://ir.uiowa.edu/etd Part of the Geography Commons Recommended Citation Mazumdar, Soumya. "Shape and scale in detecting disease clusters." PhD (Doctor of Philosophy) thesis, University of Iowa, 2008. hp://ir.uiowa.edu/etd/208.

Upload: zarcone7

Post on 17-Dec-2015

220 views

Category:

Documents


0 download

DESCRIPTION

maps

TRANSCRIPT

  • University of IowaIowa Research Online

    Theses and Dissertations

    2008

    Shape and scale in detecting disease clustersSoumya MazumdarUniversity of Iowa

    Copyright 2008 Soumya Mazumdar

    This dissertation is available at Iowa Research Online: http://ir.uiowa.edu/etd/208

    Follow this and additional works at: http://ir.uiowa.edu/etd

    Part of the Geography Commons

    Recommended CitationMazumdar, Soumya. "Shape and scale in detecting disease clusters." PhD (Doctor of Philosophy) thesis, University of Iowa, 2008.http://ir.uiowa.edu/etd/208.

  • 1

    SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS

    by

    Soumya Mazumdar

    An Abstract

    Of a thesis submitted in partial fulfillment of the requirements for the Doctor of

    Philosophy degree in Geography in the Graduate College of

    The University of Iowa

    December 2008

    Thesis Supervisor: Professor Gerard Rushton

  • 1

    ABSTRACT

    This dissertation offers a new cluster detection method. This method looks at the

    cluster detection problem from a new perspective. I change the question of What do real

    clusters look like? to the question of What do spurious clusters look like? and How

    do spurious clusters affect the ability to recover real clusters? Spurious clusters can be

    identified from their geographical characteristics. These are related to the spatial

    distribution of people at risk, the shape and scale of the geographic units used to

    aggregate these people, the shape and scale of the spatial configurations that the disease

    mapping or cluster detection method may impose on the data and the shape and scale of

    the area of increased risk. The statistical testing process may also create spurious clusters.

    I propose that the problem of spurious clusters can be resolved using a computational

    geographic approach. I argue that Monte Carlo simulations can be used to estimate the

    patterns of spurious clusters in any situation of interest given knowledge of the first three

    of these four determinants of spurious clusters. Then, given these determinants, where

    real measurements of disease or mortality are known, it is possible to show those areas of

    increased risk that are true clusters as opposed to those that are spurious clusters. This

    distinction is made in a three dimensional signature space, with shape, size and rate as the

    three axes. The extent of similarity (or dissimilarity) of a cluster to the simulated spurious cluster influences whether it can be recovered. These experiments show that this method

    is successful in detecting clusters. This method can also predict with reasonable certainty

    which clusters can be recovered, and which cannot. I compare this method with

    Rogersons Score statistic method. These comparisons expose the weaknesses of

    Rogersons method. Finally these two methods and the Spatial Scan Statistic are applied

    to searching for possible clusters of prostate cancer incidence in Iowa. The implications

    of the findings are discussed.

  • 2

    Abstract Approved: ___________________________________ Thesis Supervisor

    ___________________________________

    Title and Department

    ___________________________________

    Date

  • SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS

    by

    Soumya Mazumdar

    A thesis submitted in partial fulfillment of the requirements for the Doctor of

    Philosophy degree in Geography in the Graduate College of

    The University of Iowa

    December 2008

    Thesis Supervisor: Professor Gerard Rushton

  • Graduate College The University of Iowa

    Iowa City, Iowa

    CERTIFICATE OF APPROVAL

    _______________________

    PH.D. THESIS

    _______________

    This is to certify that the Ph.D. thesis of

    Soumya Mazumdar

    has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Geography at the December 2008 graduation.

    Thesis Committee: ___________________________________ Gerard Rushton, Thesis Supervisor

    ___________________________________

    David Bennett

    ___________________________________

    Naresh Kumar

    ___________________________________

    Marc Linderman

    ___________________________________

    Dale Zimmerman

  • ii

    ACKNOWLEDGMENTS

    I would like to acknowledge the help I have received during the course of my stay

    in Iowa. I would like to thank Dr Rushton for supervising my research. I would also like

    to thank my committee members for their contributions. The last four years of my life

    have been emotionally challenging for me. I thank the great masters before us who have

    helped me through. I am thankful to the writings of M. Scott Peck, Viktor Frankl, Swami

    Vivekananda, and the yogic practices of Sri Sri Ravishankar @ Art of Living Foundation.

    I would also like to thank my family members, especially my mom, mishtimashi and late

    Dr Mazumdar for their support. Thanks are also due to all my friends and well wishers.

  • iii

    TABLE OF CONTENTS

    LIST OF TABLES ......................................................................................................v

    LIST OF FIGURES .................................................................................................. vi

    CHAPTER

    1. DETECTING CLUSTERS OF DISEASE: INVESTIGATING SPURIOUS CLUSTERS---------------------------------------------------------------------1

    1.1 Statement of Purpose------------------------------------------------------1 1.2 Introduction-----------------------------------------------------------------2 1.3 Organization of the dissertation------------------------------------------7 1.4 Review of existing methods of cluster detection----------------------7

    1.4.1 Map data without further geographic processing---------------9 1.4.1.1 Methods that do not smooth the data------------------10 1.4.1.2 Methods that smooth the data--------------------------10

    1.4.2 Methods that pre-process the data before calculating and/or testing for significant disease risk----------------12

    1.4.2.1 Non combinatiorial approches-------------------------13 1.4.2.2 Combinatorial approaches------------------------------17 1.4.2.3 Hybrid approaches---------------------------------------18

    1.4.3 Significance testing and spurious clusters---------------------19 1.4.4 Identifying spurious clusters and distinguishing true clusters from spurious clusters---------------------------------22

    1.4.4.1 The spatial distribution of the locations of people in the map-----------------------------------------------24

    1.4.4.2 The scale and spatial configuration of the geographic units that are used to aggregate data into discrete small areas-------------------------------27

    1.4.5 Identifying spurious clusters and distinguishing true clusters from spurious clusters---------------------------------29

    1.4.6 Why use size, shape and rate----------------------------------- 30

    2. THE SHAPE SIZE SENSITIVE (S.S.S) METHOD FOR DETECTING DISEASE CLUSTERS-------------------------------------------------------55

    2.1 Theoretical foundations of the S.S.S method-------------------------55 2.2 Hypothesis testing ---------------------------------------------------60 2.3 The simulated dataset---------------------------------------------------65

    2.3.1 Hypothetical study area and population------------------------65 2.3.2 Hypothetical case population------------------------------------66 2.3.3 Datasets under the null hypothesis of no clustering----------66 2.3.4 Extracting the cluster candidates--------------------------------68 2.3.5 Datasets under the alternative hypothesis of clustering------69

  • iv

    2.3.5.1 Rationale Behind the choice of these configurations of synthetic clusters------------------------------69

    2.4 Rogersons Score Statistic-----------------------------------------------73 2.4.1 Theory--------------------------------------------------------------73

    2.5 Diagnostics----------------------------------------------------------------75 2.6 Computational Scheme--------------------------------------------------76 2.7 Results- ------------------------------------------------------------------ 77

    2.8 Discussions and future directions--------------------------------------81

    3. INVESTIGATING THE SPATIAL PATTERNS OF PROSTATE CANCER IN IOWA---------------------------------------------------------------------109

    3.1 Background-------------------------------------------------------------109 3.2 Methods-----------------------------------------------------------------111 3.3 Results-------------------------------------------------------------------115

    3.4 Discussion---------------------------------------------------------------119 3.5 Conclusion--------------------------------------------------------------120

    3.6 Contribution that this dissertation makes to the geography literature-----------------------------------------------------------------120

    REFERENCES----------------------------------------------------------------------------135

  • v

    LIST OF TABLES

    Table

    2.1 Hold one validation for null hypothesis.-----------------------------------------102

    2.2 Hold one validation for alternative hypothesis.---------------------------------102

    2.3 Summary statistics of the simulated 3675 spurious clusters.------------------103

    2.4 Shape, size, risk (signature) and the ability to recover simulated clusters.--104 2.5 The table illustrates the average sensitivity (ability to detect a cluster

    when it exists) and specificity (ability to classify an area that is not a cluster as such).--------------------------------------------------------------------105

    2.6 This table compares sensitivity and specificity with which clusters are recovered for SSS and Rogersons method and the higher the sensitivity the better the cluster is recovered.-------------------------------------------------106

    2.7 Cluster recovery using only rates and only shapes.-----------------------------107

    2.8 How do true clusters differ in shape and size from spurious clusters.-------108

  • vi

    LIST OF FIGURES

    Figure

    1.1 This figure displays the statistical significance of accidents per square kilometer (a p- map over densities) , where accidents have been randomly scattered across the study area . A 30 meter grid was laid over the entire study area and a 600 meter filter was used to estimate the accident densities. The black areas are significant noisy clusters--------35

    1.2 This figure displays a spurious cluster detected by Duczmals Simulated Annealing based SaTScan method. This cluster has a high, statistically significant likelihood value.-------------------------------------------36

    1.3 In the geographic area, 42 people are distributed over a uniform grid. Each circle represents an individual. They are color coded white to indicate that they are healthy. ------------------------------------------------------37

    1.4 A noise or spurious cluster generating process operates at the scale of the entire geographical area. No person is at a greater risk of disease than any other. All people are at a risk of 0.24. Diseased people are randomly diseased over the map. These disease people are color coded black to indicate a diseased state.-------------------------------------------------------------38

    1.5 A boundary is drawn around those people who are diseased. This represents our gerrymandered cluster. Note the highly irregular and large shape of the cluster.-------------------------------------------------------39

    1.6 In contrast to 1.4, a cluster generating process operates on this geographic area. The cluster generating process predisposes the people living in the area bound by the dotted lines to a greater risk than other areas of the map. These people are at a risk of 0.56. In one realization of the process cluster of 10 people therefore are diseased in this area.----------------------40

    1.7 The cluster is then enclosed within a boundary. Note the relatively regular shape of the cluster (compared to a random distribution of diseased people). ------------------------------------------------------------------41

    1.8 People are distributed non uniformly over space.--------------------------------42

    1.9 The entire geographic space is subject to the same risk (0.24) noise generating process. The resulting 10 diseased people and the gerrymandered cluster are shown.--------------------------------------------------43

    1.10 The cluster generating process in figure 6 operates on the inhomogenously distributed population. The risk elevation is the same as in Figure 1.6 0.56. This causes 8 people to fall ill from an at-risk population of 14.--------44

  • vii

    1.11 The estimated cluster shape and size is very different from what the shape and size of the cluster is in reality (The dotted line in Figure 10). It is also very different from what was obtained for a homogenous distribution of people in Figure 1.6.------------------------------------------------45

    1.12 Now a cluster generating process operates on this space. The white river within the dotted lines is the area of excess risk. People living within this area are at an excess risk of disease.--------------------------46

    1.13 Assuming an inhomogeneous distribution of people as in figure 1.8 and a risk elevation of 0.71, we see that a certain number of people (10) within the area of excess risk are diseased.----------------------47

    1.14 The gerrymandered cluster now encloses the diseased people. Note the highly irregular and large shape of this cluster.------------------------------48

    1.15 Two cluster generating processes of circular shape and risk elevation of 0.75 operate on a homogenous distribution of people.-----------------------49

    1.16 The clusters that are estimated from this have the same triangular shape. This is highly unlikely in reality.---------------------------------------------------50

    1.17 In this example a slightly larger area of increased risk is considered than in the earlier example. 6 people in each of the two clusters are subject to a risk of 0.5, which results in 3 of them becoming cases/ falling ill.-----------------------------------------------------------------------51

    1.18 The clusters that are generated have very different shapes. In fact the larger the area of increased risk, the greater the number of possible shapes and sizes of the estimated cluster.----------------------------------------52

    1.19 In this example people are inhomogenously distributed. The same cluster generating process in Figure 1.15 gives rise to two circular areas of increased risk where the risk elevation is 0.5.-----------------------------------53

    1.20 The two clusters generated have very different shapes. There is no configuration of cases within the clusters for which two estimated clusters could have the same shape.------------------------------------------------54

    2.1 Using echelons to extract cluster candidates.----------------------------------------87

    2.2 A set of 50,000 cardiovascular disease mortality cases are randomly distributed by population weights to each of 942 ZCTAs in the state of Iowa. A pattern is then extracted using Spatial Filtering. The pattern is binarized, and the resulting polygon cluster candidates are extracted using a GIS.----------------------------------------------88

    2.3 An example set of spurious cluster signatures S(ZN ) in signature space.---89 2.4 An example set of spurious cluster signatures S(ZN ) in signature space

    with a few candidate clusters (grey squares).-------------------------------------90 2.5 Bounding rectangle for elliptical footprint.---------------------------------------91

  • viii

    2.6 Flowchart of the S.S.S method.-----------------------------------------------------92

    2.7 Population distribution of ZCTAs in Iowa, 2000.--------------------------------93

    2.8 This figure displays the computational process used to create the simulated dataset. Each bin is labeled as k and has a specific size. For the simulations in this research n=942.-------------------------------------------93

    2.9 The simulated datasets follow a multinomial distribution.----------------------94

    2.10 Summary of shapes of simulated spurious clusters, frequency and cumulative frequency.----------------------------------------------------------------95

    2.11 Summary of sizes of simulated spurious clusters, frequency and cumulative frequency.----------------------------------------------------------------96

    2.12 Summary of rates of simulated spurious clusters, frequency and cumulative frequency.----------------------------------------------------------------97

    2.13 Characteristics of the four clusters simulated under the alternative hypothesis.-----------------------------------------------------------------------------98

    2.14 Cluster detection diagnostics (The key to the numbers is in the text).--------99 2.15 Patterns detected by the Score statistic and the S.S.S method for one

    dataset among 20 datasets simulated for cluster-4. The true cluster pattern can be seen inset. In this particular dataset S.S.S is able to identify 62% of the true cluster pattern, while the Score statistic is able to identify 20%.----------------------------------------------------------------100

    2.16 Patterns detected by the Score statistic and the S.S.S method for one dataset among 20 datasets simulated for cluster-3. The true cluster pattern can be seen in the inset. In this particular dataset S.S.S is able to identify 98% of the true cluster pattern, while the Score statistic is able to identify 91%.-------------------------------------------------------------101

    3.1 Spatial patterns of prostate cancer incidence (1999-2004) in Iowa.----------123 3.2 Cluster of prostate cancer incidence in Iowa, detected by the S.S.S

    method. ----------------------------------------------------------------------------124

    3.3 Cluster detected by SaTScan when the geometry of the cluster is assumed to be ellipsoidal.----------------------------------------------------------125

    3.4 Cluster detected by SaTScan when the geometry of the cluster is assumed to be circular.-------------------------------------------------------------126

    3.5 Large secondary cluster with low elevation in risk detected by Kulldorffs SaTScan when the geometry of the cluster is assumed to be elliptical.-----------------------------------------------------------------------127

    3.6 ZCTAs in Iowa with a significant value of Rogersons Score statistic.-----128

  • ix

    3.7 Expected number of cases in ZCTAs: Entire Iowa versus areas with a significant value of Rogersons Score statistic.---------------------------------129

    3.8 ZCTAs in the North West Iowa cluster of high prostate cancer incidence.-----------------------------------------------------------------------------130

    3.9 Counties boundaries with ZCTAs in the North West Iowa cluster of high prostate cancer incidence.----------------------------------------------------------131

    3.10 Change in mortality and incidence rates from 1990-2004 in five counties Dickinson, Clay, Buena-Vista, Emmet and Clay Counties in the cluster. The expected counts for the particular year (1990, 1991.2000) are calculated using 2000 census population for the local area, and incidence/mortality information for the state of Iowa (Same procedure as indirect standardization).-----------------------------------132

    3.11 Variations in the directly standardized incidence and mortality rate in Iowa, and incidence of Prostate cancer in Dickinson County for the years 1990-2004.----------------------------------------------------------------133

    3.12 Variations in the directly standardized incidence and mortality rate in Iowa, and incidence of Prostate cancer in Clay County for the years 1990-2004.---------------------------------------------------------------------------134

  • 1

    CHAPTER 1: DETECTING CLUSTERS OF DISEASE: INVESTIGATING

    SPURIOUS CLUSTERS

    1.1 Statement of Purpose

    This dissertation offers a new cluster detection method. This method looks at the

    cluster detection problem from a new perspective. I change the question of What do real

    clusters look like? to the question of What do spurious clusters look like? and How

    do spurious clusters affect the ability to recover real clusters? Spurious clusters can be

    identified from their geographical characteristics. These are related to the spatial

    distribution of people at risk, the shape and scale of the geographic units used to

    aggregate these people, the shape and scale of the spatial configurations that the disease

    mapping or cluster detection method may impose on the data and the shape and scale of

    the area of increased risk. The statistical testing process may also create spurious clusters.

    I propose that the problem of spurious clusters can be resolved using a computational

    geographic [1] approach. I argue that Monte Carlo simulations can be used to estimate the patterns of spurious clusters in any situation of interest given knowledge of the first

    three of these four determinants of spurious clusters. Then, given these determinants,

    where real measurements of disease or mortality are known, it is possible to show those

    areas of increased risk that are true clusters as opposed to those that are spurious clusters.

    The extent of similarity (or dissimilarity) of a cluster to the simulated spurious cluster influences whether it can be recovered. These experiments show that this method is

    successful in detecting clusters. This method can also predict with reasonable certainty

    which clusters can be recovered, and which cannot. I compare this method with

    Rogersons Score statistic method [2]. These comparisons expose the weaknesses of Rogersons method. Finally these two methods and the Spatial Scan Statistic [3] are

  • 2

    applied to searching for possible clusters of prostate cancer incidence in Iowa. The

    implications of the findings are discussed.

    1.2 Introduction

    Disease mapping has a long history. Starting with the example of John Snows

    cholera map to the intelligent agents [4] of the present century, disease mapping has progressed with developments in science, especially Geographical Information Systems

    (G.I.S) and epidemiology. Some of the first disease maps were simple dot maps indicating the location of disease cases. These gave way to maps of statistical summaries

    known as thematic maps". These maps convey more information than simple dot maps

    and are therefore, powerful exploratory and decision making tools. For example, when

    mortality maps of lung cancer for the United States were made in the 1960s, high rates

    were found in areas of the Eastern Seaboard [5, 6]. Later, these high rates were attributed to exposure to asbestos among shipyard workers in these areas. A disease map can thus

    be used to map spatial variations in disease risk. A decision maker can ask Is a person

    living in a given area at a greater risk of disease than a person living in another area? or

    In which areas of the map do people have the greatest risk of disease? In the disease

    mapping literature the problem of finding areas of excess risk is often called cluster

    detection", a cluster being defined as A geographically bounded group of occurrences of

    sufficient size and concentration to be unlikely to have occurred by chance" [7] or in plain English, a geographic area of high disease risk. A geographical cluster is therefore

    spatially analogous to statistical clustering [8], where the question of interest is finding things near in statistical space instead of geographical space.

    While investigating the causal factors (or etiology) of areas of increased risk are important, there are other important applications of these methods. Public health agencies

    are often interested in allocating resources to areas with an increased burden of disease

    [9, 10]. Cluster detection methods are used to identify areas with increased burden of

  • 3

    disease. Sometimes, environmental policy is formulated on the basis of such studies. In

    one instance, the Vatican was taken to task for operating radio transmitters at illegal

    frequencies after studies showed an increased risk of cancer among people living close to

    these transmitters [11, 12]. Note that policies are often formulated on the basis of evidence that an increased risk exists even though the etiological basis for the increased

    risk may not have been established. An interesting extension to etiological research is that

    the presence of spatial clusters of increased risk could also be used to prove the existence

    of disease risk factors that are spatially non random. For example, it has been claimed

    that clusters of autism in California prove the existence of risk factors that are not related

    to genetics or the vaccine hypothesis1 (barring selective migration) [13]. Many public health agencies maintain on the fly cluster investigation infrastructure to address

    cluster related enquiries [14]. A number of methods exist that can be used to delineate clusters. A persistent

    problem with many of these methods is the detection of areas not at high risk being

    identified as such. Some convenient terms for such false positives are noise" [15], noisy clusters or spurious clusters [16-19] . In this research I develop a method to detect and adjust for the occurrence of spurious clusters in cluster detection studies. The cluster detection literature identifies at least three types of spurious clusters.

    The first is when the estimate of risk in an area is based on a small number of people

    [15]. These estimates of risk are unreliable and therefore the area may not have a significant excess risk. A number of solutions exist to solve this problem [20-26]. The second type of spurious clusters stem from statistical issues in the cluster detection

    method. For example, failing to adjust for multiple hypothesis testing problems may give rise to spurious clusters [18, 27]. This problem is an area of active research [28].

    1 The vaccine hypothesis is that exposure to Thimerosol a mercury based additive in

    vaccines is a risk factor for autism.

  • 4

    Kulldorffs SaTScan method resolves this problem by adopting a likelihood based

    hypothesis testing framework [3]. The third type of spurious cluster is created by a mismatch in the scale and spatial

    structure of the process that generates the cluster, with the scale and spatial structure used

    to measure the process. The scale and spatial structure or spatial form of the cluster

    search process (which measures or samples the underlying data) can generate spurious clusters. Unlike the other sources of spurious clusters very little research exists on this

    form of noise. There are a number of reasons for this. Until recently, the computational

    power available to researchers, for cluster detection problems was limited. A cluster can

    have any geometry or spatial form in reality. However a limited amount of computational

    power confined researchers to searching for clusters within a small range of spatial forms.

    For instance, it is a common strategy to search for circular clusters. This strategy was

    adopted by some of the first cluster search methods [27], and remains common today [29]. If the real cluster is not circular in shape, then the power to detect non circular clusters is greatly reduced. But, a limited search also implies that the likelihood of

    mismatch between the circles and the underlying true cluster is also limited (given that the spatial form of this true cluster is unknown). In contrast, if the cluster search incorporates a number of different spatial forms, then the likelihood of mismatch

    increases. Since computational power is not a limiting factor anymore, some researchers

    have developed shape free" disease cluster detection methods. These methods, that draw

    from the work of geographers in the 1960s and 70s [30] measure spatial attributes (like disease counts or rates) at a large number of possible shapes , sizes and scales. The measured spatial attributes or some functions of the attributes are used to decide if an

    area of a given shape and size at a given scale is a cluster or not. For example, Duczmals

    [31] scan assigns a likelihood value to each cluster it finds, where the likelihood is a function of attributes such as an observed number of cases in the cluster. The clusters

    with the highest likelihood are most likely to be clusters. These methods thus, promise to

  • 5

    seek out the true clusters, no matter what their spatial form. However, this also means,

    that at some shape and scale, noise or spurious clusters will be detected. These spatial

    forms will represent a mismatch between the shape and scale of the process that

    generated the process and the shape and scale of the process being used to detect it. The

    closest analogy that can be drawn to this is similar to what is known in the disease

    mapping literature as the Texas Sharpshooter Effect. If a shotgun is used on a wall,

    then the wall is splattered with seemingly random bullet holes. At the scale of the wall,

    the process is random. However, it is always possible to draw targets a posteriori around

    the bullet holes. The act of drawing a target is similar to searching for a cluster at a scale

    different from the scale at which the original process occurred (the entire wall). Duczmals search procedure, thus often finds clusters that are spurious. Such spurious

    clusters will be found by any method that offers the least amount of geometric freedom to

    the clusters search. In fact, these spurious clusters have even been found when the search

    is limited to circular geometries (for example, see Kulldorff [32]). Tackling this problem therefore requires a) A thorough understanding of the problem of what gives rise to these spurious clusters. b) Suggesting a method to solve or in the very least, manage this problem. This dissertation is an attempt at this.

    It is clear that an understanding of this problem requires an understanding of scale

    and shape of the spurious cluster or noise generating process. The shape, size and risk

    elevation of a cluster, whether spurious or real, is unique to each and every disease

    mapping/cluster detection situation. The characteristics (shape, size and risk elevation) of a cluster depend on : a) The cluster generating process, especially the shape and size of the area of excess risk, b) The spatial distribution of people over space and c) The scale at which the spatial data are aggregated [19]. These factors are unique to each disease mapping situation/example, and these factors are responsible for creating spurious

    clusters. Once we have established these facts, two take home facts are: 1) Every disease mapping situation has a unique noise or spurious cluster signature b) It is not possible to

  • 6

    guess this signature a-priori. However this signature may be computed as explained

    below.

    Since, each disease mapping situation has a unique noise or spurious cluster

    signature, it follows that in every disease mapping situation there will be some clusters

    which will be hard to detect. These clusters will be in ways similar to the spurious or

    noisy clusters. This issue or the issue of recoverability has just started being discussed in the disease mapping literature [33, 34]. The methods I describe incorporate the following features. First, it extracts cluster candidates using an exploratory approach.

    Second, shape, size and rate are used to distinguish true clusters from spurious clusters.

    Third, the method incorporates recoverability of clusters into the analyses. The researcher

    is able to know (computationally) a-priori what spatial form of clusters are recoverable. The method utilizes computational geography and two fundamental geographic aspects of

    clusters- shape, and size to analyze the recoverability of clusters and to separate cluster

    from non cluster or spurious clusters. This dissertation diverges from the traditional

    disease clustering literature in taking shape and size into consideration. Traditionally only

    the rate at a given location or some function of the rate is used to separate a true cluster

    from a spurious one. Since the method incorporates the shape and size of the cluster in its

    analysis, I call it the Shape, Size Sensitive disease cluster detection method or the S.S.S

    method. The S.S.S method is tested and validated on simulated data. This method

    demonstrates the power of computational geography over traditional methods [35]. The ideas and methods developed and tested in this dissertation are either new, or have been

    discussed only in scant detail in the literature. Yet, they are fundamental to geography

    and disease mapping. This research thus makes an important contribution to the disease

    mapping literature.

  • 7

    1.3 Organization of the dissertation

    In this chapter (Chapter 1) I discuss how various disease mapping and cluster detection techniques approach the problem of spurious clusters. I then argue that these

    methods do not address the issue of spurious clusters adequately. I suggest that a

    geographical approach can help us better understand the problem and explain how

    geography gives rise to spurious clusters. Then, having understood the geographical

    bases for spurious clusters I propose a geographically sensitive disease cluster detection

    method. I explain this method the Shape Size Sensitive (S.S.S) method in Chapter 2. Then, using simulated data, I test the sensitivity of this method. I also compare the

    performance of the S.S.S method with Rogersons Score statistic method for detecting

    disease clusters. The final, short chapter is Chapter 3. Here I use the S.S.S method and

    Rogersons Score Statistic and Kulldorffs Spatial Scan Statistic to investigate the spatial

    patterns of prostate cancer risk in Iowa. The implications of the findings are discussed.

    1.4 Review of existing methods of cluster

    detection

    All disease mapping and cluster detection approaches share a common goal. This

    is to uncover the underlying pattern of risk. These methods calculate statistics as rates or

    likelihoods which serve as measures of risk. The patterns" on a map are obtained by

    mapping either these statistics, or those areas that cross some threshold of the calculated

    statistic. When the second procedure is followed, that is, the rate, or, the likelihood of an

    area having an excess risk is statistically tested; the method is often called a cluster

    detection method. Most cluster detection methods test a large number of areas which

    could possibly be clusters. These are called candidate clusters [31, 36] or cluster candidates. If a cluster passes the statistical test, but demarcates an area where no

    cluster exists in reality, then, it is a noisy cluster [31] or spurious cluster [16-19]. The term true cluster may be used to indicate geographic areas of excess risk. It is also

  • 8

    possible that a true cluster is suppressed by the cluster detection process. In the disease

    cluster detection literature this problem is usually not discussed separately, but forms an

    integral part of the spurious cluster detection problem. Spurious clusters may be created

    at various stages in the disease mapping/cluster detection process. The first step for

    applying a cluster detection method is to collect spatial data. This data may come pre-

    aggregated into administrative regions, or it may come in the individual form [37, 38]. If the data are in the individual form, they need to be processed and aggregated

    such that summary statistics may be gleaned from them and the summary statistics

    mapped. The process of aggregation may create spurious clusters. One solution is to use

    the individual level data to search for clusters [39]. While a number of methods will work with both aggregated and individual level data, there are a very few methods, that have

    been developed exclusively for individual level data [40, 41] . With better quality data being increasingly available, such analyses will become more common [37, 42]. The majority of disease mapping situations start with aggregated data and summary statistics are calculated from these datasets. When the summary statistics are calculated based on a

    small base population (also called a small support size), then these statistical estimates are likely to be unreliable. This is the small number problem. Some methods carry out

    a process called smoothing", where information from neighboring regions is used to

    obtain a better estimate of the mapped statistic for a given region. This, to some extent

    alleviates the problem of spurious clusters created from small numbers. The statistical

    testing procedure could also create spurious clusters. If multiple hypothesis tests without

    adjustment are carried out then, this process may also give rise to spurious clusters. In a famous example, Openshaw [27] carried out multiple hypothesis tests when searching for leukemia clusters in Northern England. Whenever a test was significant, a circle was

    drawn. Some of these circles were spurious clusters, and would not have existed if

    adjustments for multiple testing were carried out. Sometimes, using the wrong reference distribution may also create spurious clusters. Conversely, using overly conservative

  • 9

    multiple testing correction techniques may suppress true clusters [28]. Waller and Gotway [4] write of situations where for a Poisson reference distribution, it is not possible to distinguish a lack of fit to the Poisson distribution (spurious cluster) from a rejection of the null hypothesis (true cluster). This is an area of active statistical research, and some new and innovative solutions have been proposed to these problems [43, 44]. Kulldorffs SatScan method uses a likelihood based hypothesis testing framework to

    solve the problem of multiple testing [3]. Instead of testing multiple hypotheses, this method tests only one hypothesis. This hypothesis test is carried out on the cluster

    candidate that is most likely to be a cluster. The likelihood is a statistical function,

    that is calculated under the assumption that the observed data conform to certain known

    distributions (ex: Poisson or binomial). There still remains the third source of spurious clusters. Unlike the first two, there

    is little research on this source of spurious clusters. This is when spurious clusters are

    created from mismatch between the process that generates the disease map patterns, and

    the processes used to recover the patterns. This mismatch could arise when the data are

    aggregated to administrative regions, or to other shapes and scales by the method of

    analysis. In this section I discuss the various methods for the detection of cluster

    detection in context of their ability to handle this problem. Among the various methods

    available, some methods offer the opportunity of multiscalar analysis. In these methods,

    the data may be geographically rescaled. While these methods geographically process the

    data before mapping patterns other methods consider the sanctity of geographic

    boundaries unbreachable. The latter attempts to expose the underlying risk pattern by

    mapping summary statistics within existing geographic boundaries without any further

    geographic processing of the data.

  • 10

    1.4.1 Map data without further geographic

    processing

    In these methods the geographic boundaries of regions are left as they are,

    however various statistical manipulations are carried out on the data. Some researchers

    prefer to call this group of methods as disease mapping methods [45]. As I discussed earlier, these methods can again be subdivided into two groups, methods that smooth the

    data and methods that do not smooth the data.

    1.4.1.1 Methods that do not smooth the data

    The vast majority of diseases maps are maps of raw rates, where the number of cases per unit population within existing geographic regions such as counties or states are

    mapped [46]. Another approach is a map of probabilities" [47, 48], where instead of mapping a rate, the probability of observing the rate within a geographic region is

    mapped. Mapping raw rates are often problematic when the rates are based on small base

    populations [15]. The maps thus produced are likely to display noisy (small number problem) patterns.

    1.4.1.2 Methods that smooth the data

    In these methods various statistical manipulations are used to smooth the rates

    in each region while at the same time keeping the geographic boundaries intact.

    Information from neighboring regions are used to stabilize the rates in a given region.

    Some examples of this approach can be found in the Bayesian disease mapping literature

    [23, 24]. Other examples are method of moving averages and headbanging [20, 22].These methods are not very successful in dealing with the problem of spurious clusters. A study by Kafadar [22] has shown that many of the popular smoothers such as headbanging and empirical Bayes are unable to detect true patterns in the data or have

    issues with detecting spurious patterns or clusters. Some of the methods smooth the data

  • 11

    by averaging rates over kernels or filters. For example Sabel et al. [49] investigate rates of Amylotropic Lateral Sclerosis (Lou Gehrings disease) incidence in Finland by smoothing rates using Gaussian Kernels. Another method is Rogersons Local Score

    statistic [2, 4, 50]. In this method the deviations from the expected rate are smoothed using Gaussian Kernels. Like other methods, if the rates are based on small numbers,

    then smoothing these unreliable rates may create spurious clusters. I use Rogersons

    Score statistic in my research and therefore, this method is discussed in detail in later

    sections. Spurious clusters are often created by these methods. First, because these

    methods map the rates based on small areas before smoothing them, they are prone to the

    small number problem. Second, these methods do not in any way attempt to deal with the

    problem of spurious clusters from spatial mismatch discussed earlier. Third, the statistical

    tests that these methods carry out may not be able to distinguish spurious clusters from

    true clusters. For example, there is no consensus on what the correct reference

    distribution is for Rogersons Score statistic [2, 4, 50]. A separate group of methods that often smooth the data, are local measures of

    spatial similarity. These methods , which are also known as LISA (Local Indicators of Spatial Autocorrelation) [51] address the question, - How similar is the risk at a given small area to that of its neighbors? The greater the similarity, the higher the likelihood

    that the small area belongs to (or is) a cluster. Some of the LISA statistics are local Morans-I and local Gearys C [50-54]. Since, the underlying philosophy of this approach is that things nearer are more similar than things farther away [55], the implicit definition of scale here is the distance at which this similarity is manifested. Thus a process that acts

    at a large scale may cause similarity among immediately neighboring local areas, than

    processes that work at a smaller scale. Like other methods, if the statistics are calculated

    on small areas, they could be unreliable. The reference distribution of LISA statistics are

    often not known [4] and the scale at which a process operates is not investigated before

  • 12

    LISA statistics are calculated. Any of these factors could lead to the creation of spurious

    clusters.

    1.4.2 Methods that pre-process the data

    before calculating and/or testing for significant

    disease risk

    These methods allow the modification of geographic boundaries to extract the

    underlying risk surface and/or to find which area has the greatest excess in disease risk.

    One group of methods, often called density estimation methods, [56] simply ignore existing geographic boundaries. Drawing from the field" theory of geographic

    phenomena [20]; they consider that disease risk patterns are continuous in nature and that they do not change or stop abruptly at geographic boundaries. When appropriately used,

    these methods provide the opportunity to control the spatial basis of support, and thus, the

    scale of the analysis [57, 58].The other group of methods draw from concepts of region building which were developed by geographers [30]. One approach to building regions is to coalesce groups of areas to build aggregate regions. These methods attempt to find

    that combination of areas which has the greatest likelihood of being a zone of high

    disease risk. A third group of methods combine concepts of region building methods with

    the first group of methods or with methods discussed in the last section. The ability of all

    these methods is limited by the scale of the data. Often the data come aggregated into

    small areas and the analysis must be carried out at scales equal or greater than the scale of

    aggregation. Nevertheless, these methods are better equipped than other methods to

    control the shape and the scale of the data, and this gives them an edge over other

    methods when dealing with the problem of spurious clusters.

  • 13

    1.4.2.1 Non combinatorial approaches

    These methods ignore geographic boundaries and attempt to extract the

    underlying patterns of risk. They often lay a uniform grid over the map area and measure

    the statistic of interest at each grid point. Irrespective of whether the data are aggregated

    or not, a value can be obtained at each grid point. While there are a number of approaches

    to calculating the statistic at each grid point [21]; a simple and common approach is to filter" the data using circular spatial filters [3, 9, 21, 27]. Some methods map the statistic calculated at each grid point [9] while others do not [3]. These circles can be of fixed or varying sizes. However, since these filters are of a certain shape, they bias the cluster

    search. The bias is in favor of detecting clusters of or similar to, the shape of the filter

    (circles in this case). Statistically, the clusters that are of the shape of the filter have a higher power of detection than clusters of other shapes. This approach therefore,

    overcomes the limitation outlined in the methods discussed earlier, but is limited in its

    treatment of geographic shape. Ellipses and other geometric shapes have also been

    studied [29, 59]. One of the methods, based on Rushtons Adaptive DMap [9] maps rates at grid points using adaptive filters and interpolates these with an IDW (Inverse Distance Weighting) interpolation algorithm. The adaptive filter [58, 60] ensures that the rates are based on the same number of people or the same support size. Thus, unlike the

    LISA methods, all statistics are equally reliable. Also, the use of an adaptive filter

    ensures that the scale of the analysis can be precisely controlled. The Inverse Distance

    Weighting Algorithm used for creating the final pattern was also found by Kafadar [22] to be the least noisy of all smoothing/interpolation methods. Thus, by allowing

    multiscalar analysis, relative freedom of cluster shape (clusters dont have to conform to geographic boundaries) and using a robust interpolation technique, Rushtons Adaptive Filtering method is best suited for dealing with the problem of spurious clusters from

    mismatch between the process and analysis scales. I use this method in my analyses.

    Another important density estimation method is Kulldorff's SaTScan [3]. While the

  • 14

    DMap method maps the extracted pattern, and is therefore good for visualizing and

    exploring the underlying pattern, SaTScan can be used to map only those areas that are

    significant clusters. SaTScan has found wide acceptance in the public health community

    because of its ability to account for the multiple hypotheses testing problem and a robust,

    freely available software. Some of the recent developments in the disease clustering

    literature have followed the combinatorial approaches that I discuss next, and their

    method of choice has been based on the Spatial Scan Statistic method of cluster

    detection. Since multiple testing is an issue with these combinatorial approaches, the

    Spatial Scan Statistic is a reasonable choice. Since I use the Spatial Scan Statistic in

    Chapter-3 to investigate clusters of prostate cancer in North West Iowa, some of the

    details of the Spatial Scan Statistic are provided next:

    The scan statistic originated as a one dimensional test. Its objective was to test if a one dimensional point process is purely random. The one dimensional spatial scan

    statistic was extended by Kulldorff into the spatial domain [3] .The spatial scan statistic moves a circle across the study area. The circle centers on to a centroid. The centroid

    could be the location of a single individual for unaggregated data, the centroid of a census

    tract (for example) for aggregated data or for a set of grid points. Kulldorff (1997) [3] states The zone defined by a circle consists of all individuals in those cells whose

    centroids lie inside the circle and each zone is uniquely identified by these individuals.

    Thus, although the number of circles is infinite the number of zones will be finite. For

    unaggregated data the zones are perfectly circular, that is, the individuals in the zone are

    exactly those located within a defining circle. With data aggregated into census districts,

    a zone may have irregular boundaries that depend on the size and the shape of the several

    contiguous census districts it includes. The Spatial Scan Statistic is implemented

    through the freely available software SaTScan [32]. The methodology of the Spatial Scan Statistic is explained as follows. The method involves two steps, - 1. Confounder

    adjustment and 2. Hypothesis testing

  • 15

    In disease cluster detection studies known risk factors or confounders are

    adjusted for, before the cluster detection algorithm is implemented. Thus, for example, it is known that age is associated with prostate cancer. It may be desirable to remove the

    effect of age from the analyses, such that the clusters that are detected reflect the presence

    of other, yet unknown, risk factors. The confounder adjustment procedure that SaTScan utilizes is known as the indirect standardization method. It is as follows:

    If ,

    ei= Expected number of cases in local area/ZCTA i after confounder adjustment. ni = Observed number of cases in local area/ZCTA i after confounder adjustment. r = specific cofounder group, for example age group from 45-65 yrs.

    = Total number of confounder groups.

    nr = Total number of cases in G in age group r

    Nir= Total number of people in G in local area i, in age group r.

    The confounder adjustment procedure is:

    ei = [ (nr / Nri1 )* N]

    The adjusted numbers of cases are then used to test the hypothesis if a given local

    area/ZCTA i has an excess risk/belongs to a cluster. The hypothesis testing procedure is

    explained next. The Spatial Scan Statistic tests the hypothesis if a given area of the map

    (for example a collection of ZCTAs) has a greater (or lesser) risk, than the rest of the

    ZCTAs in the entire geographic region G.

    If Zj is the jth cluster :

  • 16

    For all possible Zjs in Z (The collection of k possible clusters in G), if the risk inside Zj is

    R(inside, j) is the risk inside Zj while R(outside, j) is the risk outside Zj ,then under the null hypothesis and alternative hypothesis:

    H0: R(inside, j) = R(outside, j)

    H1: R(inside, j) > R(outside, j)

    The observed number of cases nj inside (or outside) a cluster candidate is assumed to be Poisson Distributed, and a function of the expected number of cases in the cluster ej and the risk R(inside, j) .

    Let n= k Nirri1 nj Poisson [ ej *R(inside, j) ] The likelihood function that is used, from these null and alternative hypotheses are as follows:

    = Likelihood (R(inside, j) > R(outside, j) ) / Likelihood(R(inside, j) = R(outside, j) )

    This likelihood ratio can be solved and written in the logarithmic form as follows:

    Log Likelihood Ratio or LLRj = (nj ln (nj/ ej)) + ((n- nj) ln [(n- nj)/(n- ej)])

    The significance of the log likelihood ratio is tested using a Monte Carlo

    hypothesis test. The SaTScan program carries out a user-specified number of Monte

    Carlo randomizations of the data and tests to 0.001 % (The percentage can be user

    specified too) significance of the presence of a cluster. A p value is reported. This is

    calculated as p value = Rank of LLR / (1 + #simulation)." Note that the spatial scan

    statistic procedure does not adjust for multiple testing in the traditional sense for example

    by carrying out a Bonferroni or other multiple testing adjustment procedure. Instead, it

    avoids the problem of testing multiple hypotheses, by concentrating on those clusters

    candidates that are most likely to be true clusters (and thus have the highest log likelihood

  • 17

    value). Also note that the Spatial Scan Statistic procedure explained above is the spatial

    Poisson model, which is the model used in disease mapping. There are numerous other

    modifications to the Spatial Scan Statistic procedure [29].

    1.4.2.2 Combinatorial Approaches

    Some geographers are interested in creating or building regions [30, 61-64]. Regions are built up by assigning small areas to groups such that they fulfill certain

    criteria. Regional geographers have called this the assignment problem. Small areas

    are so assigned to regions, that a certain attribute of the region is optimized [30, 62]. Sometimes, the problem could involve maximizing the variation in an attribute of the

    newly built region as a proportion of the variation within the entire map [30, 65]. The general question in this approach is What combination of areas will optimize a given

    objective? ". In the disease mapping context disease risk or the likelihood of risk can be maximized. An example in the disease mapping context was investigated by Alvanides

    [61]. A similar strategy was also suggested (but not implemented) by Rushton [66]. These ideas were implemented in computer programs first by Openshaw [64] and later by other researchers [63, 67, 68]. Independently Duczmal suggested a similar solution to finding disease clusters of any shape. He operationally achieved this by maximizing the

    Spatial Scan Statistic likelihood function over possible combinations of areas. While it is

    sometimes possible to look at all possible combinations/ collections of areas, for most

    realistic geographical areas this is not possible (For example, see Cliff and Haggett [62]). Neither are there theoretical solutions to the problem. In operations research, such

    problems are called np-complete. This means that for a collection of n areas, the problem

    cannot be solved in polynomial computer time. Heuristics are used to solve such

    problems. Duczmal uses the Simulated Annealing (SA) and Genetic Algorithm (GA) heuristics in his research [31, 69]. An important aspect of these methods is that they provide enormous freedom of analysis of shape and scale. The analysis scale and shape

  • 18

    vary across a multitude of combinations. Thus instead of asking the question Is there a

    cluster at a given scale of the following shape? these methods demand - Find clusters

    of any shape at any scale. This makes these methods immensely powerful. But this

    strength also brings about a weakness. If spurious clusters are created from a mismatch

    between the process and analysis scale and shapes, and if a large number of scales and

    shapes are evaluated by this analysis method, then it follows that noisy clusters will

    almost always be detected by these methods alongside genuine or true clusters. At the

    end of this section will shall see an example of this. The next section discusses some of

    the modifications that researchers have proposed to these methods. These modifications

    offer better power of detecting clusters.

    1.4.2.3 Hybrid Approaches

    These approaches combine some of the strategies of the non-combinatorial

    approaches with a combinatorial search. Some examples are the approaches proposed by

    Patil and Tallie [70], Tango [71] and Yinnakoulias [36]. Tango proposed that the search begin with a circular cluster as a seed", but then regions adjacent to the circular cluster be coalesced with it and the resulting hybrid be tested as a possible cluster. With every

    level of adjacency enumerated the problem becomes computationally complex, and therefore in their example Tango suggested that three levels of adjacency be tested. Patil and Tallie`s [70] approach is limited to restricting the search space to areas with the highest rates, which Patil and Tallie call the Upper level sets". These methods provide

    interesting extensions to the combinatorial shape-free methods of cluster search.

    We are now in a position to summarize the various methods discussed. All the

    methods outlined above have one singular goal: To extract the underlying pattern of

    significant excess risk. Some methods are good at mapping the entire pattern [9], while others are good at testing for significant excess risk [3]. In the next section, I discuss how problems with significance testing can introduce spurious clusters.

  • 19

    1.4.3 Significance Testing and Spurious

    Clusters

    In general all methods at some point, address the following question: Of all the

    candidate clusters in the pattern of risk (whether mapped or not), what clusters are true clusters? Each candidate cluster has a specific risk elevation, a size, and a shape.

    Traditionally most cluster detection" techniques have used some function of the risk

    elevation or rate of a given area to decide if the area is a true cluster. The question that is

    asked is How likely are we to observe this risk elevation or rate in this area if the

    underlying process is noise? " If the probability is small then the area is not a cluster.

    The distribution of risks/rates under the process of noise is also known as the reference

    distribution. Traditionally, the reference distribution is normatively chosen. Some

    choices are the normal distribution [2, 50], the chi-squared distribution [2, 50], the Poisson [3] distribution and the Gumbel distribution [43]. However, using such distributions is problematic. If the populations are small, the normal distribution cannot

    be used. It is often hard to distinguish a lack of fit to the Chi-Squared distribution from

    a genuine deviation from the Chi-Squared distribution (indicating clustering) [4] . A more robust method of achieving this is to use a Monte Carlo simulation approach to

    empirically determine the reference distribution. Methodologically this may be achieved

    by simulating a series of maps, in each of which noise is the underlying process. Multiple

    Monte-Carlo simulations of the data are used to mimic the noise process. If the observed

    risk elevation (or some function of the risk value such as the rate) for the area is significantly different from the ones in the simulated maps, then the area is considered to

    be a cluster. However Monte Carlo simulations do not guarantee that spurious clusters

    will not be detected. Steenberghen et al.,[72] carried out an experiment that illustrates this problem. This is displayed in Fig 1.1. Fig 1.1 is a map in which simulated locations

    of traffic accidents (points) were randomly scattered [72], filtered using 600 meter filters,

  • 20

    the density of points estimated, the resulting clusters tested for significance and the level

    of significance was displayed (also known as a p-map). If areas which show 0.025 % significance are called clusters, the black shapes in Figure 1.1 are spurious clusters.

    Some methods attempt to tackle this problem with a combination of both Monte

    Carlo and normative statistical techniques. Examples are Duczmals and Kulldorffs

    methods. Duczmals method [3, 31, 43, 69, 73] (which derives from Kuldorffs method) generates a large number of irregular cluster candidates. For each candidate the rate is

    calculated. The rate is then fed into a function known as a likelihood function to yield a

    likelihood value of the cluster candidate being a true cluster. This value is divided by

    the likelihood of the cluster candidate not being a true cluster. This ratio is known as the

    likelihood ratio. The likelihood ratios for all cluster candidates are calculated. The

    cluster candidates with the highest ratios are the most likely clusters. Multiple Monte

    Carlo simulations are carried out, and the rates at all the candidate clusters calculated.

    Again, the rates are fed into the likelihood function, thus generating a reference

    distribution of likelihood ratios for each cluster candidate. The likelihood ratio value of

    the cluster candidate is compared with the reference distribution to decide if the cluster

    candidate is a true cluster. However when Duczmal applied this approach to some of his

    data, problems with this approach were dramatically exposed. In one of his studies

    Duczmal [31] simulated breast cancer cases and randomly distributed them over 245 counties in New England (Fig 1.2). When he instructed his Simulated Annealing (SA) SaTScan based irregular cluster search algorithm to search for clusters, one of the clusters

    that it found was a large and extremely irregular cluster encompassing 122 counties, and

    enclosing a large percentage of the randomly scattered cases. This cluster is an example

    of a noisy cluster. The noise generating process (random distribution of cases) operated at the scale of 245 counties (aggregated). The shape of the area at which this process operated is the shape of the New England region that we see in Fig 1.2. At this scale and

    shape, the process generates noise. However, if this process is studied at the scale of an

  • 21

    aggregation of 122 counties and at the shape that follows the darker (orange if your copy of this document is in color) shaded counties in Figure 1.2, then, a noisy or spurious cluster is generated. It is known that the process that generated this cluster is noise.

    This example thus illustrates a situation where spurious clusters are created from a

    mismatch between the scale and shape of the process that generates the cluster and the

    scale and the shape imposed by the method of analysis. Duczmal [31] noted that this noisy cluster was large in size and extremely irregular in shape. Duczmal [73] suggests that large and irregular clusters like the one found in his study (above) are likely to be spurious. He and some other researchers [36] therefore, incorporate a penalty for irregularity of shape in this cluster search algorithm. The extent of this penalty is decided

    on a priori knowledge of the shape of the cluster. Therefore, if researchers believe that

    the clusters in an area are likely to be circular; they place a high penalty on clusters that

    are not circular in shape and vice versa. The spurious cluster detected by Duczmals

    method and the proposed solution raises some important questions. Is this spurious

    cluster large and irregular with a high risk/rate elevation a cluster of his particular

    method, or is it possible that if a cluster detection method is given freedom of shape and

    size then these clusters are likely to be detected? We note that the shape and size of the

    spurious clusters in Fig 1.1 are different from the shape and size of Duczmals spurious

    cluster. Thus not all spurious clusters are large and irregular.

    Duczmals problem has reintroduced the otherwise rarely discussed issue of shape

    and size in the disease cluster detection literature [69, 74, 75]. Risk elevation is just one possible characteristic of a cluster. McCullagh [76] states - In map analysis, features of prime importance may be size, shape, orientation and spacing". It is possible for clusters

    of different shapes and sizes to have the same risk elevation. It is also possible for

    clusters of same shape and sizes to have different risk elevations. The first objective of any cluster search should therefore be to distinguish spurious or noisy clusters from

    everything else. The risk or rate value of a possible cluster alone is not sufficient to make

  • 22

    this distinction. The shape and size of the cluster must also be factored in, when

    considering if a cluster is a true cluster. Duczmal proposes a solution that makes certain a

    priori assumptions about the shape and size of a cluster. This solution is interesting.

    However, the problem of spurious clusters may be approached from a different angle.

    Instead of asking the question What is the shape of a true cluster? which is what these

    methods do, and which is a question which is hard if not impossible to answer, the

    question that should be asked is What is the shape of a spurious cluster?. Unlike the

    first question, this is easier to answer. This is because the shape of a spurious cluster,

    unlike a true cluster can be mined a-posteriori from the data. To know how this can be

    done, we first need to understand how spurious clusters are generated in the first place.

    Thus, in the chapter that follows I discuss in depth, the phenomenon of noise and the

    creation of spurious clusters.

    1.4.4 Identifying spurious clusters and

    distinguishing true clusters from spurious

    clusters

    Spurious clusters enclose noise. Across disciplines noise is defined as .. a

    random and unpredictable signal" [77]. By this definition if the nature of the signal is known, then noise can be detected and filtered out. For example in a satellite image, it

    may be known that certain frequencies are the signal frequencies and therefore a spectral

    analysis and subsequent filtering may help remove the undesirable noise. In a satellite

    image the signal has a physical existence. For example, infrared radiation emitted by

    vegetation can be measured with certain instruments. In contrast, in mapping disease the

    signal cannot be physically measured. The signal is conceptual and has to be estimated

    from the available data. Some geographers and statisticians attempt to tackle the problem

    by developing statistical models that attempt to separate signal from noise [21, 23, 78-

  • 23

    80]. Perhaps a better approach to understanding signal and noise in a disease map is to understand the physical process that gives rise to the signal (as in a satellite signal). It is known that in a disease map, the observed patterns are the result of underlying processes.

    The observed patterns are patterns obtained from mapping statistical summaries of

    disease outcomes. For example, a map of patterns of cholera mortality in England could

    be displaying the number of cholera deaths per unit population in each county. The

    outcome in this case is cholera mortality which is the outcome of a disease process. Since

    cholera is a communicable disease it is possible that the spread of cholera can be modeled

    as a contact network process [81]. There exist many other spatially explicit disease processes2. For example, patterns of disease could be the result of processes that reflect

    an underlying lack of access to healthcare [10, 56, 82-84]. Whatever the specific process may be, these processes have a common trait in having a spatial form [85], and this means that they predispose some areas of the map to have a greater risk than any other.

    It is also possible that the underlying process does not cause any region of the

    map to have a greater risk than any other. Since a disease case may appear at any point on

    the map by random chance, by the earlier definition of noise, this is a noise generating

    process. A cluster defined by enclosing some of these disease cases is a spurious cluster.

    On any given map disease patterns can be the result of one or more processes. It could be

    the result of one process that generates clusters and another process that generates noise.

    The challenge therefore, is to distinguish the areas of a pattern that are the result of a

    cluster generating process from those that are not. Also, given a disease process that

    generates patterns on a map; a number of other factors also influence the patterns we

    2 It is important to distinguish between a spatially explicit disease process and a

    spatial disease process. Some scientists attempt to model diseases as purely spatial processes. Examples of this can be seen from the cellular automata based disease modeling literature. No disease process is purely spatial and therefore such models are misleading.

  • 24

    actually observe. Given a cluster generating process, the following factors influence the

    pattern that is then extracted:

    1. The spatial distribution of the locations of people in the map.

    2. The shape and size of the geographic units that are used to aggregate individuals

    into discrete small areas.

    3. The shape and size of the spatial configuration, the disease mapping or cluster

    detection method may impose on the data (In addition to 2).

    Understanding these factors is essential to understanding noise and spurious

    clusters. I discuss this next.

    1.4.4.1 The spatial distribution of the locations

    of people in the map

    A cluster generating process causes an area of the map to have a greater risk than

    other areas of the map. Cluster detection methods seek to estimate the shape, size and risk

    elevation of the area of increased risk using the locations of people as proxy sample sites.

    A representative spatial sample of the area of risk would be a uniform grid [86]. People are never distributed uniformly over space; instead, a likely spatial distribution consists

    of dense settlements interspaced with sparsely populated areas. This creates a challenge

    in estimating the true shape of the cluster. As I illustrate from figures 1.3 to 1.11, a

    cluster that in reality has a uniform shape, may be estimated as having a highly irregular

    shape, because of the way people are distributed over space [75].The shape of the actual area of increased risk or true cluster created by the cluster generating process also

    influences the shape of the cluster that is finally estimated. If the shape of the true cluster

  • 25

    is highly irregular, it is quite likely that the shape of the cluster that is estimated is also

    highly irregular, but the converse may also be true! This is illustrated from figures 1.12 to

    1.14.Another phenomenon long observed by geographers is that the same risk process

    may give birth to different shaped clusters in different areas of the map or, in more

    general terms, the same cluster generating process may give rise to different patterns

    [87]. While the shape of the original area of the increased risk or true cluster may be the same in two areas and the spatial distribution of the people may be the same, it is not

    necessary that the pattern of people who are diseased (and who are not) will be the same in both areas. This means that the shape of the estimated area of increased risk will not be

    the same in both areas. This is further complicated by the fact that people are almost

    never distributed similarly over space in two different regions (Figures 1.15 to 1.20). First, for the purposes of understanding this issue, let us assume the highly

    improbable situation that people are uniformly distributed over space. Let the distribution

    be over a uniform grid. Figure 1.3 illustrates the situation. Next, let us consider that out

    of the 42 people in the region, 10 are afflicted by some disease. However, we assume that

    the process that causes disease is a noise generating process. Therefore, we expect

    diseased people (or cases) to be randomly distributed over the region among 42 people as shown in figure 1.4. A convex hull boundary of these cases is seen in Figure 1.5. In

    contrast, if there is a cluster generating process, we would expect the diseased people to

    be clustered together. Figure 1.6 illustrates such a situation. People enclosed within a

    dotted area of increased risk are diseased, the risk being 0. 24 (the risk in other areas being 0). We observe in Figure 1.6 one realization of the risk process, so 10 people are diseased. Figure 1.7 displays the convex hull boundary of this cluster of diseased

    people. The smooth and regular shape of this cluster is in sharp contrast to the irregular

    cluster shape that we observe in Figure 1.5. Since it is highly unlikely, that people will be

    uniformly distributed over space, Figure 1.8 illustrates the more realistic possibility of

    people being non uniformly distributed over space. If the entire geographic area in figure

  • 26

    1.8 is subject to a risk, we expect some people to become diseased (again, one realization of the process) . Figure 1.9 illustrates this and the boundary that demarcates the cluster. The shape of the cluster is very different from what was obtained in Figure 1.5. An

    increased area of risk on such a heterogeneously distributed population gives rise to

    clusters of unpredictable shapes (figures 1.10 and 1.11).These example show how the spatial distribution of the people affect the shape and size of the risk surface detected.

    From these examples it may seem that for a given distribution of people over

    space, a cluster generating process gives rise to patterns on a map that are regular

    compared to the shapes generated by a noise generating process. Indeed, some scientists

    use measures of regularity of a clusters shape to distinguish a true cluster from a

    cluster spurious cluster [73]. Also, people never are distributed uniformly over geographic space. Next, we see how this affects the shape and size of the cluster detected.

    In the example I have discussed I assumed that the cluster generating process gives rise to

    a very regularly shaped area of increased risk (The area within the dotted line). In reality this may not be true. The area of increased risk may have a very irregular shape. Some

    examples of geographic features that can be areas of increased risk are rivers, roads,

    underground groundwater streams, plumes of aerial pollution or a combination of some

    of these. We therefore observe that the shape and size of a cluster cannot be predicted a-

    priori and is unique to the risk elevation of the cluster generating process and the spatial

    distribution of the people. Another aspect of a cluster generating process is that the same

    process can give rise to different shaped clusters in different regions of the map. This can

    happen even if people are uniformly distributed. The examples below illustrate this:

    From the discussion and the examples, we can conclude that both the spatial

    distribution of people and the shape and size of the area of increased risk, have an

    important bearing on the shape and size of the cluster that is finally detected. The area of

    increased risk or the true cluster may have a very different spatial configuration from

    the cluster that is detected. Parts of the true cluster may be suppressed or spurious areas

  • 27

    of increased risk may arise. Spurious clusters are created from the method used to

    measure the outcome of the process of clustering. By definition, the method uses a scale

    and (or) shape of measurement that is dependent on the spatial distribution of people. Since this distribution is not representative of the underlying area of increased risk, there

    is a mismatch between the measurement shape/scale and the process shape scale. While

    the above examples are with individual level data, the conclusions drawn can be

    generalized to aggregated data. The act of data aggregation itself could introduce noise

    over and above the problem of heterogeneously distributed people. This is discussed in

    the next section.

    1.4.4.2 The scale and spatial configuration

    of the geographic units that are used to

    aggregate data into discrete small areas

    In the geography literature the term scale is used to refer to three different kinds

    of scales, two of which are of relevance here. The first is the phenomenon scale, or the

    scale at which a spatial process operates. The second is the analysis scale the scale at

    which data are aggregated for measurement and analysis [88]. When a phenomenon such as a disease operates at a given scale, its outcome is often registered as heterogeneity in

    disease rates at that scale [89]. Geographers have often attempted to find the scale at which a process operates [90]. Two well known methods are the use of spectral analysis [65] and variogram [91] modeling. The latter approach is often used in the health geography literature. Studies in China have shown that Esophageal and Liver Cancers

    operate at scales of less than 150 kms while stomach cancers operate at scales less than

    90 km [91]. In Sweden substance related disorders operate at scales less than 3 kms [92]. Unfortunately, the scale at which a given process operates is not known in most

    geographic studies. A geographer attempts to study a process by collecting and analyzing

  • 28

    spatial data. This process involves analysis through the calculation of statistical

    summaries of data aggregated at an appropriate scale. When the process scale is not

    known there is every possibility of a mismatch between the process scale and the analysis

    scale. This mismatch or misalignment arises from two sources. First, geographic data are

    often aggregated into discrete units often for purposes different from the analyses for

    which they are being used. These units of aggregation could differ in shape and scale

    from the process scale and shape. As Haining [93] states in Conceptual models of spatial variation [93] ...This might be referred to as process-induced spatial heterogeneity. This source of heterogeneity may be compounded in the case of regional data by measuring

    attributes through spatial units of different size. This might be referred to as

    measurement-induced heterogeneity because it is a product of how attributes are

    observed and measured. A second source of mismatch is from the spatial structures that a

    disease mapping/ cluster detection method imposes on the data. For example, spatial

    filtering [9, 10] and Spatial Scan Statistic based methods calculate summary statistics by aggregating data along circular filters. In the geography literature the problems that

    arise from spatial mismatch are grouped under MAUP or the Modifiable Area Unit

    Problem [91, 94]. MAUP phenomena are again grouped under two broad sub groups as the zone effect and the scale effect. The creation of spurious heterogeneity or destruction

    of true heterogeneity with changing scales is a manifestation of the scale effect. If the

    scale is kept fixed but the shape of the zones of aggregation are changed, then the zone

    effect is likely to be seen. Geographic data aggregated to administrative units often

    display both the zone and scale effects of MAUP. Aggregating data has a smoothing

    effect on disease rates [95], and therefore clusters at scales smaller than the scale of aggregation could be missed, when analyses are done using these data. Conversely, if the

    scale of aggregation is smaller than the process scale, then noisy clusters could be

    detected. A recent study by Ozonoff et al., [19] demonstrated that when individual level data are aggregated and a Spatial Scan Statistic cluster search method used on the data,

  • 29

    then noise increases with increasing levels of aggregation. Therefore, analysis and

    process scales interact in complex ways to create noisy clusters and suppress true clusters

    We can conclude from our discussions above, that a number of complex factors

    influence the shape, size and the risk elevation of the clusters that are detected and the

    spurious clusters created. These factors are dependent on the spatial distribution of the

    people and the process and analysis scales. It is not possible to make a priori assumptions

    about these factors, and it is certainly not possible to predict the shape of a noisy cluster a

    priori. What approach is then appropriate if the spurious clusters have to be separated

    from the true clusters? The section that follows answers this question.

    1.4.5 Identifying the noisy" or spurious

    components of the pattern

    A reasonable cluster detection technique should take into consideration not only

    the risk elevation but also the shape and size of the cluster. I propose a spatially enabled

    computational process that uses these attributes of a cluster, to identify the signature of

    spurious clusters from patterns on a disease map. Earlier, I introduced the idea that a

    pattern is the outcome of a process. Analyzing a pattern or the components of a pattern

    such as individual clusters may yield clues about the underlying process. A map of

    disease patterns represents one realization of the underlying process. It may not be

    possible to draw conclusions on the process that generated the pattern or components of

    the pattern by analyzing just one map. However, if multiple maps were available, representing multiple realizations of the process, then analyzing the patterns may yield

    clues about the underlying process. A classic example of this approach can be found in

    Hagerstrands classic paper [96] in which he simulates multiple maps assuming an underlying process. He then compares maps of empirical data with the maps that he has

    simulated to draw conclusions about the validity with which he represents the process in

    his model. Another example can be seen from Diggle [97].Therefore, if maps were

  • 30

    created using a known process, then analysis of the simulated patterns on the maps would

    yield clues on the signature" of that particular process. Once this signature" is known,

    then the pattern could imply (or not imply) the existence of this process. More specifically, this scheme can help identify a signature" for spurious clusters. These

    signatures can then be used to distinguish clusters that are spurious from clusters that are

    true", in any given pattern of disease risk. Shape, size and risk elevation are part of this

    signature". For example, the signature of spurious clusters in Duczmals [73] method was that these clusters were large in size and had irregular shapes. The next chapter is

    devoted to the method I have developed based on these ideas. The method is first

    described, then tested and validated on simulated data.

    1.4.6 Why use size, shape and rate

    The reason I add the dimensions of size and shape, in addition to rate, is to

    characterize the reference space in which spurious clusters are located. I know from

    theory (as discussed in this chapter) that spurious clusters arise differently to the extent that the numbers of people at risk in relation to the overall relative risk of the disease

    exist differ across the space. When people are distributed uniformly in space, the average

    number and average size of spurious clusters in that space can be determined from

    theory. As Schinazi [98] shows, deterministic statistics can be used to determine the chance of finding a given number of clusters with a rate higher or lower than the expected

    rate. However, when people at risk are distributed non-uniformly in space, the equivalent

    number is more difficult to determine directly from theory. The same theory still applies;

    it is just more difficult to implement in the case of non-uniform distribution of people at risk. For this reason, I use Monte Carlo simulation to discover the rate, size, shape space

    in which typical spurious clusters lie, given the particular distribution of people at risk

    and the particular overall relative risk of the disease in the study area in question. In his

    seminal paper King [85] states The mathematics of stochast