shape and scale in detecting disease clusters

University of IowaIowa Research Online

Theses and Dissertations

2008

Shape and scale in detecting disease clustersSoumya MazumdarUniversity of Iowa

Copyright 2008 Soumya Mazumdar

This dissertation is available at Iowa Research Online: http://ir.uiowa.edu/etd/208

Follow this and additional works at: http://ir.uiowa.edu/etd

Part of the Geography Commons

Recommended CitationMazumdar, Soumya. "Shape and scale in detecting disease clusters." PhD (Doctor of Philosophy) thesis, University of Iowa, 2008.http://ir.uiowa.edu/etd/208.

1

SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS

by

Soumya Mazumdar

An Abstract

Of a thesis submitted in partial fulfillment of the requirements for the Doctor of

Philosophy degree in Geography in the Graduate College of

The University of Iowa

December 2008

Thesis Supervisor: Professor Gerard Rushton

1

ABSTRACT

This dissertation offers a new cluster detection method. This method looks at the

cluster detection problem from a new perspective. I change the question of What do real

clusters look like? to the question of What do spurious clusters look like? and How

do spurious clusters affect the ability to recover real clusters? Spurious clusters can be

identified from their geographical characteristics. These are related to the spatial

distribution of people at risk, the shape and scale of the geographic units used to

aggregate these people, the shape and scale of the spatial configurations that the disease

mapping or cluster detection method may impose on the data and the shape and scale of

the area of increased risk. The statistical testing process may also create spurious clusters.

I propose that the problem of spurious clusters can be resolved using a computational

geographic approach. I argue that Monte Carlo simulations can be used to estimate the

patterns of spurious clusters in any situation of interest given knowledge of the first three

of these four determinants of spurious clusters. Then, given these determinants, where

real measurements of disease or mortality are known, it is possible to show those areas of

increased risk that are true clusters as opposed to those that are spurious clusters. This

distinction is made in a three dimensional signature space, with shape, size and rate as the

three axes. The extent of similarity (or dissimilarity) of a cluster to the simulated spurious cluster influences whether it can be recovered. These experiments show that this method

is successful in detecting clusters. This method can also predict with reasonable certainty

which clusters can be recovered, and which cannot. I compare this method with

Rogersons Score statistic method. These comparisons expose the weaknesses of

Rogersons method. Finally these two methods and the Spatial Scan Statistic are applied

to searching for possible clusters of prostate cancer incidence in Iowa. The implications

of the findings are discussed.

2

Abstract Approved: ___________________________________ Thesis Supervisor

___________________________________

Title and Department

___________________________________

Date

SHAPE AND SCALE IN DETECTING DISEASE CLUSTERS

by

Soumya Mazumdar

A thesis submitted in partial fulfillment of the requirements for the Doctor of

Philosophy degree in Geography in the Graduate College of

The University of Iowa

December 2008

Thesis Supervisor: Professor Gerard Rushton

Graduate College The University of Iowa

Iowa City, Iowa

CERTIFICATE OF APPROVAL

_______________________

PH.D. THESIS

_______________

This is to certify that the Ph.D. thesis of

Soumya Mazumdar

has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Geography at the December 2008 graduation.

Thesis Committee: ___________________________________ Gerard Rushton, Thesis Supervisor

___________________________________

David Bennett

___________________________________

Naresh Kumar

___________________________________

Marc Linderman

___________________________________

Dale Zimmerman

ii

ACKNOWLEDGMENTS

I would like to acknowledge the help I have received during the course of my stay

in Iowa. I would like to thank Dr Rushton for supervising my research. I would also like

to thank my committee members for their contributions. The last four years of my life

have been emotionally challenging for me. I thank the great masters before us who have

helped me through. I am thankful to the writings of M. Scott Peck, Viktor Frankl, Swami

Vivekananda, and the yogic practices of Sri Sri Ravishankar @ Art of Living Foundation.

I would also like to thank my family members, especially my mom, mishtimashi and late

Dr Mazumdar for their support. Thanks are also due to all my friends and well wishers.

iii

TABLE OF CONTENTS

LIST OF TABLES ......................................................................................................v

LIST OF FIGURES .................................................................................................. vi

CHAPTER

1. DETECTING CLUSTERS OF DISEASE: INVESTIGATING SPURIOUS CLUSTERS---------------------------------------------------------------------1

1.1 Statement of Purpose------------------------------------------------------1 1.2 Introduction-----------------------------------------------------------------2 1.3 Organization of the dissertation------------------------------------------7 1.4 Review of existing methods of cluster detection----------------------7

1.4.1 Map data without further geographic processing---------------9 1.4.1.1 Methods that do not smooth the data------------------10 1.4.1.2 Methods that smooth the data--------------------------10

1.4.2 Methods that pre-process the data before calculating and/or testing for significant disease risk----------------12

1.4.2.1 Non combinatiorial approches-------------------------13 1.4.2.2 Combinatorial approaches------------------------------17 1.4.2.3 Hybrid approaches---------------------------------------18

1.4.3 Significance testing and spurious clusters---------------------19 1.4.4 Identifying spurious clusters and distinguishing true clusters from spurious clusters---------------------------------22

1.4.4.1 The spatial distribution of the locations of people in the map-----------------------------------------------24

1.4.4.2 The scale and spatial configuration of the geographic units that are used to aggregate data into discrete small areas-------------------------------27

1.4.5 Identifying spurious clusters and distinguishing true clusters from spurious clusters---------------------------------29

1.4.6 Why use size, shape and rate----------------------------------- 30

2. THE SHAPE SIZE SENSITIVE (S.S.S) METHOD FOR DETECTING DISEASE CLUSTERS-------------------------------------------------------55

2.1 Theoretical foundations of the S.S.S method-------------------------55 2.2 Hypothesis testing ---------------------------------------------------60 2.3 The simulated dataset---------------------------------------------------65

2.3.1 Hypothetical study area and population------------------------65 2.3.2 Hypothetical case population------------------------------------66 2.3.3 Datasets under the null hypothesis of no clustering----------66 2.3.4 Extracting the cluster candidates--------------------------------68 2.3.5 Datasets under the alternative hypothesis of clustering------69

iv

2.3.5.1 Rationale Behind the choice of these configurations of synthetic clusters------------------------------69

2.4 Rogersons Score Statistic-----------------------------------------------73 2.4.1 Theory--------------------------------------------------------------73

2.5 Diagnostics----------------------------------------------------------------75 2.6 Computational Scheme--------------------------------------------------76 2.7 Results- ------------------------------------------------------------------ 77

2.8 Discussions and future directions--------------------------------------81

3. INVESTIGATING THE SPATIAL PATTERNS OF PROSTATE CANCER IN IOWA---------------------------------------------------------------------109

3.1 Background-------------------------------------------------------------109 3.2 Methods-----------------------------------------------------------------111 3.3 Results-------------------------------------------------------------------115

3.4 Discussion---------------------------------------------------------------119 3.5 Conclusion--------------------------------------------------------------120

3.6 Contribution that this dissertation makes to the geography literature-----------------------------------------------------------------120

REFERENCES----------------------------------------------------------------------------135

v

LIST OF TABLES

Table

2.1 Hold one validation for null hypothesis.-----------------------------------------102

2.2 Hold one validation for alternative hypothesis.---------------------------------102

2.3 Summary statistics of the simulated 3675 spurious clusters.------------------103

2.4 Shape, size, risk (signature) and the ability to recover simulated clusters.--104 2.5 The table illustrates the average sensitivity (ability to detect a cluster

when it exists) and specificity (ability to classify an area that is not a cluster as such).--------------------------------------------------------------------105

2.6 This table compares sensitivity and specificity with which clusters are recovered for SSS and Rogersons method and the higher the sensitivity the better the cluster is recovered.-------------------------------------------------106

2.7 Cluster recovery using only rates and only shapes.-----------------------------107

2.8 How do true clusters differ in shape and size from spurious clusters.-------108

vi

LIST OF FIGURES

Figure

1.1 This figure displays the statistical significance of accidents per square kilometer (a p- map over densities) , where accidents have been randomly scattered across the study area . A 30 meter grid was laid over the entire study area and a 600 meter filter was used to estimate the accident densities. The black areas are significant noisy clusters--------35

1.2 This figure displays a spurious cluster detected by Duczmals Simulated Annealing based SaTScan method. This cluster has a high, statistically significant likelihood value.-------------------------------------------36

1.3 In the geographic area, 42 people are distributed over a uniform grid. Each circle represents an individual. They are color coded white to indicate that they are healthy. ------------------------------------------------------37

1.4 A noise or spurious cluster generating process operates at the scale of the entire geographical area. No person is at a greater risk of disease than any other. All people are at a risk of 0.24. Diseased people are randomly diseased over the map. These disease people are color coded black to indicate a diseased state.-------------------------------------------------------------38

1.5 A boundary is drawn around those people who are diseased. This represents our gerrymandered cluster. Note the highly irregular and large shape of the cluster.-------------------------------------------------------39

1.6 In contrast to 1.4, a cluster generating process operates on this geographic area. The cluster generating process predisposes the people living in the area bound by the dotted lines to a greater risk than other areas of the map. These people are at a risk of 0.56. In one realization of the process cluster of 10 people therefore are diseased in this area.----------------------40

1.7 The cluster is then enclosed within a boundary. Note the relatively regular shape of the cluster (compared to a random distribution of diseased people). ------------------------------------------------------------------41

1.8 People are distributed non uniformly over space.--------------------------------42

1.9 The entire geographic space is subject to the same risk (0.24) noise generating process. The resulting 10 diseased people and the gerrymandered cluster are shown.--------------------------------------------------43

1.10 The cluster generating process in figure 6 operates on the inhomogenously distributed population. The risk elevation is the same as in Figure 1.6 0.56. This causes 8 people to fall ill from an at-risk population of 14.--------44

vii

1.11 The estimated cluster shape and size is very different from what the shape and size of the cluster is in reality (The dotted line in Figure 10). It is also very different from what was obtained for a homogenous distribution of people in Figure 1.6.------------------------------------------------45

1.12 Now a cluster generating process operates on this space. The white river within the dotted lines is the area of excess risk. People living within this area are at an excess risk of disease.--------------------------46

1.13 Assuming an inhomogeneous distribution of people as in figure 1.8 and a risk elevation of 0.71, we see that a certain number of people (10) within the area of excess risk are diseased.----------------------47

1.14 The gerrymandered cluster now encloses the diseased people. Note the highly irregular and large shape of this cluster.------------------------------48

1.15 Two cluster generating processes of circular shape and risk elevation of 0.75 operate on a homogenous distribution of people.-----------------------49

1.16 The clusters that are estimated from this have the same triangular shape. This is highly unlikely in reality.---------------------------------------------------50

1.17 In this example a slightly larger area of increased risk is considered than in the earlier example. 6 people in each of the two clusters are subject to a risk of 0.5, which results in 3 of them becoming cases/ falling ill.-----------------------------------------------------------------------51

1.18 The clusters that are generated have very different shapes. In fact the larger the area of increased risk, the greater the number of possible shapes and sizes of the estimated cluster.----------------------------------------52

1.19 In this example people are inhomogenously distributed. The same cluster generating process in Figure 1.15 gives rise to two circular areas of increased risk where the risk elevation is 0.5.-----------------------------------53

1.20 The two clusters generated have very different shapes. There is no configuration of cases within the clusters for which two estimated clusters could have the same shape.------------------------------------------------54

2.1 Using echelons to extract cluster candidates.----------------------------------------87

2.2 A set of 50,000 cardiovascular disease mortality cases are randomly distributed by population weights to each of 942 ZCTAs in the state of Iowa. A pattern is then extracted using Spatial Filtering. The pattern is binarized, and the resulting polygon cluster candidates are extracted using a GIS.----------------------------------------------88

2.3 An example set of spurious cluster signatures S(ZN ) in signature space.---89 2.4 An example set of spurious cluster signatures S(ZN ) in signature space

with a few candidate clusters (grey squares).-------------------------------------90 2.5 Bounding rectangle for elliptical footprint.---------------------------------------91

viii

2.6 Flowchart of the S.S.S method.-----------------------------------------------------92

2.7 Population distribution of ZCTAs in Iowa, 2000.--------------------------------93

2.8 This figure displays the computational process used to create the simulated dataset. Each bin is labeled as k and has a specific size. For the simulations in this research n=942.-------------------------------------------93

2.9 The simulated datasets follow a multinomial distribution.----------------------94

2.10 Summary of shapes of simulated spurious clusters, frequency and cumulative frequency.----------------------------------------------------------------95

2.11 Summary of sizes of simulated spurious clusters, frequency and cumulative frequency.----------------------------------------------------------------96

2.12 Summary of rates of simulated spurious clusters, frequency and cumulative frequency.----------------------------------------------------------------97

2.13 Characteristics of the four clusters simulated under the alternative hypothesis.-----------------------------------------------------------------------------98

2.14 Cluster detection diagnostics (The key to the numbers is in the text).--------99 2.15 Patterns detected by the Score statistic and the S.S.S method for one

dataset among 20 datasets simulated for cluster-4. The true cluster pattern can be seen inset. In this particular dataset S.S.S is able to identify 62% of the true cluster pattern, while the Score statistic is able to identify 20%.----------------------------------------------------------------100

2.16 Patterns detected by the Score statistic and the S.S.S method for one dataset among 20 datasets simulated for cluster-3. The true cluster pattern can be seen in the inset. In this particular dataset S.S.S is able to identify 98% of the true cluster pattern, while the Score statistic is able to identify 91%.-------------------------------------------------------------101

3.1 Spatial patterns of prostate cancer incidence (1999-2004) in Iowa.----------123 3.2 Cluster of prostate cancer incidence in Iowa, detected by the S.S.S

method. ----------------------------------------------------------------------------124

3.3 Cluster detected by SaTScan when the geometry of the cluster is assumed to be ellipsoidal.----------------------------------------------------------125

3.4 Cluster detected by SaTScan when the geometry of the cluster is assumed to be circular.-------------------------------------------------------------126

3.5 Large secondary cluster with low elevation in risk detected by Kulldorffs SaTScan when the geometry of the cluster is assumed to be elliptical.-----------------------------------------------------------------------127

3.6 ZCTAs in Iowa with a significant value of Rogersons Score statistic.-----128

ix

3.7 Expected number of cases in ZCTAs: Entire Iowa versus areas with a significant value of Rogersons Score statistic.---------------------------------129

3.8 ZCTAs in the North West Iowa cluster of high prostate cancer incidence.-----------------------------------------------------------------------------130

3.9 Counties boundaries with ZCTAs in the North West Iowa cluster of high prostate cancer incidence.----------------------------------------------------------131

3.10 Change in mortality and incidence rates from 1990-2004 in five counties Dickinson, Clay, Buena-Vista, Emmet and Clay Counties in the cluster. The expected counts for the particular year (1990, 1991.2000) are calculated using 2000 census population for the local area, and incidence/mortality information for the state of Iowa (Same procedure as indirect standardization).-----------------------------------132

3.11 Variations in the directly standardized incidence and mortality rate in Iowa, and incidence of Prostate cancer in Dickinson County for the years 1990-2004.----------------------------------------------------------------133

3.12 Variations in the directly standardized incidence and mortality rate in Iowa, and incidence of Prostate cancer in Clay County for the years 1990-2004.---------------------------------------------------------------------------134

1

CHAPTER 1: DETECTING CLUSTERS OF DISEASE: INVESTIGATING

SPURIOUS CLUSTERS

1.1 Statement of Purpose

This dissertation offers a new cluster detection method. This method looks at the

cluster detection problem from a new perspective. I change the question of What do real

clusters look like? to the question of What do spurious clusters look like? and How

do spurious clusters affect the ability to recover real clusters? Spurious clusters can be

identified from their geographical characteristics. These are related to the spatial

distribution of people at risk, the shape and scale of the geographic units used to

aggregate these people, the shape and scale of the spatial configurations that the disease

mapping or cluster detection method may impose on the data and the shape and scale of

the area of increased risk. The statistical testing process may also create spurious clusters.

I propose that the problem of spurious clusters can be resolved using a computational

geographic [1] approach. I argue that Monte Carlo simulations can be used to estimate the patterns of spurious clusters in any situation of interest given knowledge of the first

three of these four determinants of spurious clusters. Then, given these determinants,

where real measurements of disease or mortality are known, it is possible to show those

areas of increased risk that are true clusters as opposed to those that are spurious clusters.

The extent of similarity (or dissimilarity) of a cluster to the simulated spurious cluster influences whether it can be recovered. These experiments show that this method is

successful in detecting clusters. This method can also predict with reasonable certainty

which clusters can be recovered, and which cannot. I compare this method with

Rogersons Score statistic method [2]. These comparisons expose the weaknesses of Rogersons method. Finally these two methods and the Spatial Scan Statistic [3] are

2

applied to searching for possible clusters of prostate cancer incidence in Iowa. The

implications of the findings are discussed.

1.2 Introduction

Disease mapping has a long history. Starting with the example of John Snows

cholera map to the intelligent agents [4] of the present century, disease mapping has progressed with developments in science, especially Geographical Information Systems

(G.I.S) and epidemiology. Some of the first disease maps were simple dot maps indicating the location of disease cases. These gave way to maps of statistical summaries

known as thematic maps". These maps convey more information than simple dot maps

and are therefore, powerful exploratory and decision making tools. For example, when

mortality maps of lung cancer for the United States were made in the 1960s, high rates

were found in areas of the Eastern Seaboard [5, 6]. Later, these high rates were attributed to exposure to asbestos among shipyard workers in these areas. A disease map can thus

be used to map spatial variations in disease risk. A decision maker can ask Is a person

living in a given area at a greater risk of disease than a person living in another area? or

In which areas of the map do people have the greatest risk of disease? In the disease

mapping literature the problem of finding areas of excess risk is often called cluster

detection", a cluster being defined as A geographically bounded group of occurrences of

sufficient size and concentration to be unlikely to have occurred by chance" [7] or in plain English, a geographic area of high disease risk. A geographical cluster is therefore

spatially analogous to statistical clustering [8], where the question of interest is finding things near in statistical space instead of geographical space.

While investigating the causal factors (or etiology) of areas of increased risk are important, there are other important applications of these methods. Public health agencies

are often interested in allocating resources to areas with an increased burden of disease

[9, 10]. Cluster detection methods are used to identify areas with increased burden of

3

disease. Sometimes, environmental policy is formulated on the basis of such studies. In

one instance, the Vatican was taken to task for operating radio transmitters at illegal

frequencies after studies showed an increased risk of cancer among people living close to

these transmitters [11, 12]. Note that policies are often formulated on the basis of evidence that an increased risk exists even though the etiological basis for the increased

risk may not have been established. An interesting extension to etiological research is that

the presence of spatial clusters of increased risk could also be used to prove the existence

of disease risk factors that are spatially non random. For example, it has been claimed

that clusters of autism in California prove the existence of risk factors that are not related

to genetics or the vaccine hypothesis1 (barring selective migration) [13]. Many public health agencies maintain on the fly cluster investigation infrastructure to address

cluster related enquiries [14]. A number of methods exist that can be used to delineate clusters. A persistent

problem with many of these methods is the detection of areas not at high risk being

identified as such. Some convenient terms for such false positives are noise" [15], noisy clusters or spurious clusters [16-19] . In this research I develop a method to detect and adjust for the occurrence of spurious clusters in cluster detection studies. The cluster detection literature identifies at least three types of spurious clusters.

The first is when the estimate of risk in an area is based on a small number of people

[15]. These estimates of risk are unreliable and therefore the area may not have a significant excess risk. A number of solutions exist to solve this problem [20-26]. The second type of spurious clusters stem from statistical issues in the cluster detection

method. For example, failing to adjust for multiple hypothesis testing problems may give rise to spurious clusters [18, 27]. This problem is an area of active research [28].

1 The vaccine hypothesis is that exposure to Thimerosol a mercury based additive in

vaccines is a risk factor for autism.

4

Kulldorffs SaTScan method resolves this problem by adopting a likelihood based

hypothesis testing framework [3]. The third type of spurious cluster is created by a mismatch in the scale and spatial

structure of the process that generates the cluster, with the scale and spatial structure used

to measure the process. The scale and spatial structure or spatial form of the cluster

search process (which measures or samples the underlying data) can generate spurious clusters. Unlike the other sources of spurious clusters very little research exists on this

form of noise. There are a number of reasons for this. Until recently, the computational

power available to researchers, for cluster detection problems was limited. A cluster can

have any geometry or spatial form in reality. However a limited amount of computational

power confined researchers to searching for clusters within a small range of spatial forms.

For instance, it is a common strategy to search for circular clusters. This strategy was

adopted by some of the first cluster search methods [27], and remains common today [29]. If the real cluster is not circular in shape, then the power to detect non circular clusters is greatly reduced. But, a limited search also implies that the likelihood of

mismatch between the circles and the underlying true cluster is also limited (given that the spatial form of this true cluster is unknown). In contrast, if the cluster search incorporates a number of different spatial forms, then the likelihood of mismatch

increases. Since computational power is not a limiting factor anymore, some researchers

have developed shape free" disease cluster detection methods. These methods, that draw

from the work of geographers in the 1960s and 70s [30] measure spatial attributes (like disease counts or rates) at a large number of possible shapes , sizes and scales. The measured spatial attributes or some functions of the attributes are used to decide if an

area of a given shape and size at a given scale is a cluster or not. For example, Duczmals

[31] scan assigns a likelihood value to each cluster it finds, where the likelihood is a function of attributes such as an observed number of cases in the cluster. The clusters

with the highest likelihood are most likely to be clusters. These methods thus, promise to

5

seek out the true clusters, no matter what their spatial form. However, this also means,

that at some shape and scale, noise or spurious clusters will be detected. These spatial

forms will represent a mismatch between the shape and scale of the process that

generated the process and the shape and scale of the process being used to detect it. The

closest analogy that can be drawn to this is similar to what is known in the disease

mapping literature as the Texas Sharpshooter Effect. If a shotgun is used on a wall,

then the wall is splattered with seemingly random bullet holes. At the scale of the wall,

the process is random. However, it is always possible to draw targets a posteriori around

the bullet holes. The act of drawing a target is similar to searching for a cluster at a scale

different from the scale at which the original process occurred (the entire wall). Duczmals search procedure, thus often finds clusters that are spurious. Such spurious

clusters will be found by any method that offers the least amount of geometric freedom to

the clusters search. In fact, these spurious clusters have even been found when the search

is limited to circular geometries (for example, see Kulldorff [32]). Tackling this problem therefore requires a) A thorough understanding of the problem of what gives rise to these spurious clusters. b) Suggesting a method to solve or in the very least, manage this problem. This dissertation is an attempt at this.

It is clear that an understanding of this problem requires an understanding of scale

and shape of the spurious cluster or noise generating process. The shape, size and risk

elevation of a cluster, whether spurious or real, is unique to each and every disease

mapping/cluster detection situation. The characteristics (shape, size and risk elevation) of a cluster depend on : a) The cluster generating process, especially the shape and size of the area of excess risk, b) The spatial distribution of people over space and c) The scale at which the spatial data are aggregated [19]. These factors are unique to each disease mapping situation/example, and these factors are responsible for creating spurious

clusters. Once we have established these facts, two take home facts are: 1) Every disease mapping situation has a unique noise or spurious cluster signature b) It is not possible to

6

guess this signature a-priori. However this signature may be computed as explained

below.

Since, each disease mapping situation has a unique noise or spurious cluster

signature, it follows that in every disease mapping situation there will be some clusters

which will be hard to detect. These clusters will be in ways similar to the spurious or

noisy clusters. This issue or the issue of recoverability has just started being discussed in the disease mapping literature [33, 34]. The methods I describe incorporate the following features. First, it extracts cluster candidates using an exploratory approach.

Second, shape, size and rate are used to distinguish true clusters from spurious clusters.

Third, the method incorporates recoverability of clusters into the analyses. The researcher

is able to know (computationally) a-priori what spatial form of clusters are recoverable. The method utilizes computational geography and two fundamental geographic aspects of

clusters- shape, and size to analyze the recoverability of clusters and to separate cluster

from non cluster or spurious clusters. This dissertation diverges from the traditional

disease clustering literature in taking shape and size into consideration. Traditionally only

the rate at a given location or some function of the rate is used to separate a true cluster

from a spurious one. Since the method incorporates the shape and size of the cluster in its

analysis, I call it the Shape, Size Sensitive disease cluster detection method or the S.S.S

method. The S.S.S method is tested and validated on simulated data. This method

demonstrates the power of computational geography over traditional methods [35]. The ideas and methods developed and tested in this dissertation are either new, or have been

discussed only in scant detail in the literature. Yet, they are fundamental to geography

and disease mapping. This research thus makes an important contribution to the disease

mapping literature.

7

1.3 Organization of the dissertation

In this chapter (Chapter 1) I discuss how various disease mapping and cluster detection techniques approach the problem of spurious clusters. I then argue that these

methods do not address the issue of spurious clusters adequately. I suggest that a

geographical approach can help us better understand the problem and explain how

geography gives rise to spurious clusters. Then, having understood the geographical

bases for spurious clusters I propose a geographically sensitive disease cluster detection

method. I explain this method the Shape Size Sensitive (S.S.S) method in Chapter 2. Then, using simulated data, I test the sensitivity of this method. I also compare the

performance of the S.S.S method with Rogersons Score statistic method for detecting

disease clusters. The final, short chapter is Chapter 3. Here I use the S.S.S method and

Rogersons Score Statistic and Kulldorffs Spatial Scan Statistic to investigate the spatial

patterns of prostate cancer risk in Iowa. The implications of the findings are discussed.

1.4 Review of existing methods of cluster

detection

All disease mapping and cluster detection approaches share a common goal. This

is to uncover the underlying pattern of risk. These methods calculate statistics as rates or

likelihoods which serve as measures of risk. The patterns" on a map are obtained by

mapping either these statistics, or those areas that cross some threshold of the calculated

statistic. When the second procedure is followed, that is, the rate, or, the likelihood of an

area having an excess risk is statistically tested; the method is often called a cluster

detection method. Most cluster detection methods test a large number of areas which

could possibly be clusters. These are called candidate clusters [31, 36] or cluster candidates. If a cluster passes the statistical test, but demarcates an area where no

cluster exists in reality, then, it is a noisy cluster [31] or spurious cluster [16-19]. The term true cluster may be used to indicate geographic areas of excess risk. It is also

8

possible that a true cluster is suppressed by the cluster detection process. In the disease

cluster detection literature this problem is usually not discussed separately, but forms an

integral part of the spurious cluster detection problem. Spurious clusters may be created

at various stages in the disease mapping/cluster detection process. The first step for

applying a cluster detection method is to collect spatial data. This data may come pre-

aggregated into administrative regions, or it may come in the individual form [37, 38]. If the data are in the individual form, they need to be processed and aggregated

such that summary statistics may be gleaned from them and the summary statistics

mapped. The process of aggregation may create spurious clusters. One solution is to use

the individual level data to search for clusters [39]. While a number of methods will work with both aggregated and individual level data, there are a very few methods, that have

been developed exclusively for individual level data [40, 41] . With better quality data being increasingly available, such analyses will become more common [37, 42]. The majority of disease mapping situations start with aggregated data and summary statistics are calculated from these datasets. When the summary statistics are calculated based on a

small base population (also called a small support size), then these statistical estimates are likely to be unreliable. This is the small number problem. Some methods carry out

a process called smoothing", where information from neighboring regions is used to

obtain a better estimate of the mapped statistic for a given region. This, to some extent

alleviates the problem of spurious clusters created from small numbers. The statistical

testing procedure could also create spurious clusters. If multiple hypothesis tests without

adjustment are carried out then, this process may also give rise to spurious clusters. In a famous example, Openshaw [27] carried out multiple hypothesis tests when searching for leukemia clusters in Northern England. Whenever a test was significant, a circle was

drawn. Some of these circles were spurious clusters, and would not have existed if

adjustments for multiple testing were carried out. Sometimes, using the wrong reference distribution may also create spurious clusters. Conversely, using overly conservative

9

multiple testing correction techniques may suppress true clusters [28]. Waller and Gotway [4] write of situations where for a Poisson reference distribution, it is not possible to distinguish a lack of fit to the Poisson distribution (spurious cluster) from a rejection of the null hypothesis (true cluster). This is an area of active statistical research, and some new and innovative solutions have been proposed to these problems [43, 44]. Kulldorffs SatScan method uses a likelihood based hypothesis testing framework to

solve the problem of multiple testing [3]. Instead of testing multiple hypotheses, this method tests only one hypothesis. This hypothesis test is carried out on the cluster

candidate that is most likely to be a cluster. The likelihood is a statistical function,

that is calculated under the assumption that the observed data conform to certain known

distributions (ex: Poisson or binomial). There still remains the third source of spurious clusters. Unlike the first two, there

is little research on this source of spurious clusters. This is when spurious clusters are

created from mismatch between the process that generates the disease map patterns, and

the processes used to recover the patterns. This mismatch could arise when the data are

aggregated to administrative regions, or to other shapes and scales by the method of

analysis. In this section I discuss the various methods for the detection of cluster

detection in context of their ability to handle this problem. Among the various methods

available, some methods offer the opportunity of multiscalar analysis. In these methods,

the data may be geographically rescaled. While these methods geographically process the

data before mapping patterns other methods consider the sanctity of geographic

boundaries unbreachable. The latter attempts to expose the underlying risk pattern by

mapping summary statistics within existing geographic boundaries without any further

geographic processing of the data.

10

1.4.1 Map data without further geographic

processing

In these methods the geographic boundaries of regions are left as they are,

however various statistical manipulations are carried out on the data. Some researchers

prefer to call this group of methods as disease mapping methods [45]. As I discussed earlier, these methods can again be subdivided into two groups, methods that smooth the

data and methods that do not smooth the data.

1.4.1.1 Methods that do not smooth the data

The vast majority of diseases maps are maps of raw rates, where the number of cases per unit population within existing geographic regions such as counties or states are

mapped [46]. Another approach is a map of probabilities" [47, 48], where instead of mapping a rate, the probability of observing the rate within a geographic region is

mapped. Mapping raw rates are often problematic when the rates are based on small base

populations [15]. The maps thus produced are likely to display noisy (small number problem) patterns.

1.4.1.2 Methods that smooth the data

In these methods various statistical manipulations are used to smooth the rates

in each region while at the same time keeping the geographic boundaries intact.

Information from neighboring regions are used to stabilize the rates in a given region.

Some examples of this approach can be found in the Bayesian disease mapping literature

[23, 24]. Other examples are method of moving averages and headbanging [20, 22].These methods are not very successful in dealing with the problem of spurious clusters. A study by Kafadar [22] has shown that many of the popular smoothers such as headbanging and empirical Bayes are unable to detect true patterns in the data or have

issues with detecting spurious patterns or clusters. Some of the methods smooth the data

11

by averaging rates over kernels or filters. For example Sabel et al. [49] investigate rates of Amylotropic Lateral Sclerosis (Lou Gehrings disease) incidence in Finland by smoothing rates using Gaussian Kernels. Another method is Rogersons Local Score

statistic [2, 4, 50]. In this method the deviations from the expected rate are smoothed using Gaussian Kernels. Like other methods, if the rates are based on small numbers,

then smoothing these unreliable rates may create spurious clusters. I use Rogersons

Score statistic in my research and therefore, this method is discussed in detail in later

sections. Spurious clusters are often created by these methods. First, because these

methods map the rates based on small areas before smoothing them, they are prone to the

small number problem. Second, these methods do not in any way attempt to deal with the

problem of spurious clusters from spatial mismatch discussed earlier. Third, the statistical

tests that these methods carry out may not be able to distinguish spurious clusters from

true clusters. For example, there is no consensus on what the correct reference

distribution is for Rogersons Score statistic [2, 4, 50]. A separate group of methods that often smooth the data, are local measures of

spatial similarity. These methods , which are also known as LISA (Local Indicators of Spatial Autocorrelation) [51] address the question, - How similar is the risk at a given small area to that of its neighbors? The greater the similarity, the higher the likelihood

that the small area belongs to (or is) a cluster. Some of the LISA statistics are local Morans-I and local Gearys C [50-54]. Since, the underlying philosophy of this approach is that things nearer are more similar than things farther away [55], the implicit definition of scale here is the distance at which this similarity is manifested. Thus a process that acts

at a large scale may cause similarity among immediately neighboring local areas, than

processes that work at a smaller scale. Like other methods, if the statistics are calculated

on small areas, they could be unreliable. The reference distribution of LISA statistics are

often not known [4] and the scale at which a process operates is not investigated before

12

LISA statistics are calculated. Any of these factors could lead to the creation of spurious

clusters.

1.4.2 Methods that pre-process the data

before calculating and/or testing for significant

disease risk

These methods allow the modification of geographic boundaries to extract the

underlying risk surface and/or to find which area has the greatest excess in disease risk.

One group of methods, often called density estimation methods, [56] simply ignore existing geographic boundaries. Drawing from the field" theory of geographic

phenomena [20]; they consider that disease risk patterns are continuous in nature and that they do not change or stop abruptly at geographic boundaries. When appropriately used,

these methods provide the opportunity to control the spatial basis of support, and thus, the

scale of the analysis [57, 58].The other group of methods draw from concepts of region building which were developed by geographers [30]. One approach to building regions is to coalesce groups of areas to build aggregate regions. These methods attempt to find

that combination of areas which has the greatest likelihood of being a zone of high

disease risk. A third group of methods combine concepts of region building methods with

the first group of methods or with methods discussed in the last section. The ability of all

these methods is limited by the scale of the data. Often the data come aggregated into

small areas and the analysis must be carried out at scales equal or greater than the scale of

aggregation. Nevertheless, these methods are better equipped than other methods to

control the shape and the scale of the data, and this gives them an edge over other

methods when dealing with the problem of spurious clusters.

13

1.4.2.1 Non combinatorial approaches

These methods ignore geographic boundaries and attempt to extract the

underlying patterns of risk. They often lay a uniform grid over the map area and measure

the statistic of interest at each grid point. Irrespective of whether the data are aggregated

or not, a value can be obtained at each grid point. While there are a number of approaches

to calculating the statistic at each grid point [21]; a simple and common approach is to filter" the data using circular spatial filters [3, 9, 21, 27]. Some methods map the statistic calculated at each grid point [9] while others do not [3]. These circles can be of fixed or varying sizes. However, since these filters are of a certain shape, they bias the cluster

search. The bias is in favor of detecting clusters of or similar to, the shape of the filter

(circles in this case). Statistically, the clusters that are of the shape of the filter have a higher power of detection than clusters of other shapes. This approach therefore,

overcomes the limitation outlined in the methods discussed earlier, but is limited in its

treatment of geographic shape. Ellipses and other geometric shapes have also been

studied [29, 59]. One of the methods, based on Rushtons Adaptive DMap [9] maps rates at grid points using adaptive filters and interpolates these with an IDW (Inverse Distance Weighting) interpolation algorithm. The adaptive filter [58, 60] ensures that the rates are based on the same number of people or the same support size. Thus, unlike the

LISA methods, all statistics are equally reliable. Also, the use of an adaptive filter

ensures that the scale of the analysis can be precisely controlled. The Inverse Distance

Weighting Algorithm used for creating the final pattern was also found by Kafadar [22] to be the least noisy of all smoothing/interpolation methods. Thus, by allowing

multiscalar analysis, relative freedom of cluster shape (clusters dont have to conform to geographic boundaries) and using a robust interpolation technique, Rushtons Adaptive Filtering method is best suited for dealing with the problem of spurious clusters from

mismatch between the process and analysis scales. I use this method in my analyses.

Another important density estimation method is Kulldorff's SaTScan [3]. While the

14

DMap method maps the extracted pattern, and is therefore good for visualizing and

exploring the underlying pattern, SaTScan can be used to map only those areas that are

significant clusters. SaTScan has found wide acceptance in the public health community

because of its ability to account for the multiple hypotheses testing problem and a robust,

freely available software. Some of the recent developments in the disease clustering

literature have followed the combinatorial approaches that I discuss next, and their

method of choice has been based on the Spatial Scan Statistic method of cluster

detection. Since multiple testing is an issue with these combinatorial approaches, the

Spatial Scan Statistic is a reasonable choice. Since I use the Spatial Scan Statistic in

Chapter-3 to investigate clusters of prostate cancer in North West Iowa, some of the

details of the Spatial Scan Statistic are provided next:

The scan statistic originated as a one dimensional test. Its objective was to test if a one dimensional point process is purely random. The one dimensional spatial scan

statistic was extended by Kulldorff into the spatial domain [3] .The spatial scan statistic moves a circle across the study area. The circle centers on to a centroid. The centroid

could be the location of a single individual for unaggregated data, the centroid of a census

tract (for example) for aggregated data or for a set of grid points. Kulldorff (1997) [3] states The zone defined by a circle consists of all individuals in those cells whose

centroids lie inside the circle and each zone is uniquely identified by these individuals.

Thus, although the number of circles is infinite the number of zones will be finite. For

unaggregated data the zones are perfectly circular, that is, the individuals in the zone are

exactly those located within a defining circle. With data aggregated into census districts,

a zone may have irregular boundaries that depend on the size and the shape of the several

contiguous census districts it includes. The Spatial Scan Statistic is implemented

through the freely available software SaTScan [32]. The methodology of the Spatial Scan Statistic is explained as follows. The method involves two steps, - 1. Confounder

adjustment and 2. Hypothesis testing

15

In disease cluster detection studies known risk factors or confounders are

adjusted for, before the cluster detection algorithm is implemented. Thus, for example, it is known that age is associated with prostate cancer. It may be desirable to remove the

effect of age from the analyses, such that the clusters that are detected reflect the presence

of other, yet unknown, risk factors. The confounder adjustment procedure that SaTScan utilizes is known as the indirect standardization method. It is as follows:

If ,

ei= Expected number of cases in local area/ZCTA i after confounder adjustment. ni = Observed number of cases in local area/ZCTA i after confounder adjustment. r = specific cofounder group, for example age group from 45-65 yrs.

= Total number of confounder groups.

nr = Total number of cases in G in age group r

Nir= Total number of people in G in local area i, in age group r.

The confounder adjustment procedure is:

ei = [ (nr / Nri1 )* N]

The adjusted numbers of cases are then used to test the hypothesis if a given local

area/ZCTA i has an excess risk/belongs to a cluster. The hypothesis testing procedure is

explained next. The Spatial Scan Statistic tests the hypothesis if a given area of the map

(for example a collection of ZCTAs) has a greater (or lesser) risk, than the rest of the

ZCTAs in the entire geographic region G.

If Zj is the jth cluster :

16

For all possible Zjs in Z (The collection of k possible clusters in G), if the risk inside Zj is

R(inside, j) is the risk inside Zj while R(outside, j) is the risk outside Zj ,then under the null hypothesis and alternative hypothesis:

H0: R(inside, j) = R(outside, j)

H1: R(inside, j) > R(outside, j)

The observed number of cases nj inside (or outside) a cluster candidate is assumed to be Poisson Distributed, and a function of the expected number of cases in the cluster ej and the risk R(inside, j) .

Let n= k Nirri1 nj Poisson [ ej *R(inside, j) ] The likelihood function that is used, from these null and alternative hypotheses are as follows:

= Likelihood (R(inside, j) > R(outside, j) ) / Likelihood(R(inside, j) = R(outside, j) )

This likelihood ratio can be solved and written in the logarithmic form as follows:

Log Likelihood Ratio or LLRj = (nj ln (nj/ ej)) + ((n- nj) ln [(n- nj)/(n- ej)])

The significance of the log likelihood ratio is tested using a Monte Carlo

hypothesis test. The SaTScan program carries out a user-specified number of Monte

Carlo randomizations of the data and tests to 0.001 % (The percentage can be user

specified too) significance of the presence of a cluster. A p value is reported. This is

calculated as p value = Rank of LLR / (1 + #simulation)." Note that the spatial scan

statistic procedure does not adjust for multiple testing in the traditional sense for example

by carrying out a Bonferroni or other multiple testing adjustment procedure. Instead, it

avoids the problem of testing multiple hypotheses, by concentrating on those clusters

candidates that are most likely to be true clusters (and thus have the highest log likelihood

17

value). Also note that the Spatial Scan Statistic procedure explained above is the spatial

Poisson model, which is the model used in disease mapping. There are numerous other

modifications to the Spatial Scan Statistic procedure [29].

1.4.2.2 Combinatorial Approaches

Some geographers are interested in creating or building regions [30, 61-64]. Regions are built up by assigning small areas to groups such that they fulfill certain

criteria. Regional geographers have called this the assignment problem. Small areas

are so assigned to regions, that a certain attribute of the region is optimized [30, 62]. Sometimes, the problem could involve maximizing the variation in an attribute of the

newly built region as a proportion of the variation within the entire map [30, 65]. The general question in this approach is What combination of areas will optimize a given

objective? ". In the disease mapping context disease risk or the likelihood of risk can be maximized. An example in the disease mapping context was investigated by Alvanides

[61]. A similar strategy was also suggested (but not implemented) by Rushton [66]. These ideas were implemented in computer programs first by Openshaw [64] and later by other researchers [63, 67, 68]. Independently Duczmal suggested a similar solution to finding disease clusters of any shape. He operationally achieved this by maximizing the

Spatial Scan Statistic likelihood function over possible combinations of areas. While it is

sometimes possible to look at all possible combinations/ collections of areas, for most

realistic geographical areas this is not possible (For example, see Cliff and Haggett [62]). Neither are there theoretical solutions to the problem. In operations research, such

problems are called np-complete. This means that for a collection of n areas, the problem

cannot be solved in polynomial computer time. Heuristics are used to solve such

problems. Duczmal uses the Simulated Annealing (SA) and Genetic Algorithm (GA) heuristics in his research [31, 69]. An important aspect of these methods is that they provide enormous freedom of analysis of shape and scale. The analysis scale and shape

18

vary across a multitude of combinations. Thus instead of asking the question Is there a

cluster at a given scale of the following shape? these methods demand - Find clusters

of any shape at any scale. This makes these methods immensely powerful. But this

strength also brings about a weakness. If spurious clusters are created from a mismatch

between the process and analysis scale and shapes, and if a large number of scales and

shapes are evaluated by this analysis method, then it follows that noisy clusters will

almost always be detected by these methods alongside genuine or true clusters. At the

end of this section will shall see an example of this. The next section discusses some of

the modifications that researchers have proposed to these methods. These modifications

offer better power of detecting clusters.

1.4.2.3 Hybrid Approaches

These approaches combine some of the strategies of the non-combinatorial

approaches with a combinatorial search. Some examples are the approaches proposed by

Patil and Tallie [70], Tango [71] and Yinnakoulias [36]. Tango proposed that the search begin with a circular cluster as a seed", but then regions adjacent to the circular cluster be coalesced with it and the resulting hybrid be tested as a possible cluster. With every

level of adjacency enumerated the problem becomes computationally complex, and therefore in their example Tango suggested that three levels of adjacency be tested. Patil and Tallie`s [70] approach is limited to restricting the search space to areas with the highest rates, which Patil and Tallie call the Upper level sets". These methods provide

interesting extensions to the combinatorial shape-free methods of cluster search.

We are now in a position to summarize the various methods discussed. All the

methods outlined above have one singular goal: To extract the underlying pattern of

significant excess risk. Some methods are good at mapping the entire pattern [9], while others are good at testing for significant excess risk [3]. In the next section, I discuss how problems with significance testing can introduce spurious clusters.

19

1.4.3 Significance Testing and Spurious

Clusters

In general all methods at some point, address the following question: Of all the

candidate clusters in the pattern of risk (whether mapped or not), what clusters are true clusters? Each candidate cluster has a specific risk elevation, a size, and a shape.

Traditionally most cluster detection" techniques have used some function of the risk

elevation or rate of a given area to decide if the area is a true cluster. The question that is

asked is How likely are we to observe this risk elevation or rate in this area if the

underlying process is noise? " If the probability is small then the area is not a cluster.

The distribution of risks/rates under the process of noise is also known as the reference

distribution. Traditionally, the reference distribution is normatively chosen. Some

choices are the normal distribution [2, 50], the chi-squared distribution [2, 50], the Poisson [3] distribution and the Gumbel distribution [43]. However, using such distributions is problematic. If the populations are small, the normal distribution cannot

be used. It is often hard to distinguish a lack of fit to the Chi-Squared distribution from

a genuine deviation from the Chi-Squared distribution (indicating clustering) [4] . A more robust method of achieving this is to use a Monte Carlo simulation approach to

empirically determine the reference distribution. Methodologically this may be achieved

by simulating a series of maps, in each of which noise is the underlying process. Multiple

Monte-Carlo simulations of the data are used to mimic the noise process. If the observed

risk elevation (or some function of the risk value such as the rate) for the area is significantly different from the ones in the simulated maps, then the area is considered to

be a cluster. However Monte Carlo simulations do not guarantee that spurious clusters

will not be detected. Steenberghen et al.,[72] carried out an experiment that illustrates this problem. This is displayed in Fig 1.1. Fig 1.1 is a map in which simulated locations

of traffic accidents (points) were randomly scattered [72], filtered using 600 meter filters,

20

the density of points estimated, the resulting clusters tested for significance and the level

of significance was displayed (also known as a p-map). If areas which show 0.025 % significance are called clusters, the black shapes in Figure 1.1 are spurious clusters.

Some methods attempt to tackle this problem with a combination of both Monte

Carlo and normative statistical techniques. Examples are Duczmals and Kulldorffs

methods. Duczmals method [3, 31, 43, 69, 73] (which derives from Kuldorffs method) generates a large number of irregular cluster candidates. For each candidate the rate is

calculated. The rate is then fed into a function known as a likelihood function to yield a

likelihood value of the cluster candidate being a true cluster. This value is divided by

the likelihood of the cluster candidate not being a true cluster. This ratio is known as the

likelihood ratio. The likelihood ratios for all cluster candidates are calculated. The

cluster candidates with the highest ratios are the most likely clusters. Multiple Monte

Carlo simulations are carried out, and the rates at all the candidate clusters calculated.

Again, the rates are fed into the likelihood function, thus generating a reference

distribution of likelihood ratios for each cluster candidate. The likelihood ratio value of

the cluster candidate is compared with the reference distribution to decide if the cluster

candidate is a true cluster. However when Duczmal applied this approach to some of his

data, problems with this approach were dramatically exposed. In one of his studies

Duczmal [31] simulated breast cancer cases and randomly distributed them over 245 counties in New England (Fig 1.2). When he instructed his Simulated Annealing (SA) SaTScan based irregular cluster search algorithm to search for clusters, one of the clusters

that it found was a large and extremely irregular cluster encompassing 122 counties, and

enclosing a large percentage of the randomly scattered cases. This cluster is an example

of a noisy cluster. The noise generating process (random distribution of cases) operated at the scale of 245 counties (aggregated). The shape of the area at which this process operated is the shape of the New England region that we see in Fig 1.2. At this scale and

shape, the process generates noise. However, if this process is studied at the scale of an

21

aggregation of 122 counties and at the shape that follows the darker (orange if your copy of this document is in color) shaded counties in Figure 1.2, then, a noisy or spurious cluster is generated. It is known that the process that generated this cluster is noise.

This example thus illustrates a situation where spurious clusters are created from a

mismatch between the scale and shape of the process that generates the cluster and the

scale and the shape imposed by the method of analysis. Duczmal [31] noted that this noisy cluster was large in size and extremely irregular in shape. Duczmal [73] suggests that large and irregular clusters like the one found in his study (above) are likely to be spurious. He and some other researchers [36] therefore, incorporate a penalty for irregularity of shape in this cluster search algorithm. The extent of this penalty is decided

on a priori knowledge of the shape of the cluster. Therefore, if researchers believe that

the clusters in an area are likely to be circular; they place a high penalty on clusters that

are not circular in shape and vice versa. The spurious cluster detected by Duczmals

method and the proposed solution raises some important questions. Is this spurious

cluster large and irregular with a high risk/rate elevation a cluster of his particular

method, or is it possible that if a cluster detection method is given freedom of shape and

size then these clusters are likely to be detected? We note that the shape and size of the

spurious clusters in Fig 1.1 are different from the shape and size of Duczmals spurious

cluster. Thus not all spurious clusters are large and irregular.

Duczmals problem has reintroduced the otherwise rarely discussed issue of shape

and size in the disease cluster detection literature [69, 74, 75]. Risk elevation is just one possible characteristic of a cluster. McCullagh [76] states - In map analysis, features of prime importance may be size, shape, orientation and spacing". It is possible for clusters

of different shapes and sizes to have the same risk elevation. It is also possible for

clusters of same shape and sizes to have different risk elevations. The first objective of any cluster search should therefore be to distinguish spurious or noisy clusters from

everything else. The risk or rate value of a possible cluster alone is not sufficient to make

22

this distinction. The shape and size of the cluster must also be factored in, when

considering if a cluster is a true cluster. Duczmal proposes a solution that makes certain a

priori assumptions about the shape and size of a cluster. This solution is interesting.

However, the problem of spurious clusters may be approached from a different angle.

Instead of asking the question What is the shape of a true cluster? which is what these

methods do, and which is a question which is hard if not impossible to answer, the

question that should be asked is What is the shape of a spurious cluster?. Unlike the

first question, this is easier to answer. This is because the shape of a spurious cluster,

unlike a true cluster can be mined a-posteriori from the data. To know how this can be

done, we first need to understand how spurious clusters are generated in the first place.

Thus, in the chapter that follows I discuss in depth, the phenomenon of noise and the

creation of spurious clusters.

1.4.4 Identifying spurious clusters and

distinguishing true clusters from spurious

clusters

Spurious clusters enclose noise. Across disciplines noise is defined as .. a

random and unpredictable signal" [77]. By this definition if the nature of the signal is known, then noise can be detected and filtered out. For example in a satellite image, it

may be known that certain frequencies are the signal frequencies and therefore a spectral

analysis and subsequent filtering may help remove the undesirable noise. In a satellite

image the signal has a physical existence. For example, infrared radiation emitted by

vegetation can be measured with certain instruments. In contrast, in mapping disease the

signal cannot be physically measured. The signal is conceptual and has to be estimated

from the available data. Some geographers and statisticians attempt to tackle the problem

by developing statistical models that attempt to separate signal from noise [21, 23, 78-

23

80]. Perhaps a better approach to understanding signal and noise in a disease map is to understand the physical process that gives rise to the signal (as in a satellite signal). It is known that in a disease map, the observed patterns are the result of underlying processes.

The observed patterns are patterns obtained from mapping statistical summaries of

disease outcomes. For example, a map of patterns of cholera mortality in England could

be displaying the number of cholera deaths per unit population in each county. The

outcome in this case is cholera mortality which is the outcome of a disease process. Since

cholera is a communicable disease it is possible that the spread of cholera can be modeled

as a contact network process [81]. There exist many other spatially explicit disease processes2. For example, patterns of disease could be the result of processes that reflect

an underlying lack of access to healthcare [10, 56, 82-84]. Whatever the specific process may be, these processes have a common trait in having a spatial form [85], and this means that they predispose some areas of the map to have a greater risk than any other.

It is also possible that the underlying process does not cause any region of the

map to have a greater risk than any other. Since a disease case may appear at any point on

the map by random chance, by the earlier definition of noise, this is a noise generating

process. A cluster defined by enclosing some of these disease cases is a spurious cluster.

On any given map disease patterns can be the result of one or more processes. It could be

the result of one process that generates clusters and another process that generates noise.

The challenge therefore, is to distinguish the areas of a pattern that are the result of a

cluster generating process from those that are not. Also, given a disease process that

generates patterns on a map; a number of other factors also influence the patterns we

2 It is important to distinguish between a spatially explicit disease process and a

spatial disease process. Some scientists attempt to model diseases as purely spatial processes. Examples of this can be seen from the cellular automata based disease modeling literature. No disease process is purely spatial and therefore such models are misleading.

24

actually observe. Given a cluster generating process, the following factors influence the

pattern that is then extracted:

1. The spatial distribution of the locations of people in the map.

2. The shape and size of the geographic units that are used to aggregate individuals

into discrete small areas.

3. The shape and size of the spatial configuration, the disease mapping or cluster

detection method may impose on the data (In addition to 2).

Understanding these factors is essential to understanding noise and spurious

clusters. I discuss this next.

1.4.4.1 The spatial distribution of the locations

of people in the map

A cluster generating process causes an area of the map to have a greater risk than

other areas of the map. Cluster detection methods seek to estimate the shape, size and risk

elevation of the area of increased risk using the locations of people as proxy sample sites.

A representative spatial sample of the area of risk would be a uniform grid [86]. People are never distributed uniformly over space; instead, a likely spatial distribution consists

of dense settlements interspaced with sparsely populated areas. This creates a challenge

in estimating the true shape of the cluster. As I illustrate from figures 1.3 to 1.11, a

cluster that in reality has a uniform shape, may be estimated as having a highly irregular

shape, because of the way people are distributed over space [75].The shape of the actual area of increased risk or true cluster created by the cluster generating process also

influences the shape of the cluster that is finally estimated. If the shape of the true cluster

25

is highly irregular, it is quite likely that the shape of the cluster that is estimated is also

highly irregular, but the converse may also be true! This is illustrated from figures 1.12 to

1.14.Another phenomenon long observed by geographers is that the same risk process

may give birth to different shaped clusters in different areas of the map or, in more

general terms, the same cluster generating process may give rise to different patterns

[87]. While the shape of the original area of the increased risk or true cluster may be the same in two areas and the spatial distribution of the people may be the same, it is not

necessary that the pattern of people who are diseased (and who are not) will be the same in both areas. This means that the shape of the estimated area of increased risk will not be

the same in both areas. This is further complicated by the fact that people are almost

never distributed similarly over space in two different regions (Figures 1.15 to 1.20). First, for the purposes of understanding this issue, let us assume the highly

improbable situation that people are uniformly distributed over space. Let the distribution

be over a uniform grid. Figure 1.3 illustrates the situation. Next, let us consider that out

of the 42 people in the region, 10 are afflicted by some disease. However, we assume that

the process that causes disease is a noise generating process. Therefore, we expect

diseased people (or cases) to be randomly distributed over the region among 42 people as shown in figure 1.4. A convex hull boundary of these cases is seen in Figure 1.5. In

contrast, if there is a cluster generating process, we would expect the diseased people to

be clustered together. Figure 1.6 illustrates such a situation. People enclosed within a

dotted area of increased risk are diseased, the risk being 0. 24 (the risk in other areas being 0). We observe in Figure 1.6 one realization of the risk process, so 10 people are diseased. Figure 1.7 displays the convex hull boundary of this cluster of diseased

people. The smooth and regular shape of this cluster is in sharp contrast to the irregular

cluster shape that we observe in Figure 1.5. Since it is highly unlikely, that people will be

uniformly distributed over space, Figure 1.8 illustrates the more realistic possibility of

people being non uniformly distributed over space. If the entire geographic area in figure

26

1.8 is subject to a risk, we expect some people to become diseased (again, one realization of the process) . Figure 1.9 illustrates this and the boundary that demarcates the cluster. The shape of the cluster is very different from what was obtained in Figure 1.5. An

increased area of risk on such a heterogeneously distributed population gives rise to

clusters of unpredictable shapes (figures 1.10 and 1.11).These example show how the spatial distribution of the people affect the shape and size of the risk surface detected.

From these examples it may seem that for a given distribution of people over

space, a cluster generating process gives rise to patterns on a map that are regular

compared to the shapes generated by a noise generating process. Indeed, some scientists

use measures of regularity of a clusters shape to distinguish a true cluster from a

cluster spurious cluster [73]. Also, people never are distributed uniformly over geographic space. Next, we see how this affects the shape and size of the cluster detected.

In the example I have discussed I assumed that the cluster generating process gives rise to

a very regularly shaped area of increased risk (The area within the dotted line). In reality this may not be true. The area of increased risk may have a very irregular shape. Some

examples of geographic features that can be areas of increased risk are rivers, roads,

underground groundwater streams, plumes of aerial pollution or a combination of some

of these. We therefore observe that the shape and size of a cluster cannot be predicted a-

priori and is unique to the risk elevation of the cluster generating process and the spatial

distribution of the people. Another aspect of a cluster generating process is that the same

process can give rise to different shaped clusters in different regions of the map. This can

happen even if people are uniformly distributed. The examples below illustrate this:

From the discussion and the examples, we can conclude that both the spatial

distribution of people and the shape and size of the area of increased risk, have an

important bearing on the shape and size of the cluster that is finally detected. The area of

increased risk or the true cluster may have a very different spatial configuration from

the cluster that is detected. Parts of the true cluster may be suppressed or spurious areas

27

of increased risk may arise. Spurious clusters are created from the method used to

measure the outcome of the process of clustering. By definition, the method uses a scale

and (or) shape of measurement that is dependent on the spatial distribution of people. Since this distribution is not representative of the underlying area of increased risk, there

is a mismatch between the measurement shape/scale and the process shape scale. While

the above examples are with individual level data, the conclusions drawn can be

generalized to aggregated data. The act of data aggregation itself could introduce noise

over and above the problem of heterogeneously distributed people. This is discussed in

the next section.

1.4.4.2 The scale and spatial configuration

of the geographic units that are used to

aggregate data into discrete small areas

In the geography literature the term scale is used to refer to three different kinds

of scales, two of which are of relevance here. The first is the phenomenon scale, or the

scale at which a spatial process operates. The second is the analysis scale the scale at

which data are aggregated for measurement and analysis [88]. When a phenomenon such as a disease operates at a given scale, its outcome is often registered as heterogeneity in

disease rates at that scale [89]. Geographers have often attempted to find the scale at which a process operates [90]. Two well known methods are the use of spectral analysis [65] and variogram [91] modeling. The latter approach is often used in the health geography literature. Studies in China have shown that Esophageal and Liver Cancers

operate at scales of less than 150 kms while stomach cancers operate at scales less than

90 km [91]. In Sweden substance related disorders operate at scales less than 3 kms [92]. Unfortunately, the scale at which a given process operates is not known in most

geographic studies. A geographer attempts to study a process by collecting and analyzing

28

spatial data. This process involves analysis through the calculation of statistical

summaries of data aggregated at an appropriate scale. When the process scale is not

known there is every possibility of a mismatch between the process scale and the analysis

scale. This mismatch or misalignment arises from two sources. First, geographic data are

often aggregated into discrete units often for purposes different from the analyses for

which they are being used. These units of aggregation could differ in shape and scale

from the process scale and shape. As Haining [93] states in Conceptual models of spatial variation [93] ...This might be referred to as process-induced spatial heterogeneity. This source of heterogeneity may be compounded in the case of regional data by measuring

attributes through spatial units of different size. This might be referred to as

measurement-induced heterogeneity because it is a product of how attributes are

observed and measured. A second source of mismatch is from the spatial structures that a

disease mapping/ cluster detection method imposes on the data. For example, spatial

filtering [9, 10] and Spatial Scan Statistic based methods calculate summary statistics by aggregating data along circular filters. In the geography literature the problems that

arise from spatial mismatch are grouped under MAUP or the Modifiable Area Unit

Problem [91, 94]. MAUP phenomena are again grouped under two broad sub groups as the zone effect and the scale effect. The creation of spurious heterogeneity or destruction

of true heterogeneity with changing scales is a manifestation of the scale effect. If the

scale is kept fixed but the shape of the zones of aggregation are changed, then the zone

effect is likely to be seen. Geographic data aggregated to administrative units often

display both the zone and scale effects of MAUP. Aggregating data has a smoothing

effect on disease rates [95], and therefore clusters at scales smaller than the scale of aggregation could be missed, when analyses are done using these data. Conversely, if the

scale of aggregation is smaller than the process scale, then noisy clusters could be

detected. A recent study by Ozonoff et al., [19] demonstrated that when individual level data are aggregated and a Spatial Scan Statistic cluster search method used on the data,

29

then noise increases with increasing levels of aggregation. Therefore, analysis and

process scales interact in complex ways to create noisy clusters and suppress true clusters

We can conclude from our discussions above, that a number of complex factors

influence the shape, size and the risk elevation of the clusters that are detected and the

spurious clusters created. These factors are dependent on the spatial distribution of the

people and the process and analysis scales. It is not possible to make a priori assumptions

about these factors, and it is certainly not possible to predict the shape of a noisy cluster a

priori. What approach is then appropriate if the spurious clusters have to be separated

from the true clusters? The section that follows answers this question.

1.4.5 Identifying the noisy" or spurious

components of the pattern

A reasonable cluster detection technique should take into consideration not only

the risk elevation but also the shape and size of the cluster. I propose a spatially enabled

computational process that uses these attributes of a cluster, to identify the signature of

spurious clusters from patterns on a disease map. Earlier, I introduced the idea that a

pattern is the outcome of a process. Analyzing a pattern or the components of a pattern

such as individual clusters may yield clues about the underlying process. A map of

disease patterns represents one realization of the underlying process. It may not be

possible to draw conclusions on the process that generated the pattern or components of

the pattern by analyzing just one map. However, if multiple maps were available, representing multiple realizations of the process, then analyzing the patterns may yield

clues about the underlying process. A classic example of this approach can be found in

Hagerstrands classic paper [96] in which he simulates multiple maps assuming an underlying process. He then compares maps of empirical data with the maps that he has

simulated to draw conclusions about the validity with which he represents the process in

his model. Another example can be seen from Diggle [97].Therefore, if maps were

30

created using a known process, then analysis of the simulated patterns on the maps would

yield clues on the signature" of that particular process. Once this signature" is known,

then the pattern could imply (or not imply) the existence of this process. More specifically, this scheme can help identify a signature" for spurious clusters. These

signatures can then be used to distinguish clusters that are spurious from clusters that are

true", in any given pattern of disease risk. Shape, size and risk elevation are part of this

signature". For example, the signature of spurious clusters in Duczmals [73] method was that these clusters were large in size and had irregular shapes. The next chapter is

devoted to the method I have developed based on these ideas. The method is first

described, then tested and validated on simulated data.

1.4.6 Why use size, shape and rate

The reason I add the dimensions of size and shape, in addition to rate, is to

characterize the reference space in which spurious clusters are located. I know from

theory (as discussed in this chapter) that spurious clusters arise differently to the extent that the numbers of people at risk in relation to the overall relative risk of the disease

exist differ across the space. When people are distributed uniformly in space, the average

number and average size of spurious clusters in that space can be determined from

theory. As Schinazi [98] shows, deterministic statistics can be used to determine the chance of finding a given number of clusters with a rate higher or lower than the expected

rate. However, when people at risk are distributed non-uniformly in space, the equivalent

number is more difficult to determine directly from theory. The same theory still applies;

it is just more difficult to implement in the case of non-uniform distribution of people at risk. For this reason, I use Monte Carlo simulation to discover the rate, size, shape space

in which typical spurious clusters lie, given the particular distribution of people at risk

and the particular overall relative risk of the disease in the study area in question. In his

seminal paper King [85] states The mathematics of stochast

shape and scale in detecting disease clusters

Documents

real clusters

detecting disease clusters

problem of spurious

true clusters

patterns of spurious

new cluster detection

date shape

weaknesses of rogersons