interpolation of data collected in polygons of data collecte… · epidemiological and social data....

Interpolation of data collected in polygons

Konstantin Krivoruchko1,2, Alexander Gribov1, Eric Krause1 1Environmental Systems Research Institute,

380 New York St, Redlands, CA, USA, 92373 [email protected], [email protected], [email protected]

2Corresponding author

Abstract We describe a multivariate version of areal interpolation where both the primary and the secondary cokriging variables can be binomial, negative binomial, or Gaussian. For all data types, we describe model fitting and model diagnostics. Areal kriging can be used as an alternative for traditional choropleth maps because it can better represent the data variability and “hot spots” that are difficult to recognize when the raw data are displayed. Gaussian areal interpolation is the basis for understanding models for count data, and the Gaussian model can be useful for interpolation of air and soil contamination collected in populated places but the exact sample location is unknown. Poisson kriging of count data collected in points and polygons was recently discussed in geostatistical literature. To account for overdispersion in the Poisson process, we extend this model to the negative binomial distribution. Binomial kriging is helpful for interpolation and reaggregation of epidemiological and social data. We show examples of Gaussian, overdispersed Poisson and binomial areal kriging and cokriging interpolation using environmental and epidemiological data. All case studies are prepared using the beta version of ArcGIS Geostatistical Analyst 10.1. The mathematical details of the models described in the main text are provided in the appendix. Key Words: interpolation, counts, binomial, overdispersed Poisson, cokriging

1. Areal Interpolation in GIS Literature Areal interpolation is the process of estimating values in polygons whose size and shape differ from an original set of polygons located in the same data domain. In GIS, areal interpolation is often associated with estimating the number of people in the required polygons. The most popular areal interpolation methods are the areal weighting (Goodchild and Lam, 1980, Flowerdew and Green, 1992), pycnophylactic interpolation (Tobler, 1979), and dasymetric mapping (Eicher and Brewer, 2001, Mennis and Hultgren, 2006). The areal weighting method assumes that the variable of interest is uniformly distributed within input polygons. Then the estimated values in the target polygons are proportional to the areas of intersection with the input polygons. Pycnophylactic interpolation assumes that the data are described by a smooth density function. The value of each overlapping grid cell is set to the value 𝑍𝑖 of the input polygon i divided by the number of cells in the polygon. Then each value is smoothed out

using the values in the neighboring cells. Finally, the cell values are iteratively updated to preserve the mass balance in the polygons: ∬𝜑(𝑥,𝑦)𝑑𝑥𝑑𝑦 =𝑍𝑖, that is, a sum of the predictions in the subareas is equal to the area total value. Dasymetric mapping uses remote sensing as ancillary data to distribute the value 𝑍𝑖 of the input polygon among grid cells assuming that the variable of interest and the raster cell values (such as a land use classifier derived from the satellite image) are strongly related. Then, the cell values are calculated so that their sum is equal to the value 𝑍𝑖 in the polygon. For example, if the ancillary data are binary (i.e. they are indicator values equal to 1 for residential areas and 0 for non-residential areas), the population of each cell that belongs to a residential area is equal to the value 𝑍𝑖 divided by the sum of indicator values in the polygon 𝑖. Dasymetric mapping was developed and named “dasymetric” by the Russian cartographer Semenov-Tyan-Shansky in the 1920s, see Preobrazenski, 1954. Areal weighting is criticized in the literature because its assumption of data uniformity is almost always unrealistic. The risk to have a particular disease or to be a crime victim usually changes smoothly across the administrative boundaries because they are defined irrespective of spatial variation of the environmental variables and social factors. The main problems with pycnophylactic interpolation are the inability to provide the uncertainty of the predictions and the impossibility to use covariates. Dasymetric mapping gives good results for disaggregation of the population, provided that the ancillary data are accurate, but it is very unlikely that this method can be successfully used with other data, such as social, epidemiological or crime because the ancillary data are unlikely to be available (i.e. we cannot derive local lung cancer risk distribution from a remote sensing image alone).

2. Gaussian Areal Kriging Statistical data averaging was an important task of geostatistical theory from the very beginning because accurate estimation of average values are required in both meteorological and geological applications, Gandin and Kagan,1962, Matheron, 1968. The average value of the variable 𝑍 in the area 𝐴 is called the support of 𝑍(𝐴) in geostatistical literature, and the averaging statistical model is called block kriging. Changing the support of a variable creates a new variable with different statistical properties. Review of the change of support problem and detailed discussion of areal Gaussian kriging can be found in Gotway and Young, 2002 and 2004. In practice, all measurements are collected in some, usually small, volumes 𝑣. Hence, the measurement values 𝑍(𝑣) assigned to the points 𝑠 are averages:

Z(𝑣) = ∫ 𝑤(𝑠)𝑍(𝑠)𝑑𝑠 𝑣∫ 𝑤(𝑠)𝑑𝑠 𝑣

,

where 𝑤(𝑠) is the data density over 𝑣. Similarly, calculation of the prediction variance in the volume v, 𝜎2 (𝑣), requires knowledge of the point data covariance:

𝜎2 (𝑣) =∬ 𝑤(𝑠)𝑤�𝑠′�𝑐𝑜𝑣�𝑍(𝑠),𝑍(𝑠′)�𝑑𝑠𝑑𝑠′ 𝑣,𝑣

�∫ 𝑤(𝑠)𝑑𝑠 𝑣 �

2 ,

where 𝑐𝑜𝑣(𝑍(𝑠),𝑍(𝑠′)) is the covariance between measurements made in points 𝑠 and 𝑠′.

Spatial correlations between data observed in polygons 𝐴𝑖 and 𝐴𝑗 , and in point 𝑠 and polygon 𝐴𝑖 are calculated using the covariances

𝑐𝑜𝑣 �𝑍(𝐴𝑖),𝑍�𝐴𝑗�� =∬ 𝑤(𝑠)𝑤�𝑠′�𝑐𝑜𝑣�𝑍(𝑠),𝑍(𝑠′)�𝑑𝑠𝑑𝑠′ 𝐴𝑖,𝐴𝑗

∫ 𝑤(𝑠)𝑑𝑠 𝐴𝑖

∫ 𝑤(𝑠′)𝑑𝑠′ 𝐴𝑗

and

𝑐𝑜𝑣�Z(𝑠),𝑍(𝐴𝑖)� =∫ 𝑤�𝑠′�𝑐𝑜𝑣�𝑍(𝑠),𝑍(𝑠′)�𝑑𝑠′ 𝐴𝑖

∫ 𝑤(𝑠′)𝑑𝑠′ 𝐴𝑖

.

From the formulas above, we see that calculation of the prediction and prediction standard error values in the polygons requires information about the point data density and the point covariance. In practice, values of 𝑍(𝐴𝑖) , 𝜎2 (𝐴𝑖) , 𝑐𝑜𝑣 �𝑍(𝐴𝑖),𝑍�𝐴𝑗�� , and 𝑐𝑜𝑣�Z(𝑠),𝑍(𝐴𝑖)� are computed using the following approximations:

�̂�(𝐴𝑖), = ∑ 𝑤(𝑠𝑖)𝑍(𝑠𝑖)𝑁𝑖=1∑ 𝑤(𝑠𝑖)𝑁𝑖=1

,

𝜎2� (𝐴𝑖) =∑ ∑ 𝑤(𝑠𝑖)𝑤(𝑠𝑗)𝑐𝑜𝑣�𝑍(𝑠𝑖),𝑍(𝑠𝑗)�𝑁

𝑗=1𝑁𝑖=1

�∑ 𝑤(𝑠𝑖)𝑁𝑖=1 �

2 ,

𝑐𝑜𝑣� �𝑍(𝐴𝑖),𝑍�𝐴𝑗�� =∑ ∑ 𝑤(𝑠𝑖)𝑤(𝑠𝑗)𝑐𝑜𝑣�𝑍(𝑠𝑖),𝑍(𝑠𝑗)�𝑀

𝑗=1𝑁𝑖=1

∑ 𝑤(𝑠𝑖)𝑁𝑖=1 ∑ 𝑤(𝑠𝑗)𝑀

𝑗=1,

𝑐𝑜𝑣� �𝑍(𝑠),𝑍(𝐴𝑖)� = ∑ 𝑤(𝑠𝑖)𝑐𝑜𝑣(𝑍(𝑠),𝑍(𝑠𝑖))𝑁𝑖=1

∑ 𝑤(𝑠𝑖)𝑁𝑖=1

,

where 𝑁 and 𝑀 are the numbers of points in the polygons used for the integral approximation. If the sample data density is unknown, it is natural to assume that samples are distributed uniformly such that all weights are equal. The areal kriging model assumes that both observed polygonal and unobserved point data values come from the same spatial random process so that the areal value in a polygon is a weighted average of the unobserved point values in that polygon. Geostatistical and statistical textbooks (for example Journel and Huijbregts, 1978 and Cressie, 1991) provide prediction and prediction standard error formulas for various combinations of measurements made in points and areas. Areal kriging requires estimation of the within-polygons point covariances, and the first computational problem is how to choose the distribution of points while taking into account that the polygons may have vastly different sizes and that it is important to analyze data in polygons that are overlapping or disjoint. The simplest idea is to use an overlapping grid with the cell size smaller than the smallest polygon in the dataset. This, however, may be very computationally inefficient because much larger polygons will be partitioned into an unnecessarily large number of small blocks. Another obvious idea is to simulate a fixed number of random points inside each polygon. This approach does not work well because the fast Fourier transform algorithm cannot be used with random points. A similar problem arises in other applications, for example, when selecting the best locations of knots in spline and radial basis function interpolation. To our knowledge, this problem has not been solved yet, and some researchers believe that finding the solution is more difficult than developing the algorithm that uses these knots. Therefore, our choice

of knots is purely empirical. The default cell size is �1𝑚�1𝑛∑ 𝐴𝑖2𝑛𝑖=1 , where m is a

parameter controlling the average number of points per polygon. If the areas of all polygons are equal, then each polygon will be partitioned into approximately m blocks. If a polygon has less than 3 blocks, then it is represented by its centroid. In the case of disjoint polygons, empty areas are not covered by points and, in the case of overlapping polygons, points are shared in the area of the overlap. Since the time of calculations heavily depends on the number of points, the default partitioning tries to minimize the point density, and this may produce non-optimal partitioning. It should be noted that applications which require downscaling of averaged Gaussian data from large polygons to small ones or to points are rare because point data are usually available for interpolation and averaging to the required sets of polygons. Exceptions are usually of theoretical interest. For example, Yoo et al, 2010, described Gaussian areal kriging usage for mapping the population density (so that the expected population in the required region with area v is ∫ 𝜇(𝑠)𝑑𝑠

𝑣 , where 𝜇(𝑠) is the population density over 𝑣) and compare it with Tobler’s pycnophylactic interpolation. However, the dasymetric mapping algorithm, which mechanically reallocates population using fine local information derived from remote sensing images can do the population reallocation job better and faster than kriging. One interesting application of Gaussian areal kriging is interpolation of environmental data collected somewhere in populated places. For example, the Cs-137 soil contamination data collected after the Chernobyl accident are associated with the names of the cities and villages, and the researchers use the centroids of these populated places to locate the measurements on the map. However, the sizes of the populated places are usually large enough to use areal instead of point kriging. Figure 1 illustrates the difference between the simple kriging prediction standard errors with known exponential covariance model, calculated using a) the data averaged on lines of different lengths, shown at the bottom of the graph, and b) these same lines’ center points. The horizontal axis shows the distance between the prediction location and the center of the lines. The kriging prediction standard error is smaller for areal kriging for all reasonable distances because the part of the line which is closer to the prediction location has stronger correlation than the center point. This illustration shows that areal kriging should be preferred for polygons of any size, and the standard errors converge to the same value as the areas of the polygons go to zero (i.e., when the polygons converge to points). The searching neighborhood dialog of the Geostatistical Analyst Wizard in figure 2 on the left shows a subset of the populated places with Cs-137 soil contamination values measured in Belarus in 1992 and the preview of Gaussian areal kriging predictions. The semivariogram modeling is shown in the right part of figure 2. The average distance between polygons is calculated using the centers of the cells of the overlapping grid, and the crosses are the empirical semivariogram values averaged within bins of similar distances. The blue line is the estimated point semivariogram model, and the red bars are the 95 percent confidence intervals estimated in two steps:

Figure 1: Comparison of the prediction standard errors for point (red) and areal (blue) Gaussian kriging 1) The expected polygonal semivariogram 𝛾 �𝑍(𝐴𝑖),𝑍�𝐴𝑗�� is calculated using

expression 𝛾 �𝑍(𝐴𝑖),𝑍�𝐴𝑗�� = 12�𝜎2 (𝐴𝑖) + 𝜎2 �𝐴𝑗�� − 𝑐𝑜𝑣 �𝑍(𝐴𝑖),𝑍�𝐴𝑗�� . Its

value is the center of the confidence interval bar. 2) Assuming that the empirical semivariogram values are normally distributed and

uncorrelated, the variance of 𝛾� �𝑍(𝐴𝑖),𝑍�𝐴𝑗�� can be calculated as shown in geostatistical textbooks, see for example, Cressie, 1993. Although the above mentioned assumptions are never satisfied, we use that estimated variance to calculate and display the confidence intervals.

More accurate estimation can be done using simulations, but it is time-consuming and it seems that the approximation described above is a sufficient first diagnostic in practice (the next step in the model verification is, as usual, cross-validation and validation diagnostics). The estimated point semivariogram model in figure 2b does not coincide with the estimated empirical semivariogram values for the polygons. In this case, areal kriging should produce more accurate predictions than point kriging with values assigned to the polygons’ centroids.

a) b)

Figure 2: a) Populated places with measured Cs137 soil contamination values (in Ci/sq.km) and preview of Gaussian areal kriging prediction map in the Geostatistical Analyst Wizard. The Cs137 values in the highlighted polygons are used for prediction in the center of the black cross. b) Deconvoluted point semivariogram (line) and re-estimated empirical semivariogram values for polygons (crosses) and their 95% confidence intervals (vertical lines)

3. Overdispersed Poisson Areal Kriging Model Monestiez et al, 2006, described a modification of classical kriging for event counts taken in polygons of uniform size with varying observation times. For each measurement location s, the model requires the number of counts and the observation time: 𝑍𝑠 and 𝑡𝑠 respectively. The model predicts the underlying density 𝑌(𝑠) = 𝑌𝑠 of the phenomenon that is being counted. The model assumes that 𝑍𝑠 has a conditional Poisson distribution, with 𝑡𝑠𝑌𝑠 as the mean parameter: 𝐸[𝑍𝑠|𝑌𝑠] = 𝑉𝑎𝑟[𝑍𝑠|𝑌𝑠] = 𝑡𝑠𝑌𝑠. This distributional assumption alters the formula for the empirical semivariogram:

𝛾𝑌�(ℎ) = 1

2∑ ∑𝑡𝛼𝑡𝛽𝑡𝛼+𝑡𝛽

𝛽𝛼∑ ∑ �� 𝑡𝛼𝑡𝛽

𝑡𝛼+𝑡𝛽� �𝑍𝛼

𝑡𝛼− 𝑍𝛽

𝑡𝛽�2− 𝑚�𝑌�𝛽𝛼 . ................................................ (1)

where 𝑍𝛼 ,𝛼 = 1, … ,𝑁 are the counts at location 𝛼 over time 𝑡𝛼 and m�Y is the expected value of 𝑌𝑠. In this paper, we extend the results to a more general distributional assumption about the relationship between the conditional mean and conditional variance of the observed counts: 𝐸[𝑍𝑠|𝑌𝑠] = 𝑡𝑠𝑌𝑠 and 𝑉𝑎𝑟[𝑍𝑠|𝑌𝑠] = 𝑘𝑡𝑠𝑌𝑠 + 𝑙(𝑡𝑠𝑌𝑠)2, where 𝑘 ≥ 1 and 𝑙 ≥ 0 are estimable parameters that are constant for all locations. 𝑌𝑠 is equal to the density of the measured phenomenon at location s. It is assumed that 𝑌 is a second order stationary positive random field with mean m�Y and variance 𝜎𝑌2. We assume the conditional independence of 𝑍𝑠|𝑌𝑠 for all locations 𝑠. Letting 𝑘 = 1 and 𝑙 = 0 corresponds to the model described above. Similarly, letting 𝑙 = 0 and k > 1 corresponds to the quasi-Poisson distribution, and letting 𝑘 = 1 and 𝑙 = 1

𝑟> 0 corresponds to the negative binomial distribution. Other values of 𝑘 and 𝑙

allow for more general types of dispersion, but their simultaneous estimation is problematic. Literature suggests that the negative binomial distribution is the most

common method for count data that is overdispersed (relative to the Poisson distribution), so all discussion below is based on that distribution. Formulas for the overdispersed Poisson areal kriging are provided in the appendix. Note that unlike Monestiez et al, 2006, we allow polygons of different sizes, and, therefore, estimates of the density must incorporate the areas of the polygons. We illustrate the usage of overdispersed Poisson areal kriging with data on the incidence of wildlife mortality events (primarily in migratory birds and endangered species) in the United States collected by the U.S. Geological Survey National Wildlife Health Center's EPIZOO database. This is a long-term project that documents over 30 years of information on epidemics in wildlife. We use a subset which contains information about the counts of wildlife mortality events caused by avian botulism by county in the central northern part of the United States. The data are shown in figure 3 at left. White-colored counties are those where the count values are zero. There are 359 such counties out of 433 and in this case the negative binomial distribution is usually more appropriate than the Poisson distribution. Figure 3 at right shows the estimated stable point covariance model (this model is sometimes called power exponential) and the covariance modeling diagnostic, 90 percent confidence intervals for the re-estimated data density in the counties. The estimated power parameter

𝜃𝑒 of the covariance model 𝜃𝑠 �1 − 𝑒−3�‖ℎ‖𝜃𝑟�𝜃𝑒

� is equal to 0.48 and the estimated

overdispersion parameter l is equal to 0.69. This means that the estimated overdispersion is large (according to the literature, the parameter l can be between 0 and 4, see for example, Hilbe, 2007).

a) b)

Figure 3: a) The avian botulism wildlife mortality data. b) The estimated stable point covariance and the fitting diagnostic Figure 4 shows the predicted density of the wildlife mortality, left, and the associated prediction standard error, right. The prediction uncertainty is large in the areas with large data values, as dictated by the assumed data distribution, and on the edges of the data domain where data are not available.

a) b)

Figure 4: The prediction density (a) and the prediction standard error (b) of the wildlife mortality

4. Binomial Areal Kriging Model Another common source of count data is binomial sampling within polygons. This entails randomly sampling individuals within a polygon and recording the number of individuals with a particular characteristic (such as individuals with lung cancer). In this case, the counts should be scaled by the total number of individuals sampled, rather than by the time of observation. One of the most active epidemiological data interpolation researchers in the last five years, Pierre Goovaerts, refers to Monestiez et al, 2006, and then substitutes without justification the time of observation in equation (1) by population in the regions (see for example Goovaerts, 2008, Goovaerts, 2010, and Kerry et al, 2010), so that equation (1) becomes

𝛾𝑌�(ℎ) = 1

2∑ ∑𝑛𝛼𝑛𝛽𝑛𝛼+𝑛𝛽

𝛽𝛼∑ ∑ �� 𝑛𝛼𝑛𝛽

𝑛𝛼+𝑛𝛽� �𝑍𝛼

𝑛𝛼− 𝑍𝛽

𝑛𝛽�2− 𝑚�𝑌�𝛽𝛼 . .............................................. (2)

Formula (2) and the ordinary kriging modification in Goovaerts, 2008, indeed correspond to the situation when the cancer mortality rates are small and the population in the administrative regions is large. That is, for the Poisson approximation to the binomial distribution, see formula (3) below. The Poisson distribution is a special case of both the negative binomial and binomial distributions. Note that all count data used in the case studies in this paper are overdispersed, which is a common situation in modeling epidemiological and crime data. Therefore, more general models should be used until a simpler model is justified. McNeill and Lajaunie (1991) developed a model for binomial kriging, driven by the inherent error in estimating binomial risk (the number of observed cases 𝑁𝑖 divided by the total sample size 𝑃𝑖 for polygon i). This error is often large, and it varies from polygon to polygon. The true risk at location s, 𝑅(𝑠), is assumed to be a stationary positive random field with constant mean and variance. The average risk 𝑅𝑖 in the polygon 𝑖 is the average of 𝑅(𝑠) over polygon 𝐴𝑖: 𝑅𝑖 = 1

|𝐴𝑖|∫ 𝑅(𝑠)𝑑𝑠 𝐴𝑖

– for uniform population density;

𝑅𝑖 =∫ 𝑤(𝑠)𝑅(𝑠)𝑑𝑠 𝐴𝑖∫ 𝑤(𝑠)𝑑𝑠 𝐴𝑖

, where 𝑤(𝑠) desribes the variability of 𝑅(𝑠) – for non uniform

population density.

Binomial areal interpolation makes the following assumptions: • 𝑍𝑖 are independent binomial variables, 1

𝑃𝑖𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑅𝑖,𝑃𝑖).

• the observed rates are the sum of true but unknown risk 𝑅𝑖 and measurement error 𝜀𝑖, 𝑍𝑖 = 𝑁𝑖

𝑃𝑖= 𝑅𝑖 + 𝜀𝑖;

• 𝑅(𝑠) is the only reason for correlation between the rates; Areal interpolation for binomial counts seeks to map the unobserved risk 𝑅(𝑠) for all points s in the data domain. McNeill and Lajaunie derived the formula for the empirical semivariogram:

𝛾𝑖,𝑗𝑅 = 𝛾𝑖,𝑗𝑍 − 12� 1𝑃𝑖

+ 1𝑃𝑗� 𝜇(1 − 𝜇) + 1

2�𝜎𝑖

2

𝑃𝑖+

𝜎𝑗2

𝑃𝑗�,............................................................. (3)

where µ and σ𝑖2 are the mean rate and the rate variance values over the region 𝐴𝑖 . Formulas for the binomial areal kriging used in this paper are provided in the appendix. Note that the areas of the polygons only influence the estimate of the point covariances, as described in the appendix. We demonstrate binomial areal kriging using lip cancer rates from 56 districts of Scotland (Kemp et all, 1985). Waller and Gotway, 2004, employed the proportion of workers in agriculture, fishing, and forestry as a covariate because sun exposure is known to be correlated with lip cancer. Figure 5 shows the fitted covariances of the variable of interest, lip cancer rates, the secondary variable, the percentage of the workers who spent considerable time outside, and the cross-covariance between these variables, assuming that the secondary variable is normally distributed.

a) b) c)

Figure 5: K-Bessel covariance of the primary variable (a), secondary variable (b), and cross-covariance between the variables (c) Figure 6 shows cokriging prediction and prediction standard error maps. As expected, they are related, and the prediction standard errors are relatively large. The histogram on the bottom left shows the distribution of the secondary variable, which is unlikely Gaussian. Figure 7 shows cokriging predictions and prediction standard errors assuming that the secondary variable also has a binomial distribution. This distributional assumption seems

more appropriate, and the prediction standard errors are clearly smaller, even though the prediction map looks similar to the map in figure 6a.

a) b)

Figure 6: Cokriging predictions of the lip cancer rates (a) and prediction standard errors (b) assuming that the primary variable has binomial data distribution and the secondary variable is Gaussian. The histogram on the bottom left shows the distribution of the percentage of the workers who spent considerable time outside

a) b)

Figure 7: Cokriging predictions of the lip cancer rates (a) and prediction standard errors (b) assuming that both variables have binomial data distribution Figure 8a shows cokriging prediction standard errors assuming that the underlying process is the overdispersed Poisson (it is assumed that the secondary variable, the percentage of the work force, is Gaussian random variable) with counts scaled by population size as in Goovaerts, 2008. The estimated value of the overdispersion parameter is equal to 0.31. Comparison of this map with maps in figures 6b and 7b shows that the binomial model produces much smaller prediction standard errors (note that the

legends in map 8a are different than the legend in figures 6b and 7b). Figure 8b shows a prediction standard error map for the Poisson process; that is, we force the overdispersion parameter to be equal to 0. The prediction standard error is reduced in comparison with the map in figure 8a, and the surface becomes bumpy due to the absence of the overdispersion smoothing effect.

a) b)

Figure 8: Cokriging prediction standard errors of the lip cancer rates. a) Overdispersed Poisson distribution with estimated parameter l=0.31; b) Poisson distribution. A decision about the best areal cokriging model for interpolation of the lip cancer data should be made with epidemiologists using their specific knowledge and the available model diagnostics. Our goal was to show that there is a choice between the binomial and the overdispersed Poisson models.

5. Conclusions There are many ways for downscaling averaged and aggregated data, and it is important to have a choice of statistical models for various cases. We provide an interactive software environment with a choice of the data distribution, spatial correlation analysis, prediction surface preview, and cross-validation diagnostic graphs and tables. Traditionally researchers prefer ordinary over simple kriging. However, statistical areal interpolation requires specification of the mean value and, therefore, the ordinary kriging constraint to the sum of weights becomes an unnecessary addition to the model. Also, the simple kriging model can be used for simulating new surfaces conditionally to the observed aggregated or averaged data while the ordinary kriging model is not good for simulations. Simulations can be useful, for example, in modeling disease outbreaks by allowing the analysis of hypothetical situations. Therefore, the simple kriging model should be preferred. We plan to continue developing the areal interpolation models. We see the following possibilities:

• The data averaging error can be known and it can be an input parameter of the Gaussian model.

• The covariance/semivariogram modeling diagnostic can be improved using simulation techniques as mentioned at the end of the section on Gaussian areal kriging. It is worth looking at the Victor de Oliveira presentation at the JSM2011.

• The algorithm for locating points inside the polygons can be improved. • Kriging weights can reflect the heterogeneous population, as was mentioned in the

section on the binomial areal kriging model. • Model generalization using the mixed linear model (the universal kriging with

external trend) framework is possible. • A Bayesian generalization can be very useful for areal kriging because the

uncertainty of the covariance/semivariogram modeling is higher here than in the case of classical point kriging. In particular, we see potential in empirical Bayesian kriging.

• The areal cokriging model can be extended to space-time.

Appendix: Definitions and Derivations Definitions 𝑌𝑢(𝑠),𝑢 = {1,2} are correlated Gaussian fields with mean 𝑚𝑢 over all locations 𝑠. 𝑐𝑜𝑣�𝑌𝑢(𝑠),𝑌𝑢′(𝑠′)� = 𝐶𝑢𝑢′(𝑠′ − 𝑠|𝜃) , where 𝜃 is the set of covariance model parameters. 𝑌𝑢(𝐴𝑢𝑖) = 1

|𝐴𝑢𝑖|∫ 𝑌𝑢(𝑠)𝐴𝑢𝑖

𝑑𝑠 is the mean of 𝑌𝑢(𝑠) over polygon 𝑖 of process 𝑢 with area |𝐴𝑢𝑖|. 𝑐𝑜𝑣�𝑌𝑢(𝐴𝑢𝑖),𝑌𝑢′(𝐴𝑢′𝑖′)� = = 𝐶𝑢𝑢′(𝐴𝑢𝑖,𝐴𝑢′𝑖′|𝜃) = 1

|𝐴𝑢𝑖||𝐴𝑢′𝑖′|∬ 𝑐𝑜𝑣(𝑌𝑢(𝑠),𝑌𝑢′(𝑠′))𝑑𝑠𝑑𝑠′𝐴𝑢𝑖,𝐴𝑢′𝑖′.

For Gaussian data 𝑍𝑢𝑖 is the observation of process 𝑌𝑢(𝑠) in polygon 𝑖 with measurement error 𝜀𝑢𝑖. 𝑍𝑢𝑖 = 𝑌𝑢(𝐴𝑢𝑖) + 𝜀𝑢𝑖 𝐸[𝑍𝑢𝑖|Yu(Aui)] = 𝑌𝑢(𝐴𝑢𝑖) 𝑉𝑎𝑟[𝑍𝑢𝑖|𝜃] = 𝐶𝑢𝑢(𝐴𝑢𝑖,𝐴𝑢𝑖|𝜃) + 𝑉𝑎𝑟[𝜀𝑢𝑖] The mean value of the process is estimated by a weighted sample mean of 𝑍𝑢𝑖 for all polygons: 𝑚�𝑢 = ∑ |𝐴𝑢𝑖|∙𝑍𝑢𝑖𝑖

∑ |𝐴𝑢𝑖|𝑖.

For overdispersed Poisson, 𝑁𝑢𝑖 is the number of observations witnessed from process 𝑢 in polygon 𝑖 over time 𝑡𝑢𝑖. 𝐸[𝑁𝑢𝑖] = 𝑌𝑢(𝐴𝑢𝑖) ∙ |𝐴𝑢𝑖| ∙ 𝑡𝑢𝑖. 𝑉𝑎𝑟[𝑁𝑢𝑖] = 𝑘 ∙ 𝑌𝑢(𝐴𝑢𝑖) ∙ |𝐴𝑢𝑖| ∙ 𝑡𝑢𝑖 + 𝑙 ∙ (𝑌𝑢(𝐴𝑢𝑖) ∙ |𝐴𝑢𝑖| ∙ 𝑡𝑢𝑖)2 , where 𝑘 and 𝑙 are estimable overdispersion parameters. 𝑍𝑢𝑖 = 𝑁𝑢𝑖

|𝐴𝑢𝑖|∙𝑡𝑢𝑖.

𝐸[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖)] = 𝑌𝑢(𝐴𝑢𝑖). 𝑉𝑎𝑟[𝑍𝑢𝑖|𝜃] = 𝐸�𝑉𝑎𝑟[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃]�+ 𝑉𝑎𝑟�𝐸[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃]�. 𝑉𝑎𝑟[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃] = 𝑘∙𝑌𝑢(𝐴𝑢𝑖)

|𝐴𝑢𝑖|∙𝑡𝑢𝑖+ 𝑙 ∙ (𝑌𝑢(𝐴𝑢𝑖))2.

𝐸�𝑉𝑎𝑟[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃]� = 𝑘∙𝑚𝑢|𝐴𝑢𝑖|∙𝑡𝑢𝑖

+ 𝑙 ∙ (𝐶𝑢𝑢(𝐴𝑢𝑖,𝐴𝑢𝑖|𝜃) + 𝑚𝑢2).

𝑉𝑎𝑟�𝐸[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃]� = 𝐶𝑢𝑢(𝐴𝑢𝑖,𝐴𝑢𝑖|𝜃). 𝑉𝑎𝑟[𝑍𝑢𝑖|𝜃] = 𝐶𝑢𝑢(𝐴𝑢𝑖,𝐴𝑢𝑖|𝜃) + 𝑘∙𝑚𝑢

|𝐴𝑢𝑖|∙𝑡𝑢𝑖+ 𝑙 ∙ (𝐶𝑢𝑢(𝐴𝑢𝑖,𝐴𝑢𝑖|𝜃) + 𝑚𝑢

2). The mean value of the process is estimated by a weighted sample mean of 𝑍𝑢𝑖 for all polygons: 𝑚�𝑢 = ∑ |𝐴𝑢𝑖|∙𝑡𝑢𝑖∙𝑍𝑢𝑖𝑖

∑ |𝐴𝑢𝑖|∙𝑡𝑢𝑖𝑖.

For binomial data, 𝑁𝑢𝑖 is the number of successes from a sample of size of 𝑃𝑢𝑖 from process 𝑢 in polygon 𝑖. 𝑍𝑢𝑖 = 𝑁𝑢𝑖

𝑃𝑢𝑖.

𝐸[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖)] = 𝑌𝑢(𝐴𝑢𝑖). 𝑉𝑎𝑟[𝑍𝑢𝑖|𝜃] = 𝐸�𝑉𝑎𝑟[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃]�+ 𝑉𝑎𝑟�𝐸[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃]�. 𝑉𝑎𝑟[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃] = 𝑌𝑢(𝐴𝑢𝑖)∙(1−𝑌𝑢(𝐴𝑢𝑖))

𝑃𝑢𝑖.

𝐸�𝑉𝑎𝑟[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃]� = 𝑚𝑢∙(1−𝑚𝑢)𝑃𝑢𝑖

− 𝐶𝑢𝑢(𝐴𝑢𝑖,𝐴𝑢𝑖|𝜃)𝑃𝑢𝑖

.

𝑉𝑎𝑟�𝐸[𝑍𝑢𝑖|𝑌𝑢(𝐴𝑢𝑖),𝜃]� = 𝐶𝑢𝑢(𝐴𝑢𝑖,𝐴𝑢𝑖|𝜃). 𝑉𝑎𝑟[𝑍𝑢𝑖|𝜃] = 𝐶𝑢𝑢(𝐴𝑢𝑖,𝐴𝑢𝑖|𝜃) �1 − 1

𝑃𝑢𝑖� + 𝑚𝑢∙(1−𝑚𝑢)

𝑃𝑢𝑖.

The mean value of the process is estimated by a weighted sample mean of 𝑍𝑢𝑖 for all polygons: 𝑚�𝑢 = ∑ 𝑁𝑢𝑖𝑖

∑ 𝑃𝑢𝑖𝑖.

Estimating parameters 𝜃 based on covariance or semivariogram models Covariance model is 𝐹𝑢𝑢′(𝜃) = ∑ 𝜔𝑢𝑖𝑢′𝑖′�(𝑍𝑢𝑖 − 𝑚𝑢)(𝑍𝑢′𝑖′ − 𝑚𝑢′) − 𝐶𝑢𝑢′(𝐴𝑢𝑖,𝐴𝑢′𝑖′|𝜃)�2𝑢≠𝑢′

or𝑖≠𝑖′

+

+𝛿𝑢𝑢′ ∑ 𝜔𝑢𝑖𝑢𝑖((𝑍𝑢𝑖 − 𝑚𝑢)2 − 𝑉𝑎𝑟[𝑍𝑢𝑖|𝜃])2𝑖 ,

where 𝛿𝑢𝑢′ = �1, if 𝑢 = 𝑢′ 0, otherwise, ∑𝜔𝑢𝑖𝑢′𝑖′ = 1.

Semivariogram model is`

𝐹𝑢𝑢(𝜃) = ∑𝜔𝑢𝑖𝑢𝑖′ �12

(𝑍𝑢𝑖 − 𝑍𝑢𝑖′)2 − 𝛾𝑢𝑖𝑢𝑖′�2,

where 𝛾𝑢𝑖𝑢𝑖′ = 𝐸 �1

2(𝑍𝑢𝑖 − 𝑍𝑢𝑖′)2� 𝜃� = 1

2(𝑉𝑎𝑟[𝑍𝑢𝑖|𝜃] + 𝑉𝑎𝑟[𝑍𝑢𝑖′|𝜃]) − 𝐶𝑢𝑢(𝐴𝑢𝑖,𝐴𝑢𝑖′|𝜃),

∑𝜔𝑢𝑖𝑢𝑖′ = 1. Between two datasets, when all polygons are the same, it is possible to use cross-semivariogram. In this case only one set of polygons {𝐴𝑖} has to be defined.

𝐹𝑢𝑢′(𝜃) = ∑𝜔𝑢𝑖𝑢′𝑖′ �12

(𝑍𝑢𝑖 − 𝑍𝑢𝑖′)(𝑍𝑢′𝑖 − 𝑍𝑢′𝑖′) − 𝛾𝑢𝑖𝑢′𝑖′�2,

where

𝛾𝑢𝑖𝑢′𝑖′ = 𝐸 �12

(𝑍𝑢𝑖 − 𝑍𝑢𝑖′)(𝑍𝑢′𝑖 − 𝑍𝑢′𝑖′)� 𝜃� =

= 12�𝐶𝑢𝑢′(𝐴𝑖 ,𝐴𝑖|𝜃) + 𝐶𝑢𝑢′(𝐴𝑖 ,𝐴𝑖′|𝜃) − 𝐶𝑢𝑢′(𝐴𝑖 ,𝐴𝑖′|𝜃) − 𝐶𝑢𝑢′(𝐴𝑖′ ,𝐴𝑖|𝜃)�,

∑𝜔𝑢𝑖𝑢′𝑖′ = 1. If the polygons are not the same for the two processes then only the covariance model can be used.

The parameters 𝜃 are estimated by 𝜃� = arg min𝜃[∑ 𝐹𝑢𝑢′𝑢≤𝑢′ (𝜃)]. The optimization is done similarly to Gribov et al, 2006. The main differences are the following: 1) there is no averaging of covariance/semivariogram pairs and 2) optimization is done in two iterations. On the first iteration we use the following weights: For Gaussian data, all weights are equal. For overdispersed Poisson data, weights are:

For covariance 𝜔𝑢𝑖𝑢′𝑖′ ∝ �|𝐴𝑢𝑖| ∙ 𝑡𝑢𝑖 ∙ |𝐴𝑢′𝑖′| ∙ 𝑡𝑢′𝑖′ , if 𝑢 ≠ 𝑢′ or 𝑖 ≠ 𝑖′

(|𝐴𝑢𝑖| ∙ 𝑡𝑢𝑖)2 2⁄ , otherwise ,

for semivariogram 𝜔𝑢𝑖𝑢𝑖′ ∝ �|𝐴𝑢𝑖|∙𝑡𝑢𝑖∙�𝐴𝑢𝑖′�∙𝑡𝑢𝑖′

|𝐴𝑢𝑖|∙𝑡𝑢𝑖+�𝐴𝑢𝑖′�∙𝑡𝑢𝑖′�2, and

for cross-semivariogram 𝜔𝑢𝑖𝑢′𝑖′ ∝ �|𝐴𝑢𝑖|∙𝑡𝑢𝑖∙�𝐴𝑢𝑖′�∙𝑡𝑢𝑖′

|𝐴𝑢𝑖|∙𝑡𝑢𝑖+�𝐴𝑢𝑖′�∙𝑡𝑢𝑖′� ∙ �

�𝐴𝑢′𝑖�∙𝑡𝑢′𝑖∙�𝐴𝑢′𝑖′�∙𝑡𝑢′𝑖′�𝐴𝑢′𝑖�∙𝑡𝑢′𝑖+�𝐴𝑢′𝑖′�∙𝑡𝑢′𝑖′

�.

For binomial data, weights are:

For covariance 𝜔𝑢𝑖𝑢′𝑖′ ∝ �𝑃𝑢𝑖 ∙ 𝑃𝑢′𝑖′ , if 𝑢 ≠ 𝑢′ or 𝑖 ≠ 𝑖′

𝑃𝑢𝑖2 2⁄ , otherwise ,

for semivariogram 𝜔𝑢𝑖𝑢𝑖′ ∝ �𝑃𝑢𝑖∙𝑃𝑢𝑖′𝑃𝑢𝑖+𝑃𝑢𝑖′

�2, and

for cross-semivariogram 𝜔𝑢𝑖𝑢′𝑖′ ∝ �𝑃𝑢𝑖∙𝑃𝑢𝑖′𝑃𝑢𝑖+𝑃𝑢𝑖′

� ∙ �𝑃𝑢′𝑖∙𝑃𝑢′𝑖′𝑃𝑢′𝑖+𝑃𝑢′𝑖′

�.

For the second iteration the weights are defined by For covariance 𝜔𝑢𝑖𝑢′𝑖′ ∝

1𝑉𝑎𝑟[𝑍𝑢𝑖|𝜃]∙𝑉𝑎𝑟�𝑍𝑢′𝑖′|𝜃�+𝐶𝑢𝑢′

2 �𝐴𝑢𝑖,𝐴𝑢′𝑖′�𝜃�,

for semivariogram 𝜔𝑢𝑖𝑢𝑖′ ∝1

𝛾𝑢𝑖𝑢𝑖′2 ,

and for cross-semivariogram 𝜔𝑢𝑖𝑢′𝑖′ ∝1

𝛾𝑢𝑖𝑢𝑖′∙𝛾𝑢′𝑖𝑢′𝑖′+𝛾𝑢𝑖𝑢′𝑖′2 .

Note that weights on the first iteration can be obtained from weights on the second iteration by assuming that the Gaussian fields are constant. Also, in case of the overdispersed Poisson model, the overdispersion is ignored on the first iteration.

References Cressie, N.A.C. (1993) Statistics for Spatial Data. Revised ed. John Wiley & Sons, New York. Eicher, C. L., and C. A. Brewer. 2001. Dasymetric mapping and areal interpolation:

implementation and evaluation. Cartography and Geographic Information Science 28 (2): 125-38.

Flowerdew, R. and M. Green (1992). Developments in areal interpolating methods and GIS. Annals of Regional Science, 26, 67-78.

Gandin, L.S. and Kagan R.L. 1962. The accuracy of determining the mean depth of snow cover from discrete data. Trudy GGO, 130, 3-10 (In Russian).

Goodchild, M. F. and N. S.-N. Lam (1980). Areal interpolation: a variant of the traditional spatial problem. Geo-Processing, 1, 297–312.

Goovaerts, P. (2008). ‘‘Kriging and Semivariogram Deconvolution in the Presence of Irregular Geographical Units.’’ Mathematical Geosciences 40, 101–28.

Goovaerts, P. (2010), Geostatistical Analysis of County-Level Lung Cancer Mortality Rates in the Southeastern United States. Geographical Analysis, 42: 32–52.

Gotway CA, Young LJ: A geostatistical approach to linking geographically-aggregated data from different sources. In Technical report # 2004-012 Department of Statistics, University of Florida; 2004.

Gotway CA, Young LJ: Combining incompatible spatial data. Journal of the American Statistical Association 2002, 97: 632-648.

Gribov, A., Krivoruchko, K., and Ver Hoef, J.M. 2006. Modeling the semivariogram: New approach, methods comparison, and simulation study. In T.C. Coburn, J.M. Yarus, and R.L. Chambers, eds., Stochastic modeling and geostatistics: Principles, methods, and Case Studies, Volume II: AAPG Computer Applications in Geology 5, p. 45-57.

Hilbe, J.M. 2007. Negative binomial regression. Cambridge University Press, Cambridge. Journel AG, Huijbregts CJ (1978) Mining geostatistics. Academic Press, London, 600 p. Kagan R.L. 1997. Averaging of Meteorological Fields. Kluwer Academic Publishers. (Original

Russian edition: Gidrometeoizdat, St. Petersburg, Russia, 1979). Kelsall J, Wakefield J: Modeling spatial variation in disease risk: a geostatistical approach. Journal

of the American Statistical Association 2002, 97(459):692-701. Kerry, R., Goovaerts, P., Haining, R. P. and Ceccato, V. (2010), Applying Geostatistical Analysis

to Crime Data: Car-Related Thefts in the Baltic States. Geographical Analysis, 42: 53–77. Krivoruchko, K., Gribov, A., and Krause, E., 2011. Multivariate Areal Interpolation for

Continuous and Count Data. Procedia Environmental Sciences, Volume 3, p. 14-19. Krivoruchko, K. 2011. Spatial statistical data analysis for GIS users. Esri Press. 928 pp. Lajaunie C. (1991) Local risk estimation for a rare non contagious disease based on observed

frequencies. Centre de Geostatistique de l’Ecole des Mines de Paris, Fontainebleau, Note N-36/91/G.

Matheron, G. (1968) Osnovy prikladnoi geostatistiki (Principles of Applied Geostatistics). Mir, Moscow, 408 pp. (In Russian).

McNeill L. (1991) Interpolation and smoothing of binomial data for the southern african bird atlas project. South African Statist. J. vol.25, pp. 129-136.

Mennis, J. and Hultgren, T., 2006. Intelligent dasymetric mapping and its application to areal interpolation. Cartography and Geographic Information Science, 33(3): 179-194.

Monestiez P, Dubroca L, Bonnin E, Durbec JP, Guinet C: Geostatistical modeling of spatial distribution of Balenoptera physalus in the northwestern Mediterranean Sea from sparse count data and heterogeneous observation efforts. Ecological Modelling 2006, 193:615-628.

Preobrazenski, A. J. 1954. Dorewolucjonnyje i sovietskije karty razmieszczenija nasielenia. Woprosy Geografii. Kartografia 34:134-149, Moscow.

Tobler, W. (1979). Smooth pycnophylactic interpolation for geographical regions (with discussion). Journal of the American Statistical Association, 74, 519–536.

Webster R, Oliver MA, Munir KR, Mann JR: Kriging the local risk of a rare disease from a register of diagnoses. Geographical Analysis 1994, 26(2):168-185.

Waller, L.A. and Gotway, C.A. (2004) Applied Spatial Statistics for Public Health Data. John Wiley & Sons, New York.

Waller, L.A., Zhu, L., Gotway, C.A., Gorman, D.M., Gruenewald, P.J. (2007) Quantifying geographic variations in associations between alcohol distribution and violence: A comparison of geographically weighted regression and spatially varying coefficient models. Stochastic Environmental Research and Risk Assessment 21 (5), pp. 573-588.

Yoo, E.-H., Kyriakidis P., Tobler W. (2010) Reconstructing population density surfaces from areal data: a comparison of Tobler’s pycnophylactic interpolation method and area-to-point kriging. Geographical Analysis, Volume 42, Issue 1, p 78-98.

interpolation of data collected in polygons of data collecte… · epidemiological and social data....

Documents