spatial modelling an introduction duncan lee, adrian bowman and marian scott enviornmental...

53
Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Upload: nathaniel-stephens

Post on 28-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Spatial modelling an introduction

Duncan Lee, Adrian Bowman and Marian Scott

Enviornmental statistics course

August 2008

Page 2: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Outline

• Spatial point processes

• Areal unit data

• Geostatistics

• Spatio-temporal modelling

Page 3: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

1. Spatial point processes

• ‘A Spatial point process is a set of locations, irregularly distributed within a designated region and presumed to have been generated by some form of stochastic mechanism’ - Diggle (2003).

• A realisation from a spatial point process is termed a spatial point pattern – a countable collection of events at locations{ui}.

• Here the locations of the events {ui} are random and are the data, no other variable is collected!

Page 4: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Example 1

Page 5: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Example 2

Page 6: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Notation

• A spatial point process is defined for a region A.

• Sub-regions within A are denoted A1, A2,……

• Single locations within A are denoted u1,u2,……

• We denote by N(A), the random variable representing the number of events in the region A. Similar defnitions apply to N(Ak) and N(uk).

Page 7: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Question of interest

“Does the point pattern have any spatial dependence?”

Three general types of structure are possible.

• Complete spatial randomness (CSR). –events occur at random.

• Clustered process – events occur close to existing events.• Regular process. – events occur away from existing events.

Page 8: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Complete Spatial Randomness

CSR asserts that:

(i) For any subregion Ak, N(Ak)~Poisson(|Ak|).(ii) For disjoint sub-regions (A1, A2) , N(A1) and N(A2) are

independent.

is termed the intensity and is the expected number of events per unit of area, so that |A| is the expected number of events in A.

A process satisfying (i) and (ii) is called a homogeneous Poisson process (with intensity ).

Page 9: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Mean and covariance

For a CSR process N(A)~Poisson(|A|).

Mean – Constant across A. Therefore at a single location u1,with area 1, the mean of N(u1) equals .

Covariance – The spatial dependence of the process between two points (u1, u2) is determined by the second order intensity function 2 (u1, u2) .

However the latter is hard to work with.

Page 10: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

K function

Instead of working with the second order intensity function 2 (u1, u2) to measure spatial dependence, we work with the K function

K(t) = E{N0(t)} /

where N0(t) is the number of events within a distance t of an arbitrary event.

Page 11: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Why is the K-function useful?Recall that

K(t) = E(n0 of events within t of an arbitrary event) /

• For a CSR process - K(t) = t2.

• For a clustered process we would expect more points close together than under CSR, so for small t, K(t) > t2.

• For a regular process we would expect less points close together than under CSR, so for small t, K(t) < t2.

Page 12: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Determining if CSR holds

• Step 1 - estimate the intensity by hat = N(A)/|A|.

• Step 2 - estimate K(t) for a given distance t, by calculating the average number of events (over all points in the pattern) within distance t of that event.

• Step 3 – Plot the theoretical function for CSR, K(t) = t2, against t, and add a second line for the estimated K function for the point process. If CSR is reasonable they will be very similar.

Page 13: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Some examples

K(t)

t

Page 14: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Further models for Spatial point processes

If CSR does not hold for the data in question there are other models that can be used. For example

•Poisson cluster process – Models clusters.

•Inhomogeneous Poisson process – spatially varying intensity.

•Cox process – incorporating time-varying intensity.

•Inhibition process – models regular processes.

Page 15: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Another example

Page 16: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Implementing point process models

Point process models (including CRS and others) can be implemented in R using the add on libraries

• spatstat

• Splancs

For further details see http://lib.stat.cmu.edu/R/CRAN/.

Page 17: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

2. Areal unit data

• The region of interest A is split into n non-overlapping sub-regions A1,…,An .

• The random variable of interest is only available as an aggregated average or total for each sub-region, and is represented by Z1,…,Zn .

• The sub-regions are fixed, and it is the variable being measured for each region that is random.

• In comparison, for Point processes no variable Z was measured, as it was the location of the event that was random.

Page 18: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Motivating example

Lip cancer rates for the 56 counties in Scotland. Two possible questions of interest:

1.Does any environmental variable effect the number of new cases?

2.Is there an outbreak of lip cancer cases in any part of Scotland?

Map taken from a paper by Wakefield from Biostatistics 2007, 158-183.

Page 19: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Modelling areal unit data

When modelling areal unit data z1,….zn from sub-regions A1,…,An consider the following:

• Response distribution – normal, Poisson, binomial, etc.

• Regression variables – e.g. sunlight in the lip cancer example.

• Spatial dependence – are areas close together related?

• Method of analysis – frequentist or Bayesian methods.

Page 20: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Spatial dependence

Spatial dependence quantifies how the values of z1,…,zn are related to each other. There are three general types of dependence.

1. Independence - the values of z1,…zn are not related.

2. Negative dependence – if areas i and j are close together then zi and zj will have different values.

3. Positive dependence – if areas i and j are close together then zi and zj will have similar values.

Page 21: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Modelling positive dependence

A common method for modelling positive dependence is based on a neighbourhood or weight matrix W.

• A matrix of 1’s and 0’s, where element ij is 1 if areas i and j are neighbours and 0 otherwise.

• Neighbours can be defined in many ways including:– Areas sharing a common border.

– Areas less than a distance d apart.

– Area i is one of the closest areas in terms of distance to area j.

Page 22: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Conditional autoregressive (CAR) models

For simplicity assume that z1,…zn are normally distributed and there are no covariates, then the CAR model is given by

Zi|Z-i

So the expected value of zi is equal to the mean of its neighbours, as ni is the number of neighbours of area i.

)/,(~ 2i

ijj

i

ij nzn

wN

Page 23: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

3. Geostatistical data

• For a fixed region A, the variable of interest could be measured at any location.

• However due to time/cost constraints it has only been measured at n locations u1,…, un , which are typically chosen and not random.

• The random variables measured at all n locations are denoted by Z(u1),…, Z(un) .

• Therefore this is different from– Point processes where the locations are the random variable.– Areal data where the variable can only be measured as n aggregated

averages (or totals) for sub-regions A1,…, An.

Page 24: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Goals of geostatistics

Given observations Z(u1),…, Z(un), there are three general goals of a geostatistical analysis.

1. How best to model the data?

2. How to estimate Z(u0) where u0 is an unobserved location?

3. How to draw a map of Z(u) for all points u in the region.

Page 25: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Modelling geostatistical data

When modelling geostatistical data consider the following:

• Response distribution – normal, Poisson, binomial, etc.

• Spatial trend – e.g. regression variables or other trends.

• Spatial dependence – how are areas close to each other related.

• Method of analysis – frequentist or Bayesian methods.

Page 26: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

General geostatistical model

A general model for data Z=(Z(u1),…, Z(un)), is

Z = µ + S

• The data Z are assumed to be normally distributed.• µ is the mean function and models spatial trend.• S is a stochastic process and models spatial dependence.

Page 27: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Modelling spatial trend

A spatial trend is a systematic change in the mean function µ over the area of interest. It is generally smooth, although it may change abruptly in response to environmental forcing variables (e.g., bedrock geology). It can be modelled in numerous ways.

• Regression variables such as geology.• Polynomials in the co-ordinates u1…un.

• Modelled within the spatial dependence component S (non-stationary).

Page 28: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Spatial dependence

• For the remainder of this course we assume that any spatial trend has been removed by the mean function µ.

• We assume positive spatial dependence rather than negative, that is the closer two points are the more similar their values of the variable will be.

Page 29: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Modelling spatial dependence

A common model for spatial dependence is

S ~ N(0 , C)

which implies the data are normally distributed.

Here C is the variance-covariance matrix, and is a transformed correlation matrix. If all observations have the same variance, then to C=σ2V, where

• V is the correlation matrix.• σ2 is the common variance of each observation.

Page 30: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Correlation matrix V

The correlation matrix typically has the following characteristics.

• The diagonal elements equal 1, as they represent the correlation of an observation with itself.

• The ijth element of V is close to one if locations ui and uj are close.

• As locations ui and uj get further apart, the ijth element gets closer to zero.

• Negative dependence (i.e. negative values in V) is rarely seen in geostatistical data.

Page 31: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Simplifying V or C

The covariance / correlation (spatial dependence) structure in the data can have two simplifying properties.

Stationarity – The covariance (or correlation) between ui and uj only depends on their difference ui – uj. so the locations of the two points does not matter, only their distance and direction from each other.

Isotropy – The covariance (or correlation) between ui and uj only depends on the magnitude of their difference ||ui – uj.||, so the locations of the two points does not matter, only their distance apart.

Page 32: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Assuming the spatial dependence is stationary and isotropic, the covariance function between 2 points Z(u) and Z(u + t) simplifies to

a function of the scalar distance between the two points. Similarly the correlation function is given by

• Where σ2 is the variance and also denoted by C(0).

))(),(()( tuZuZCovtC

2

)()(

tC

t

Page 33: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Semi-variogram modelling

However in the geostatistical literature spatial dependence is modelled in terms of the semi-variogram

γ(t) = 0.5Var(Z(u+t) – Z(u)) = C(0) – C(t)

= σ2 - C(t)

rather than the covariance function.

Page 34: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Estimating the semi-variogram

The semi-variogram for data Z(u1),…, Z(un) can be estimated by calculating

for any value of t. Here N(t) is the set of points (ui, uj) that are distance t apart. This function is called the empirical semi-variogram, and it can be plotted against t to see the general shape.

)(),(

2)]()([|)(|2

1)(ˆ

tNuuji

ji

uzuztN

t

Page 35: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Alternatively, you could plot the semi-variogram cloud, which is a plot of

against

for all pairs of points. This form gives more than one value for each distance t, so it is a scatterplot.

|||| jiij uut

2)]()([5.0)( jiij uzuzt

Page 36: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

What should a semi-variogram look like?

Page 37: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

• The nugget is the limiting value of the semi-variogram as the distance t approaches zero. It quantifies the amount of spatial variability at very small spatial scales (those less than the separation between observations) and also measurement error.

• The sill is the horizontal asymptote of the variogram, if it exists, and represents the overall variance of the random process.

• The range is the distance t* at which the semi-variogram reaches the sill. Pairs of points that are further apart than the sill are uncorrelated

Page 38: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

But what about in practice?

Sometimes the semi-variogram only approaches the sill asymptotically, and in this case we define the practical range as the lag t* at which

γ(t) = 0.95* sill

= 0.95* σ2

Page 39: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Modelling spatial dependence

Spatial dependence in the data can now be modelled in two stages.

1. Plot the empirical semi-variogram and determine which family of semi-variogram models it resembles.

2. Estimate the parameters (sill, nugget, range) of the chosen semi-variogram model by least squares methods.

Page 40: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Semi-variogram models

otherwise,

0 if,0)(

c

hh Nugget - random data

ahc

hch a

hah

if,

0 if,5.05.1)(

3

Spherical

A number of semi-variogram models exist that can be used.

Exponential

otherwise,0

0 if)),exp(1()(

22 tth

Although these models may not fit the data particularly well.

Page 41: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Spatial prediction

Once a trend and spatial dependence model have been fitted, it is of interest to estimate Z at some unobserved location u0. There are many methods for doing this including:

• Regression modelling using generalised least squares.

• Inverse distance weighted interpolation.• Kriging.

Page 42: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

n

iii uzuz

10

* )()(

The main difference between the methods is how the weights are estimated. A map can then be produced by predicting the surface at a regular grid of points.

The majority of these approaches predict z*(u0) the variable at location u0 by a weighted average of the form

i

Page 43: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

137Cs deposition maps in SW Scotland prepared by different European teams (ECCOMAGS, 2002)

Page 44: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Kriging 1

Ordinary Kriging

1. First, the trend is estimated using least squares methods.

2. Then the observed values can be de-trended by subtracting the estimated trend from the data.

3. Finally a model for the variogram is fitted to the de-trended data and used to generate the weights for the prediction.

Page 45: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Kriging 2

• There are a number of other kriging methods, such as block kriging, indicator kriging and co-kriging.

• Some interesting issues concern the uncertainty in the prediction. We can use the kriging procedure to produce uncertainty maps, and recent work has been to develop approaches to incorporate this uncertainty in the variogram model.

Page 46: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Kriging in R

There are routines to do kriging in the R libraries:-geoRfieldsgstatsgeostatspatstatspatdat

Page 47: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Choosing the locations u1…un

The desired set of locations depends on the goal of the analysis.

Point prediction – Locate points on a regular grid so that all prediction locations will be highly correlated with a few observed data points.

Average estimation – If the aim is to estimate the average value of Z over the region A, then correlated points provides redundant information. Therefore you want the distance between pairs of points to be roughly the variogram range.

Page 48: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

4. Spatio-temporal statistical modelling

Spatio-temporal statistical modelling is a real challenge because:

• usually very large data sets and one ‘dimension’ may be richer than the other– lots of stations, limited measurement in time.– few stations, monitored very frequently in time.

• need to combine the techniques found in time series and spatial analysis.

Page 49: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Modelling spatial and temporal dependence

One major difficulty concerns how to jointly model– correlation through time– correlation over space

Is correlation through space constant over time, and correlation through time constant over space?– if yes, then we have a ‘separable’ and stationary

process.– if not, then we need to build a space-time correlation

structure (hard work).

Page 50: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

General approach

The general approach to spatio-temporal models is through stochastic spatio-temporal processes

Z(u,t) - where u represents space and t represents time

which may be a combination of a spatial and a time series process.

Page 51: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

Simplifying assumptions

• Stationarity – natural extension from time series and spatial models.

• Isotropy – natural extension from spatial models.

• Separability – The covariance function of Z(u,t) can be split into space and time parts, i.e.

cov[Z(u1,, t1), Z(u2, t2)] = Cu(u1,u2)CT(t1,t2)

which means we can use the tools we have met previously.

Page 52: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

At each time point a plane across space was fitted and Gaussian Variograms of the residuals were computed. The average of the variogram parameters’ estimates were used to obtain the spatial

covariance matrix .

-10 0 10 20 30

40

45

50

55

60

65

70

-2-2 -2 -2

-1-1

-1

-1 0

0 0 0

0 0

0

0

1

1

1

1

22

2

observed values of SO2 May 1991

Longitude

La

titu

de

-10 0 10 20 30

40

45

50

55

60

65

70

-1

-1

-0.5

0

0.5

1

1.5

estimated trend of SO2 May 1991

Longitude

La

titu

de

-10 0 10 20 30

40

45

50

55

60

65

70

-3-2

-2

-1

-1

-1

-1

-1

-1

0 0

0

011

observed values of SO2 September 1998

Longitude

La

titu

de

-10 0 10 20 30

40

45

50

55

60

65

70

-1.5

-1

-0.5

0

estimated trend of SO2 September 1998

Longitude

La

titu

de

Spatial Analysis Across TimeSpatial Analysis Across Time

Page 53: Spatial modelling an introduction Duncan Lee, Adrian Bowman and Marian Scott Enviornmental statistics course August 2008

non-separable processes

• Much harder problem, still the basis of much statistical research.