2017 descriptive spatio-temporal wikle, c.k. modeling...

Descriptive Spatio-Temporal

Modeling: Part I (Covariance-Based

Approach)

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Learning Points

Recall, spatio-temporal modeling typically seeks to accomplish one

of four things given potentially incomplete and noisy observations

in space and time:

• predicting a plausible value at some location in space and

time and reporting the uncertainty of that prediction

• forecasting the future value at some location (and its

uncertainty)

• performing scientific inference about parameters in a model

given errors with spatio-temporal dependence

• data assimilation with mechanistic models

An Introduction to Spatio-Temporal Statistics: Descriptive Modeling c©Christopher K. Wikle 2

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Learning Points

We consider models of the form:

observations = true process + observation error

true process = trend term + dependent random process

Here we focus on the more traditional “descriptive” approach that

considers the dependent random process in terms of the first-order

and second-order moments (means, variances, and covariances) of

its marginal distribution.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Learning Points

In this portion of the course we focus specifically on:

• Optimal Linear Prediction for Gaussian (normal) Data and

Latent Gaussian Processes

• Basis Functions and Random Effects Parameterizations

• Non-Gaussian Data Models with Latent Gaussian Processes


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction

As an example, consider prediction (“interpolation”) of maximum

temperature at location ”x” in the central US given data from 138

measurement locations.

NOAA maximum temperature observations for 8 July 1993.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction

But, we have data from other times as well; Tobler’s “law”

suggests that we should consider “nearby” observations in both

space AND time to help with the prediction (interpolation).

NOAA maximum temperature observations for 4, 8, and 12 July 1993.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Notation

{Y (s; t) : s ∈ Ds , t ∈ Dt} - denotes a latent (unobserved)

spatio-temporal random process with spatial location s in spatial

domain Ds (a subset of d-dimensional real space), and time index

t in temporal domain Dt (along the one-dimensional real line).

Data: {z(sij ; tj)} for spatial locations {sij : i = 1, . . . ,mj} and

times {tj : j = 1, . . . ,T}.

It is convenient to consider all observations at a given time (say,

tj) by the vector:

ztj = (z(s1j ; tj), . . . , z(smj j ; tj))′


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Data Model

We represent the data in terms of the latent spatio-temporal

process of interest plus an error:

z(sij ; tj) = Y (sij ; tj) + ε(sij ; tj), i = 1, . . . ,mj ; j = 1, . . . ,T ,

Assumptions:

• {ε(sij ; tj)} - independent and identically distributed mean-zero

observation errors that are independent of Y and have

variance σ2ε

• the errors are assumed to have a normal (Gaussian)

distribution here; ε(sij ; tj) ∼ iid N(0, σ2ε ) (or, we might write

this ε(sij ; tj) ∼ iid Gau(0, σ2ε ))


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Goal

We seek to predict the “true” (latent) process at some location

(s0; t0) (for which we might not have data): Y (s0; t0) given all of

the observations {z(sij , tj) : i = 1, . . . ,mj ; j = 1, . . . ,T}.

Consider the linear predictor of the form:

Y (s0; t0) =T∑j=1

mj∑i=1

wij z(sij , tj)

How do we choose these weights?

• they typically reflect Tobler’s “first law of geography”; thecloser an observation is to location (s0; t0), the larger theweight

• this is similar to kernel-weighted regression, but the “optimal”weights (see below) consider spatio-temporal dependencemore directly


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Cartoon

Cartoon showing spatio-temporal prediction of latent process Y (s0; t0)

given observations at time t0 and before and after. The solid arrows

depict the prediction weights wij for the various observations.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Latent Process Model

We assume that the latent process follows the model

Y (s; t) = µ(s; t) + η(s; t),

for all (s; t) in our space-time domain of interest (e.g., Ds × Dt).

• µ(s; t) represents the process mean

• we may choose to let µ(s; t) be: (i) known, (ii) constant but

unknown, or (iii) modeled in terms of p covariates,

µ(s; t) = x(s; t)′β, where the p-dimensional vector β is

unknown, which corresponds to simple, ordinary, and universal

spatio-temporal kriging, respectively

• η(s; t) represents a zero-mean Gaussian process (we will

define this in a bit) with spatial and temporal dependence


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Gaussian Processes

What is the difference between a Gaussian “distribution” and a

Gaussian “process”?

• Gaussian Distribution: a distribution over vectors

• Fully specified by a mean vector and covariance matrix: e.g.,

Y ∼ Gau(µ,Σ) (i.e., a multivariate normal distribution)

• In spatio-temporal applications, the elements of Y are indexed

by space-time location, e.g., Yi = Y (si ; ti )

• Gaussian Process: a distribution over functions

• Fully specified by a mean function (m) and covariance function

(c): e.g., Y ∼ GP(m, c); thus, we will need to specify a mean

function and covariance function to define a GP

• In spatio-temporal applications, the argument of the function

is the spatio-temporal location: Y (s; t)


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Gaussian Processes

Why GPs?

• A GP is an infinite-dimensional object

• This will allow us to make a prediction anywhere in our

space-time domain (if we know the mean and covariance

functions)

• But, in practice we will only ever need to considerfinite-dimensional distributions; why?

• We only have a finite set of data and are eventually only

interested in a finite set of prediction locations

• Any finite collection of GP random variables has a joint

Gaussian distribution

• This allows us to use the traditional machinery of multivariate

normal distributions and linear mixed models to do calculations


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Gaussian Distributions

Technical Review:

Recall the famous result relating joint and conditional Gaussian

random variables:(v

v∗

)∼ Gau

((µ

µ∗

),

(Σ Σ∗Σ′∗ Σ∗∗

))

where E (v) = µ, E (v∗) = µ∗, Σ = cov(v, v), Σ∗ = cov(v, v∗),

and Σ∗∗ = cov(v∗, v∗).

Then,

v∗|v ∼ Gau(µ∗ + Σ′∗Σ−1(v − µ),Σ∗∗ −Σ′∗Σ

−1Σ∗)


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Optimal Linear Prediction

Let’s get back to the problem at hand:

Recall we are interested in predicting Y (s0; t0) (i.e., at some

arbitrary location in space-time) given the data

z ≡ (z(s11; t1), . . . , z(smT ; tT ))′. As we mentioned, we are looking

for linear predictors of the form:

Y (s0; t0) =T∑j=1

mj∑i=1

wij z(sij , tj) = w′0z.

Thus, we seek the conditional distribution:

[Y (s0; t0) | z ].

Why not [Z(s0; t0) | z ]?


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


It helps to rewrite the data model in vector form:

z = Y + ε, ε ∼ Gau(0,Cε)

where Y and ε are ordered in the same way as the data vector z.

Now, cov(Y,Y) ≡ Cy , cov(ε, ε) ≡ Cε, cov(z, z) = Cy + Cε.

In addition, c′0 ≡ cov(Y (s0; t0), z), c0,0 ≡ var(Y (s0; t0)), and X is

the (∑T

j=1mj)× p matrix of known covariates. Then, consider the

joint Gaussian distribution:[Y (s0; t0)

z

]∼ Gau

([x(s0; t0)′

X

]β ,

[c0,0 c′0c0 Cz

]).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Using the joint/conditional Gaussian distribution formula from

before (and, assuming for the moment that β is known; i.e., simple

spatio-temporal kriging) we get

Y (s0; t0) |z ∼ Gau(x(s0; t0)′β+c′0C−1z (z−Xβ) , c0,0−c′0C−1z c0).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


The prediction formula is given by the mean of this distribution:

Y (s0; t0) = x(s0; t0)′β + c′0C−1z (z− Xβ),

and the prediction variance is then σ2Y0= c0,0 − c′0C−1z c0.

This predictor is the optimal predictor in that it minimizes the

mean squared prediction error:

MSPE = E [(Y (s0; t0)− Y (s0; t0))2].

Thus, in this case, the weights are given by w0 = c′0C−1z and they

are applied to the “mean corrected data,” (z− Xβ).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Let’s take a closer look at the predictive distribution:

(Note, it is simple to predict at many new locations, say Y0. The

formulas are slightly modified to include matrices for X0, C0 and c0,0.)


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Optimal Prediction Examples

Simulated data in 1-D space and time, with predictions and standard

errors. Observation locations are given by ”x”. Note, prediction errors are

much less when data are “nearby” in space and time.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Optimal Prediction

In most real-world problems, one does not know β.

It is straightforward to show the optimal universal spatio-temporal

kriging predictor of Y (s0; t0) is given by:

Y (s0; t0) = x(s0; t0)′βgls + c′0C−1z (z− Xβgls),

where the generalized-least-squares (gls) estimator of β is given by

βgls ≡ (X′C−1z X)−1X′C−1z z.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Optimal Prediction

The associated universal spatio-temporal kriging variance is given

by

σ2Y ,uk(s0; t0) = c0,0 − c′0C−1z c0 + κ,

where

κ ≡ (x(s0; t0)− X′C−1z c0)′(X′C−1z X)−1(x(s0; t0)− X′C−1z c0),

represents the additional uncertainty brought to the prediction

(relative to simple kriging) due to the estimation of β.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Optimal Prediction Example

Universal kriging predictions of NOAA maximum temperature (in degrees

Fahrenheit) within a square box enclosing the domain of interest for six

days in July 1993, using the R package gstat. Data for 14 July 1993

were omitted from the original dataset.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Optimal Prediction Example

Universal kriging standard errors for predictions of NOAA maximum

temperature (in degrees Fahrenheit) within a square box enclosing the

domain of interest for six days in July 1993, using the R package gstat.

Data for 14 July 1993 were omitted from the original dataset.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Covariance Functions

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Covariance Functions

Recall that we considered the spatio-temporal process as a

Gaussian Process here because we need the covariance function

between ANY space-time locations. This requires that we specify a

mean function (easy) and a covariance function (not so easy).

So far, we assumed that we knew the variances and covariances

that make up Cy , Cε, c0, and c0,0. In the real-world we would

rarely (if ever) know these.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Seemingly simple solution: specify covariance functions in terms

of parameters, θ, that can give a covariance between any pairs of

space-time coordinates, and then estimate the parameters (see

below).

In spatio-temporal data analysis, as with spatial statistics, the

specification of these covariance functions becomes one of the

most challenging components of the problem. Why?

• Covariance functions must be positive semi-definite (this

ensures that prediction variances are non-negative!)

• Covariance functions should be realistic (that is, they should

account for realistic dependencies)


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


In practice, classical kriging implementations typically make a

stationarity assumption.

For example, if the random process is assumed to be second-order

(weakly) stationary then it has constant expectation and a

covariance function that can be expressed in terms of the spatial

and temporal lags

c(h ; τ) ≡ c(sij − sk` ; tj − t`), (1)

where h ≡ sij − sk` and τ ≡ tj − t` are the spatial and temporal

lags, respectively.

Recall from our exploratory analyses that if the spatial lag does not

depend on direction (i.e., h = ||h||), we say there is spatial isotropy.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Two biggest benefits of the stationarity assumption:

• allows for more parsimonious parameterizations of the

covariance function

• provides pseudo-replication of dependencies at given lags in

space and time, both of which facilitate estimation

The next question is how do we specify (or construct) such valid

covariance functions? Traditionally,

• separable covariance functions

• sums-and-products formulations

• spectral construction

• stochastic partial differential equation solutions


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Separable Covariance Functions:

c(h; τ) ≡ c(s)(h) · c(t)(τ),

which is valid if both the spatial covariance, c(s)(h), and temporal

covariance c(t)(τ) are valid.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


There are a large number of classes of valid spatial and temporal

covariance functions in the literature (e.g., the Matern, powered

exponential, and Gaussian classes, to name but a few). For

example, the exponential covariance function is given by

c(t)(τ) = σ2 exp{−τa

},

where σ2 is the variance parameter and a is the dependence

parameter.

Exponential covariance function for time lag, τ ; σ2 = 2.5 and a = 2.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Benefit of Separable Covariance Functions:

Separable models help with computation in problems with many

spatial and/or temporal locations (e.g., matrix inverses).

Technical Note:

For example, assume that Z(sij ; tj) is observed at each of mj = m locations and

each time point, j = 1, . . . ,T .

Separability allows us to write: Cz = C(t)z ⊗ C(s)

z , where ⊗ is the Kronecker

product, C(s)z is the m ×m spatial covariance matrix, and C(t)

z is the T × T

temporal covariance matrix.

In this case, C−1z = (C(t)

z )−1 ⊗ (C(s)z )−1, which shows that one only has to take

the inverse of m and T dimensional matrices to get the inverse of the much

larger (mT )× (mT ) matrix.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Separable Covariance Functions:

Consider the NOAA maximum temperature dataset. The left panel

below shows a contour plot of the empirical spatio-temporal

covariance function and the right panel shows the product of the

empirical spatial (c(s)(‖h‖; 0)) and temporal (c(t)(0; τ)) marginal

covariance functions. These plots suggest that separability may be

a reasonable assumption for these data.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Separability is unusual in spatio-temporal processes; it says that

temporal evolution of the process at a given spatial location does

not depend directly on the process’ temporal evolution at other

locations. That is, separability comes from a lack of

spatio-temporal interaction in Y (·; ·).

What other options do we have to specify S-T covariance

functions?

• Sums and products of covariance functions

• Construction (e.g., spectrally, via Bochner’s Theorem, which

relates the spectral representation to the covariance

representation)

• Solving a stochastic partial differential equation (SPDE)


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Other Options – Sums and Products of Covariance Functions:

Because the products and sums of valid covariance functions are

valid, we can define:

c(h; τ) ≡ p c(s)1 (h) · c(t)1 (τ) + q c

(s)2 (h) + r c

(t)2 (τ),

for p > 0, q ≥ 0, r ≥ 0, and where c(s)1 (h) and c

(s)2 (h) are valid

spatial covariance functions and c(t)1 (τ) and c

(t)2 (τ) are valid

temporal covariance functions.

In general, it is not clear if realistic processes can be modeled by such

structures. A special case, sometimes known as a “fully symmetric

spatio-temporal covariance function” implies

cov(Y (s; t),Y (s′; t′)) = cov(Y (s; t′),Y (s′; t)),

which is not realistic for many processes.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Other Options – Spectral Construction:

Bochner’s Theorem establishes a direct connection between the

Fourier spectral (frequency-based) representation of a process and

its covariance function (see Cressie and Wikle, 2011) and makes it

easier to construct valid covariance functions.

For example, by showing that they only needed to choose a

one-dimensional valid function for the time lag, Cressie and Huang (1999)

derived the valid non-separable S-T covariance function (among others):

c(h; τ) = σ2 exp{−b2||h||2/(a2τ 2 + 1)}/(a2τ 2 + 1)d/2,

where σ2 = c(0; 0), d corresponds to the spatial dimension, and a ≥ 0,

b ≥ 0 are time and space scaling parameters, respectively.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Spectral Construction:

Left: Contour plot of the nonseparable Cressie-Huang covariance function

for d = 1, a = 2, b = 0.2, and σ = 1. Right: Equivalent separable

representation based on the product of the spatial and temporal marginal

correlation functions.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Other Options – SPDE-Derived Covariance Functions:

The SPDE approach to deriving spatio-temporal covariance

functions was originally inspired by statistical physics, where

physical equations (PDEs) forced by random processes that

describe advective, diffusive, and decay behavior can be used, in

principle, to describe the second moments of macro-scale processes.

Most famous examples: the ubiquitous Matern class of spatial

covariance functions can be derived from a linear diffusion SPDE

(e.g., Whittle 1962; see Guttorp and Gneiting (2006) for historical

perspective) and the reaction-diffusion model of Heine (1955).

Although these approaches can suggest nonseparable

spatio-temporal covariance functions, only a few special (simple)

models lead to closed forms.An Introduction to Spatio-Temporal Statistics: Descriptive Modeling c©Christopher K. Wikle 38

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Descriptive Modeling: Part II

(Estimation)

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Prediction: Estimation

The spatio-temporal covariance models presented above depend on

unknown parameters; these must be estimated from the data.

There is a history in spatial statistics of fitting such models directly

to the empirical estimates (e.g., by using a least squares or

weighted least squares approach; see Cressie 1993, Section 2.6 for

an overview).

In the spatio-temporal context, one typically considers covariance

models and estimates parameters through likelihood-based

methods or through fully Bayesian methods. This follows closely

the approaches in linear mixed model parameter estimation (for an

overview see McCulloch and Searle, 2001).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Likelihood Estimation:

Recall Cz = Cy + Cε and this depends on parameters θ ≡ {θy ,θε}for the latent and error processes, respectively. The likelihood can

then be written

L(β,θ) ∝ |Cz(θ)|−1/2 exp

{−1

2(z− Xβ)′(Cz(θ))−1(z− Xβ)

},

and we maximize this likelihood with respect to {β,θ}, thus

obtaining the maximum likelihood estimates (MLEs).

The fact that the covariance parameters appear in the matrix

inverse and determinant prohibit analytical maximization, but

numerical methods can be used (e.g., Newton-Raphson) if the

dimension isn’t too large.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Restricted Maximum Likelihood Estimation (REML):

REML considers the likelihood of a linear transformation of the

data vector such that the errors are orthogonal to the Xs that

make up the mean function.

Numerical maximization of the associated likelihood, which is only

a function of the θ parameters, is typically more computationally

efficient than for MLEs (and, conventional wisdom is that the bias

properties of the REML covariance parameter estimates are better

than MLEs for linear mixed models.)


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Fully (Hierarchical) Bayesian Estimation:

Recall, we can decompose an arbitrary joint distribution in terms of

a hierarchical sequence of conditional distributions and a marginal

distribution, e.g., [A,B,C ] = [A|B,C ][B|C ][C ].

In the context of the general spatio-temporal model,

[z,Y,β,θ] = [z|Y,θ,β][Y|β,θ][β|θ][θ]

= [z|Y,θε][Y|β,θy ][θ][β],

where θ contains all of the variance and covariance parameters

from the data model and the process model.

Note that the first equality is based on the probability

decomposition and the second equality is based on writing

θ = {θε,θy} and assuming the priors on β and θ are independent.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Fully (Hierarchical) Bayesian Estimation:

Now, Bayes’ rule gives the posterior distribution

[Y,β,θ|z] ∝ [z|Y,θε][Y|β,θy ][θ][β],

where [z|Y,θε] is the data model and [Y|β,θy ] is the latent

process model. The prior distributions for the parameters [θ] and

[β] are then specified according to the specific choices one makes

for the error and process covariance functions.

In general, the normalizing constant required to fully specify the

posterior distribution is not available analytically and numerical

sampling methods (e.g., Markov chain Monte Carlo, MCMC) must

be used.

Advantage of the Bayesian hierarchical model (BHM) approach:

parameter uncertainty accounted for directlyAn Introduction to Spatio-Temporal Statistics: Descriptive Modeling c©Christopher K. Wikle 44

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Contour plot of the empirical covariance function for NOAA maximum

temperature data (top-left panel), empirical covariance function obtained

by taking the product of c(s)(‖h‖) and c(t)(|τ |) (top-right panel), fitted

covariance function using a spatio-temporal separable model (bottom-left

panel) and the fitted covariance function using an anisotropic covariance

model (bottom-right panel).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Descriptive Modeling: Part III

(Basis Functions and Random

Effects Parameterizations)

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Spatio-Temporal Basis Function and Random Effects Models

Two problems:

• difficulty in working with large spatio-temporal covariance

matrices (e.g., C−1z ) in situations with large numbers of

prediction or observation locations

• realistic covariance structure

Possible solution:

Expand the spatio-temporal process in terms of basis functions and

associated random effects and take advantage of conditional

specifications that a hierarchical modeling framework allows.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


One can consider classical linear mixed models from either a

conditional perspective, where one conditions the response on

random effects, or from a marginal perspective, where the random

effects have been averaged out.

Example: in a longitudinal-data-analysis setting, one might allow

there to be subject-specific intercepts or slopes corresponding to a

drug treatment effect over time. The variation in the subject

intercepts and slopes are specified by random effects.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Longitudinal study with two treatments and a control (Verdonck et al. 1998).

• subject level (conditional) inference

• treatment effect (marginal) inference

Key: shared random effects induce dependence!


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


As in traditional linear mixed models, consider:

z = Y + ε,

Y = Xβ + Φα,

where ε ∼ Gau(0,Cε), and α ∼ Gau(0,Cα).

Alternatively, we can write this conditionally:

z|Y ∼ Gau(Xβ + Φα,Cε),

α ∼ Gau(0,Cα).

Integrating out the random effects (α) we get the marginal

representation:

z ∼ Gau(Xβ,ΦCαΦ′ + Cε).

Thus, averaging across the random effects induces marginal

dependence (e.g., ΦCαΦ′ + Cε versus just Cε).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


In the context of spatio-temporal models, we can use this

conditional perspective to specify common random effects that

simplify modeling by either reducing the dimension, simplifying the

conditional dependence structure, or speeding up computation.

Note that in spatial and spatio-temporal cases, we sometimes

include an additional random effect, e.g.,

Y = Xβ + Φα+ ν,

where ν ∼ Gau(0,Cν) and Φ is a matrix of “basis functions”

(broadly speaking).

Before considering this model in more detail, we provide some

basis function intuition.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


In recent years, basis-expansion representations have become quite

popular in functional data analysis, nonparametrics, data mining,

and spatial and spatio-temporal statistics. Why? Induced

sparseness, low-rank matrix inverses, non-stationarity,

multi-resolution behavior

In general, we can write

Y (x) ≈p∑

k=1

φk(x)αk ,

where {φk(x), αk : k = 1, . . . , p} are basis functions and the

associated random expansion coefficients (weights) over the

support of x .

For a set of n locations in the domain of x , let Y = (Y (x1), . . . ,Y (xn))′; we

can write the expansion as: Y ≈ Φα.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Consider the case in 1-D where the basis functions are local

functions (“curves”; e.g., Gaussian kernels); in this case, the

random weights are referenced by x and give the height of the

curves.

Top panel: two individual components, φk(x)αk ; Bottom panel: sum of

components,∑2

k=1 φk(x)αk .


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Consider the same case of local basis functions but with 21 basis

functions:


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Now consider the case in 1-D where the basis functions are

“global” curves; in this case, the random weights are not

referenced by x but give the importance of the 8 curves


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Finally, consider two realizations of a process with 2-D basis

functions that are “global” surfaces


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


For modeling spatio-temporal processes we can consider three main

basis function approaches:

• spatio-temporal basis functions: the basis functions are

indexed in space and time and the random effects have no

specific location index

• temporal basis functions: the basis functions are indexed by

time and the random effects are indexed by space

• spatial basis functions: the basis functions are indexed by

space and the random effects are indexed by time


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Spatio-Temporal Basis Functions:

Rewrite the process model in terms of fixed and random effects (β

and {αi : i = 1, . . . , nα}, respectively):

Y (s; t) = x(s; t)′β +nα∑i=1

φi (s; t)αi + ν(s; t),

where {φi (s; t) : i = 1, . . . , nα} are (typically) specified basis

functions corresponding to location (s; t), {αi} are random effects,

and ν(s; t) is sometimes needed to represent left-over small-scale

spatio-temporal random effects.

Let α ∼ Gau(0,Cα), where α ≡ (α1, . . . , αnα)′.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Note, we are interested in the Y -process at ny spatio-temporal

locations, which we denote by the vector Y.

The process model then becomes

Y = Xβ + Φα+ ν,

where the ith column of the ny × nα matrix Φ corresponds to the

ith basis function, φi (·; ·), at all of the spatio-temporal locations

ordered as given in Y, and the vector ν also corresponds to the

spatio-temporal ordering given in Y such that ν ∼ Gau(0,Cν).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


The marginal distribution of Y is then

Y ∼ Gau(Xβ,ΦCαΦ′ + Cν),

so that CY = ΦCαΦ′ + Cν .

In this case, the spatio-temporal dependence is primarily accounted

for by the spatio-temporal basis functions, Φ, and the random

effects covariance matrix Cα; in general, this could accommodate

non-separable dependence.

The benefits of this approach are that the modeling effort then

focuses on the random effects, α (which are typically of relatively

low dimension and do not have spatial or temporal dependence).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Low Rank Representation:

In situations where nα is much smaller than the dimension of Y

(i.e., a low-rank representation), then obvious benefits come from

being able to perform matrix inverses in terms of nα-dimensional

matrices (through well-known matrix algebra relationships).

Technical Note:

Specifically, let V ≡ Cν + Cε, then Cz in the spatio-temporal kriging formulas

above can be written Cz = ΦΣαΦ′ + V. Thus, using the well-known

Sherman-Morrison-Woodbury matrix identities

C−1z = V−1 − V−1Φ(Φ′V−1Φ + C−1

α )−1Φ′V−1.

If V and Cα (or C−1α ) have simple structure (i.e., sparse) and nα � nY , then

this inverse is easy to calculate.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


It is important to note that even in the full-rank (nα equal to the

dimension of Y) and over-complete (nα larger than the dimension

of Y) cases, there can still be dramatic computational benefits

through induced sparsity in Cα and efficient matrix multiplication

routines that utilize multi-resolution algorithms.

In addition, such implementations typically assume that ν = 0 and

are often orthonormal, so that Φ Φ′ = I, which simplifies

calculation significantly.

Finally, we note that specific basis functions and methodologies

take advantage of other properties of the various matrices; e.g.,

sparse structure on the random effects covariance matrix, Cα, or

the random effects precision matrix, C−1α .


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Predictions of NOAA maximum temperature for six days spanning the

temporal window of the data, 1-July-1993 – 20-July-2003 using

spatio-temporal bi-square basis functions implemented in the FRK (fixed

rank kriging) R package.An Introduction to Spatio-Temporal Statistics: Descriptive Modeling c©Christopher K. Wikle 63

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Prediction standard errors of NOAA maximum temperature for six days

spanning the temporal window of the data, 1-July-1993 – 20-July-2003

using spatio-temporal bi-square basis functions implemented in the FRK

(fixed rank kriging) R package.An Introduction to Spatio-Temporal Statistics: Descriptive Modeling c©Christopher K. Wikle 64

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17



ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Random Effects with Temporal Basis Functions:

We can also express the spatio-temporal random process in terms

of temporal basis functions and spatially-indexed random effects

Y (s; t) = x(s; t)′β +nα∑i=1

φi (t)αi (s) + ν(s; t),

where {φi (t) : i = 1, . . . , nα; t ∈ Dt} are temporal basis functions

and αi (s) are spatially-indexed random effects.

In this case, one could model {αi (s)} using the methods from

spatial statistics. That is, either nα separate spatial processes, or

an nα-dimensional multivariate spatial process (see Cressie and

Wikle, 2011, Ch. 4).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


NOAA maximum temperature data temporal basis function example.

Top panel: Basis functions φ1(t), φ2(t), and φ3(t), where the first basis

function is a constant and latter two were obtained from the temporal

EOFs from the data matrix. Bottom panels: E (α0(s) | z),E (α1(s) | z),

and E (α2(s) | z). Note: modeling and prediction using temporal basis

functions can be carried out with the SpatioTemporal R package.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Random Effects with Spatial Basis Functions

We can also express the spatio-temporal random process in terms

of spatial basis functions and temporally-indexed random effects

Y (s; t) = x(s; t)′β +nα∑i=1

φi (s)αt(i) + ν(s; t),

where {φi (s) : i = 1, . . . , nα; s ∈ Ds} are spatial basis functions

and αt(i) are temporal random effects.

The modeling focus is then on modeling the random temporal

effects vectors, αt ≡ (αt(1), . . . , αt(nα))′.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


If the αt vectors are independent in time, then the marginal

dependence structure of {Y (s; t)} is typically unrealistic.

Interesting spatio-temporal dependence arises when these effects

are dependent; such models are simplified by assuming conditional

temporal dependence (dynamics!), as we will see in the next

section when we talk about dynamical representations of

spatio-temporal processes.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Basis Functions:

• Many choices for basis functions: e.g., orthogonal

polynomials, wavelets, splines, Wendland, Galerkin (finite

element), EOFs, bi-squares, discrete kernel convolutions,

“factor” loadings, “predictive processes”, Moran’s I bases, etc.

• Basis function decisions:

• Fixed or “estimated” (parameterized)

• reduced rank (nα � n), complete (nα = n), or overcomplete

(nα > n)

• expansion coefficients in physical space or not

• discrete or continuous space

• stationary or non-stationary

• truncation error

• what kind of distribution on the random effects, α?; iid,

dependence via covariance matrix or precision matrix


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


There is very little guidance on which bases or representation to

select!

• people have their favorites and there is a lot of dogma

• practical experience and recent research suggests that in most

cases with just spatial data, it probably doesn’t matter too

much (Bradley et al. 2016)

• in dynamical spatio-temporal model settings, the mechanistic

process can sometimes suggest an appropriate basis set


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Confounding:

It is important to mention potential confounding that can occur

between the fixed effects and random effects in dependence

models. This can be particularly important when one is interested

in inference on the β (fixed effects) parameters.

Consider the general mixed-effects representation

Y = Xβ + Φα+ ν, ν ∼ N(0,Cν).

Although the columns of Φ are basis functions, they are indexed in

space and time just as the columns of X.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Confounding:

It is quite possible then that, depending on the structure of the

columns in these two matrices, the random effects can be

confounded with the fixed effects similar to how extreme

collinearity can affect the estimation of fixed effects in traditional

regression.

As with collinearity, if the columns of Φ and X are linearly

independent, then there is no worry from confounding.

This has led to mitigation strategies that tend to restrict the

random effects by selecting basis functions in Φ that are

orthogonal to the column space of X (or, approximately so); see

Hanks et al. (2015) for an overview.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Descriptive Modeling: Part IV

(Non-Gaussian Data Models with

Latent Gaussian Processes)

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Non-Gaussian Data and Latent Gaussian Processes

The discussion presented so far has assumed Gaussian error and

random-effects distributions.

Many spatio-temporal problems of interest deal with distinctly

non-Gaussian data (e.g., counts, binary responses, extreme-value

distributions).

One of the most useful aspects of the hierarchical modeling

paradigm is that it allows one to accommodate fairly easily

non-Gaussian data models, so long as the observations are

conditionally independent (e.g., as with Generalized Linear Mixed

Models (GLMMs) and Generalized Additive Models (GAMs)).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


As an example, consider the data model

z(s; t)|Y (s; t), γ ∼ ind . EF (Y (s; t), γ),

where EF corresponds to a distribution from the exponential family

with scale parameter γ and mean Y (s; t).

Then, we consider a transformation of the mean response modeled

in terms of fixed and random effects,

g(Y (s; t)) = x(s; t)′β + η(s; t), (2)

where g(·) is a specified monotonic link function, x(s; t) is a

p-dimensional vector of covariates for spatial location s and time t,

and η(s; t) is a spatio-temporal GP.An Introduction to Spatio-Temporal Statistics: Descriptive Modeling c©Christopher K. Wikle 76

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Note that the latent spatio-temporal GP can be modeled in terms

of spatio-temporal covariances, basis function expansions (e.g.,

GAMs), or dynamic spatio-temporal processes (as we will discuss

in the next part of the course).

The same modeling issues associated with this latent Gaussian

spatio-temporal process are present here as with the Gaussian data

case, but estimation is typically more complicated given the

non-Gaussian data model.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


As an illustration, a simple model involving spatio-temporal count

data could be represented by

zt |Yt ∼ ind . Poi(Yt),

log(Yt) = Xtβ + Φtα+ νt ,

where zt is an mt × 1 data vector of counts at mt spatial

locations, Yt represents the latent spatio-temporal mean process at

mt locations, Φt is an mt × nα matrix of spatio-temporal basis

functions, and the associated random coefficients are modeled as

α ∼ Gau(0,Cα), with small-scale error term νt ∼ Gau(0, σ2νI).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Non-Gaussian Data and Latent Gaussian Processes:

Estimation

Implicit in the estimation associated with the linear Gaussian

spatio-temporal model discussed previously is that the covariance

and fixed effects parameters can be estimated more easily when we

marginalize out the latent Gaussian spatio-temporal process.

More generally, we represent this by integrating out the random

effects,

[z|θ,β] =

∫[z|Y,θ][Y|θ,β]dY.

For linear Gaussian mixed models, this can be done analytically;

this is not true for non-Gaussian data models.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Estimation

The integral in the marginal likelihood given above can be solved

numerically in principle, in which case one can estimate the

relatively few fixed effects and covariance parameters (θ,β)

through usual numerical methods (e.g., Newton-Raphson).

However, this is complicated in practice for spatio-temporal models

by the high-dimensionality of the integral (e.g., recall that Y is an

(∑T

j=1mt)-dimensional vector).

Traditional approaches to this problem are facilitated by the usual

conditional independence assumption in the data model and by

exploiting the latent Gaussian nature of the random effects.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17


Estimation

Estimation approaches include methods such as quasi-likelihood,

generalized estimating equations, pseudo-likelihood, penalized

quasi-likelihood and Bayesian methods.

Although these methods can be used successfully in the

spatio-temporal context (e.g., the use of penalized quasi-likelihood

(PQL) in estimating GAMs), there has been much more focus on

Bayesian estimation approaches for spatio-temporal models in the

literature (e.g., Bayesian hieararchical models, Integrated Nested

Laplace Approximations (INLA)).

Regardless, some type of approximation is needed for all of these

methods (approximating integrals, approximating the models,

approximating the posterior distribution). [For examples, see the

mgcv and INLA R packages]An Introduction to Spatio-Temporal Statistics: Descriptive Modeling c©Christopher K. Wikle 81

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Non-Gaussian Data and Latent Gaussian Processes: Example

Log intensity of Carolina Wren BBS counts in Missouri on a grid for t =

1 (1994) and t = 21 (2014). The log of the observed count is shown in

circles using the same color scale. This was fit using a Poisson data

model and GAM basis functions in the mgcv R package.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Non-Gaussian Data and Latent Gaussian Processes: Example

Standard error of log intensity of Carolina Wren BBS counts in Missouri

on a grid for t = 1 (1994) and t = 21 (2014). The log of the observed

count is shown in circles using the same color scale. This was fit using a

Poisson data model and GAM basis functions in the mgcv R package.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Recap

• We focused here on so-called “descriptive” methods for

spatio-temporal prediction; that is, methods that are

concerned with modeling first-order and second-order

dependencies, with the primary goal of performing prediction

in space and time (although, parameter inference is covered in

this context as well).

• A big challenge with these methods is the realistic

specification of valid spatio-temporal covariance functions.


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Recap

• Another challenge is related to the curse-of-dimensionality in

these models.

• Both the dimensionality challenge and the realism challenge

necessitate the consideration of random effects that are

random coefficients in a basis expansion.

• We also considered how it is fairly straightforward to extend

these modeling approaches to non-Gaussian data models, so

long as the data is independent conditional upon a latent

Gaussian spatio-temporal process (although,

computation/estimation is more difficult).


ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

Next

• One of the most challenging aspects of characterizing the

spatio-temporal dependence structure, either in the

covariance-based perspective or the basis function perspective,

is the ability to model realistically the interactions that occur

across time and space in most real-world processes.

• That is, most processes are best described by spatial fields

that change through time according to “rules” that govern

the behavior. That is, they represent some stochastic

dynamical system.

• As we will see in the next part of the course, spatio-temporal

models that explicitly account for these dynamics offer the

benefit of providing more realistic models in general, are

better for forecasting, and can simplify model construction

and estimation through conditioning.An Introduction to Spatio-Temporal Statistics: Descriptive Modeling c©Christopher K. Wikle 86

ASA Bos

ton S

hort C

ourse

: Cop

yrigh

t C.K

. Wikl

e, 20

17

2017 descriptive spatio-temporal wikle, c.k. modeling...

Documents