estimation of parameterized spatio-temporal dynamic models

Journal of Statistical Planning and Inference 137 (2007) 567–588www.elsevier.com/locate/jspi

Estimation of parameterized spatio-temporal dynamic modelsKe Xu, Christopher K. Wikle∗

Department of Statistics, University of Missouri-Columbia, 146 Math Science Building, Columbia, MO 65211, USA

Received 22 July 2004; received in revised form 13 December 2005; accepted 21 December 2005Available online 13 March 2006

Abstract

Spatio-temporal processes are often high-dimensional, exhibiting complicated variability across space and time. Traditional state-space model approaches to such processes in the presence of uncertain data have been shown to be useful. However, estimation ofstate-space models in this context is often problematic since parameter vectors and matrices are of high dimension and can havecomplicated dependence structures. We propose a spatio-temporal dynamic model formulation with parameter matrices restrictedbased on prior scientific knowledge and/or common spatial models. Estimation is carried out via the expectation–maximization(EM) algorithm or general EM algorithm. Several parameterization strategies are proposed and analytical or computational closedform EM update equations are derived for each. We apply the methodology to a model based on an advection–diffusion partialdifferential equation in a simulation study and also to a dimension-reduced model for a Palmer Drought Severity Index (PDSI) dataset.© 2006 Elsevier B.V. All rights reserved.

Keywords: Dynamic; EM algorithm; General EM; State-space; Time series; Spatial; Spatio-temporal

1. Introduction

Spatio-temporal statistical models are essential tools for performing inference and prediction for processes in thephysical, environmental, and biological sciences. Such processes are often complicated in that the dependence structureacross space and time is non-trivial, often non-separable and non-stationary in space or time. In addition, it is oftenthe case that the number of spatial locations at which inference is desired is quite large. Furthermore, data are oftencollected with substantial observational uncertainty and it is not uncommon to have missing observations at variousspatial and temporal locations.

Various approaches have been proposed to model spatio-temporal processes (e.g., see Kyriakidis and Journel, 1999for a review). If one considers time as an extra dimension, then traditional spatial statistics techniques can be applied(Cressie, 1993). However, such approaches ignore the fundamental differences between space and time, principallythat time is naturally ordered and space is not. Alternatively, one can consider the spatio-temporal problem from amultivariate geostatistical perspective which requires space-time covariance functions be specified. Traditionally thisapproach has been limited in that the known class of valid spatio-temporal covariance functions is quite small, although

∗ Corresponding author. Tel.: +1 573 882 9659; fax: +1 573 884 5524.E-mail address: [email protected] (C.K. Wikle).

0378-3758/$ - see front matter © 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.jspi.2005.12.005

http://www.elsevier.com/locate/jspi

mailto:[email protected]

568 K. Xu, C.K. Wikle / Journal of Statistical Planning and Inference 137 (2007) 567–588

in recent years, several authors have extended this class of functions (e.g., Cressie and Huang, 1999; Gneiting, 2002;Stein, 2005). Nevertheless, this approach is still limited by the fact that such covariance functions are often not realisticfor complicated dynamical processes and dimensionality can prohibit practical implementation.

Spatio-temporal processes can also be considered from the multiple time series perspective (e.g., see Kyriakidis andJournel, 1999 for a review). That is, each spatial location is associated with a time series. Then, multivariate time seriestechniques can be transferred to the space-time problem. However, such approaches ignore the fundamental differencesbetween space and time and one’s ability to predict at locations for which data were not observed is limited. Suchapproaches do not in general explicitly account for uncertainty in the observed data. Perhaps more critically, suchmethods are difficult to implement in cases where the dimensionality of the state vector (i.e., the number of spatiallocations) is high.

A natural approach to spatio-temporal modeling for complex dynamical processes is a combination of spatial andtime series techniques, which is accomplished by a spatio-temporal dynamic model formulation (e.g., see Cressie andWikle, 2002 for a brief review). However, estimation in this context can be problematic due to the high dimensionalityof the state process. Several modeling strategies have been proposed to address this problem. One approach is to reducedimensionality by projecting the state-process on some set of spectral basis functions (e.g., Mardia et al., 1998; Wikleand Cressie, 1999). Alternatively, one might specify very simple, random walk dynamics (e.g., Stroud et al., 2001;Huerta et al., 2004). Another approach is to incorporate physical or biological models directly into the parameterization(Wikle et al., 2001; Wikle, 2003). Even in the case of physically or biologically motivated dynamic models, it is seldomthe case for statistical problems (unlike some engineering problems) that we know explicitly the model parameters.These must be estimated, but in the presence of known constraints on the dynamical formulation. Furthermore, dueto the high dimensionality, measurement error and process covariance matrices typically have too many parameters toestimate outright. Thus, these matrices must be parameterized as well.

Estimation in the spatio-temporal dynamical model setting is best accomplished through a state-space framework.Given parameters, the unobserved state-process can be estimated via the Kalman filter or Kalman smoother (e.g., seeCressie and Wikle, 2002 for a brief review of spatio-temporal Kalman filter implementations). However, in the moreusual setting where model parameters are unknown, the standard approach following Shumway and Stoffer (1982) isto use the expectation–maximization (EM) algorithm to estimate parameters. As mentioned above, the spatio-temporalproblem typically requires restrictions on the parameter matrices. Shumway and Stoffer (1982) discuss modificationsto their algorithm to accommodate fully restricted parameter matrices. However, it is not clear how they account forpartially restricted or parameterized model matrices in this framework. Examination of Shumway (1988, pp. 323–332)implies that one approach to deal with partially restricted parameter matrices is to set initial parameters (in the EMalgorithm) to agree with the known values. Then in the M-step, only those parameters that require estimation areupdated, so that the fixed parameters do not change. Alternatively, one can update all parameters but then immediatelyimpute the known values for the fixed parameters. Although these approaches are relatively easy to implement, it is notclear that they give the maximum likelihood estimates under the state-space model assumptions. Another approach,considered here, is to develop general EM (GEM) algorithms to account directly for the restricted or partially restrictedmodel matrices.

In this paper we describe efficient estimation approaches for spatio-temporal dynamic models in which the pa-rameter matrices and/or noise covariance matrices are highly parameterized (or restricted). We utilize GEM algo-rithms to carry out this estimation. In Section 2, we give some necessary background for spatio-temporal dynamicmodels and GEM algorithms. In Sections 3–5, we propose several methods of parameterization and derive the EMupdate formula for each. In Section 6, we consider two examples. Finally, Section 7 contains a brief summary andconclusion.

2. Background

2.1. Spatio-temporal dynamic model formulation

Let zt = (z(s1; t) · · · z(smt ; t))′ be an mt × 1 vector containing the data values at mt spatial locations, si , at timet. Let yt = (y(s1; t) · · · y(sn; t))′ be an n × 1 vector for an unobservable spatio-temporal state process at somefixed network of locations s1, . . . , sn at time t. This state process is our primary interest. The two sets of spatial

K. Xu, C.K. Wikle / Journal of Statistical Planning and Inference 137 (2007) 567–588 569

locations, si , si ∈ S, where S is some domain in Rd , need not be the same. Write

zt = Ktyt + εt , (1a)

yt = Hyt−1 + �t , (1b)

for t = 1, . . . , T , where (1a) is called the measurement equation and (1b) the state equation. Let Kt be a known mt × n

matrix that maps the data zt to the process yt . The measurement noise εt is zero-mean, uncorrelated in time andGaussian with mt × mt covariance matrix Rt . The dynamics are described in the state equation (1b) via a first-orderMarkov process with the transition or propagator matrix H. We also assume there are shocks �t to the system, whichare spatially colored, temporally white and Gaussian with mean zero and a common n × n covariance matrix Q. Forcompleteness, we assume the process starts with y0, which is a Gaussian spatial process with mean µ0 and n × n

covariance matrix �0. Such a model in the spatio-temporal context is not new, nor is it the most general. However, suchmodels have received a considerable amount of attention in the environmental literature in recent years and have beenshown to be quite useful (e.g. see the review in Cressie and Wikle, 2002).

The parameters for the model (1) are �={µ0, �0, H, Q, Rt }. The major challenge in fitting this model lies in the highdimensionality of most space-time applications. For example, a model for pacific sea surface temperature (SST) mighthave mt = n = 2261 (Berliner et al., 2000), which requires estimation of a 2261 × 2261 matrix H. Often, researchersresort to Bayesian hierarchical approaches for dealing with this dimensionality problem by considering restrictionsto �, assigning priors and then using MCMC (Wikle et al., 1998). This paper shows that it is often possible to fitsuch models and estimate their parameters via the convenient Kalman and EM algorithms. Again, the key is to assigna structure to the model parameters. Though still limited for many problems, such a formulation is useful in manysettings. For example, in the early stage of the model building, one might consider such an implementation, since itmay be fast and easy to implement relative to the MCMC. In addition, there are situations where there is little scientifictheory or previous empirical evidence to suggest prior parameterizations for a fully Bayesian model. In these cases, ifthe model is sufficiently parameterized, the KF/EM approach is a reasonable alternative.

2.2. Kalman filter and smoother

Suppose we know the value of all parameters, �, then one can use a set of recursions known as the Kalman filter andKalman smoother to obtain the conditional mean and covariance of the state variable, yt (Kalman, 1960; Shumway andStoffer, 1982; West and Harrison, 1997). These recursions are well-known but we present them here to define notationand for completeness. Our overview follows Shumway and Stoffer (2000) with various notational modifications. First,define the conditional mean ys

t = E(yt |z1 . . . zs). In particular, yt−1t , yt

t , and yTt are called the predicted, filtered and

smoothed values, respectively.Also define the conditional variance covariance matrix, Pst =var(yt |z1 . . . zs) and lag-one

covariance matrix, Pst,t−1 = cov(yt , yt−1|z1 . . . zs).

To get predicted and filtered values, one evaluates the following set of recursions for t = 1, . . . , T , which is calledthe Kalman filter:

yt−1t = Hyt−1

t−1,

Pt−1t = HPt−1

t−1H′ + Q,

Gt = Pt−1t K′

t (KtPt−1t K′

t + Rt )−1,

ytt = yt−1

t + Gt (zt − Ktyt−1t ),

Ptt = Pt−1

t − GtKtPt−1t

and where y00 and P0

0 are specified. To get smoothed values, one runs the following backward recursion for t = T ,

T − 1, . . . , 1, which is sometimes called the Kalman smoother:

Jt−1 = Pt−1t−1H′(Pt−1

t )−1,

yTt−1 = yt−1

t−1 + Jt−1(yTt − yt−1

t ),

PTt−1 = Pt−1

t−1 + Jt−1(PTt − Pt−1

t )J′t−1.


To get the smoothed lag-one covariance, one runs the backward recursion for t = T , T − 1, . . . , 2 on

PTt−1,t−2 = Pt−1

t−1J′t−2 + Jt−1(PT

t,t−1 − HPt−1t−1)J

′t−2,

where

PTT ,T −1 = (I − GT KT )HPT −1

T −1.

2.3. EM estimation

One can estimate the parameters � by the method of moments and then plug them in (1) to implement the Kalmanfilter (Wikle and Cressie, 1999). Alternatively, one can run the Kalman recursion and recognize that a byproduct of theKalman algorithm is that the likelihood can be computed from the filtered values with little extra effort. That is, if wedefine an innovation εt and its covariance �t as εt = zt − Kty

t−1t and �t = KtP

t−1t K

′t + Rt , respectively. Then, the log

likelihood value up to a constant is simply (Shumway and Stoffer, 2000):

−2 log LZ(�) =T∑

t=1

log |�t (�)| +T∑

t=1

εt (�)′�t (�)−1εt (�). (2)

Thus, we might perform maximum likelihood estimation, either numerically (Gupta and Mehra, 1974) or by the EMalgorithm (Shumway and Stoffer, 1982, 2000). In this paper we focus on the EM algorithm.

Consider {y0, y1, . . . , yT , z1, . . . , zT } as the “complete data” and denote its likelihood LY,Z .An EM iteration consistsof two steps: an E-step and an M-step. Given the current value of the parameters, �(j−1), the E-step computes theexpected value of the complete data likelihood, which is of the following form (for details see Shumway and Stoffer,2000):

g(�|�(j−1)) ≡ − 2E(log LY,Z|z1, . . . , zT ; �(j−1))

∝ log |�0| + tr{�−10 [PT

0 + (yT0 − µ0)(y

T0 − µ0)

′ ]}+ T log |Q| + tr{Q−1[S11 − S10H

′ − HS′10 + HS00H

′ ]}

+T∑

t=1

log |Rt | +T∑

t=1

tr{R−1t [(zt − KtyT

t )(zt − KtyTt )

′ + KtPTt K

′t ]}, (3)

where S11 =∑Tt=1 (yT

t yT ′t + PT

t ), S10 =∑Tt=1 (yT

t yT ′t−1 + PT

t,t−1) and S00 =∑Tt=1 (yT

t−1yT ′t−1 + PT

t−1). Note yTt , PT

t

and PTt,t−1 depend on �(j−1).

In the M-step, an update, �(j) is chosen such that g(�(j)|�(j−1)) < g(�(j−1)|�(j−1)). This will guarantee that thelikelihood increases monotonically. When the likelihood function is bounded, the iterates will eventually converge tothe MLE. If �(j) is also the minimum of (3), we have the standard EM algorithm. Otherwise, the algorithm is knownas General EM (GEM) (McLachlan and Krishnan, 1997).

In the case of Rt = R there exists a closed form EM update formula for all parameters. Minimizing (3) with respectto the parameters yields the M-step update formula for our model (Shumway and Stoffer, 1982).

H(j) = S10S−100 , (4a)

Q(j) = T −1(S11 − S10S−100 S

′10), (4b)

R(j) = T −1B, (4c)

µ(j)0 = yT

0 , (4d)

where

B =T∑

t=1

[(zt − KtyTt )(zt − KtyT

t )′ + KtPTt K′

t ]. (5)


Note that �0 is not updated, since µ0 and �0 are essentially nuisance parameters and they cannot be estimatedsimultaneously (Shumway and Stoffer, 1982). We choose to update µ0 rather than the covariance matrix �0 since ingeneral we do not have enough data to justify estimating a covariance matrix (we have only one observation for theinitial vector.)

As mentioned previously, for our applications, the spatially indexed data vector, zt , is usually high dimensional. Asa result, our parameters are often of high dimension as well. Hence, some form of dimension reduction is called for.One approach is to parameterize � by exploiting the special structure of the process. We propose several approachesfor specifying realistic submodels for Rt , Q and H, thereby substantially easing the burden of estimation.

Our methods rely heavily on GEM algorithms, since we shall see later that in many cases parameters are not“separable”, which means the joint “best” update �(j), such as (4), is not available. It is also the case that sometimesthe analytical closed form “best” update formula cannot be derived for some of the parameters. As a result, we oftenmust settle for a “better” update, which only need ensure that the likelihood move monotonically. The price to bepaid for this generality is that it takes more iterations to converge than what we would experience with the traditionaltypes of EM estimation for state–space models. Two of the most useful GEM algorithms are described in the followingsection.

Although we do not explicitly describe algorithms for obtaining standard error estimates for �, they can be computedin various ways. For example, it is sometimes possible to evaluate the Hessian matrix after convergence (Shumway andStoffer, 2000).Alternatively, one may obtain estimates of the standard error by perturbing the likelihood function (2) andusing numerical differentiation (e.g. Shumway and Stoffer, 2000; Tanner, 1996). However, although the likelihood-based parameter estimates are consistent and asymptotically normal, the asymptotics are often not applicable forthe relatively small sample sizes one encounters in spatio-temporal applications. In that case, a bootstrap procedureis appropriate and is very simple to implement. For example, Stoffer and Wall (1991) describe a simple bootstrapsampling algorithm for parameter estimates in general state-space models that is appropriate for the spatio-temporalsetting discussed here. In addition Wall and Stoffer (2002) describe how bootstrap resampling in this context can alsogive appropriate estimates of conditional forecast accuracy.

2.4. Two general EM (GEM) algorithms

2.4.1. Expectation–conditional maximization (ECM) algorithmAn ECM algorithm consists of an expectation (E) step and conditional maximization (CM) steps (McLachlan and

Krishnan, 1997). Sometimes the M-step update is difficult to obtain, so we replace the M-step with several simple CM-steps. As an example, suppose the parameter of interest consists of two parts, i.e., � = {�1, �2}. An ECM algorithmupdates the two sub-parameters sequentially or conditionally. That is, given the current value �(j−1), we obtain theupdate �(j) via two CM-steps subject to the conditionally maximizing requirement at each CM-step:

• CM-step 1: update �(j)1 with �(j)

2 = �(j−1)2 such that g({�(j)

1 , �(j−1)2 }) < g({�(j−1)

1 , �(j−1)2 }),

• CM-step 2: update �(j)2 with �(j)

1 = �(j)1 such that g({�(j)

1 , �(j)2 }) < g({�(j)

1 , �(j−1)2 }).

Noteg({�(j)1 , �(j)

2 }) < g({�(j)1 , �(j−1)

2 }) < g({�(j−1)1 , �(j−1)

2 }).That is, the final update satisfiesg(�(j))<g(�(j−1)),so the likelihood value increases after the final update. Clearly ECM qualifies as a GEM algorithm. If there is need tofurther divide the parameters into additional parts, the update simply takes more CM-steps.

2.4.2. GEM based on one Newton–Raphson stepNext, consider �=[�, �], where � is a scalar parameter. This is for notational ease and illustration, as the algorithm

described below also works for vector parameters. If the first two derivatives of g(�) with respect to � exist in closedform, we can use a procedure called “GEM Based on One Newton–Raphson Step” to update � (McLachlan andKrishnan, 1997). The update �(j) has the form

�(j) = �(j−1) + a(j−1)�(j−1),


where 0 < a(j−1) �1 and

�(j−1) =[

�2g(�)

��2

]−1

�=�(j−1)

[�g(�)

��

]�=�(j−1)

.

For a(j−1) sufficiently small, this will guarantee that g([�(j), �]) < g([�(j−1), �]), so this procedure is a GEM algo-rithm. In practice choosing a(j−1) = 1 will suffice when near the minimum (Lange, 1999). Since this step satisfies theconditionally maximizing requirement, it works well with ECM.

2.5. Convergence criteria

The EM algorithm is said to converge when one of the two following conditions is met (Tanner, 1996):

‖�(i) − �(i−1)‖ < �� or |(−2 log LZ(�(i)) − (−2 log LZ(�(i−1))| < �L

for some small positive ��, �L, where ‖�‖ ≡ ∑i �2

i and �i are (scalar) elements of �. Since the EM algorithmconverges to a stationary point, which can be a saddle point, local minimum, or global minimum (McLachlan andKrishnan, 1997), it is advisable to check the result with several different starting values.

2.6. Starting values

To achieve fast convergence, one should choose reasonable starting values. One simple method is to use momentbased estimates. Suppose the data vector zt are of the same size for all t and T > n. It is straight forward to calculatethe sample estimates of its first two moments:

µz = 1

T

T∑t=1

zt , (6a)

C0 = 1

T

T∑t=1

(zt − µz)(zt − µz)′, (6b)

C1 = 1

T

T −1∑t=1

(zt+1 − µz)(zt − µz)′, (6c)

where C1 denotes lag-one covariance estimate.As a crude guess, assume zt follows a VAR(1) model. We then use the obtained estimates as starting values: µ(1)

0 =µz,H(1)=C1C−1

0 and Q(1)=C0−C1C−10 C1. We typically specify the measurement noise covariance matrix by R(1)=�2

RI,where �2

R is obtained from an assessment of the measuring instrument or from the estimate of “nugget effect” from aspatial variogram (e.g., Cressie, 1993).

Note that in cases with T �n, the estimates C0 and C1 are not positive definite and one cannot get an estimate ofH(1) and Q(1) as described above. Alternatively, one can fit individual univariate autoregressive models of order 1 foreach spatial location and let H(1) and Q(1) be diagonal matrices with estimates of the autoregressive parameters andconditional variances on the diagonal, respectively.

3. Algorithms for parameterizations of the Rt matrix

3.1. White noise

Without site-specific information about the measurement error process, it is often realistic to assume that measurementerror is independent and identically distributed white noise for all data locations, especially if the domain of interestis relatively homogeneous. For example, researchers have modeled monthly temperature in the U.S. corn belt with an


i.i.d. measurement error for all sites (Wikle et al., 1998). In this simple case, we reduce the error matrix Rt to a productof a scalar and the identity matrix:

Rt (�2� ) = �2

� Imt . (7)

One can show that a closed form M-step update formula exists for �2� .

Proposition 3.1. The M-step update of �2� for model (7) is

�2�(j) = 1∑T

t=1 mt

T∑t=1

tr{(zt − KtyTt )(zt − KtyT

t )′ + KtPTt K

′t }. (8)

Proof. Rewrite Eq. (3) as

g(Rt ) ∝T∑

t=1

mt log(�2� ) + 1

�2�

T∑t=1


t )′ + KtPTt K

′t }.

Differentiating with respect to �2�

�g(Rt )

��2�

= 1

�2�

T∑t=1

mt − 1

(�2� )

2

T∑t=1


t )′ + KtPTt K

′t }.

Setting the above to zero and solving for �2� gives the result. A second derivative test shows that this is indeed the

minimum. �

3.2. Truncated basis function representation

In some cases, the measurement error does depend on spatial location, so assuming i.i.d. error is no longer appropriate.However, if we have some knowledge about the measurement error, say from historical data or from a reformulationof the measurement equation, then we can incorporate that information into the model. First, we assume the mea-surement error covariance is not time dependent, Rt = R. Now, consider a basis function expansion for the R matrix(Berliner et al., 2000; Harville, 1997):

R(c) = cI +I∑

i=1

�iAi , (9)

where Ai are symmetric and idempotent matrices such that AiAj = 0, ∀i �= j and c > 0, �i �0. We assume that Ai

and �i are all known and the positive scalar, c is the only unknown.This model makes use of an incomplete matrix decomposition such as an eigenvalue decomposition. In this case we

know the I dominant matrix bases for R and use the c term to represent smaller scale variability and randomness, toensure that R(c) is positive definite. Estimation of c with the EM algorithm in this case involves a numerical step assuggested by the following theorem.

Proposition 3.2. The M-step update of c for model (9) is

c(j) = the (positive) root of f (c)

where

f (c) = 1

c

[T m − T

I∑i=1

tr(Ai )

]+ 1

c2

[I∑

i=1

tr(AiB) − tr(B)

]+

I∑i=1

T tr(Ai )

c + �i

−I∑

i=1

tr(AiB)

(c + �i )2 (10)

and B is given by (5).


Proof. First note that R(c) is positive definite for any c > 0 and (Harville, 1997)

R−1(c) = dI +I∑

i=1

�iAi .

where d = 1/c, and �i = −�i/(c(c + �i )). From (3) we have

g(R) ∝ T log |R| + tr(R−1B). (11)

Taking the first derivative with respect to c gives

�g(R)

�c= T

� log |R|�c

+ � tr(R−1B)

�c(12)

= T tr(R−1) − tr(R−1R−1B) (13)

= T tr

(dI +

I∑i=1

�iAi

)− tr

[(d2I + 2d

I∑i=1

�iAi +I∑

i=1

�2i Ai

)B

]

= T dm + T

I∑i=1

�i tr(Ai ) − d2 tr(B) −I∑

i=1

(2d�i + �2i ) tr(AiB)

= T m

c+ T

I∑i=1

(1

c + �i

− 1

c

)tr(Ai ) − tr(B)

c2 −I∑

i=1

[1

(c + �i )2 − 1

c2

]tr(AiB).

The last line follows from c = 1/d and �i = 1/(c + �i ) − 1/c. Collecting terms, we get (10). �

In general we cannot find the closed form solution of (10), so we have to resort to numerical methods. Fortunately,most modern software packages have routines for finding the roots of a function of one variable if the user supplies aninitial search bracket. However, if I = 1, or I = 2 in (9), then it can be shown that (10) is a polynomial in c of degree3 or 5. Then, we can use standard routines for solving polynomials. In that case, estimation is fully specified. In theevent of multiple positive roots, we simply evaluate (10) and select the minimum. Alternatively, we could employ oneNewton–Raphson step to update c.

4. Algorithms for parameterization of the Q matrix

First let us derive the update formula for the unparameterized case since the general case yields a different resultthan (4b).

4.1. General case

The EM update for the general Q case (but with parameterized H) is given by the following proposition.

Proposition 4.1. The j th update formula for the general Q is

Q(j) = T −1(S11 − S10H′ − HS′10 + HS00H′). (14)

Proof. First note that for any X and B we have

� tr(X−1B)

�X= −(X−1BX−1)′.

Let A = S11 − S10H′ − HS′10 + HS00H′ and rewrite (3) as a function of Q

g(Q) ∝ T log |Q| + tr(Q−1A).


Differentiating with respect to Q gives

�g(Q)

�Q= T (Q−1)′ − (Q−1AQ−1)′.

Setting the above to zero and solving for Q yields the result. The second derivative test confirms this is a minimum.�

Remark 4.1. When H is also not parameterized, which implies H(j) is (4a), then replacing H with H(j) in proposition(4.1) will yield (4b), or Q(j) = T −1(S11 − S10S−1

00 S′10).

4.2. Diagonal case

Assume that Q is a diagonal matrix with diagonal elements in the vector �:

Q(�) = diag(�1, . . . , �n). (15)

Such a model is especially appropriate when the state variable is in the spectral domain, since the state process elementsare often approximately decorrelated in that setting (e.g., Wikle, 2002). The following proposition gives the updateequation for �. It is simply the diagonal vector of Q(j) for the general case as given in (14).

Proposition 4.2. The j th update formula of � for model (15) is

�(j) = 1

Ta, (16)

where a is the diagonal of A = S11 − S10H′ − HS′10 + HS00H′.

Proof. From (3) we have

g(�) ∝ T log |Q| + tr(Q−1A)

= T logn∏

i=1

�i +n∑

i=1

1

�i

aii .

Taking partial derivatives with respect to �i gives

�g(�)

��i

= T1

�i

− 1

�2i

aii .

Setting the above to zero we obtain

�i = 1

Taii ,

which is true for i = 1, . . . , n. The second derivative test confirms that this is a minimum. �

4.3. Conditional autoregressive (CAR) model

Consider a CAR model for �t in (1b) by assuming the following (e.g., He and Sun, 2000):

�t (si ) | �t (sk), k �= i ∼ N

⎛⎝�∑k �=i

Cik�t (sk), �

⎞⎠ , (17)

where Cik = 1 if location si and sk are neighbors, and Cik = 0 otherwise. Define the adjacency matrix C = (Cik). Itcan be shown that the covariance matrix of the joint distribution �t is �(I − �C)−1 (e.g., He and Sun, 2000). In otherwords, the model for Q is

Q(�, �) = �(I − �C)−1. (18)


Let �1 � · · · ��n be the eigenvalues of C. Sun et al. (2000) showed that �1 < 0, �n > 0 and that in order for thecovariance matrix to be positive definite, �−1

1 < � < �−1n .

Proposition 4.3. The M-step update of � and � for model (18) is

�(j) = the root of f (�), (19a)

�(j) = 1

T n[tr(A) − �(j) tr(CA)], (19b)

where

f (�) = T

n∑i=1

�i

1 − ��i

− T n tr(CA)

tr(A) − � tr(CA), (20)

and

A = S11 − S10H′ − HS′10 + HS00H′. (21)

Proof. Starting from (3) and using the fact that |I − �C| =∏ni=1 (1 − ��i )

g(Q) ∝ T log |Q| + tr(Q−1A)

= − T log |Q−1| + tr(Q−1A)

= − T log {�−n|I − � C|} + 1

�tr{(I − �C)A}

= T n log � − T

n∑i=1

log(1 − ��i ) + 1

�tr A − �

�tr(CA).

Taking the first derivative with respect to � and �, respectively, gives

�g(Q)

��= T

n∑i=1

�i

1 − ��i

− 1

�tr(CA), (22a)

�g(Q)

��= T n

�− 1

�2 tr A + �

�2 tr(CA). (22b)

Setting (22b) to zero yields

�(j) = 1

T n[tr A − � tr(CA)]. (23)

Substituting (23) into (22a) yields f (�), or (20), the root of which is �(j). Substituting �(j) in (23) we obtain (19b).The second derivative test confirms that this is a minimum. �

Note that to find �(j) one needs to perform a numerical search on a line. Fortunately, the initial search bracket isalways known, i.e., (�−1

1 , �−1n ) since � has to be constrained as mentioned earlier. Therefore these update formula are

fully specified and their implementation is automatic.

4.4. Exponential covariogram model

It is common in spatial statistics to impose a parametric model on the spatial random field. We consider the commonlyused exponential covariogram model:

Q(�2�, �) = �2

�C(�), (24)

where the correlation matrix C(�) is governed by the exponential correlation function c(d; �)=exp(−�2d), and d is thedistance between two locations and � is the spatial dependence parameter (Cressie, 1993). It is important to recognize


that there exists an analytical form for the first and second derivative of this correlation function c(d; �) with respectto �. This enables us to obtain the closed form update formula as given by the following proposition.

Proposition 4.4. The update formula of �2� and � for the model (24) is

�2(j)� = 1

T ntr[C−1(�(j−1))A], (25a)

�(j) = �(j−1) − a(j−1) g′(�(j−1))

g′′(�(j−1)), (25b)

where

A = S11 − S10H′ − HS′10 + HS00H′,

g′(�) = T tr

(C−1 �C

��

)− 1

�2(j)�

tr

(C−1 �C

��C−1A

),

g′′(�) = T tr

(C−1 �2C

��2

)− T tr

(C−1 �C

��C−1 �C

��

)− 1

�2(j)�

tr

(C−1 �2C

��2 C−1A

)

+ 2

�2(j)�

tr

(C−1 �C

��C−1 �C

��C−1A

),

and 0 < a(j−1) �1.

Proof. Starting from (3)

g(Q) ∝ T log |Q| + tr(Q−1A) = T n log �2� + T log |C| + 1

�2�

tr(C−1A).

Taking the first derivative with respect to �2� yields

�g

��2�

= T n

�2�

− 1

(�2�)

2 tr(C−1A).

Setting the above to zero and using �(j−1) for R yields the update formula (25a). Note this is an ECM step. To update�, we focus the following function:

g(�) = T log |C(�)| + 1

�2(j)�

tr(C(�)−1A).

Then, we use the GEM algorithm based on one Newton–Raphson step to obtain (25b). �

Remark 4.2. The algorithm given by Proposition 4.4 is appropriate for any covariogram model which has analyticalform of the first and second derivative of the correlation function with respect to the spatial dependence parameter �.

5. Parameterization for transition matrix H

The transition (or propagator) matrix, H, is the most critical part of the spatio-temporal dynamic model (1) , sinceit governs the evolution of the process. Each row of H contains essentially the location-wise weights applied to theprocess at the previous time for the spatial location corresponding to that row. It can be shown that the M-step updateformula for the unparameterized H is given in Eq. (4a) regardless of parameterization of Q and Rt . Here we propose a


simple but powerful parameterization. Assume that an entry of H is either zero or i , i = 1, . . . , m. The position of the0’s and the i’s in the matrix H are fixed. We can write

H = H(0, �), (26)

where � = (1, 2, . . . , m) and m is known.In the next theorem we derive the closed form update formula for �.

Proposition 5.1. The j th iteration update of � for model (26) is

�(j) = �−1b,

where

� = (ij ), b = (bi), ij = tr

{Q−1 �H

�j

S00�H

′

�i

}and bi = tr

{Q−1 �H

�i

S′10

}.

Proof. Starting from Eq. (3),

g(H) ∝ tr{Q−1[S11 − S10H′ − HS

′10 + HS00H

′ ]}∝ − 2 tr{Q−1HS

′10} + tr{Q−1HS00H

′ }.Defining D(i) ≡ �H/�i and differentiating g(H) with respect to i gives

�g(H)

�i

= −2 tr{Q−1D(i)S′10} + 2 tr{Q−1HS00D

′(i)}.

Setting the above to zero and using the fact that H = 1D(1) + · · · + mD(m), we get

1 tr{Q−1D(1)S00D′(i)} + · · · + m tr{Q−1D(m)S00D

′(i)} = tr{Q−1D(i)S

′10}.

Therefore, we have i = 1, . . . , m linear equations. Writing in the matrix form and solving for � gives the result. �

The formula in Proposition (5.1) implies that to update H we need to know Q(j). This is impossible with thestandard EM algorithm. Instead we employ the ECM algorithm to update the parameters sequentially (see Section 2.4).To illustrate, suppose Q is defined in the general case. Then, we can update H and Q together as suggested by thefollowing remark.

Remark 5.1. The ECM update for model (26) and general Q is

1. First update H(j) with Proposition (5.1) by letting Q = Q(j−1).2. Then update Q(j) with Proposition (4.1) by letting H = H(j).

6. Illustrative examples

6.1. Advection–diffusion PDE: a simulation study

6.1.1. BackgroundIn an ecological study, researchers used a diffusion PDE to predict the spread of the house finch in the eastern United

States with a hierarchical Bayesian model (Wikle, 2003). In cases where one does not have strong a priori belief thatthe diffusion parameter varies with space, such a model lends itself well to the methodology we just developed. Forillustration, consider a one-dimensional advection–diffusion equation for the spatio-temporal state process ut (x), atspatial location x and time t:

�u

�t+ �

�u

�x= �

�2u

�x2 , ��0, x0 �x�xL, (27)


where � is the advection coefficient and � is the diffusion coefficient. Following basic finite difference approaches to thenumerical solution of partial differential equations (e.g., Haberman, 1987), we can apply first-order forward differencesin time (�u/�t ≈ (ut (x)−ut− t

(x))/ t ) and centered differences in space (�u/�x ≈ (ut (x + x)−ut (x − x))/2 x

and �2u/�x2 ≈ (ut (x + x) − 2ut (x) + ut (x − x))/ 2x), where these centered differences are valid for any time t,

to get

ut (x) = ut− t(x)

(1 − 2 t�

2x

)+ ut− t

(x + x)

(− t�

2 x

+ t�

2x

)

+ ut− t(x − x)

( t�

2 x

+ t�

2x

)+ �t (x), (28)

where t and x are the temporal and spatial increments, respectively, and we thus discretize the spatial domain so thatx0, x1, . . . , xn, xn+1 have equal spacing, x . Note �t (x) was added to (28) to make up for the loss from discretizationand to introduce extra stochastic forcing (Wikle, 2003). Furthermore, if we let

1 = 1 − 2 t�

2x

, 2 = − t�

2 x

+ t�

2x

, 3 = t�

2 x

+ t�

2x

, x = 1

(since spatial locations are equally spaced, this is convenient notationally) and t = 1, then (28) becomes

ut (x) = ut−1(x)1 + ut−1(x + 1)2 + ut−1(x − 1)3 + �t (x). (29)

Writing in matrix form, we have

ut = H(�)ut−1 + HB(�)uBt−1 + �t , (30)

where ut=(ut (x1), . . . , ut (xn))′ is the interior process and uB

t =(ut (x0) ut (xn+1))′ is the boundary process, respectively.

Furthermore,

HB() ≡

⎡⎢⎢⎢⎢⎣3 00 0...

...

0 00 2

⎤⎥⎥⎥⎥⎦ (31)

is the propagator matrix for the boundary process, and

H() ≡

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

1 2 0 · · · 0 0 03 1 2 · · · 0 0 00 3 1 · · · 0 0 0...

......

. . ....

......

0 0 0 · · · 1 2 00 0 0 · · · 3 1 20 0 0 · · · 0 3 1

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦(32)

is a tri-diagonal propagator matrix for the interior process. This matrix is in the form of (26) due to the structural zeros.For simplicity, we let the boundaries in (30) be zeros (i.e., uB

t−1 = (0 0)′ for all t). Thus, (30) becomes the usual stateequation. In addition, we specify a measurement equation similar to (1a) with noise covariance matrix (7). We havespecified a spatio-temporal dynamic model for the diffusion PDE process:

zt = Ktut + εt , (33a)

ut = H(�)ut−1 + �t , (33b)

where cov(εt ) = Rt = �2� I. We assume that the process error, �t , has an exponential covariance matrix as described by

(24), i.e., cov(�t ) = Q(�) = �2�C(�). We assume both noise processes are Gaussian with zero-mean.


Of course, in most “real-world” applications in which an advection–diffusion process would be appropriate (e.g.,processes in the atmosphere, ocean, or ecological processes) one would not know the parameters � and � (and thus, 0,1 and 2). Thus, we seek to estimate them. We demonstrate such estimation by simulating the process and comparingestimates to the known parameters.

6.1.2. Simulating the data setTo illustrate our estimation methodology, we simulate a data set according to the above specified model. Table 1

summarizes the actual value of the parameters and other simulation set-up values. As a way of gauging the estimationperformance, we withhold a certain amount of data for validation. This missing data set-up is achieved easily with anincidence matrix Kt in the measurement equation.

Fig. 1(a) shows the map of the simulated spatio-temporal diffusion data. There is a noticeable pattern of propagationof spatial features to the left through time. This is the result of the special structure of the propagation matrix H(�) andthe chosen value of �. Another way of looking at the data is to examine the time series plot for some locations (see

Table 1Simulation set-up for diffusion data used in Table 2 and Fig. 1. (Signal-to-noise ratio (SNR)=�2

�/�2� , mt =number of observation at time t, n=spatial

dimension of the state process, � = spatial dependence parameter, � = parameters in the propagator matrix.)

Parameter Set-up

1 2 3 �2� �2 �2

� SNR Missing mt n T

0.3 0.6 0.1 1 1 5 5 10% ≈ 18 20 100

Est

Location10 20

10

20

30

40

50

60

70

80

90

100

-10 0 10

Truth

Location10 20

10

20

30

40

50

60

70

80

90

100

-10 -100 10

Truth-Est

Location10 20

10

20

30

40

50

60

70

80

90

100

0 10

Std

Location10 20

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5

Data

Location

Tim

e

10 20

10

20

30

40

50

60

70

80

90

100

0 10

(a) (b) (c) (d) (e)

Fig. 1. Simulated diffusion data and estimation: (a) Simulated data, zt ; see Table 1 for true parameter values; (b) smoothed value uTt ; (c) true process

ut ; (d) prediction error ut − uTt ; (e) standard deviation of prediction error [diag(PT

t )]1/2.


0 10 20 30 40 50 60 70 80 90 100-10

-5

0

5

10

15Location 4

dattruthest

0 10 20 30 40 50 60 70 80 90 100-10

-5

0

5

10

15Location 8

dattruthest

Fig. 2. Simulated diffusion data (square), true process ut (line) and prediction uTt (dot) for two selected locations. See Fig. 1 for data at all locations.

Fig. 2). We cannot easily detect spatio-temporal propagation from such time series plots, but they give us informationabout the temporal structure and stability of the signal.

6.1.3. ECM estimationEstimation is carried out with an ECM algorithm. For this model we need to update the following parameters:

� = {�, �2�, �, �2

� , µ0}. Given the current iterate �(j−1), five CM-steps are used to obtain the update �(j):

• CM-step 1: update �(j) by Proposition 5.1 with �2� = �2

�(j−1)

, � = �(j−1) , �2� = �2

�(j−1)

and µ0 = µ(j−1)0 ,

• CM-step 2: update �2�(j)

by Eq. (25a) Proposition 4.4 with � = �(j), � = �(j−1), �2� = �2

�(j−1)

and µ0 = µ(j−1)0 ,

• CM-step 3: update �(j) by Eq. (25b) in Proposition 4.4 with � = �(j), �2� = �2

�(j)

, �2� = �2

�(j−1)

and µ0 = µ(j−1)0 ,

• CM-step 4: update �2�(j)

by Proposition 3.1 with � = �(j), �2� = �2

�(j)

, � = �(j), and µ0 = µ(j−1)0 ,

• CM-step 5: update µ(j)0 by Eq. (4d) with � = �(j), �2

� = �2�(j)

, � = �(j), and �2� = �2

�(j)

.

These parameters can be updated in a different order if desired. To update �, we use the “GEM Based on OneNewton–Raphson Step” algorithm discussed in Section 2.4 (McLachlan and Krishnan, 1997).

ECM gives updates that move the likelihood in the right direction, but it is usually not the “best” update. Thereforeit takes more iterations to converge as compared to the common EM. For this example, it takes about 270 iterations toconverge (see Table 2). The final estimates are reasonably close to the truth. Fig. 1 shows the space-time plot of thesmoothed values of ut as well as the associated smoothing error. We look at smoothed values because we are interestedin prediction at locations with missing data. There is very little apparent difference between the prediction map and thetrue process map. A closer examination in Fig. 2 for two locations reveals minor discrepancy. Finally, we note that theprediction variance (see Fig. 1e) is larger in the locations with missing data, as expected.

6.1.4. Sampling distribution of parameter estimatesTo gain insight into the sampling distribution properties of the estimates, we conduct a small simulation study with

the same parameter values as given in Table 1 except for different values of the signal noise �2� (hence different SNRs,

SNR =�2�/�

2� ) as well as different amounts of missing data. To be specific, we conduct four experiments: two levels of


Table 2ECM iterates for simulated diffusion data. � = {µ0, H(�), Q(�2

�, �), Rt (�2� )}. a(j) is the Newton–Raphson step size for updating �. We consider

the stopping criterion �� = 0.001. See Fig. 1 for a plot of the data

j − log L(j) log L (j)1 (j)

2 (j)3 �2

�(j) �2

�(j) �2(j)

a(j) ‖µ(j)0 ‖ ‖�‖

1 13766.03 0.00 0.30 0.30 0.30 0.2000 1.00 1.00 1.00 6.66 0.000002 2825.46 10940.57 0.33 0.53 0.20 1.8420 5.51 1.14 1.00 22.73 28.694743 2643.62 181.83 0.33 0.59 0.14 2.6336 6.57 1.33 1.00 22.43 9.183744 2612.22 31.41 0.32 0.62 0.11 2.8971 6.81 1.54 1.00 22.21 4.573505 2597.93 14.28 0.31 0.63 0.09 2.9249 6.74 1.79 1.00 22.12 2.63997

61 2492.28 0.03 0.28 0.61 0.10 1.1671 4.78 9.13 1.00 26.67 0.03425121 2491.71 0.00 0.28 0.61 0.10 0.9806 5.00 9.46 1.00 27.07 0.00839181 2491.64 0.00 0.28 0.61 0.10 0.9196 5.08 9.56 1.00 27.19 0.00328241 2491.62 0.00 0.28 0.60 0.10 0.8937 5.11 9.61 1.00 27.24 0.00148265 2491.62 0.00 0.28 0.60 0.10 0.8878 5.12 9.62 1.00 27.25 0.00110266 2491.62 0.00 0.28 0.60 0.10 0.8876 5.12 9.62 1.00 27.25 0.00108267 2491.62 0.00 0.28 0.60 0.10 0.8874 5.12 9.62 1.00 27.25 0.00107268 2491.62 0.00 0.28 0.60 0.10 0.8871 5.12 9.62 1.00 27.25 0.00106269 2491.62 0.00 0.28 0.60 0.10 0.8869 5.12 9.62 1.00 27.25 0.00105270 2491.62 0.00 0.28 0.60 0.10 0.8867 5.12 9.62 1.00 27.25 0.00103271 2491.62 0.00 0.28 0.60 0.10 0.8865 5.12 9.62 1.00 27.25 0.00102272 2491.62 0.00 0.28 0.60 0.10 0.8863 5.12 9.62 1.00 27.25 0.00101

Truth 0.30 0.60 0.10 1.0000 5.00 10.00

SNR (one strong (SNR =5) and one weak (SNR =0.5)) with two levels of missing (few (10%) missing and substantial(40%) missing). Each experiment includes 1000 cycles, where in each cycle we simulate a data set and then run theECM algorithm to get the estimates. As shown in Table 3, the estimates are generally centered around the true valuewith small deviances. Several findings are worth noting. First, the estimates of the ’s are not sensitive to the amount ofmissing data, yet the variance parameter estimates are more uncertain if the amount of missing data is large. Second, andnot surprisingly, large SNR values yield more accurate estimates for most of the parameters except for the measurementnoise variance �2

� .The findings from this simulation study do not necessarily generalize. However, they do give us a picture of how this

estimation procedure might perform in practice for a process that is very realistic in many environmental applications.

6.2. Palmer Drought Severity Index (PDSI)

6.2.1. BackgroundDrought poses a serious problem for every society. One measure of drought is the PDSI, which is typically a monthly

valued index (Heim, 2002). The typical value of PDSI ranges from −6 to +6 with negative values denoting dry spellsand positive values denoting wet spells.

We obtain the monthly PDSI for 107 locations in the central U.S. from January 1900 to December 1997. Fig. 3displays the data for two typical months. We can see that there is significant spatial correlation in the data. Indeed,dry and wet spells occur with substantial spatial coherence across the region. Therefore there is no need to model 107“stations” individually. A more concise representation should suffice for this data set. Thus, we consider a dimensionreduced spatio-temporal approach to model PDSI.

6.2.2. Dimension reductionFirst, we introduce the idea of spatio-temporal dimension reduction. The key is to recast the state vector in a much

lower dimensional space by using a spectral basis (Wikle and Cressie, 1999). Let

yt = �at + �t , (34)

where �=[�1 · · · �K ] is an n×K matrix of spectral basis function, K>n, and �t contains the residual process inducedby the truncation. We treat at as our new state vector, which follows a first-order Markov process if yt follows such aprocess. Typically, �t is a non-dynamic (uncorrelated in time) spatial process.


Table 3Sampling distribution of MLEs for diffusion model. Four cases: two levels of signal to noise ratio (SNR) at two levels of missing data

MLE Truth Sampling distribution

Mean Std Percentile

2.5th Median 97.5th

Missing 10 percent; SNR = 5.01 0.3000 0.2969 0.0226 0.2513 0.2972 0.34142 0.6000 0.5969 0.0214 0.5533 0.5976 0.63583 0.1000 0.1002 0.0177 0.0670 0.1007 0.1352�2

� 1.0000 1.0017 0.2646 0.4980 1.0006 1.5347�2

� 5.0000 4.9559 0.4026 4.1962 4.9583 5.7083

�2

10.0000 9.9400 0.8749 8.3339 9.9068 11.8073


� 1.0000 1.0001 0.3612 0.3612 0.9999 1.7156�2

� 5.0000 4.9164 0.4913 3.9663 4.9162 5.8190

�2

10.0000 10.0229 1.2314 7.8129 9.9793 12.6944


� 1.0000 0.9875 0.0761 0.8293 0.9872 1.1360�2

� 0.5000 0.5017 0.0697 0.3737 0.4997 0.6481

�2

10.0000 10.0044 1.8650 6.8860 9.7791 14.3477


� 1.0000 0.9810 0.1018 0.7674 0.9829 1.1621�2

� 0.5000 0.5006 0.0910 0.3359 0.4962 0.6835

�2

10.0000 10.2322 2.7400 6.2384 9.8888 16.7922

Now, rewrite the spatio-temporal model (1) in the light of dimension reduction,

zt = Kt�at + εt , (35a)

at = Hat−1 + �t , (35b)

where εt contains measurement error as well as truncation error �t , and as is typical, we assume that εt and �t are meanzero Gaussian processes that are temporally independent. This model is in the same form as (1) with slightly differentnotation. We proceed to specify the covariance of εt after taking into account the truncation error:

R = cI +K+k∑

i=K+1

�i�i�′i . (36)

This is a simplified version of a formulation by Berliner et al. (2000). By using the next k basis functions �i , thisformulation amounts to a second dimension reduction. If �i are orthonormal, then it is evident that (36) is an exampleof model (9) with Ai =�i�

′i . We assume the covariance matrix of the model error process �t is diagonal, i.e., Q=Q(�).

This is reasonable since the spectral decomposition typically leads to decorrelation in spectral space.


-105 -100 -95 -9030

32

34

36

38

40

42

44

46

48

50PDSI 7/1988

-105 -100 -95 -9030

32

34

36

38

40

42

44

46

48

50PRED 7/1988

-105 -100 -95 -9030

32

34

36

38

40

42

44

46

48

50PDSI 7/1993

-105 -100 -95 -9030

32

34

36

38

40

42

44

46

48

50PRED 7/1993

Fig. 3. PDSI data (left column) and prediction (right column) for two months. Dark (open) circle corresponds to negative (positive) PDSI value. Sizeof the circle is proportional to the magnitude of the PDSI value.

If one is interested in predicting the process yt at locations for which one does not have data, then it is importantto consider �t explicitly as suggested by Wikle and Cressie (1999) and Cressie and Wikle (2002). However, if one isprimarily interested in the dynamic process and/or its parameters (at , H, Q) then it is simpler to consider �t through itsmarginal covariance (R in this case). This is directly analogous to traditional mixed models where if one is interestedin inference on the fixed effects, then one integrates out the random effects and considers the so-called marginalformulation. However, if one is interested in predicting the random effects then one considers the random effectsdirectly (the so-called conditional specification). In this application, we are interested in forecasting the dynamiccomponent so it is reasonable to consider the effects of �t marginally as indicated above.

Although we could use any set of orthonormal basis functions (e.g., Fourier, wavelets, empirical), we choose to useempirical orthogonal functions (EOFs) for this example, since they are widely used in meteorological studies. EOFsare meteorologists’ name for the familiar principal components analysis for spatio-temporal data (see Wikle, 1996 foran overview). We obtain the EOFs, �i , by performing an eigenvalue decomposition of the estimated spatial covariancematrix of the data. Fig. 4 shows the percent variability accounted for by each basis function. Note the steep decline upto the 10th EOF. The remaining EOFs explain very little of the variability in the data. Therefore, we fix the truncationparameter K at 10. Instead of modeling a 107-dimensional state vector, we now model a 10-dimensional state vector,which is a much easier task both statistically and computationally. Finally, we choose k = 20 in (36) since the next 20EOFs account for about 10% of the variability and adding more EOFs does not add much spatial structure.


1 10 30 50 70 90 1070

10

20

30

40

50

60

70

80

90

100

Per

cent

EOF ID

individualcummulative

Fig. 4. Percentages (solid line) and accumulated percentages (dashed line) of total variability accounted for by EOFs for the PDSI data.

Table 4EM iterates for PDSI data. � = {µ0, H, Q(�), R(c)}. The stopping criterion is �� = 0.001

j − log L(j) log L c(j) ‖H(j)‖ ‖�(j)‖ ‖µ(j)0 ‖ ‖�‖

1 70736.15 0.00 1.0000 0.98 27.55 0.00 0.0002 64755.44 5980.72 0.6198 0.99 25.35 15.37 17.0733 64653.15 102.29 0.5862 0.99 25.52 17.65 2.6614 64649.81 3.34 0.5832 0.99 25.59 17.98 0.4315 64649.16 0.65 0.5830 0.99 25.60 18.02 0.0906 64648.98 0.18 0.5830 0.99 25.60 18.02 0.0377 64648.93 0.05 0.5831 0.99 25.59 18.02 0.0208 64648.91 0.01 0.5831 0.99 25.59 18.02 0.0119 64648.91 0.00 0.5831 0.99 25.59 18.02 0.006

10 64648.91 0.00 0.5831 0.99 25.59 18.02 0.00311 64648.91 0.00 0.5831 0.99 25.59 18.02 0.002

6.2.3. EM estimationThe parameters for model (35) are H, �, µ0 and c. The M-step update formulas are Eq. (4a), Proposition (4.2), Eq.

(4d) and Proposition (3.2), respectively. The EM iteration history is shown in Table 4. The algorithm converges muchfaster than the ECM example discussed in Section 6.1.

To assess the fit of the model, we examine the plots of predicted values of the state vector in the original space aftermaking the transform, yt−1

t = �at−1t . Fig. 3 shows the one-month ahead prediction along with the observed data for

each of two months. As we can see, the predictions are able to capture the major spatial pattern fairly well. Fig. 5shows time series plots for three stations for a 10-year period around the major Midwest flood year of 1993. In general,the model does a reasonable job of predicting the next month’s PDSI values even amidst the unusual the flood event.However, it is clear that there are periods during which the predictions are biased relative to the (noisy) observed data,suggesting that a state-dependent transition matrix might be a more appropriate model. The prediction standard errormap(Fig. 6) shows the prediction error is roughly of the same order across the spatial domain and for different timeperiods.


-105 -100 -95 -9030

32

34

36

38

40

42

44

46

48

50std 7/1988

-105 -100 -95 -9030

32

34

36

38

40

42

44

46

48

50std 7/1993

Fig. 5. PDSI data (x), prediction, yt−1t (solid line) and approximate 95% prediction intervals (dashed lines) (dashed line) for three stations.

1060 1080 1100 1120 1140 1160 1180

-5

0

5

lon -102.30, lat 48.50

1060 1080 1100 1120 1140 1160 1180

-5

0

5

lon -95.90, lat 39.60

1060 1080 1100 1120 1140 1160 1180

-5

0

5

lon -92.30, lat 31.20

month

Fig. 6. Prediction standard error of PDSI data for two months. Size of the circle is proportional to values. See Fig. 3 for data and prediction.

7. Summary and conclusion

We have proposed several parameterizations for a spatio-temporal dynamic model. The strategy is to make use of validspatial statistical models and to make simple physical assumptions about the process which lead to partial restrictionson the transition matrix and/or the covariance matrices. We also derive the relevant (General) EM update formulas forthese restrictions. We demonstrate this methodology with a simulation study in which the true state-process followsan advection–diffusion process. In addition, we apply the methodology to the problem of spatio-temporal modeling ofmonthly Palmer Drought Severity Index values over the central U.S.


It is important to point out that although this development was motivated by spatio-temporal problems, the parameter-ization/GEM approach is quite flexible and the ideas contained in this paper can be used for other parameterizations aswell. In particular, the state-space framework is useful for many multivariate time series problems (e.g., see applicationsin Shumway, 1988). However, the fact that we prefer closed form update formulas does put a limit on our choices forparameterizations. In addition, since there could be different parameterizations for the same data that are reasonablefrom a scientific perspective, it is desirable to have a consistent way of performing model selection (e.g., Bengtsson,2000).

It is reasonable to ask what are the advantages and disadvantages of the EM/GEM approach presented here comparedto a fully Bayesian approach (e.g., Wikle, 2003; Berliner et al., 2000). In the spatio-temporal dynamical model frame-work, the fully Bayesian approach is most useful when one has some prior understanding (either from scientific theory,or from previous empirical studies) about the process dynamics, particularly the evolution operator (propagator). Forexample, Wikle (2003) considered a discretized PDE-based model analogous to the advection–diffusion example inSection 6. However, in that case the ecological problem at hand suggested that the dynamics were largely controlled byspatially-varying (yet unknown) diffusion coefficients that corresponded to population spread. In that context, it wasappropriate, given the relationship between heterogeneous population spread and habitat, that the diffusion parametersbe spatially dependent and should thus have a spatial prior distribution. However, for a process such as Example 6.1, inwhich one does not expect the advection and diffusion coefficients to be spatially varying, then it is certainly reasonableto assume no spatial dependence and thus estimate the relatively few parameters empirically through the EM/GEMapproach outlined here (with uncertainty in the parameter estimates accounted for by bootstrapping).

In summary, when the model complexity increases, and/or when one has significant prior knowledge about thedynamics, one should use a fully Bayesian (MCMC) approach. When one has a relatively simple model and little priorknowledge, then it is reasonable to use the EM/GEM approach. However, the EM/GEM approach is of limited utilityif the parameter space is high-dimensional as the convergence of the EM/GEM algorithm is likely to be problematic.One may then consider the fully Bayesian approach, with additional parameterization at the lower levels of the modelhierarchy. Of course, there still may be convergence issues in the MCMC implementation in that setting as well.In general, the approach described here is useful for relatively simple spatio-temporal state-space models that areeffectively parameterized by relatively few parameters.

Acknowledgments

This research was made possible by National Science Foundation Grants ATM-0222057 and DMS-0139903. Wethank the anonymous reviewers and the AE for their helpful comments on an earlier draft.

References

Bengtsson, T., 2000. Time series discrimination, signal comparison testing, and model selection in the state-space framework. Ph.D. Thesis, Universityof Missouri-Columbia.

Berliner, L.M., Wikle, C.K., Cressie, N., 2000. Long-lead prediction of pacific SST via Bayesian dynamic modeling. J. Climate 13, 3953–3968.Cressie, N.A.C., 1993. Statistics for Spatial Data. revised ed. Wiley, New York.Cressie, N., Huang, H.-C., 1999. Classes of nonseparable, spatio-temporal stationary covariance functions. J. Amer. Statist. Assoc. 94 (448),

1330–1340.Cressie, N., Wikle, C., 2002. Space-time kalman filter. In: El Shaarawi, A., Piegorsch, W. (Eds.), Encyclopedia of Environmetrics, vol. 4. Wiley,

New York, pp. 2045–2049.Gneiting, T., 2002. Nonseparable, stationary covariance functions for space-time data. J. Amer. Statist. Assoc. 97, 590–600.Gupta, N., Mehra, R., 1974. Computational aspects of maximum likelihood estimation and reduction in sensitivity function calculations. IEEE Trans.

Automat. Control 19, 774–783.Haberman, R., 1987. Elementary Applied Partial Differential Equations. second ed. Prentice-Hall, New Jersey.Harville, D.A., 1997. Matrix Algebra from a Statistician’s Perspective. Springer, New York.He, Z., Sun, D., 2000. Hierarchical bayes estimation of hunting success rates with spatial correlations. Biometrics 56, 360–367.Heim Jr., R.R., 2002. A review of twentieth-century drought indices used in the united states. Bull. Amer. Meteorol. Soc. 83, 1149–1165.Huerta, G., Sanso, B., Stroud, J., 2004. A spatiotemporal model for mexico city ozone levels. J. Roy. Statist. Soc. Ser. C 53, 231–248.Kalman, R.E., 1960. A new approach to linear filtering and prediction problems. J. Basic Eng. 82 (D), 35–45.Kyriakidis, P.C., Journel, A.G., 1999. Geostatistical space-time models: a review. Math. Geol. 31 (6), 651–684.Lange, K., 1999. Numerical Analysis for Statisticians. Springer, New York.


Mardia, K., Goodall, C., Redfern, E., Alonso, F., 1998. The kriged kalman filter. Test 7, 217–285 (with discussion).McLachlan, G.J., Krishnan, T., 1997. The EM Algorithm and Extensions. Wiley, New York.Shumway, R., 1988. Applied Statistical Time Series Analysis. Prentice-Hall, Englewood Cliffs, NJ.Shumway, R.H., Stoffer, D.S., 1982. An approach to time series smoothing and forecasting using the EM algorithm. J. Time Ser. Anal. 3 (4),

253–264.Shumway, R.H., Stoffer, D.S., 2000. Time Series Analysis and its Applications. Springer, New York.Stein, M., 2005. Space-time covariance functions. J. Amer. Statist. Assoc. 100 (469), 310–321.Stoffer, D., Wall, K., 1991. Bootstrapping state-space models: Gaussian maximum likelihood estimation and the kalman filter. J. Amer. Statist. Assoc.

86, 1024–1033.Stroud, J., Mueller, P., Sanso, B., 2001. Dynamic models for spatio-temporal data. J. Royal Statist. Soc. Ser. B 63, 673–689.Sun, D., Tsutakawa, R.K., Speckman, P.L., 2000. Bayesian inference for CAR (1) models with noninformative priors. Biometrika 86, 341–350.Tanner, M.A., 1996. Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. Springer, New

York.Wall, K., Stoffer, D., 2002. A state space approach to bootstrapping conditional forecasts in arma models. J. Time Ser. Anal. 23, 733–751.West, M., Harrison, J., 1997. Bayesian Forecasting and Dynamic Models. second ed. Springer, New York.Wikle, C.K., 1996. Spatio-temporal statistical models with applications to atmospheric processes. Ph.D. Thesis, Iowa State University.Wikle, C.K., 2002. Spatial modeling of count data: a case study in modelling breeding bird survey data on large spatial domains. In: Lawson, A.B.,

Denison, D.G.T. (Eds.), Spatial Cluster Modelling. Chapman & Hall, London, pp. 199–209.Wikle, C.K., 2003. Hierarchical Bayesian models for predicting the spread of ecological processes. Ecology 84, 1382–1394.Wikle, C.K., Cressie, N., 1999. A dimension-reduced approach to space-time Kalman filtering. Biometrika 86 (4), 815–829.Wikle, C.K., Berliner, L.M., Cressie, N., 1998. Hierarchical Bayesian space-time models. J. Environ. Ecol. Statist. 5, 117–154.Wikle, C.K., Milliff, R., Nychka, D., Berliner, L., 2001. Spatiotemporal hierarchical Bayesian modeling: tropical ocean surface winds. J. Amer.

Statist. Assoc. 96, 382–397.

estimation of parameterized spatio-temporal dynamic models

Documents