model-based geostatistics - purdue universityhuang251/zhang_1118.pdf · i model-based geostatistics...
TRANSCRIPT
Title
Model-based GeostatisticsPresented by Tonglin Zhang
Department of StatisticsPurdue University
November 18, 2014
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
References
Major References
I Diggle, P.J., Tawn, J.A., Moyeed, R.A. Model-basedgeostatistics. Applied Statistics, 47, 299-350.
I Diggle, P.J., Ribeiro, P.J. (2007). Model-based geostatistics.Springer.
Minor References
I Elliott, P., Wakefield, J., Best, N., Briggs, D. (2000). SpatialEpidemiology, Chapters 6 & 7. Oxford press.
I Agresti, A. (2002). Categorical Data Analysis. Wiley.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Outline
Outline
I Geostatistical Model
I Generalized Linear Mixed-Effect Model
I Moment Formulae
I Examples
I Generalized Linear Prediction
I Estimation
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Geostatistical Model
Spatial Gaussian Process
A process Z (s) on Rd is a Gaussian process ifz = (Z (s1), · · · ,Z (sn)) is a multivariate normal random vector forany distinct s1, · · · , sn ∈ Rd . It is often assumed that
I E [Z (s)] = 0.
I Cov [Z (s),Z (s+ h)] = cθz (∥h∥), where cθz is a parametriccovariance family.
I Then, we can write cθz (∥h∥) = τ21 ρθz (∥h∥), where ρθz (h) is aparametric correlation function.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Geostatistical Model
The geostatistical model is often proposed as
Y (s) = x(s)β + Z (s) + ϵ(s),
where x is the vector of explanatory variables, Z (s) is a Gaussianrandom field, and ϵ(s) is the white noise error term (i.e.,ϵ(s1), · · · , ϵ(sn) are iid N(0, τ22 ) for distance s1, · · · , sn). Letσ2 = τ21 + τ22 and
θ = (τ21σ2
, θz).
Then, the covariance between Y (s) is completely determined by θ.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Geostatistical Model
Let
y =
Y (s1)...
Y (sn)
,X =
x(s1)...
x(sn)
.
Using matrix expression, there is
y = Xβ + z+ ϵ.
Then,y ∼ N(xβ, σ2R)
where R = Cor(y). Then, the kriging prediction of Y (s0) is
Y ∗(s0) = E (Y (s0)|y) = x(s0)β + c′0R−1(y − Xβ),
where c0 = Cor(y,Y (s0)).
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Geostatistical Model
Estimation
The likelihood function is
ℓ(β, σ2, θ) =− n
2log(2π)− n
2log(σ2)
− 1
2log | det(Rθ)| −
1
2σ2(y − Xβ)′R−1
θ (y − Xβ).
Given θ, there is
β = βθ = (X′R−1θ X)−1X′R−1
θ y
σ2 = σ2θ =
1
n[y′R−1
θ y − y′R−1θ X(X′R−1
θ X)−1X′R−1θ y].
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Geostatistical Model
Then, θ can be estimated by maximizing the profile likelihood as
ℓP(θ) = −n
2[1 + log(
2π
n)]− 1
2log | det(Rθ)| −
n
2log(y′Mθy),
whereMθ = R−1
θ − R−1θ X(X′R−1
θ X)−1X′R−1θ .
I Since there is no analytic result, numerical methods are oftenused.
I The above involves the computation of Rθ, which is large if nis large.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Generalized Linear Mixed-Effect Model
Non Gaussian Data
In applications, the response may be count. Then, non-Gaussiandata appear. Mostly y(si ) follows either a binominal or Poissondistribution, where the spatial correlated effect is interpreted by aGaussian random field.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Generalized Linear Mixed-Effect Model
Binomial Data
Assume Y (s) ∼ Bin(n(s), p(s)). Then, a logistic model-basedgeostatistical model is
logp(s)
1− p(s)= x(s)β + Z (s),
where Z (s) is a Gaussian random field. It is often assumed thatgiven (p(s1), · · · , p(sn)), y = (Y (s1), · · · ,Y (sn)) are independent.This is called the conditional independence assumption.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Generalized Linear Mixed-Effect Model
Poisson Data
Assume Y (s) ∼ Poisson(λ(s)). Then, a loglinear model-basedgeostatistical model is
log λ(s) = x(s)β + Z (s).
Conditional independence is also often assumed.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Moment Formulae
Moment Formulae
Agresti (2002) (page 563-564) presents useful mean, variance, andcovariance formulae. We have
E(Y (s)) =E[E(Y (s)|Z (s))]
V(Y (s)) =E[V(Y (s)|Z (s))] + V[E(Y (s)|Z (s))]
and
Cov(Y (s),Y (s′)) =E[Cov(Y (s),Y (s′)|Z (s),Z (s′))]+ Cov[E(Y (s)|Z (s)),E(Y (s′)|Z (s′))].
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Moment Formulae
Specification: Poisson Model
For the Poisson model, we have
E(Y (s)) =ex(s)β+σ2/2
V(Y (s)) =ex(s)β+σ2/2 + e2x(s)β(e2σ2 − eσ
2)
and
Cov(Y (s),Y (s′)) =ex(s)β+x(s′)βCov(eZ(s), eZ(s′))
=ex(s)β+x(s′)β[eσ2(eσ
2ρθ(∥s−s′∥) − 1)].
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Examples
Tree Locations: Ambrosia Dumosa Plants
I The Ambrosia Dumosa data consisted of locations and severalimportant measurements of 4358 Ambrosia dumosa in asquare area in the Colorado Desert in 1984.
I Other measurements areI the height of the plant canopy;I the length of the major axis of the plant canopy;I the length of the minor axis of the plant canopy;I the volume of the plant canopy.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Examples
0 20 40 60 80 100
020
4060
8010
0
Ambrosia Dumosa locations in 1984 in the Colorado Desert
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Generalized Linear Mixed-Effect Model
We have model the locations of trees by
logp(s)
1− p(s)= α+ Z (s),
where p(s) is the intensity function.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Examples
Disease Mapping: Cancer Clusters
Assume an area is partitioned in m units: y(si ), i = 1, · · · ,m, isthe count of disease, ξi is the at risk population.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Examples
Marion
40 0 40 Miles
Legend56 - 9195 - 103104 - 114115 - 129130 - 151154 - 258
Male Colorectal Cancer Rate (per 100,000) in Indiana 2003-2007
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Examples
Disease Mapping: Cancer Clusters
We may assumey(si ) ∼ Poisson(ξiθi )
withlog θi = µ+ log(ξi ) + Z (si ).
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Examples
Extensions in Poisson Case
There are three extensions (the first is not):I Stationary models: Z (s) is a stationary Gaussian random field.
I CAR (conditional autoregressive) model: the variance of Z (si )is inverse proportional to ξi .
I SAR (spatial autoregressive) model: the distirbution of Z (s)only depends on its neighbour.
I Quasi-Poisson model: the expected value of Y (s) onlydepends on β but not Z (s).
Note: CAR and SAR models are not typical geostatisticalapproaches.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Generalized Linear Prediction
Generalized Linear Prediction
For any unobsered point s0, the prediction of response is
y(s0) = g(x(s0)β + Z (s0)),
where Z (s0) is the prediction of Z (s0). To compute Z (s0), it isrecommend to use
Z (s0) = E[Z (s0)|y].
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Estimation
Likelihood Function
Let Yi = Y (si ), xi = x(si ), and Zi = Z (si ). Writefi (Yi |Zi , β, θ, σ
2) as the PMF of PDF at location si . Theconditional likelihood function is
L(β, θ, σ2|z) =n∏
i=1
fi (Yi |Zi , β, θ, σ2).
The marginal likelihood function is
L(β, θ, σ2) =
∫Rn
L(β, θ|z)g(z|θ, σ2)dz.
This is hard to compute. Therefore, MCMC (Markov Chain MonteCarlo) algorithm is used.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Estimation
Specification: Poisson Example
I The conditional PMF is
fi (Yi |Zi , β, θ, σ2)) =
1
Yi !eYi (xiβ+Zi )−exiβ+Zi .
I The conditional likelihood function is
L(β, θ, σ2|z) =n∏
i=1
1
Yi !eYi (xiβ+Zi )−exiβ+Zi .
I The likelihood function is
L(β, θ, σ2) =
∫Rn
[n∏
i=1
1
Yi !eYi (xiβ+Zi )−exiβ+Zi ]
1
σn2 | det(Rθ)|
e−12zR−1
θ zdz.
This is not integrable.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Estimation
Specification: MCMC algorithm
I Generate θ, β, and σ2 from their prior distributions.
I Generate z from N(0, σ2Rθ).
I Compute the conditional likelihood function L(β, θ, σ2|z).I The posterior is proportional to the conditional likelihood
function given priors. Then, we can derive the posterior meanof σ2, β, and θ if parameters are weighted by their priors.
I Disadvantage: convergence rate is low.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics
Summary
Summary
I Model-based geostatistics can be used to analyzenon-Gaussian data.
I Estimation of parameters is difficult.
I Generalized Linear Prediction is used to compute thepredicted response.
Tonglin Zhang, Department of Statistics, Purdue University Model-based Geostatistics