on: 09 june 2014, at: 08:40 geographically weighted ...gis.geog.uconn.edu/personal/paper1/journal...
TRANSCRIPT
This article was downloaded by: [University of Connecticut]On: 09 June 2014, At: 08:40Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Journal of Spatial SciencePublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/tjss20
Mapping soil organic matterwith limited sample data usinggeographically weighted regressionK. Wanga, C.R. Zhangb, W.D. Lib, J. Linb & D.X. Zhangb
a Department of Geographical Science, Minjiang University,Fuzhou, Fujian 350108, Chinab Department of Geography and Center for EnvironmentalSciences and Engineering, University of Connecticut, Storrs, CT,USAPublished online: 23 Jul 2013.
To cite this article: K. Wang, C.R. Zhang, W.D. Li, J. Lin & D.X. Zhang (2014) Mapping soil organicmatter with limited sample data using geographically weighted regression, Journal of SpatialScience, 59:1, 91-106, DOI: 10.1080/14498596.2013.812024
To link to this article: http://dx.doi.org/10.1080/14498596.2013.812024
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
Mapping soil organic matter with limited sample data using geographicallyweighted regression
K. Wanga, C.R. Zhangb*, W.D. Lib, J. Linb and D.X. Zhangb
aDepartment of Geographical Science, Minjiang University, Fuzhou, Fujian 350108, China;bDepartment of Geography and Center for Environmental Sciences and Engineering,
University of Connecticut, Storrs, CT, USA
The spatial information of soil organic matter (SOM) is crucial for precision agriculture andenvironmental modeling. It is, however, difficult to obtain the regional details of SOM by densesampling due to the high cost. Although a variety of interpolation methods are available formapping SOM at regional scales, accurate prediction usually needs densely distributed samplesand requires the interpolated variable to meet some constraints such as spatial stationarity. Thispaper introduces the Geographically Weighted Regression (GWR) technique as an alternativeapproach for SOM mapping. We interpolated the spatial distribution of SOM based on a limitednumber of samples with the incorporation of multiple independent variables. We also comparedGWR with the ordinary least squares regression approach in mapping SOM. Results indicatedthat GWR could capture more local details and improve the prediction accuracy. However, moreattention should be paid to the selection of independent variables.
Keywords: soil organic matter; mapping; geographically weighted regression; ordinary leastsquares; environmental modeling
1. Introduction
Soil is a complex natural body, and it is jointly
influenced by many factors, such as topogra-
phy, vegetation, time, climate, parentmaterials,
and human factors. Soil organic matter (SOM)
varies continuously over space (Bai et al. 2005;
Huang et al. 2007) and is influenced by soil
formation factors. The spatial distribution
information of SOM in a certain region is
important because it regulates the functioning
of local ecosystems and the health of soils,
therefore having strong effects on agricultural
productivity and global climate change. SOM is
an important indicator of soil fertility and
quality as well, having great influence on soil
physical, chemical, and biological processes. It
is, however, difficult to acquire detailed SOM
data for a given region at low cost. The high cost
of collecting SOMdata through dense sampling
across landscapes has created a need for
methods of inferring spatial distribution infor-
mation of SOM such as interpolation methods.
It is possible to obtain spatial distribution data
of SOM in a given region by interpolating
existing sample data (Mabit et al. 2008;
Sumfleth & Duttmann 2008). A number of
methods can be used to interpolate SOM from
discrete sampling points into a spatially
continuous surface map. These methods
include linear regression (e.g. Moore et al.
1993; Gessler et al. 1995), inverse distance
weighting (IDW) (e.g. Bregt et al. 1992),
q 2013 Mapping Sciences Institute, Australia and Surveying and Spatial Sciences Institute
*Corresponding author. Email: [email protected]
Journal of Spatial Science, 2014
Vol. 59, No. 1, 91–106, http://dx.doi.org/10.1080/14498596.2013.812024
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
various kriging methods (e.g. Stein & Corsten
1991; Baxter & Oliver 2005; Bishop & Lark
2006; Robinson & Schumacker 2006; Nerini
et al. 2010), generalised linear models (e.g.
McKenzie & Austin 1993; Dobson & Barnett,
2008), and regression trees (e.g. Pachepsky
et al. 2001). To account for the fact that soil
attributes are related to soil processes and
landscape variables, the ordinary least squares
based multiple linear regression (MLR) has
been used to predict the variation of soil
attributes using a variety of environmental
variables (McKenzie & Austin 1993; Gessler
et al. 1995). Results from several investigations
suggested that regression might give more
precise prediction than cokriging in some
circumstances (e.g. Lesch et al. 1995; Hughson
et al. 1996; Dungan, 1998). In addition, MLR
might have advantages over cokriging where
the second-order component of the intrinsic
hypothesis of stationarity fails (Lark 2000).
However, MLR is a non-spatial statistical
method. One of its assumptions – indepen-
dence of observations – is often unrealistic
due to the existence of autocorrelations in
spatial data, which leads to a biased
estimation of standard errors of model
parameters and, consequently, misleads sig-
nificance tests. Instead of showing the
variation of the relationships between soil
attributes and soil process variables, MLR
only can provide an ‘average’ of the
relationships. Recently, an approach named
Geographical Weighted Regression (GWR)
has been used to estimate the relationships
among spatially non-stationary variables
(Leung et al. 2000; Brunsdon et al. 2002;
Tu & Xia 2008; Propastin 2009; Young et al.
2009). GWR is a ‘local’ regression procedure
that was developed to deal with non-
stationarity issues faced by MLR (Fother-
ingham et al. 2002). The major advantage of
GWR is that it can incorporate a number of
correlated environmental variables while
taking the local variation of the primary
variable and secondary variables (through
local correlation coefficients) into account.
However, to our knowledge, studies using
GWR to predict the spatial distribution of soil
attributes are still rare (Mishra et al. 2010).
The objective of this study is to explore the
feasibility of using GWR as an alternative spatial
interpolation method to predict the spatial
distribution of SOM at a regional scale by
accounting for soil process variables, particularly
whether the approach is feasible for sparse
sample data. The commonly used MLR method
is employed for the comparison purpose.
2. Data and methods
Study area
The study area is located in Longyan, Fujian
province, China (1158100 –1158400E, 26800–268400N) (Figure 1), covering a total area of
1260 km2. The elevation of this area decreases
from the north, east, and west towards the south
and centre, with altitude ranging from 1200m
to 140m. It has a subtropical monsoon humid
climate with abundant but seasonally uneven
rainfall. The annual mean precipitation is
1600mm but over half falls during the wet
season (April–June) when storms are frequent.
Drought conditions occur in autumn, as both
frequency and amount of precipitation decrease
significantly. The annual mean temperature is
19.18C, the coldest temperature is 3.48C in
January, and the hottest is 35.48C in July. The
main geomorphic types are river valley alluvial
plains, red earthy hills, red sandstone hills, and
granite and metamorphic rock mountains. The
area is dominated by red soil (Humic Acrisols)
and paddy soil (Stagnic Anthrosols), and dotted
with yellow soil (Ferralic Cambisols) and
purple soil (Eutric-Rhodic Cambisols).
Data
To predict the spatial distribution of SOM using
the GWR approach, environmental variables,
which are factors influencing formation and
decomposition of SOM and are assumed to be
independent of each other, should be determined
first. There are many factors contributing to
K. Wang et al.92
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
SOM in soils, including natural factors and
anthropogenic factors, such as land use, tillage,
irrigation, and fertilisation. According to the
theory of pedogenesis proposed by V. V.
Dokuchaev, soil can be derived from interaction
of five factors: topography, vegetation, climate,
soil parental materials, and time. Since SOM is
an important part of soil ingredients, these
factors also influence the spatial distribution of
SOM. However, because climate and time often
exert control at larger scales than of interest here,
these two factors were not considered in this
study. We selected independent environmental
variables from soil formation factors of topo-
graphy, vegetation, and soil parental materials.
Field sample data
Surface soil samples from 0 to 20 cm horizon
were collected at the randomly selected sites in
the study region during May 2004. Random
selection was performed by using a random-
number generator to generate the coordinates
of a number of sample locations. Surveyors
went to the field to find the selected sampling
sites using a GPS. Sometimes an approximate
location had to be chosen in a field if the
specifed location was inaccessible. The
accurate coordinate and elevation of a sampled
location was recorded using the GPS when the
sample was taken. The total number of the
selected sample sites is 110 (Figure 1). Sample
sites were mainly in paddy lands (30 samples),
open forest lands (35 samples), dry farm lands
(19 samples), and forest lands (14 samples).
One thousand grams of mixed soil at each site
were collected and then processed in the
laboratory. SOM were measured using the
potassium dichromate volumetric method after
air drying, grinding, and sieving.
Environmental variable data
Environmental indicators used in this study
include elevation, slope, normalised difference
vegetation index (NDVI), distances from
Figure 1. The study area and sampling sites
Journal of Spatial Science 93
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
sample points to water body, soil erosion
intensity, and ferrous minerals index. We
derived these data from several collected
maps, including a digital elevation model
(DEM) (grid format, 30 £ 30 m resolution)
from the Fujian Provincial Geomatics Center,
an ETM þ remotely sensed satellite image
(17 June 2000), and a soil erosion intensity
(shapefile format) map. The elevation data,
which range from 140 m to 1 200 m, were
acquired directly from DEM. The slope and
aspect data were also calculated from DEM
using the spatial analyst module of ArcGIS.
The unit of slope adopted here is degree, and
it ranges from 0 to 70. Aspect data are cyclic,
representing directions and ranging from 0 to
360 from the north direction. We used
the cosine function to transfer aspect data to
the range from 21 to 1. Note: the values of
cosine aspects indicate the degrees of
north (Lian et al. 2006). NDVI data were
derived from the ETM þ image using
(Band4-Band3) / (Band4þBand3) (Chen
et al. 2004; Alejandro & Kenji 2007).
Accumulation and decomposition of SOM
are greatly influenced by groundwater level
(Jiang et al. 1987), so groundwater level may
be used as a factor to analyse the variation of
SOM. However, groundwater level data are
difficult to obtain for a large area, so the
distances of sample locations from the rivers
and the elevation data were jointly used to
describe the groundwater level. In general, if a
sample site is close to a river and located at a
lower elevation, it can be considered that its
groundwater level is shallow, and vice versa.
Soil erosion intensity was classified into four
grades based on the results of the soil erosion
investigation in this area in 2000, that is, non-
apparent erosion, slight erosion, moderate
erosion, and severe erosion (Wang et al.
2009). Ferrous mineral index data were
acquired from the ETM þ image using the
calculation of ‘Band5 / Band4’ in the ERDAS
IMAGINE 8.5 environment. The index reflects
mineralisation intensity of soil ingredients: the
higher intensity of mineralisation the soil
experiences, the less content of the SOM
remains (Zech et al. 1997).
Methods
GWR is an extension of the traditional
regression in which variations in rates of
change are allowed in order that regression
coefficients are specific to a location rather
than being global estimates (Brunsdon et al.
1998a, 1998b; Fotheringham et al. 1998).
Suppose there are series of explanatory
variables {xij} and dependent variables {yi},
i ¼ 1, 2, . . . ,m, j ¼ 1, 2, . . . , n, a conventional
multiple linear regression (MLR) fitted by
ordinary least squares method is expressed as:
yi ¼ b0 þXnj¼1
bjxij þ 1i ð1Þ
where yi is the value of dependent variable y at
location i, b0 is the intercept, bj is the slope
parameter for independent predictor variable
xj. e i represents the error terms, which are
generally assumed to be independent and
normally distributed with zero means and
constant variance s 2. In this model, each of
the parameters can be thought of as the
parameters between one of the independent
variables and the dependent variable. This type
of regression is known as global because of the
spatial stationarity of its parameter estimates,
which means that a single model is fitted to all
of the sample data and is applied equally to the
whole study area of interest. The regression
model and its coefficients are constant across
the study area, assuming the relationships
between the dependent and independent
variables to be spatially constant.
However, variations or spatial non-statio-
narity in relationships between the dependent
and independent variables over space com-
monly exist in spatial data sets and the
assumption of stationarity or structural stab-
ility over space may be unrealistic (Fother-
ingham et al. 1997). So, when analysing
spatial data, we should take spatial non-
K. Wang et al.94
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
stationarity into account. The local regression
approach, known as GWR, recognises expli-
citly that the parameter estimates in a
regression model can vary across the space in
which the regression model is calibrated.
GWR allows the parameter estimates to be a
function of location. The local estimation of
the parameters with GWR is expressed by the
following equation:
yi ¼ b0ðui; viÞ þXnj¼1
bjðui; viÞxij þ 1i ð2Þ
where ðui; viÞ is the spatial location of the ith
observation and bjðui; viÞis the value of the jthparameter at point location i. The regression
parameters of this equation are estimated at
each location iðui; viÞ. Note that the MLR
model (see Eq. (1)) with constant parameters is
a special case of the GWR model. The
parameters in the GWR model can be
calibrated using the weighted least squares
approach. In matrix form, the parameters of
the GWR model at each location i are
estimated from:
bðui; viÞ ¼ ½X TWðui; viÞX�21X TWðui; viÞYð3Þ
whereWðui; viÞis an (m £ m) spatial weighting
diagonal matrix, X is an [m £ (n þ 1)]
independent data matrix, and Y is an (m £ 1)
dependent data vector.
To estimate parameters in the GWR
model, it is important to decide on the spatial
weighting matrix. The spatial weighting
matrix can be calculated by different methods.
One method is to specify the Wðui; viÞ as a
continuous and monotonic decreasing function
of distance dij between point i and point j
(Fotheringham et al. 1997). Several functions
have been proposed to determine the weighting
matrix (Fotheringham et al. 2002). For
example, the weight of each point can be
calculated by applying the Gaussian function:
wij ¼ e20:5ðdij=hÞ2 ð4Þwhere wij is the weight of observed data at
location j for estimating the dependent variable
at location i, and h is referred as a bandwidth.
This function is a distance decay function, in
which the weight of location j decreases with
its distance from location i being regressed.
However, the weights obtained by this function
are nonzero for all data points, no matter how
far they are from the centre i (Fotheringham
et al. 2002; Bickford & Laffan 2006). Another
commonly used weighting function is the
bisquare distance decay kernel function
wij ¼ 12dij
h
� �2" #2
when dij # h
wij ¼ 0 otherwise
ð5Þ
where wij ¼ 1 at the centre i (dij ¼ 0) and wij
¼ 0 when the distance equals or is larger than
the bandwidth.
The bandwidth may be fixed or adaptive. A
potential problem of the kernel function with a
fixed bandwidth is that for some locations in
the study area there may be only a few data
points available to calibrate the regression
model if the sample data are sparse around the
centre locations, that is, a ‘weak data’ problem.
Alternatively, an adaptive bandwidth can be
used to reduce the weak data problem. The
adaptive bandwidth is selected such that the
number of observations with nonzero weights
is the same at each location i across the study
area. Thus, the kernel function with adaptive
bandwidth adapts itself in the neighbourhood
size to the variations in the density of data. It
has larger bandwidths where the data are
sparse and smaller ones where the data are
denser (Foody 2003; Fotheringham et al.
2002). In this study, adaptive bandwidth was
adopted for SOM estimation, and the neigh-
bourhood size used for GWR analysis in the
ArcGIS environment was 68.
Journal of Spatial Science 95
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
Parameter estimation in GWR is highly
dependent on the weighting function and the
bandwidth used. As the bandwidth increases,
the parameter estimates will gradually
approach the estimates from a global model.
The selection of the weighting function and
bandwidth can be determined using the cross-
validation approach (Fotheringham et al.
2002), the generalised cross-validation cri-
terion (Loader 1999), or the Akaike Infor-
mation Criterion (AIC) (Hurvich et al. 1998).
The AIC method has the advantage of taking
into account the fact that the degrees of
freedom may vary among the models centred
on different locations. In this study, selection
of the weighting function and the optimal
bandwidth h was accomplished by minimising
the corrected AIC, which could decrease
weights in the densely sampled places and
increase their values in the sparse places
(Jaimes et al. 2010).
Since there are not enough samples in this
research, we selected 18 sites as the validation
points basedon the samples density, that is,where
the sampling points are dense, some of themwere
selected as the validation points (Figure 1). We
used the other 92 samples to develop the GWR
and MLR models. Conventional statistical
analyses were processed using the software of
PASW Statistics 18.0.
Evaluation
Prediction accuracy of the two approaches was
evaluated by comparing the validation data,
which were not used for interpolation, with the
predicted data. Mean error (ME) and root mean
square error (RMSE) were used to verify
prediction accuracy. TheME index is defined as
ME ¼ 1
n
Xni¼1
½zðxi; yiÞ2 z*ðxi; yiÞ� ð6Þ
where n is the number of SOM observations in
the validation data set, z(xi, yi) and z*(xi, yi) are
values of the observed and the predicted SOM,
respectively; xi, yi are the location coordinates.
RMSE is defined as
RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
n
Xni¼1
½zðxi; yiÞ2 z *ðxi; yiÞ�2s
ð7Þ
In addition, R 2 values between the observed
and the predicted value at the 18 validation sites
are also used for the purpose of comparison.
3. Results
Exploratory data analysis
Table 1 shows the statistical summary of the
environmental factors and SOM. From the
table one can see that SOM varies from 3.47 to
71.9 g/kg, which means that there are great
variations in the spatial distribution of SOM in
Table 1. Statistical summary of SOM samples and the environmental variables
Min Max Mean Std. deviation Kurtosis Skewness
SOM 3.47 71. 9 20.7 13.11 1.36 2.74NDVI 20.12 0.35 0.07 0.10 0.56 20.07SDR 0.00 1302.50 296.93 303.42 1.68 1.98ELE 149.60 1100.10 273.33 159.18 2.63 8.82SEI 1.00 6.00 1.99 1.39 1.23 0.62Slope 0.00 37.40 6.44 7.20 1.63 3.24CosA 20.98 1.00 0.13 0.69 20.23 21.47FMI 0.30 1.54 0.99 0.22 20.23 0.11
SOM: soil organic matter (g/kg); NDVI: normalised difference vegetation index; SDR: sample distance to river; ELE:elevation; SEI: soil erosion intensity; CosA: cosine value of aspect; FMI: ferrous minerals index.
K. Wang et al.96
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
this region. From the table we also can see that
several environmental variables, such as SDR
(sample distance to river), elevation and slope,
have large standard deviations. This means
that there are great spatial variations from the
averages of samples of the environmental
factors. The values of kurtosis and skewness,
however, are small except for elevation. This
indicates that the histograms of the data are
symmetric and approach normal distributions.
The results of the Kolmogorov-Smirnov (K-S)
tests also confirm the normal distributions of
the data.
To find out the relationships between SOM
and the environmental variables, Pearson’s
correlation coefficients among variables were
calculated and are provided in Table 2. One
can see that there are significant positive
correlations between SOM and elevation, and
significant negative correlations between SOM
and soil erosion intensity and ferrous minerals
index. SOM also has positive correlations with
NDVI, distance to river, slope and aspect, but
these do not reach the significant level. There
are quite a few significant correlations among
environmental variables, such as the corre-
lations between NDVI and elevation, slope and
ferrous mineral index.
To decrease the multi-collinearity problem
in the linear regression, the tolerance and
Variance Inflation Factor (VIF) were tested
among all of the environmental factors
(Table 3). The test results show that all
variables have VIF values less than 7.5, and
tolerance values close to 1. This indicates that
multi-collinearity is not a problem among
these variables although some variables have
relatively high Pearson’s correlation values
(Robinson & Schumacker 2009).
To better understand the effects of the
environment factors on the spatial distribution
of SOM, we selected three different combi-
nations of the environmental factors (see
Table 4) to construct the MLR and GWR
Table 2. The Pearson correlation matrix between the SOM and environmental variables
Variables NDVI SDR ELEV SEI SLOPE CosA FMI
SOM .111 .183 .486** 2 .282** .200 2 .043 2 .329**NDVI 1 .122 .250* 2 .077 .326** 2 .187 2 .407**SDR 1 .431** 2 .018 .186 2 .054 2 .312**ELE 1 2 .234* .418** .114 2 .450**SEI 1 2 .121 .082 .155Slope 1 .056 2 .336**CosA 1 .010
* Significant at the 0.05 level; ** significant at the 0.01 level.
Table 3. Results of the tolerance and variance inflation factor tests among the environmental variables
F R 2 Sig t Tolerance VIF
Regression equation 4.953 0.292 .000constant coefficient .004 2.982NDVI .438 2 .780 .751 1.332SDR .633 2 .479 .774 1.292ELE .000 3.657 .594 1.684SEI .106 21.633 .917 1.091Slope .894 2 .133 .760 1.316CosA .325 2 .991 .905 1.104FMI .151 21.451 .678 1.475
F: value of F-test; R 2: coefficient of determination; Sig: significance; t: value of t-test; VIF: variance inflation factor.
Journal of Spatial Science 97
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
models for interpolating the spatial distri-
bution of SOM. Based on the results of the
three combinations, an optimised set was
chosen as the final one for the interpolation.
MLR model analysis
Table 5 shows the values of AIC (Akaike
Information Criteria), R 2 (coefficient of
determination), ADJ-R 2 (adjusted R 2), and
RSS (residual sums of squares) for each MLR
combination. The AIC index is a measure of
the relative goodness of fit of a statistical
model. Models with smaller AIC values are
preferable to models with higher values. The
R 2 index measures the proportion of the total
variation in the dependent variable explained
by the model. The value of R 2 can be
influenced by the number of the predictor
variables, because increasing the number of
variables will always increase the R 2 value
(Quinino et al. 2012). The adjusted R 2 index is
a preferable measure because it contains
certain adjustment when extra explanatory
variables are added to a given regression
model. The RSS index is a measure of the
discrepancy between the sample data and the
regression models, and a small RSS indicates a
good fit of the model to the observed data. In
this study, the criteria for model selection
mainly depend on AIC, RSS and ADJ-R 2: the
best model should have the smallest values of
AIC and RSS and the largest value of ADJ-R 2.
R 2 was just used as a reference index when the
above two parameters are very close.
Among the three combinations (Table 4),
combination 1 has the smallest AIC and the
highest ADJ-R 2, but its RSS is also the largest
one. The indices of combination 2 are close to
those of the combination 1, and the indices of
combination 3 also do not show much
deviation from those of combination 2. This
means it is difficult to judge which combi-
nation is the optimal one. To compare with the
GWR approach, we selected combination 2
(AIC ¼ 648.3, ADJ-R 2 ¼ 0.2516) for the
MLR analysis because its indices fall between
those of combination 1 and combination 3. The
MLR mathematical model of the selected
combination is expressed as:
SOM ¼ 22:3762 0:0027*SDR
þ 0:0364*ELE2 1:4308*SEI
2 0:0405*Slope2 7:9442*FMI ð8ÞFigure 2a is the scatter plot between the
observed SOM and the predicted SOM by the
Table 4. Three selected combinations ofenvironmental factors for MLR and GWRmodel construction
Combinations Variables
Combination 1 Elevation, soil erosion intensity,ferrous minerals index
Combination 2 Elevation, soil erosion intensity,ferrous minerals index, slope,sample distance to river
Combination 3 Elevation, soil erosion intensity,ferrous minerals index, slope,sample distance to river,NDVI, aspect
Table 5. AIC, R 2, ADJ-R 2 and RSS for selected regression combinations of independent variables
Combinations Methods AIC R 2 ADJ-R 2 RSS
Combination 1 MLR 644.6 0.2944 0.2676 10418.3GWR 642.0 0.4734 0.3691 7774.9
Combination 2 MLR 648.3 0.2973 0.2516 10376.4GWR 645.6 0.5263 0.3859 6994.4
Combination 3 MLR 649.8 0.3175 0.2538 10077.5GWR 652.0 0.4407 0.3136 8258.3
K. Wang et al.98
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
MLR model of Equation (8). The MLR model
obviously underestimates the SOM values
when they are high and overestimates them
when they are low. The largest standard
deviation of residuals is 2.54 g/kg, accounting
for 12.4 percent of the mean of the SOM
values. From the distribution map of the
standard deviation of residuals (Figure 3a), it
can be seen that large deviations mainly occur
in high elevation regions, which are covered
with natural vegetation, mainly forest. Small
residuals are mostly distributed in low
elevation or flat regions, which are mainly
tilled lands.
GWR model analysis
The values of AIC, R2, ADJ-R2, and RSS for
the three GWR combinations are also listed in
Table 5. Combination 2 has the largest R 2 and
ADJ-R 2, and the smallest RSS. Although not
the smallest, its AIC value is very close to the
Figure 2. Scatter plots of observed SOM vs predicted SOM by MLR (a) and GWR (b)
Figure 3. Standard deviation of residuals by MLR (a) and GWR (b)
Journal of Spatial Science 99
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
smallest one. So combination 2 was selected
for GWR analysis.
From Table 5, it can be seen that GWR can
greatly improve prediction accuracy. For
combination 2 of the independent variables,
the GWR approach has improvement over the
MLR approach based on the following indices:
R 2 increases from 0.2973 to 0.5263, ADJ-R 2
improves from 0.2526 to 0.3859, and RSS
decreases from 10376.2 to 6994.4.
Figure 2b illustrates the scatter plot of
observed SOM and predicted SOM by GWR.
Although GWR has similar overestimation and
underestimation problems as MLR has, the
regression line with GWR is closer to the
diagonal line (the dash line), which means that
its errors are less than those of MLR. This
indicates that the fitting results have been
improved by the GWR approach. Large
standard deviations of residuals generated by
GWR mainly occur in high altitude regions
with good vegetation cover, but occasionally
also appear in lower altitude mountain regions
with moderate vegetation cover (Figure 3b).
However, both methods tend to generate small
prediction errors in low altitude and tilled
regions.
Interpolated SOM maps
Figures 4 and 5 illustrate the spatial distri-
bution maps of SOM interpolated by MLR and
GWR, both using combination 2 of environ-
mental variables.
The value range of SOM contents interp-
olated by MLR is 2.17–56.0 g/kg (Figure 4),
and that by GWR is 1.94–71.1 g/kg (Figure 5).
In general, the overall values estimated by
GWRare higher than those byMLR.Compared
with the observed sample data, MLR appar-
ently underestimated the maximum value.
Figure 4. The spatial distribution map of SOM interpolated by MLR
K. Wang et al.100
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
Values estimated by GWR, however, fit the
range of the observed values much better.
Comparing the two images, the spatial
distribution map estimated by MLR has more
smoothing effects than that estimated by GWR.
This means that MLR is less capable of
effectively predicting the local variation of
SOM because it only provides an ‘average’ of
the relationships between SOM and environ-
mental variables. GWR, however, is more
capable of showing local variation (see panels
B andC in Figure 4 and Figure 5). That is to say,
MLR renders only the spatial trends of SOM
and is poor in accounting for local variations
induced by local factors, but GWR can better
handle local variations. It also can be found
from the two maps that the values estimated by
MLR are relatively higher than those estimated
by GWR in the northern region, where the
altitude is high and the vegetation coverage is
well (see panel A in Figures 4 and 5).
To further compare the prediction accu-
racy of the two methods, a sample data set of
18 observed points, excluded from model
calibration, was used to verify the prediction
results by MLR and GWR. Table 6 shows the
comparison results. One can see that most of
the SOM residuals generated by GWR are
smaller than those generated by MLR. The
maximum error generated by MLR is larger
than that generated by GWR (Table 7). Both
ME and RMSE produced by GWR have
smaller values than those generated by MLR.
These statistical values indicate that the
prediction accuracy of GWR is relatively
higher than that of MLR. The R 2 values
between observed SOM data and predicted
data at the 18 validation sites also indicate that
GWR performed better (R 2 ¼ 0.094) than
MLR (R 2 ¼ 0.0009).
In addition, in order to find out the sources
of the errors, 10 samples were randomly
Figure 5. The spatial distribution map of SOM interpolated by GWR
Journal of Spatial Science 101
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
selected from forest soils and paddy soils,
respectively, to compare differences between
the estimated and the observed values (i.e.,
residuals). The forest soils were formed
naturally without human disturbance, while
the paddy soils experienced strong anthropo-
genic disturbance. Table 8 shows the com-
parison results.
It can be seen that smaller residuals were
generated by both methods in the forest soils,
while larger residuals happen in the paddy soils.
This means that major estimation errors were
caused by paddy soils, which further implies
that it is more difficult to accurately predict the
spatial distribution of SOM in the human-
disturbed soils using the two methods without
considering anthropogenic factors. The fact that
GWR generated even larger errors thanMLR in
paddy soil areas (see Table 8) indicates that if
some key anthropogenic factors are absent,
GWR may not outperform MLR in human-
impacted areas though it can more reasonably
capture local spatial variations.
4. Discussion and conclusions
The results of this study show that GWR can
be used as an alternative spatial interpolation
method to predict the spatial distribution of
SOM, even with limited samples, by incorpor-
ating the impacts of environmental variables.
Table 6. Prediction accuracies of MLR and GWR at the 18 selected observation samples
No. SOM MLR GWR SOM-MLR SOM-GWR
1 33.61 15.08 17.61 18.53 16.02 19.70 15.56 17.91 4.14 1.793 27.43 26.93 27.02 0.50 0.414 25.97 11.39 14.55 14.58 11.45 19.62 15.47 16.12 4.15 3.56 19.33 11.05 15.87 8.28 3.467 11.85 15.43 14.81 23.58 22.968 15.88 20.32 17.16 24.44 21.289 18.64 15.28 17.44 3.36 1.210 17.63 17.92 16.83 20.29 0.811 8.13 17.00 14.30 28.87 26.1712 29.60 18.25 15.97 11.35 13.613 15.86 17.39 15.06 21.53 0.814 18.10 18.59 16.17 20.49 1.9315 6.67 20.15 20.19 213.48 213.516 23.06 17.18 15.65 5.88 7.4117 30.97 19.66 20.84 11.31 10.1318 23.09 17.34 18.81 5.75 4.28
SOM: observed SOM values; MLR: predicted SOM values by MLR; GWR: predicted SOM values by GWR; SOM-MLR:residuals of SOM generated by MLR; SOM-GWR: residuals of SOM generated by GWR.
Table 7. Comparison of performances of MLR and GWR using ME, RMSE, maximum (2 andþ) errors ofpredictions and their coefficients of regression (R 2) with 18 validation data
Maximum (2) error(g/kg)
Maximum (þ) error(g/kg)
ME(g/kg)
RMSE(g/kg) R 2
MLR 213.5 18.5 3.064 8.467 0.0009GWR 213.5 16.0 2.933 7.496 0.094
K. Wang et al.102
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
We used a sparse sample dataset to develop the
regression models in this research. The sample
density may not be sufficient for generating
high-quality maps through sample interp-
olation without using ancillary data in the
study area. However, GWR still generated the
spatial distribution map of SOM with many
details by utilising the small number of
samples and the available secondary infor-
mation, such as elevation, soil erosion
intensity, and ferrous minerals index.
Although more samples will help in accurately
estimating local coefficients, GWR does not
necessarily require a very dense network of
sampling data to predict the non-stationarity
property of the spatial distribution of SOM.
Compared with MLR, which uses global
regression coefficients, GWR can predict the
spatial distribution of SOM with higher
accuracy and more details through using
local coefficients, although the neighbourhood
area used has to be large when samples are
very sparse. Thus it may be used as a more
flexible alternative interpolation method when
the relationships between SOM and its
formation factors vary over different places
of a region. In addition, by combining both
spatial location and environmental factors,
GWR may provide us some insights into the
spatially varying relationships between SOM
and environmental factors.
Among the samples used in the study,
more than half of them were collected from the
paddy land and open forest land, where
anthropogenic disturbances such as tillage
and fertilisation activities may impact the
accumulation of SOM (Viaud et al. 2011).
However, the factors we used as independent
variables are only related to the natural
environment, and no anthropogenic factors
Table 8. Observed data and estimated values by MLR and GWR at different sites for forest and paddy soils
No. SOMEstimatedby MLR jMLR-SOMj
Estimatedby GWR jGWR-SOMj
Samples collectedfrom natural soils
1 27.27 32.07 4.8 30.58 3.312 22.1 18.9 3.2 19.9 2.23 19.09 21.78 2.69 22.79 3.74 33.78 33.97 0.19 32.18 1.65 52.94 49.2 3.74 47.16 5.786 15.88 19.71 3.83 18.8 2.927 36.88 29.37 7.51 28.25 8.638 32.99 29.46 3.53 31.84 1.159 16.04 17.22 1.18 16.65 0.6110 19.55 21.53 1.98 22.09 2.54
Average 3.26 3.24
Samples collectedfrom paddy soils
1 36.29 20.1 16.19 19.35 16.942 21.27 15.8 5.47 17.2 4.073 6.81 15.2 8.39 13.5 6.694 10.85 18.6 7.75 17.9 7.055 23.8 14.56 9.24 13.6 10.26 17.62 12.19 5.43 12.86 4.767 24.81 15.6 9.21 14.21 10.68 25.34 15.8 9.54 13.3 12.049 23.48 15.7 7.78 15.1 8.3810 30.5 19.12 11.38 20.74 9.76
Average 9.04 12.29
jMLR-SOMj: absolute values of SOM residuals generated by MLR; jGWR-SOMj: absolute values of SOM residualsgenerated by GWR.
Journal of Spatial Science 103
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
were considered in the regression models.
Therefore it is unavoidable that there will be
relatively large estimation errors, especially in
the intense human-disturbance regions, no
matter which method was used. To obtain
accurate estimation of the spatial distribution
of SOM, more representative independent
variables are needed. That is to say, besides
certain quantity requirements, samples should
be collected based on a variety of factors, such
as topography, land use, soil type, and
vegetation coverage. Unfortunately, anthropo-
genic factors are difficult to collect in the study
area, because the paddy lands are owned by
individual farmers, who usually adopt different
tillage management methods (e.g., fertilisa-
tion, straw-treating, tilling manners) in differ-
ent patches.
This study also shows that it is critical to
select appropriate factors to develop regression
models when using GWR. Inappropriate factor
selection may not improve the prediction
accuracy, and may even cause wrong results
sometimes. If samples are not extremely sparse
and predictor variables are properly selected,
GWR should be suitable for mapping the
spatial distribution of SOM and indicates
apparent advantages over the conventionally
used MLR method.
Acknowledgements
Support for this work was provided by the NationalNatural Science Foundation of China (No.41271232), the Natural Science Foundation ofFujian Province, China (No. 2012J01179), and theScience and Technological Project of the Edu-cational Commission of Fujian Province, China(No. JA11202). We thank the two anonymousreviewers for their constructive comments.
References
Alejandro, M.D.A., & Kenji, O. (2007) Estimationof vegetation parameter for modeling soilerosion using linear Spectral Mixture Analysisof Landsat ETM data. ISPRS Journal ofPhotogrammetry and Remote Sensing, vol. 62,pp. 309–324.
Bai, J.H., Ouyang, H., Deng, W., Zhu, Y.M., Zhang,X.L., & Wang, Q.G. (2005) Spatial distributioncharacteristics of organic matter and totalnitrogen of marsh soils in river marginalwetlands. Geoderma, vol. 124, pp. 181–192.
Baxter, S.J., & Oliver, M.A. (2005) The spatialprediction of soil mineral N and potentiallyavailable N using elevation. Geoderma, vol.128, pp. 325–339.
Bickford, S., & Laffan, S. (2006) Multi-extentanalysis of the relationship between pterido-phyte species richness and climate. GlobalEcology and Biogeography, vol. 15, pp. 588–601.
Bishop, T.F.A., & Lark, R.M. (2006) The geostatis-tical analysis of experiments at the landscape-scale. Geoderma, vol. 133, pp. 87–106.
Bregt, A.K., Gesing, H.J., & Alkasuma, M. (1992)Mapping the conditional probability of soilvariables. Geoderma, vol. 53, pp. 15–29.
Brunsdon, C., Fotheringham, S., & Charlton, M.(1998a) Geographically weighted regression -modelling spatial non-stationarity. The Statis-tician, vol. 47, pp. 431–443.
Brunsdon, C., Fotheringham, S., & Charlton, M.(1998b) Spatial nonstationarity and autoregres-sive models. Environment and Planning A, vol.30, pp. 957–973.
Brunsdon, C., Fotheringham, S., & Charlton, M.(2002) Geographically weighted summarystatistics: a framework for localized exploratorydata analysis. Computers, Environment andUrban Systems, vol. 26, pp. 501–524.
Chen, X.X., Vierling, L., Rowell, E., & DeFelice, T.(2004) Using lidar and effective LAI data toevaluate IKONOS and Landsat 7 ETM þvegetation cover estimates in a ponderosa pineforest. Remote Sensing of Environment, vol. 91,pp. 14–26.
Dobson, A.J., & Barnett, A.G. (2008) An Introduc-tion to Generalized Linear Models, Chapmanand Hall, London.
Dungan, J. (1998) Spatial prediction of vegetationquantities using ground and image data.International Journal of Remote Sensing, vol.19, pp. 267–285.
Foody, G.M. (2003) Geographical weighting as afurther refinement to regression modeling: anexample focused on the NDVI-rainfall relation-ship. Remote Sensing of Environment, vol. 88,pp. 283–293.
Fotheringham, S., Charlton, M., & Brundson, C.(1997) Measuring spatial variations in relation-ships with geographically weighted regression.In: Fischer, M.M., & Getis, A., eds. Recent
K. Wang et al.104
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
Developments in Spatial Analysis, Springer-Verlag, Berlin, pp. 60–82.
Fotheringham, S., Charlton, M., & Brundson, C.(1998) Geographically weighted regression: anatural evolution of the expansion method forspatial data. Environment and Planning A, vol.30, pp. 1905–1927.
Fotheringham, S., Brunsdo, C.A., & RemoteCharlton, M. (2002) Geographically WeightedRegression: The Analysis of Spatially VaryingRelationships, John Wiley & Sons, New York.
Gessler, P.E., Moore, I.D., McKenzie, N.J., &Ryan, P.J. (1995) Soil-landscape modellingand spatial prediction of soil attributes.International Journal of Geographical Infor-mation Systems, vol. 9, pp. 421–432.
Huang, B., Sun, W.X., Zhao, Y.C., Zhu, J., Yang, R.Q., Zou, Z., Ding, F., & Su, J.P. (2007)Temporal and spatial variability of soil organicmatter and total nitrogen in an agriculturalecosystem as affected by farming practices.Geoderma, vol. 139, pp. 336–345.
Hughson, L., Huntley, D., & Razack, M. (1996)Cokriging limited transmissivity data usingwidely sampled specific capacity from pumptests in an alluvial aquifer. Ground Water, vol.34, pp. 12–18.
Hurvich, C.M., Simonoff, J.S., & Tsai, C.L. (1998)Smoothing parameter selection in nonpara-metric regression using an improved Akaikeinformation criterion. Journal of Royal Statisti-cal Society Series B, vol. 60, pp. 271–293.
Jaimes, N.B.P.J., Sendra, J.B., Delgado, M.G., &Plata, R.F. (2010) Exploring the driving forcesbehind deforestation in the state of Mexico(Mexico) using geographically weightedregression. Applied Geography, vol. 30,pp. 576–591.
Jiang, J.R., Zhong, S.L., Yuan, Z.P., Xiao, R.L., &Zhang, Y.Z. (1987) Effect of different croppingsystem and underwater level on content of soilorganic matter and ferric oxide in paddy soil.Journal of Hunan Agricultural University(Natural Sciences), vol. 3, pp. 25–30.
Lark, R.M. (2000) Regression analysis withspatially autocorrelated error: simulationstudies and application to mapping of soilorganic matter. International Journal of Geo-graphical Information Science, vol. 14,pp. 247–264.
Lesch, S.M., Strauss, D.J., & Rhoades, J.D. (1995)Spatial prediction of soil-salinity using electro-magnetic induction techniques. I. Statisticalprediction models - a comparison of multiplelinear regression and cokriging. WaterResources Research, vol. 31, pp. 373–386.
Leung, Y., Mei, C.L., & Zhang, W.X. (2000)Statistical tests for spatial nonstationarity basedon the geographically weighted regressionmodel. Environment and Planning A, vol. 32,pp. 9–32.
Lian, G., Guo, X.D., Fu, B.J., & Hu, C.X. (2006)Spatial variability and prediction of soil organicmatter at county scale on the Loess Plateau (inChinese). Progress in Geography, vol. 25,pp. 112–123.
Loader, C. (1999) Local Regression and Likelihood,Springer, New York. 290 pp.
Mabit, L., Bernard, C., Makhlouf, M., & Laverdiere,M.R. (2008) Spatial variability of erosion andsoil organic matter content estimated from137Cs measurements and geostatistics. Geo-derma, vol. 145, pp. 245–251.
McKenzie, N.J., & Austin, M.P. (1993) Aquantitative Australian approach to mediumand small scale surveys based on soil strati-graphy and environmental correlation. Geo-derma, vol. 57, pp. 329–355.
Mishra, U., Lal, R., Liu, D.S., & Marc, V.M. (2010)Predicting the spatial variation of the soilorganic carbon pool at a regional scale. SoilScience Society of America Journal, vol. 74,pp. 906–914.
Moore, I.D., Gessler, P.E., Nielsen, G.A., &Peterson, G.A. (1993) Soil attribute predictionusing terrain analysis. Soil Science Society ofAmerica Journal, vol. 57, pp. 443–452.
Nerini, D., Monestiez, P., & Mante, C. (2010)Cokriging for spatial functional data. Journal ofMultivariate Analysis, vol. 101, pp. 409–418.
Pachepsky, Y.A., Timlin, D.J., & Rawls, W.J.(2001) Soil water retention as related totopographic variables. Soil Science Society ofAmerica Journal, vol. 65, pp. 1787–1795.
Propastin, P.A. (2009) Spatial non-stationarity andscale-dependency of prediction accuracy in theremote estimation of LAI over a tropicalrainforest in Sulawesi, Indonesia. Remote Sen-sing of Environment, vol. 113, pp. 2234–2242.
Quinino, R.C., Reis, E.A., & Bessegato, L.F. (2012)Using the coefficient of determination R2 to testthe significance of multiple linear regression.Teaching Statistics, vol. 35, pp. 84–88.
Robinson, C., & Schumacker, R.E. (2009) Inter-action effects: centering, variance inflationfactor, and interpretation issues.Multiple LinearRegression Viewpoints, vol. 35, pp. 6–11.
Stein, A., & Corsten, L.C.A. (1991) Universalkriging and cokriging as a regression procedure.Biometrics, vol. 47, pp. 575–587.
Sumfleth, K., & Duttmann, R. (2008) Prediction ofsoil property distribution in paddy soil land-
Journal of Spatial Science 105
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4
scapes using terrain data and satellite infor-mation as indicators. Landscape Ecology, vol. 8,pp. 485–501.
Tu, J., & Xia, Z.G. (2008) Examining spatiallyvarying relationships between land use andwater quality using geographically weightedregression I: model design and evaluation. TheScience of the Total Environment, vol. 407,pp. 358–378.
Viaud, V., Angers, D.A., Parnaudeau, V., Morvan,T., & Menasseri, A.S. (2011) Response oforganic matter to reduced tillage and animalmanure in a temperate loamy soil. Soil Use andManagement, vol. 27, pp. 84–93.
Wang, K., Wang, H.J., Shi, X.Z., Weindorf, D.C.,Yu, D.S., Liang, Y., & Shi, D.M. (2009)
Landscape analysis of dynamic soil erosion in
Subtropical China: a case study in Xingguo
County, Jiangxi Province. Soil and Tillage
Research, vol. 105, pp. 313–321.
Young, L.J., Gotway, C.A., Yang, J., Kearney, G.,
& DuClos, C. (2009) Linking health and
environmental data in geographical analysis:
it’s so much more than centroids. Spatial and
Spatio-temporal Epidemiology, vol. 1,
pp. 73–84.
Zech, W., Senesi, N., Guggenberger, G., Kaiser, K.,
Lehmann, J., Miano, T.M., Miltner, A., &
Schroth, G. (1997) Factors controlling humifi-
cation and mineralization of soil organic matter
in the tropics. Geoderma, vol. 79, pp. 117–161.
K. Wang et al.106
Dow
nloa
ded
by [
Uni
vers
ity o
f C
onne
ctic
ut]
at 0
8:40
09
June
201
4