on: 09 june 2014, at: 08:40 geographically weighted ...gis.geog.uconn.edu/personal/paper1/journal...

This article was downloaded by: [University of Connecticut]On: 09 June 2014, At: 08:40Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Spatial SciencePublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/tjss20

Mapping soil organic matterwith limited sample data usinggeographically weighted regressionK. Wanga, C.R. Zhangb, W.D. Lib, J. Linb & D.X. Zhangb

a Department of Geographical Science, Minjiang University,Fuzhou, Fujian 350108, Chinab Department of Geography and Center for EnvironmentalSciences and Engineering, University of Connecticut, Storrs, CT,USAPublished online: 23 Jul 2013.

To cite this article: K. Wang, C.R. Zhang, W.D. Li, J. Lin & D.X. Zhang (2014) Mapping soil organicmatter with limited sample data using geographically weighted regression, Journal of SpatialScience, 59:1, 91-106, DOI: 10.1080/14498596.2013.812024

To link to this article: http://dx.doi.org/10.1080/14498596.2013.812024

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

http://www.tandfonline.com/loi/tjss20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/14498596.2013.812024

http://dx.doi.org/10.1080/14498596.2013.812024

Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Mapping soil organic matter with limited sample data using geographicallyweighted regression

K. Wanga, C.R. Zhangb*, W.D. Lib, J. Linb and D.X. Zhangb

aDepartment of Geographical Science, Minjiang University, Fuzhou, Fujian 350108, China;bDepartment of Geography and Center for Environmental Sciences and Engineering,

University of Connecticut, Storrs, CT, USA

The spatial information of soil organic matter (SOM) is crucial for precision agriculture andenvironmental modeling. It is, however, difficult to obtain the regional details of SOM by densesampling due to the high cost. Although a variety of interpolation methods are available formapping SOM at regional scales, accurate prediction usually needs densely distributed samplesand requires the interpolated variable to meet some constraints such as spatial stationarity. Thispaper introduces the Geographically Weighted Regression (GWR) technique as an alternativeapproach for SOM mapping. We interpolated the spatial distribution of SOM based on a limitednumber of samples with the incorporation of multiple independent variables. We also comparedGWR with the ordinary least squares regression approach in mapping SOM. Results indicatedthat GWR could capture more local details and improve the prediction accuracy. However, moreattention should be paid to the selection of independent variables.

Keywords: soil organic matter; mapping; geographically weighted regression; ordinary leastsquares; environmental modeling

1. Introduction

Soil is a complex natural body, and it is jointly

influenced by many factors, such as topogra-

phy, vegetation, time, climate, parentmaterials,

and human factors. Soil organic matter (SOM)

varies continuously over space (Bai et al. 2005;

Huang et al. 2007) and is influenced by soil

formation factors. The spatial distribution

information of SOM in a certain region is

important because it regulates the functioning

of local ecosystems and the health of soils,

therefore having strong effects on agricultural

productivity and global climate change. SOM is

an important indicator of soil fertility and

quality as well, having great influence on soil

physical, chemical, and biological processes. It

is, however, difficult to acquire detailed SOM

data for a given region at low cost. The high cost

of collecting SOMdata through dense sampling

across landscapes has created a need for

methods of inferring spatial distribution infor-

mation of SOM such as interpolation methods.

It is possible to obtain spatial distribution data

of SOM in a given region by interpolating

existing sample data (Mabit et al. 2008;

Sumfleth & Duttmann 2008). A number of

methods can be used to interpolate SOM from

discrete sampling points into a spatially

continuous surface map. These methods

include linear regression (e.g. Moore et al.

1993; Gessler et al. 1995), inverse distance

weighting (IDW) (e.g. Bregt et al. 1992),

q 2013 Mapping Sciences Institute, Australia and Surveying and Spatial Sciences Institute

*Corresponding author. Email: [email protected]

Journal of Spatial Science, 2014

Vol. 59, No. 1, 91–106, http://dx.doi.org/10.1080/14498596.2013.812024

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1080/14498596.2013.812024

various kriging methods (e.g. Stein & Corsten

1991; Baxter & Oliver 2005; Bishop & Lark

2006; Robinson & Schumacker 2006; Nerini

et al. 2010), generalised linear models (e.g.

McKenzie & Austin 1993; Dobson & Barnett,

2008), and regression trees (e.g. Pachepsky

et al. 2001). To account for the fact that soil

attributes are related to soil processes and

landscape variables, the ordinary least squares

based multiple linear regression (MLR) has

been used to predict the variation of soil

attributes using a variety of environmental

variables (McKenzie & Austin 1993; Gessler

et al. 1995). Results from several investigations

suggested that regression might give more

precise prediction than cokriging in some

circumstances (e.g. Lesch et al. 1995; Hughson

et al. 1996; Dungan, 1998). In addition, MLR

might have advantages over cokriging where

the second-order component of the intrinsic

hypothesis of stationarity fails (Lark 2000).

However, MLR is a non-spatial statistical

method. One of its assumptions – indepen-

dence of observations – is often unrealistic

due to the existence of autocorrelations in

spatial data, which leads to a biased

estimation of standard errors of model

parameters and, consequently, misleads sig-

nificance tests. Instead of showing the

variation of the relationships between soil

attributes and soil process variables, MLR

only can provide an ‘average’ of the

relationships. Recently, an approach named

Geographical Weighted Regression (GWR)

has been used to estimate the relationships

among spatially non-stationary variables

(Leung et al. 2000; Brunsdon et al. 2002;

Tu & Xia 2008; Propastin 2009; Young et al.

2009). GWR is a ‘local’ regression procedure

that was developed to deal with non-

stationarity issues faced by MLR (Fother-

ingham et al. 2002). The major advantage of

GWR is that it can incorporate a number of

correlated environmental variables while

taking the local variation of the primary

variable and secondary variables (through

local correlation coefficients) into account.

However, to our knowledge, studies using

GWR to predict the spatial distribution of soil

attributes are still rare (Mishra et al. 2010).

The objective of this study is to explore the

feasibility of using GWR as an alternative spatial

interpolation method to predict the spatial

distribution of SOM at a regional scale by

accounting for soil process variables, particularly

whether the approach is feasible for sparse

sample data. The commonly used MLR method

is employed for the comparison purpose.

2. Data and methods

Study area

The study area is located in Longyan, Fujian

province, China (1158100 –1158400E, 26800–268400N) (Figure 1), covering a total area of

1260 km2. The elevation of this area decreases

from the north, east, and west towards the south

and centre, with altitude ranging from 1200m

to 140m. It has a subtropical monsoon humid

climate with abundant but seasonally uneven

rainfall. The annual mean precipitation is

1600mm but over half falls during the wet

season (April–June) when storms are frequent.

Drought conditions occur in autumn, as both

frequency and amount of precipitation decrease

significantly. The annual mean temperature is

19.18C, the coldest temperature is 3.48C in

January, and the hottest is 35.48C in July. The

main geomorphic types are river valley alluvial

plains, red earthy hills, red sandstone hills, and

granite and metamorphic rock mountains. The

area is dominated by red soil (Humic Acrisols)

and paddy soil (Stagnic Anthrosols), and dotted

with yellow soil (Ferralic Cambisols) and

purple soil (Eutric-Rhodic Cambisols).

Data

To predict the spatial distribution of SOM using

the GWR approach, environmental variables,

which are factors influencing formation and

decomposition of SOM and are assumed to be

independent of each other, should be determined

first. There are many factors contributing to

K. Wang et al.92

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

SOM in soils, including natural factors and

anthropogenic factors, such as land use, tillage,

irrigation, and fertilisation. According to the

theory of pedogenesis proposed by V. V.

Dokuchaev, soil can be derived from interaction

of five factors: topography, vegetation, climate,

soil parental materials, and time. Since SOM is

an important part of soil ingredients, these

factors also influence the spatial distribution of

SOM. However, because climate and time often

exert control at larger scales than of interest here,

these two factors were not considered in this

study. We selected independent environmental

variables from soil formation factors of topo-

graphy, vegetation, and soil parental materials.

Field sample data

Surface soil samples from 0 to 20 cm horizon

were collected at the randomly selected sites in

the study region during May 2004. Random

selection was performed by using a random-

number generator to generate the coordinates

of a number of sample locations. Surveyors

went to the field to find the selected sampling

sites using a GPS. Sometimes an approximate

location had to be chosen in a field if the

specifed location was inaccessible. The

accurate coordinate and elevation of a sampled

location was recorded using the GPS when the

sample was taken. The total number of the

selected sample sites is 110 (Figure 1). Sample

sites were mainly in paddy lands (30 samples),

open forest lands (35 samples), dry farm lands

(19 samples), and forest lands (14 samples).

One thousand grams of mixed soil at each site

were collected and then processed in the

laboratory. SOM were measured using the

potassium dichromate volumetric method after

air drying, grinding, and sieving.

Environmental variable data

Environmental indicators used in this study

include elevation, slope, normalised difference

vegetation index (NDVI), distances from

Figure 1. The study area and sampling sites

Journal of Spatial Science 93

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

sample points to water body, soil erosion

intensity, and ferrous minerals index. We

derived these data from several collected

maps, including a digital elevation model

(DEM) (grid format, 30 £ 30 m resolution)

from the Fujian Provincial Geomatics Center,

an ETM þ remotely sensed satellite image

(17 June 2000), and a soil erosion intensity

(shapefile format) map. The elevation data,

which range from 140 m to 1 200 m, were

acquired directly from DEM. The slope and

aspect data were also calculated from DEM

using the spatial analyst module of ArcGIS.

The unit of slope adopted here is degree, and

it ranges from 0 to 70. Aspect data are cyclic,

representing directions and ranging from 0 to

360 from the north direction. We used

the cosine function to transfer aspect data to

the range from 21 to 1. Note: the values of

cosine aspects indicate the degrees of

north (Lian et al. 2006). NDVI data were

derived from the ETM þ image using

(Band4-Band3) / (Band4þBand3) (Chen

et al. 2004; Alejandro & Kenji 2007).

Accumulation and decomposition of SOM

are greatly influenced by groundwater level

(Jiang et al. 1987), so groundwater level may

be used as a factor to analyse the variation of

SOM. However, groundwater level data are

difficult to obtain for a large area, so the

distances of sample locations from the rivers

and the elevation data were jointly used to

describe the groundwater level. In general, if a

sample site is close to a river and located at a

lower elevation, it can be considered that its

groundwater level is shallow, and vice versa.

Soil erosion intensity was classified into four

grades based on the results of the soil erosion

investigation in this area in 2000, that is, non-

apparent erosion, slight erosion, moderate

erosion, and severe erosion (Wang et al.

2009). Ferrous mineral index data were

acquired from the ETM þ image using the

calculation of ‘Band5 / Band4’ in the ERDAS

IMAGINE 8.5 environment. The index reflects

mineralisation intensity of soil ingredients: the

higher intensity of mineralisation the soil

experiences, the less content of the SOM

remains (Zech et al. 1997).

Methods

GWR is an extension of the traditional

regression in which variations in rates of

change are allowed in order that regression

coefficients are specific to a location rather

than being global estimates (Brunsdon et al.

1998a, 1998b; Fotheringham et al. 1998).

Suppose there are series of explanatory

variables {xij} and dependent variables {yi},

i ¼ 1, 2, . . . ,m, j ¼ 1, 2, . . . , n, a conventional

multiple linear regression (MLR) fitted by

ordinary least squares method is expressed as:

yi ¼ b0 þXnj¼1

bjxij þ 1i ð1Þ

where yi is the value of dependent variable y at

location i, b0 is the intercept, bj is the slope

parameter for independent predictor variable

xj. e i represents the error terms, which are

generally assumed to be independent and

normally distributed with zero means and

constant variance s 2. In this model, each of

the parameters can be thought of as the

parameters between one of the independent

variables and the dependent variable. This type

of regression is known as global because of the

spatial stationarity of its parameter estimates,

which means that a single model is fitted to all

of the sample data and is applied equally to the

whole study area of interest. The regression

model and its coefficients are constant across

the study area, assuming the relationships

between the dependent and independent

variables to be spatially constant.

However, variations or spatial non-statio-

narity in relationships between the dependent

and independent variables over space com-

monly exist in spatial data sets and the

assumption of stationarity or structural stab-

ility over space may be unrealistic (Fother-

ingham et al. 1997). So, when analysing

spatial data, we should take spatial non-

K. Wang et al.94

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

stationarity into account. The local regression

approach, known as GWR, recognises expli-

citly that the parameter estimates in a

regression model can vary across the space in

which the regression model is calibrated.

GWR allows the parameter estimates to be a

function of location. The local estimation of

the parameters with GWR is expressed by the

following equation:

yi ¼ b0ðui; viÞ þXnj¼1

bjðui; viÞxij þ 1i ð2Þ

where ðui; viÞ is the spatial location of the ith

observation and bjðui; viÞis the value of the jthparameter at point location i. The regression

parameters of this equation are estimated at

each location iðui; viÞ. Note that the MLR

model (see Eq. (1)) with constant parameters is

a special case of the GWR model. The

parameters in the GWR model can be

calibrated using the weighted least squares

approach. In matrix form, the parameters of

the GWR model at each location i are

estimated from:

bðui; viÞ ¼ ½X TWðui; viÞX�21X TWðui; viÞYð3Þ

whereWðui; viÞis an (m £ m) spatial weighting

diagonal matrix, X is an [m £ (n þ 1)]

independent data matrix, and Y is an (m £ 1)

dependent data vector.

To estimate parameters in the GWR

model, it is important to decide on the spatial

weighting matrix. The spatial weighting

matrix can be calculated by different methods.

One method is to specify the Wðui; viÞ as a

continuous and monotonic decreasing function

of distance dij between point i and point j

(Fotheringham et al. 1997). Several functions

have been proposed to determine the weighting

matrix (Fotheringham et al. 2002). For

example, the weight of each point can be

calculated by applying the Gaussian function:

wij ¼ e20:5ðdij=hÞ2 ð4Þwhere wij is the weight of observed data at

location j for estimating the dependent variable

at location i, and h is referred as a bandwidth.

This function is a distance decay function, in

which the weight of location j decreases with

its distance from location i being regressed.

However, the weights obtained by this function

are nonzero for all data points, no matter how

far they are from the centre i (Fotheringham

et al. 2002; Bickford & Laffan 2006). Another

commonly used weighting function is the

bisquare distance decay kernel function

wij ¼ 12dij

h

� �2" #2

when dij # h

wij ¼ 0 otherwise

ð5Þ

where wij ¼ 1 at the centre i (dij ¼ 0) and wij

¼ 0 when the distance equals or is larger than

the bandwidth.

The bandwidth may be fixed or adaptive. A

potential problem of the kernel function with a

fixed bandwidth is that for some locations in

the study area there may be only a few data

points available to calibrate the regression

model if the sample data are sparse around the

centre locations, that is, a ‘weak data’ problem.

Alternatively, an adaptive bandwidth can be

used to reduce the weak data problem. The

adaptive bandwidth is selected such that the

number of observations with nonzero weights

is the same at each location i across the study

area. Thus, the kernel function with adaptive

bandwidth adapts itself in the neighbourhood

size to the variations in the density of data. It

has larger bandwidths where the data are

sparse and smaller ones where the data are

denser (Foody 2003; Fotheringham et al.

2002). In this study, adaptive bandwidth was

adopted for SOM estimation, and the neigh-

bourhood size used for GWR analysis in the

ArcGIS environment was 68.


Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

Parameter estimation in GWR is highly

dependent on the weighting function and the

bandwidth used. As the bandwidth increases,

the parameter estimates will gradually

approach the estimates from a global model.

The selection of the weighting function and

bandwidth can be determined using the cross-

validation approach (Fotheringham et al.

2002), the generalised cross-validation cri-

terion (Loader 1999), or the Akaike Infor-

mation Criterion (AIC) (Hurvich et al. 1998).

The AIC method has the advantage of taking

into account the fact that the degrees of

freedom may vary among the models centred

on different locations. In this study, selection

of the weighting function and the optimal

bandwidth h was accomplished by minimising

the corrected AIC, which could decrease

weights in the densely sampled places and

increase their values in the sparse places

(Jaimes et al. 2010).

Since there are not enough samples in this

research, we selected 18 sites as the validation

points basedon the samples density, that is,where

the sampling points are dense, some of themwere

selected as the validation points (Figure 1). We

used the other 92 samples to develop the GWR

and MLR models. Conventional statistical

analyses were processed using the software of

PASW Statistics 18.0.

Evaluation

Prediction accuracy of the two approaches was

evaluated by comparing the validation data,

which were not used for interpolation, with the

predicted data. Mean error (ME) and root mean

square error (RMSE) were used to verify

prediction accuracy. TheME index is defined as

ME ¼ 1

n

Xni¼1

½zðxi; yiÞ2 z*ðxi; yiÞ� ð6Þ

where n is the number of SOM observations in

the validation data set, z(xi, yi) and z*(xi, yi) are

values of the observed and the predicted SOM,

respectively; xi, yi are the location coordinates.

RMSE is defined as

RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

n

Xni¼1

½zðxi; yiÞ2 z *ðxi; yiÞ�2s

ð7Þ

In addition, R 2 values between the observed

and the predicted value at the 18 validation sites

are also used for the purpose of comparison.

3. Results

Exploratory data analysis

Table 1 shows the statistical summary of the

environmental factors and SOM. From the

table one can see that SOM varies from 3.47 to

71.9 g/kg, which means that there are great

variations in the spatial distribution of SOM in

Table 1. Statistical summary of SOM samples and the environmental variables

Min Max Mean Std. deviation Kurtosis Skewness

SOM 3.47 71. 9 20.7 13.11 1.36 2.74NDVI 20.12 0.35 0.07 0.10 0.56 20.07SDR 0.00 1302.50 296.93 303.42 1.68 1.98ELE 149.60 1100.10 273.33 159.18 2.63 8.82SEI 1.00 6.00 1.99 1.39 1.23 0.62Slope 0.00 37.40 6.44 7.20 1.63 3.24CosA 20.98 1.00 0.13 0.69 20.23 21.47FMI 0.30 1.54 0.99 0.22 20.23 0.11

SOM: soil organic matter (g/kg); NDVI: normalised difference vegetation index; SDR: sample distance to river; ELE:elevation; SEI: soil erosion intensity; CosA: cosine value of aspect; FMI: ferrous minerals index.

K. Wang et al.96

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

this region. From the table we also can see that

several environmental variables, such as SDR

(sample distance to river), elevation and slope,

have large standard deviations. This means

that there are great spatial variations from the

averages of samples of the environmental

factors. The values of kurtosis and skewness,

however, are small except for elevation. This

indicates that the histograms of the data are

symmetric and approach normal distributions.

The results of the Kolmogorov-Smirnov (K-S)

tests also confirm the normal distributions of

the data.

To find out the relationships between SOM

and the environmental variables, Pearson’s

correlation coefficients among variables were

calculated and are provided in Table 2. One

can see that there are significant positive

correlations between SOM and elevation, and

significant negative correlations between SOM

and soil erosion intensity and ferrous minerals

index. SOM also has positive correlations with

NDVI, distance to river, slope and aspect, but

these do not reach the significant level. There

are quite a few significant correlations among

environmental variables, such as the corre-

lations between NDVI and elevation, slope and

ferrous mineral index.

To decrease the multi-collinearity problem

in the linear regression, the tolerance and

Variance Inflation Factor (VIF) were tested

among all of the environmental factors

(Table 3). The test results show that all

variables have VIF values less than 7.5, and

tolerance values close to 1. This indicates that

multi-collinearity is not a problem among

these variables although some variables have

relatively high Pearson’s correlation values

(Robinson & Schumacker 2009).

To better understand the effects of the

environment factors on the spatial distribution

of SOM, we selected three different combi-

nations of the environmental factors (see

Table 4) to construct the MLR and GWR

Table 2. The Pearson correlation matrix between the SOM and environmental variables

Variables NDVI SDR ELEV SEI SLOPE CosA FMI

SOM .111 .183 .486** 2 .282** .200 2 .043 2 .329**NDVI 1 .122 .250* 2 .077 .326** 2 .187 2 .407**SDR 1 .431** 2 .018 .186 2 .054 2 .312**ELE 1 2 .234* .418** .114 2 .450**SEI 1 2 .121 .082 .155Slope 1 .056 2 .336**CosA 1 .010

* Significant at the 0.05 level; ** significant at the 0.01 level.

Table 3. Results of the tolerance and variance inflation factor tests among the environmental variables

F R 2 Sig t Tolerance VIF

Regression equation 4.953 0.292 .000constant coefficient .004 2.982NDVI .438 2 .780 .751 1.332SDR .633 2 .479 .774 1.292ELE .000 3.657 .594 1.684SEI .106 21.633 .917 1.091Slope .894 2 .133 .760 1.316CosA .325 2 .991 .905 1.104FMI .151 21.451 .678 1.475

F: value of F-test; R 2: coefficient of determination; Sig: significance; t: value of t-test; VIF: variance inflation factor.


Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

models for interpolating the spatial distri-

bution of SOM. Based on the results of the

three combinations, an optimised set was

chosen as the final one for the interpolation.

MLR model analysis

Table 5 shows the values of AIC (Akaike

Information Criteria), R 2 (coefficient of

determination), ADJ-R 2 (adjusted R 2), and

RSS (residual sums of squares) for each MLR

combination. The AIC index is a measure of

the relative goodness of fit of a statistical

model. Models with smaller AIC values are

preferable to models with higher values. The

R 2 index measures the proportion of the total

variation in the dependent variable explained

by the model. The value of R 2 can be

influenced by the number of the predictor

variables, because increasing the number of

variables will always increase the R 2 value

(Quinino et al. 2012). The adjusted R 2 index is

a preferable measure because it contains

certain adjustment when extra explanatory

variables are added to a given regression

model. The RSS index is a measure of the

discrepancy between the sample data and the

regression models, and a small RSS indicates a

good fit of the model to the observed data. In

this study, the criteria for model selection

mainly depend on AIC, RSS and ADJ-R 2: the

best model should have the smallest values of

AIC and RSS and the largest value of ADJ-R 2.

R 2 was just used as a reference index when the

above two parameters are very close.

Among the three combinations (Table 4),

combination 1 has the smallest AIC and the

highest ADJ-R 2, but its RSS is also the largest

one. The indices of combination 2 are close to

those of the combination 1, and the indices of

combination 3 also do not show much

deviation from those of combination 2. This

means it is difficult to judge which combi-

nation is the optimal one. To compare with the

GWR approach, we selected combination 2

(AIC ¼ 648.3, ADJ-R 2 ¼ 0.2516) for the

MLR analysis because its indices fall between

those of combination 1 and combination 3. The

MLR mathematical model of the selected

combination is expressed as:

SOM ¼ 22:3762 0:0027*SDR

þ 0:0364*ELE2 1:4308*SEI

2 0:0405*Slope2 7:9442*FMI ð8ÞFigure 2a is the scatter plot between the

observed SOM and the predicted SOM by the

Table 4. Three selected combinations ofenvironmental factors for MLR and GWRmodel construction

Combinations Variables

Combination 1 Elevation, soil erosion intensity,ferrous minerals index

Combination 2 Elevation, soil erosion intensity,ferrous minerals index, slope,sample distance to river

Combination 3 Elevation, soil erosion intensity,ferrous minerals index, slope,sample distance to river,NDVI, aspect

Table 5. AIC, R 2, ADJ-R 2 and RSS for selected regression combinations of independent variables

Combinations Methods AIC R 2 ADJ-R 2 RSS

Combination 1 MLR 644.6 0.2944 0.2676 10418.3GWR 642.0 0.4734 0.3691 7774.9



K. Wang et al.98

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

MLR model of Equation (8). The MLR model

obviously underestimates the SOM values

when they are high and overestimates them

when they are low. The largest standard

deviation of residuals is 2.54 g/kg, accounting

for 12.4 percent of the mean of the SOM

values. From the distribution map of the

standard deviation of residuals (Figure 3a), it

can be seen that large deviations mainly occur

in high elevation regions, which are covered

with natural vegetation, mainly forest. Small

residuals are mostly distributed in low

elevation or flat regions, which are mainly

tilled lands.

GWR model analysis

The values of AIC, R2, ADJ-R2, and RSS for

the three GWR combinations are also listed in

Table 5. Combination 2 has the largest R 2 and

ADJ-R 2, and the smallest RSS. Although not

the smallest, its AIC value is very close to the

Figure 2. Scatter plots of observed SOM vs predicted SOM by MLR (a) and GWR (b)

Figure 3. Standard deviation of residuals by MLR (a) and GWR (b)


Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

smallest one. So combination 2 was selected

for GWR analysis.

From Table 5, it can be seen that GWR can

greatly improve prediction accuracy. For

combination 2 of the independent variables,

the GWR approach has improvement over the

MLR approach based on the following indices:

R 2 increases from 0.2973 to 0.5263, ADJ-R 2

improves from 0.2526 to 0.3859, and RSS

decreases from 10376.2 to 6994.4.

Figure 2b illustrates the scatter plot of

observed SOM and predicted SOM by GWR.

Although GWR has similar overestimation and

underestimation problems as MLR has, the

regression line with GWR is closer to the

diagonal line (the dash line), which means that

its errors are less than those of MLR. This

indicates that the fitting results have been

improved by the GWR approach. Large

standard deviations of residuals generated by

GWR mainly occur in high altitude regions

with good vegetation cover, but occasionally

also appear in lower altitude mountain regions

with moderate vegetation cover (Figure 3b).

However, both methods tend to generate small

prediction errors in low altitude and tilled

regions.

Interpolated SOM maps

Figures 4 and 5 illustrate the spatial distri-

bution maps of SOM interpolated by MLR and

GWR, both using combination 2 of environ-

mental variables.

The value range of SOM contents interp-

olated by MLR is 2.17–56.0 g/kg (Figure 4),

and that by GWR is 1.94–71.1 g/kg (Figure 5).

In general, the overall values estimated by

GWRare higher than those byMLR.Compared

with the observed sample data, MLR appar-

ently underestimated the maximum value.

Figure 4. The spatial distribution map of SOM interpolated by MLR

K. Wang et al.100

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

Values estimated by GWR, however, fit the

range of the observed values much better.

Comparing the two images, the spatial

distribution map estimated by MLR has more

smoothing effects than that estimated by GWR.

This means that MLR is less capable of

effectively predicting the local variation of

SOM because it only provides an ‘average’ of

the relationships between SOM and environ-

mental variables. GWR, however, is more

capable of showing local variation (see panels

B andC in Figure 4 and Figure 5). That is to say,

MLR renders only the spatial trends of SOM

and is poor in accounting for local variations

induced by local factors, but GWR can better

handle local variations. It also can be found

from the two maps that the values estimated by

MLR are relatively higher than those estimated

by GWR in the northern region, where the

altitude is high and the vegetation coverage is

well (see panel A in Figures 4 and 5).

To further compare the prediction accu-

racy of the two methods, a sample data set of

18 observed points, excluded from model

calibration, was used to verify the prediction

results by MLR and GWR. Table 6 shows the

comparison results. One can see that most of

the SOM residuals generated by GWR are

smaller than those generated by MLR. The

maximum error generated by MLR is larger

than that generated by GWR (Table 7). Both

ME and RMSE produced by GWR have

smaller values than those generated by MLR.

These statistical values indicate that the

prediction accuracy of GWR is relatively

higher than that of MLR. The R 2 values

between observed SOM data and predicted

data at the 18 validation sites also indicate that

GWR performed better (R 2 ¼ 0.094) than

MLR (R 2 ¼ 0.0009).

In addition, in order to find out the sources

of the errors, 10 samples were randomly

Figure 5. The spatial distribution map of SOM interpolated by GWR


Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

selected from forest soils and paddy soils,

respectively, to compare differences between

the estimated and the observed values (i.e.,

residuals). The forest soils were formed

naturally without human disturbance, while

the paddy soils experienced strong anthropo-

genic disturbance. Table 8 shows the com-

parison results.

It can be seen that smaller residuals were

generated by both methods in the forest soils,

while larger residuals happen in the paddy soils.

This means that major estimation errors were

caused by paddy soils, which further implies

that it is more difficult to accurately predict the

spatial distribution of SOM in the human-

disturbed soils using the two methods without

considering anthropogenic factors. The fact that

GWR generated even larger errors thanMLR in

paddy soil areas (see Table 8) indicates that if

some key anthropogenic factors are absent,

GWR may not outperform MLR in human-

impacted areas though it can more reasonably

capture local spatial variations.

4. Discussion and conclusions

The results of this study show that GWR can

be used as an alternative spatial interpolation

method to predict the spatial distribution of

SOM, even with limited samples, by incorpor-

ating the impacts of environmental variables.

Table 6. Prediction accuracies of MLR and GWR at the 18 selected observation samples

No. SOM MLR GWR SOM-MLR SOM-GWR

1 33.61 15.08 17.61 18.53 16.02 19.70 15.56 17.91 4.14 1.793 27.43 26.93 27.02 0.50 0.414 25.97 11.39 14.55 14.58 11.45 19.62 15.47 16.12 4.15 3.56 19.33 11.05 15.87 8.28 3.467 11.85 15.43 14.81 23.58 22.968 15.88 20.32 17.16 24.44 21.289 18.64 15.28 17.44 3.36 1.210 17.63 17.92 16.83 20.29 0.811 8.13 17.00 14.30 28.87 26.1712 29.60 18.25 15.97 11.35 13.613 15.86 17.39 15.06 21.53 0.814 18.10 18.59 16.17 20.49 1.9315 6.67 20.15 20.19 213.48 213.516 23.06 17.18 15.65 5.88 7.4117 30.97 19.66 20.84 11.31 10.1318 23.09 17.34 18.81 5.75 4.28

SOM: observed SOM values; MLR: predicted SOM values by MLR; GWR: predicted SOM values by GWR; SOM-MLR:residuals of SOM generated by MLR; SOM-GWR: residuals of SOM generated by GWR.

Table 7. Comparison of performances of MLR and GWR using ME, RMSE, maximum (2 andþ) errors ofpredictions and their coefficients of regression (R 2) with 18 validation data

Maximum (2) error(g/kg)

Maximum (þ) error(g/kg)

ME(g/kg)

RMSE(g/kg) R 2

MLR 213.5 18.5 3.064 8.467 0.0009GWR 213.5 16.0 2.933 7.496 0.094

K. Wang et al.102

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

We used a sparse sample dataset to develop the

regression models in this research. The sample

density may not be sufficient for generating

high-quality maps through sample interp-

olation without using ancillary data in the

study area. However, GWR still generated the

spatial distribution map of SOM with many

details by utilising the small number of

samples and the available secondary infor-

mation, such as elevation, soil erosion

intensity, and ferrous minerals index.

Although more samples will help in accurately

estimating local coefficients, GWR does not

necessarily require a very dense network of

sampling data to predict the non-stationarity

property of the spatial distribution of SOM.

Compared with MLR, which uses global

regression coefficients, GWR can predict the

spatial distribution of SOM with higher

accuracy and more details through using

local coefficients, although the neighbourhood

area used has to be large when samples are

very sparse. Thus it may be used as a more

flexible alternative interpolation method when

the relationships between SOM and its

formation factors vary over different places

of a region. In addition, by combining both

spatial location and environmental factors,

GWR may provide us some insights into the

spatially varying relationships between SOM

and environmental factors.

Among the samples used in the study,

more than half of them were collected from the

paddy land and open forest land, where

anthropogenic disturbances such as tillage

and fertilisation activities may impact the

accumulation of SOM (Viaud et al. 2011).

However, the factors we used as independent

variables are only related to the natural

environment, and no anthropogenic factors

Table 8. Observed data and estimated values by MLR and GWR at different sites for forest and paddy soils

No. SOMEstimatedby MLR jMLR-SOMj

Estimatedby GWR jGWR-SOMj

Samples collectedfrom natural soils

1 27.27 32.07 4.8 30.58 3.312 22.1 18.9 3.2 19.9 2.23 19.09 21.78 2.69 22.79 3.74 33.78 33.97 0.19 32.18 1.65 52.94 49.2 3.74 47.16 5.786 15.88 19.71 3.83 18.8 2.927 36.88 29.37 7.51 28.25 8.638 32.99 29.46 3.53 31.84 1.159 16.04 17.22 1.18 16.65 0.6110 19.55 21.53 1.98 22.09 2.54

Average 3.26 3.24

Samples collectedfrom paddy soils

1 36.29 20.1 16.19 19.35 16.942 21.27 15.8 5.47 17.2 4.073 6.81 15.2 8.39 13.5 6.694 10.85 18.6 7.75 17.9 7.055 23.8 14.56 9.24 13.6 10.26 17.62 12.19 5.43 12.86 4.767 24.81 15.6 9.21 14.21 10.68 25.34 15.8 9.54 13.3 12.049 23.48 15.7 7.78 15.1 8.3810 30.5 19.12 11.38 20.74 9.76

Average 9.04 12.29

jMLR-SOMj: absolute values of SOM residuals generated by MLR; jGWR-SOMj: absolute values of SOM residualsgenerated by GWR.


Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

were considered in the regression models.

Therefore it is unavoidable that there will be

relatively large estimation errors, especially in

the intense human-disturbance regions, no

matter which method was used. To obtain

accurate estimation of the spatial distribution

of SOM, more representative independent

variables are needed. That is to say, besides

certain quantity requirements, samples should

be collected based on a variety of factors, such

as topography, land use, soil type, and

vegetation coverage. Unfortunately, anthropo-

genic factors are difficult to collect in the study

area, because the paddy lands are owned by

individual farmers, who usually adopt different

tillage management methods (e.g., fertilisa-

tion, straw-treating, tilling manners) in differ-

ent patches.

This study also shows that it is critical to

select appropriate factors to develop regression

models when using GWR. Inappropriate factor

selection may not improve the prediction

accuracy, and may even cause wrong results

sometimes. If samples are not extremely sparse

and predictor variables are properly selected,

GWR should be suitable for mapping the

spatial distribution of SOM and indicates

apparent advantages over the conventionally

used MLR method.

Acknowledgements

Support for this work was provided by the NationalNatural Science Foundation of China (No.41271232), the Natural Science Foundation ofFujian Province, China (No. 2012J01179), and theScience and Technological Project of the Edu-cational Commission of Fujian Province, China(No. JA11202). We thank the two anonymousreviewers for their constructive comments.

References

Alejandro, M.D.A., & Kenji, O. (2007) Estimationof vegetation parameter for modeling soilerosion using linear Spectral Mixture Analysisof Landsat ETM data. ISPRS Journal ofPhotogrammetry and Remote Sensing, vol. 62,pp. 309–324.

Bai, J.H., Ouyang, H., Deng, W., Zhu, Y.M., Zhang,X.L., & Wang, Q.G. (2005) Spatial distributioncharacteristics of organic matter and totalnitrogen of marsh soils in river marginalwetlands. Geoderma, vol. 124, pp. 181–192.

Baxter, S.J., & Oliver, M.A. (2005) The spatialprediction of soil mineral N and potentiallyavailable N using elevation. Geoderma, vol.128, pp. 325–339.

Bickford, S., & Laffan, S. (2006) Multi-extentanalysis of the relationship between pterido-phyte species richness and climate. GlobalEcology and Biogeography, vol. 15, pp. 588–601.

Bishop, T.F.A., & Lark, R.M. (2006) The geostatis-tical analysis of experiments at the landscape-scale. Geoderma, vol. 133, pp. 87–106.

Bregt, A.K., Gesing, H.J., & Alkasuma, M. (1992)Mapping the conditional probability of soilvariables. Geoderma, vol. 53, pp. 15–29.

Brunsdon, C., Fotheringham, S., & Charlton, M.(1998a) Geographically weighted regression -modelling spatial non-stationarity. The Statis-tician, vol. 47, pp. 431–443.

Brunsdon, C., Fotheringham, S., & Charlton, M.(1998b) Spatial nonstationarity and autoregres-sive models. Environment and Planning A, vol.30, pp. 957–973.

Brunsdon, C., Fotheringham, S., & Charlton, M.(2002) Geographically weighted summarystatistics: a framework for localized exploratorydata analysis. Computers, Environment andUrban Systems, vol. 26, pp. 501–524.

Chen, X.X., Vierling, L., Rowell, E., & DeFelice, T.(2004) Using lidar and effective LAI data toevaluate IKONOS and Landsat 7 ETM þvegetation cover estimates in a ponderosa pineforest. Remote Sensing of Environment, vol. 91,pp. 14–26.

Dobson, A.J., & Barnett, A.G. (2008) An Introduc-tion to Generalized Linear Models, Chapmanand Hall, London.

Dungan, J. (1998) Spatial prediction of vegetationquantities using ground and image data.International Journal of Remote Sensing, vol.19, pp. 267–285.

Foody, G.M. (2003) Geographical weighting as afurther refinement to regression modeling: anexample focused on the NDVI-rainfall relation-ship. Remote Sensing of Environment, vol. 88,pp. 283–293.

Fotheringham, S., Charlton, M., & Brundson, C.(1997) Measuring spatial variations in relation-ships with geographically weighted regression.In: Fischer, M.M., & Getis, A., eds. Recent

K. Wang et al.104

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

Developments in Spatial Analysis, Springer-Verlag, Berlin, pp. 60–82.

Fotheringham, S., Charlton, M., & Brundson, C.(1998) Geographically weighted regression: anatural evolution of the expansion method forspatial data. Environment and Planning A, vol.30, pp. 1905–1927.

Fotheringham, S., Brunsdo, C.A., & RemoteCharlton, M. (2002) Geographically WeightedRegression: The Analysis of Spatially VaryingRelationships, John Wiley & Sons, New York.

Gessler, P.E., Moore, I.D., McKenzie, N.J., &Ryan, P.J. (1995) Soil-landscape modellingand spatial prediction of soil attributes.International Journal of Geographical Infor-mation Systems, vol. 9, pp. 421–432.

Huang, B., Sun, W.X., Zhao, Y.C., Zhu, J., Yang, R.Q., Zou, Z., Ding, F., & Su, J.P. (2007)Temporal and spatial variability of soil organicmatter and total nitrogen in an agriculturalecosystem as affected by farming practices.Geoderma, vol. 139, pp. 336–345.

Hughson, L., Huntley, D., & Razack, M. (1996)Cokriging limited transmissivity data usingwidely sampled specific capacity from pumptests in an alluvial aquifer. Ground Water, vol.34, pp. 12–18.

Hurvich, C.M., Simonoff, J.S., & Tsai, C.L. (1998)Smoothing parameter selection in nonpara-metric regression using an improved Akaikeinformation criterion. Journal of Royal Statisti-cal Society Series B, vol. 60, pp. 271–293.

Jaimes, N.B.P.J., Sendra, J.B., Delgado, M.G., &Plata, R.F. (2010) Exploring the driving forcesbehind deforestation in the state of Mexico(Mexico) using geographically weightedregression. Applied Geography, vol. 30,pp. 576–591.

Jiang, J.R., Zhong, S.L., Yuan, Z.P., Xiao, R.L., &Zhang, Y.Z. (1987) Effect of different croppingsystem and underwater level on content of soilorganic matter and ferric oxide in paddy soil.Journal of Hunan Agricultural University(Natural Sciences), vol. 3, pp. 25–30.

Lark, R.M. (2000) Regression analysis withspatially autocorrelated error: simulationstudies and application to mapping of soilorganic matter. International Journal of Geo-graphical Information Science, vol. 14,pp. 247–264.

Lesch, S.M., Strauss, D.J., & Rhoades, J.D. (1995)Spatial prediction of soil-salinity using electro-magnetic induction techniques. I. Statisticalprediction models - a comparison of multiplelinear regression and cokriging. WaterResources Research, vol. 31, pp. 373–386.

Leung, Y., Mei, C.L., & Zhang, W.X. (2000)Statistical tests for spatial nonstationarity basedon the geographically weighted regressionmodel. Environment and Planning A, vol. 32,pp. 9–32.

Lian, G., Guo, X.D., Fu, B.J., & Hu, C.X. (2006)Spatial variability and prediction of soil organicmatter at county scale on the Loess Plateau (inChinese). Progress in Geography, vol. 25,pp. 112–123.

Loader, C. (1999) Local Regression and Likelihood,Springer, New York. 290 pp.

Mabit, L., Bernard, C., Makhlouf, M., & Laverdiere,M.R. (2008) Spatial variability of erosion andsoil organic matter content estimated from137Cs measurements and geostatistics. Geo-derma, vol. 145, pp. 245–251.

McKenzie, N.J., & Austin, M.P. (1993) Aquantitative Australian approach to mediumand small scale surveys based on soil strati-graphy and environmental correlation. Geo-derma, vol. 57, pp. 329–355.

Mishra, U., Lal, R., Liu, D.S., & Marc, V.M. (2010)Predicting the spatial variation of the soilorganic carbon pool at a regional scale. SoilScience Society of America Journal, vol. 74,pp. 906–914.

Moore, I.D., Gessler, P.E., Nielsen, G.A., &Peterson, G.A. (1993) Soil attribute predictionusing terrain analysis. Soil Science Society ofAmerica Journal, vol. 57, pp. 443–452.

Nerini, D., Monestiez, P., & Mante, C. (2010)Cokriging for spatial functional data. Journal ofMultivariate Analysis, vol. 101, pp. 409–418.

Pachepsky, Y.A., Timlin, D.J., & Rawls, W.J.(2001) Soil water retention as related totopographic variables. Soil Science Society ofAmerica Journal, vol. 65, pp. 1787–1795.

Propastin, P.A. (2009) Spatial non-stationarity andscale-dependency of prediction accuracy in theremote estimation of LAI over a tropicalrainforest in Sulawesi, Indonesia. Remote Sen-sing of Environment, vol. 113, pp. 2234–2242.

Quinino, R.C., Reis, E.A., & Bessegato, L.F. (2012)Using the coefficient of determination R2 to testthe significance of multiple linear regression.Teaching Statistics, vol. 35, pp. 84–88.

Robinson, C., & Schumacker, R.E. (2009) Inter-action effects: centering, variance inflationfactor, and interpretation issues.Multiple LinearRegression Viewpoints, vol. 35, pp. 6–11.

Stein, A., & Corsten, L.C.A. (1991) Universalkriging and cokriging as a regression procedure.Biometrics, vol. 47, pp. 575–587.

Sumfleth, K., & Duttmann, R. (2008) Prediction ofsoil property distribution in paddy soil land-


Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

scapes using terrain data and satellite infor-mation as indicators. Landscape Ecology, vol. 8,pp. 485–501.

Tu, J., & Xia, Z.G. (2008) Examining spatiallyvarying relationships between land use andwater quality using geographically weightedregression I: model design and evaluation. TheScience of the Total Environment, vol. 407,pp. 358–378.

Viaud, V., Angers, D.A., Parnaudeau, V., Morvan,T., & Menasseri, A.S. (2011) Response oforganic matter to reduced tillage and animalmanure in a temperate loamy soil. Soil Use andManagement, vol. 27, pp. 84–93.

Wang, K., Wang, H.J., Shi, X.Z., Weindorf, D.C.,Yu, D.S., Liang, Y., & Shi, D.M. (2009)

Landscape analysis of dynamic soil erosion in

Subtropical China: a case study in Xingguo

County, Jiangxi Province. Soil and Tillage

Research, vol. 105, pp. 313–321.

Young, L.J., Gotway, C.A., Yang, J., Kearney, G.,

& DuClos, C. (2009) Linking health and

environmental data in geographical analysis:

it’s so much more than centroids. Spatial and

Spatio-temporal Epidemiology, vol. 1,

pp. 73–84.

Zech, W., Senesi, N., Guggenberger, G., Kaiser, K.,

Lehmann, J., Miano, T.M., Miltner, A., &

Schroth, G. (1997) Factors controlling humifi-

cation and mineralization of soil organic matter

in the tropics. Geoderma, vol. 79, pp. 117–161.

K. Wang et al.106

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

8:40

09

June

201

4

on: 09 june 2014, at: 08:40 geographically weighted ...gis.geog.uconn.edu/personal/paper1/journal...

Documents