quantifying the predictive value of soil moisture for ... · observed soil moisture is arguably the...

MSc Artificial Intelligence

Track: Machine learning

Master Thesis

Quantifying the predictive value of soilmoisture for vegetation growth using

neural networks

by

Robert Leenders

10811548

42 ECTS

April 2016 – September 2016

Supervisor:

Dr R de Jeu

Assessor:

Dr M Welling

Machine Learning GroupUniversity of Amsterdam

Abstract

Soil moisture is a crucial constraint for vegetation growth, and has there-

fore potentially predictive value. However, the strength of this predictive

value is still to a large degree unknown. This thesis quantifies the pre-

dictive value of soil moisture for vegetation growth. New methods are

introduced to predict vegetation growth using satellite based soil mois-

ture observations. These new methods are based on neural networks and

are evaluated over mainland Australia. Analysis on the predictions of our

3 layer neural network revealed that (a) soil moisture provides a strong

predictive value for vegetation, (b) soil moisture can be used to reliably

predict vegetation up to two months in advance, and (c) soil moisture has

a strong local spatial relation with vegetation. The accuracy of vegetation

predictions are dependent on the magnitude of soil moisture, where the

quality of the vegetation prediction is higher in dry regions as compared

to wet areas.

Contents

1 Introduction 1

2 Background 5

2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Soil moisture . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 NDVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Machine learning models . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . 13

3 Predicting NDVI 15

3.1 Using neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Analyzing the performance . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Performance on different areas . . . . . . . . . . . . . . . . . . 20

3.2.2 Performance on different time periods . . . . . . . . . . . . . . 22

3.2.3 Why wetness decreases performance . . . . . . . . . . . . . . . 23

3.2.4 Adaptability on anomalies . . . . . . . . . . . . . . . . . . . . 24

3.3 Predicting further into the future . . . . . . . . . . . . . . . . . . . . 26

3.4 Locally connected methods . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Predicting NDVI with uncertainty 31

4.1 Bayesian neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Analyzing the uncertainty . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2.1 Uncertainty in different areas . . . . . . . . . . . . . . . . . . 37

4.2.2 Uncertainty in different time periods . . . . . . . . . . . . . . 38

5 Conclusion 42

Bibliography 44

Chapter 1

Introduction

Vegetation is the assemblage of plant species and their ground cover. It plays an

important role in our ecosystem where it regulates various biogeochemical cycles such

as water, carbon, and nitrogen. It converts carbon to oxygen, converts solar energy

into biomass, is the basis of all food chains, and provides wildlife habitat and food.

Understanding the impact of climate change on vegetation dynamics is crucial in

understanding ecosystem dynamics. This is also the reason why vegetation dynamics

are observed for analyzing climate change. Besides the importance of vegetation for

our ecosystem, vegetation is being used in a wide range of important problems, most

notably in forecasting and monitoring. Examples of such problems are climate change

monitoring [Bounoua et al., 2000], agricultural productivity (crop yield [Teal et al.,

2006]), drought monitoring [Peters et al., 2002], and forest fire detection [Illera et al.,

1996].

The effect of climate change on vegetation dynamics is complex and is influenced

by a wide range of different climatic constraints. The three strongest climatic con-

straints are water availability, solar radiation, and temperature [Stephenson, 1990,

Churkina and Running, 1998, Nemani et al., 2003]. The impact of these three com-

ponents on vegetation are relatively well studied, with water availability being the

least well studied [Lotsch et al., 2003, Mercado et al., 2009]. This is peculiar as

more than half of the world’s ecosystems are substantially limited by the availability

of water [Heimann and Reichstein, 2008]. A decrease in water availability reduces

the ability of vegetation to convert carbon-dioxide to oxygen due to a restriction in

stomatal conductance and a limited availability of root water [van der Molen et al.,

2011].

The climatic constraint of water availability on vegetation consists mainly of pre-

cipitation and soil moisture, with soil moisture being more strongly related to plant

growth dynamics than precipitation. There are three important factors that make

1

soil moisture crucial to plants. First off, it provides water and nutrients to the plants,

allowing it to grow. Secondly it creates a buffer and ensures water availability to

plants, even in absence of precipitation. And finally, it enhances the soil chemical

processes which aids the availability of macro-nutrients such as nitrogen. Besides

the influence on vegetation, soil moisture is also of fundamental importance to many

other hydrological and biological processes.

Observed soil moisture is arguably the key variable for modulating the complex

dynamics of the climate-soil-vegetation system and controlling the spatial and tempo-

ral patterns of vegetation [Porporato and Rodriguez-Iturbe, 2002]. However, instead

of using soil moisture observations to study the relation between vegetation dynamics

and water availability, often proxies are used such as model based soil moisture and

drought indices [Hirschi et al., 2011, Lotsch et al., 2003]. Near surface soil moisture

can be accurately observed at a regional and global scale using passive and active mi-

crowave sensing instruments [Owe et al., 2008, Liu et al., 2012, 2011, Miralles et al.,

2010]. The combination of passive and active observations gives a robust observed

satellite based soil moisture product [De Jeu et al., 2008, Dorigo et al., 2010].

Satellite observed soil moisture has been used to show a strong positive relation

between soil moisture and vegetation at large spatial and long-term temporal scales

over mainland Australia [Chen et al., 2014], with dry regions that have low vegetation

density being more sensitive to soil moisture and with vegetation lagging about one

month behind soil moisture. However, the details of the relationship between soil

moisture and vegetation are not yet clear.

The main objective of this thesis is to quantify the predictive value of satellite

based soil moisture for vegetation by forecasting vegetation maps. To forecast vege-

tation maps powerful machine learning techniques are used to model the relationship

between satellite based soil moisture and vegetation. The predictive value of satellite

based soil moisture for vegetation will be analyzed for different spatial and temporal

regions, and for different soil moisture levels. Additionally, the quantity of how far

into the future soil moisture has predictive value will be analyzed.

Long term satellite soil moisture data from ESA CCI Liu et al. [2011] and satellite

vegetation proxies as described by the normalized difference vegetation index (NDVI)

[Rouse, 1973] are used. Neural networks are used as our machine learning model to

forecast vegetation maps. The neural networks will take satellite soil moisture as

input and produce NDVI maps as output. Neural networks are a powerful set of

models that can model complex non-linear relationships between input and output.

A deep neural network is a neural network which is composed of multiple hidden

layers. By stacking layers, which represent linear and non-linear transformations,

2

deep neural networks can learn increasingly complex abstractions of the data. Deep

neural networks have become very popular over the last couple of years, especially

under the term deep learning [Hinton et al., 2012, Collobert and Weston, 2008, LeCun

et al., 2015].

To quantify the predictive value of soil moisture for vegetation the accuracy of

the neural networks are analyzed. The analysis will be done over different spatial

regions as well as different temporal regions. To quantify how far into the future soil

moisture has predictive value for vegetation a lag period between input and output

samples is introduced. The accuracy of the models with different lag periods are then

analyzed to quantify how far into the future soil moisture has predictive value. To

quantify the spatial relation between soil moisture and vegetation locally connected

neural networks are used, and their accuracies are analyzed. Finally a measure of

uncertainty is introduced to the models, thereby introducing another way to possibly

quantify the predictive value of soil moisture for vegetation. Having a measure of

uncertainty is also useful for the practical applicability of our models.

Several studies have set up methodologies to predict vegetation (NDVI). However,

none of them used soil moisture as input. Indeje et al. [2006] predicted NDVI in Ke-

nia using the seasonal rainfall from the global climate models (GCM). It is assumed

that climate variability, especially precipitation, drive variability in NDVI. The au-

thors apply a correction to the GCM output using the model output statistics (MOS)

approach, and then predict NDVI using a combination of empirical orthogonal func-

tion (EOF), singular value decomposition (SVD), or canonical correlation analysis

(CCA), and multiple linear regression. They report that NDVI can be skillfully pre-

dicted (with ≥ 0.6 correlation), however, they do not report any error characterization

such as the mean squared error (MSE).

[Jiang et al., 2016] studied the spatiotemporal variability and predictability of

NDVI in Alberta, Canada. They showed that vegetation in southern Alberta is pre-

dominantly driven by precipitation. Instead of predicting NDVI it predicts smoothed

NDVI (sNDVI). The authors use a linear regression model and an artificial neural

network model calibrated by a genetic algorithm (ANN-GA) to predict sNDVI. Simi-

lar to our findings, they found that the non-linear model (ANN-GA) performed better

than the linear model. This study will take a similar approach, but then with a direct

focus on soil moisture using more advanced neural networks over Australia.

In this study the focus will be on both the influence of soil moisture on NDVI

(as already investigated by Chen et al. [2014]) and the predictive value of soil mois-

ture. This allows for a deeper analysis on the relationship between soil moisture

and vegetation and allows for a better look at the predictive value of soil moisture

3

for vegetation. The main contribution of this thesis, presented in chapter 3, is the

analysis and prediction of vegetation with high accuracy using neural networks. The

focus is on analyzing the predictability of vegetation in different temporal and spatial

regions. Also analyzed is the effect of introducing a lag period between the soil mois-

ture observations and the vegetation predictions. Furthermore, the performance of a

regular neural network and an ensemble of small locally connected neural networks

is compared. Finally, chapter 4 will focus on improving the predictions by adding a

measure of uncertainty to them. This is done using a Bayesian approach; by replac-

ing the neural network with a Bayesian neural network based on work of Louizos and

Welling [2016].

4

Chapter 2

Background

2.1 Data

2.1.1 Soil moisture

The soil moisture dataset [Liu et al., 2012, 2011, Wagner et al., 2012] is provided by the

CCI project which is part of the ESA programme on global monitoring of essential cli-

mate variables. The dataset was retrieved from http://www.esa-soilmoisture-cci.

org/node/145 on April 2016. It provides surface soil moisture maps at a 0.25◦ res-

olution from 1972 to 2014. It uses active as well as passive microwave sensors and

combines these two data streams into one final dataset. Observations are available

daily, however, not every area has a daily valid soil moisture observation. In other

words, daily maps are incomplete. To help with this issue, and to make the data

consistent with the NDVI dataset, the observations are averaged over the first fifteen

days of a month and the remaining observations of a month. This results in two soil

moisture maps per month. Figure 2.1 shows an example of a 15 day soil moisture

map.

2.1.2 NDVI

To quantify vegetation the normalized difference vegetation index (NDVI) [Rouse,

1973] is used. NDVI is an index that captures the amount of live green vegetation

or photosynthetic activity in an area and was first introduced in 1973 by Rouse et

al. It is a popular index that has found a wide range of applications in areas such as

vegetation dynamics, biomass production, and crop yield prediction.

NDVI uses visible light and near infrared light to distinguish between healthy

and unhealthy vegetation. It uses the concept that in general healthy vegetation will

5

http://www.esa-soilmoisture-cci.org/node/145

http://www.esa-soilmoisture-cci.org/node/145

0 200 400 600 800 1000 1200 1400

0

100

200

300

400

500

600

700

Figure 2.1: Example of a soil moisture map. White indicates no soil moisture in-

formation is available for that area, blue indicates dry areas, and red indicates wet

areas. Even when averaged there are still areas without data (e.g. the white areas in

South-America).

absorb most of the visible light while it reflects more of the near infrared light. In

contrast, unhealthy vegetation reflects more visible light and less near infrared light.

This leads to the following fraction:

NDV I =NIR−REDNIR +RED

where NIR is the near infrared reflectance value for a cell and RED is the red

reflectance value for that cell. The near infrared reflectance and red reflectance values

for cells are captured using satellite instruments. In general NDVI values range from

-1 to +1 with larger values indicating more vegetation.

NDVI is not the only index that measures live green vegetation. Other indices

such as the soil-adjusted vegetation index (SAVI) or the enhanced vegetation index

(EVI) also try to measure live green vegetation. For this study NDVI is chosen due

to its wide recognition within the science community.

The NDVI data is obtained from the GIMMS AVHRR Global NDVI dataset

[Pinzon and Tucker, 2014]. The dataset was retrieved from https://ecocast.arc.

nasa.gov/data/pub/gimms/3g.v0/ on April 2016. The dataset is assembled from a

collection of observation of NOAAs Advanced Very High Resolution Radiometers. It

6

https://ecocast.arc.nasa.gov/data/pub/gimms/3g.v0/

https://ecocast.arc.nasa.gov/data/pub/gimms/3g.v0/

provides bimonthly observations at a 1/8th◦ resolution from 1981 to 2014. To avoid

resolution incompatibility with the soil moisture data the NDVI dataset is downscaled

to the same 0.25◦ resolution of the soil moisture dataset. Figure 2.2 shows an example

of a 15 day NDVI map.

0 200 400 600 800 1000 1200 1400

0

100

200

300

400

500

600

700

Figure 2.2: Example of an NDVI map. Blue indicates an NDVI value of -1 and red

indicates an NDVI value of +1.

2.1.3 Preprocessing

Before feeding the data to the models three preprocessing transformations were per-

formed. The first transformation is the aggregation of 10 observations (2 per month)

into a single observation, effectively introducing a notion of history to all our observa-

tions. This comes from the work of Chen et al. [2014], which shows that a soil moisture

observation influences the NDVI up until the following 5 months. Performing this

preprocessing step results in a small increase, 2-3%, in performance.

The second transformation is normalizing the input data. This is commonly done

as it often leads to faster convergence and better local optima. To normalize the

input data the mean is subtracted from the input data and the result is divided by

the standard deviation of the input data.

The third transformation is removing the seasonality so our neural network can

focus on learning anomalies. The data contains a strong seasonality, in other words,

vegetation maps of the same months (or adjacent months) look similar. Note that

every month is represented by two samples, such that there are 24 samples in a year.

7

Each sample then spans a period of roughly two weeks. To remove the seasonality

the mean is computed for each period and this mean is then subtracted from each

sample within that same period. An example of this transformation on Australia can

be seen in figure 2.3.

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

Figure 2.3: The first map is the original output, the second map is the average of

all samples of that same time period, and the third map is the difference between

the first two maps. Orange indicates a difference of zero, red means an increase of

vegetation, blue means a decrease of vegetation.

This transformation is also applied on the input data. Note that this might remove

some important information, most importantly the scale. Consider two samples from

different months, that have completely different averages. It is then possible that after

applying this transformation two (input) samples have the same difference maps but

have completely different output maps.

To evaluate this transformation two neural networks were trained, one on a dataset

without this transformation (’non-anomaly’) and one with this transformation applied

(’anomaly’). Figure 2.4 shows the test accuracy of both neural networks. The non-

anomaly neural network has an error of 0.002310 and the anomaly neural network

has an error of 0.002075, an improvement of ±10%. The differences are small but the

anomaly neural network outperforms the non-anomaly neural network consistently.

Therefore, this transformation will be used as an extra preprocessing step on the

input dataset.

2.2 Machine learning models

In this study a few machine learning models are used including ridge regression, neural

networks, and Bayesian neural networks. Familiarity with ridge regression is assumed

so that the next two sections can focus on giving an overview of neural networks and

Bayesian neural networks.

8

2009 2010 2011 2012 2013

Time

0.000

0.002

0.004

0.006

0.008

0.010

0.012

Err

or

non-anomaly

anomaly

Figure 2.4: Mean squared error for original dataset and transformed dataset

2.2.1 Neural Networks

To explain what a neural network is, it is important to first understand what a

perceptron is, which is a type of artificial neuron. The perceptron takes several

inputs x1, x2, . . . , xn, and multiplies each input by a weight w1, w2, . . . , wn, it then

sums up all these values together and if that value is larger than a certain threshold

it will output a 1 and otherwise a 0. To be more precise:

output =

0, if∑n

i=0 xiwi ≤ threshold

1, if∑n

i=0 xiwi > threshold(2.1)

By varying the weights and the threshold the perceptron will learn to make dif-

ferent decisions. The last few years different artificial neurons are often being used

instead of perceptrons, they still use the same idea of weights, except they will often

apply a non-linearity such as the sigmoid function to the result sum instead of com-

paring it to some threshold. By stacking these perceptrons a more powerful model

called the multilayer perceptron is obtained.

One example of a multilayer perceptron is shown in figure 2.5. The first column

of nodes is usually the input, the second column is the first layer of perceptrons, the

9

third column is the second layer of perceptrons, etcetera. By having multiple layers

of perceptrons increasingly difficult decision can be made. A multilayer perceptron

is a certain type of neural network, one where the artificial neurons are perceptrons,

however, as will become clear in the next section, it is possible to have different

kind of neurons. A neuron is often called a (hidden) unit. A neural network then

has input units, hidden units (in a multilayer perceptron case the perceptrons), and

output units. To be clear, a multilayer perceptron is a neural network, but a neural

network is not necessarily a multilayer perceptron.

Input #1

Input #2

Input #3

Output

Hidden

layer #2

Hidden

layer #1

Input

layer

Output

layer

Figure 2.5: An example of a multilayer perceptron

The equation 2.1 can be rewritten in a more general form:

output = f(w · x + b) (2.2)

where x and w are vectors of input and weights, and b is a bias term. The bias

term is simply the threshold except it has been moved to the left hand side. The

function f defines what kind of artificial neuron it is. Given the function:

f(x) =

0, if x ≤ 0

1, if x > 0(2.3)

the neuron corresponds to a perceptron and the equation is equal to equation 2.1.

However, f can be any kind of function like a sigmoid, tanh, or a rectified linear

one. It is important to note that often f is a non-linear function as this makes the

neural network more powerful. Below three non-linearities are highlighted. Firstly,

the sigmoid function which squashes inputs to a value between 0 and 1 as can been

10

seen in figure 2.6, the equation is as follows:

σ(x) =1

1 + e−x

Secondly, the tanh function which is similar to the sigmoid function except it

squashes inputs to a value between -1 and +1, it is plotted in figure 2.6, and the

equation is as follows:

tanh(x) =1− e−2x

1 + e−2x

Finally, the rectified linear function, these units are often called rectified linear

units or ReLU units. This function returns the input if it’s larger than zero, otherwise

it returns zero. It is plotted in figure 2.7 and the equation is as follows:

relu(x) =

x, if x > 0

0, if x ≤ 0(2.4)

−5.0 −4.0 −3.0 −2.0 −1.0 1.0 2.0 3.0 4.0 5.0

−1.0

−0.5

0.5

1.0

x

yσ(x) = 1

1+e−x

tanh(x) = 1−e−2x

1+e−2x

Figure 2.6: The sigmoid and tanh functions

The output of a neural network could be a unit in which f is the identify function,

which is often used for regression problems. There are other possible options such as

a softmax, which is often used for classification problems. The problem considered

in this thesis has as many output units as there are pixels in the NDVI map that is

11

−5.0 −4.0 −3.0 −2.0 −1.0 1.0 2.0 3.0 4.0 5.0

−1.0

1.0

2.0

3.0

4.0

5.0

x

yrelu(x) = max(0, x)

Figure 2.7: The ReLU function

being predicted. Each output unit has to predict a real value between -1 and 1 (the

range of NDVI values) so for this problem f is set to the identify function for the

output units.

By changing the weights of the neural network, the decisions made by the neural

network change, but how should one change these weights? Basically, one would like

to have an algorithm that changes the weights and the biases of the neural network so

that it outputs correct answers, based on some training data. To quantify how correct

an answer is a cost function is defined, or an error measure. One example of a cost

function is the mean squared error. The learning algorithm then tries to minimize this

cost function by changing the weights and biases. One of the most common learning

algorithms is gradient descent, that computes the gradient of the error with respect

to the weights and biases, and then updates the weights and biases, so that the error

decreases. Computing this gradient is often done using backpropagation [Rumelhart

et al., 1985].

As gradient descent requires computation of the gradient over the complete dataset,

which is expensive, stochastic gradient descent (SGD) is often used. SGD is a stochas-

tic approximation of gradient descent that instead computing the gradient over the

complete dataset, computes the gradient over a subset of the dataset. SGD is widely

used but is inefficient when it comes to optimizing objectives that contain other

sources of noise than data subsampling. Adam [Kingma and Ba, 2014] is a learning

algorithm that tries to be efficient at optimizing these stochastic objectives. The ad-

vantage of using Adam over SGD is that it is invariant to rescaling of gradients and

robust to noisy and sparse gradients while having little memory and computational

12

overhead. In this thesis Adam is used as optimizer for all our experiments.

Another crucial part of a neural network is its architecture. The architecture of

a neural network consists of layers, each containing a number of units. Usually, the

first layer is the input, the final layer is the output, and all layers in between are

hidden layers. For example, the neural network in figure 2.5 has 4 layers. The first

layer has 3 input units, the two following layers are hidden layers one with 4 units

and one with 5 units, and the final layer has a single output unit.

Finally, dropout [Srivastava et al., 2014] is briefly discussed. Dropout is a regu-

larization technique that randomly drops units during training. To be precise, during

training the output of randomly selected units will be set to zero, while during testing

all outputs will be scaled by some factor (this factor depends on the probability that

a unit drops). Usually a certain probability set per layer on whether or not a unit

drops. The hope is that this prevents units from co-adapting too much. Dropout is

a very simple technique but surprisingly effective.

2.2.2 Bayesian Neural Networks

A disadvantage of neural networks is that they do not provide any kind of uncertainty

measure with their output, in other words, they do not provide any confidence inter-

vals. This is especially important for problems where key decisions are being made

based on the output.

To introduce confidence levels to neural networks Bayesian methods are applied

to it. In the Bayesian treatment of neural networks we marginalize over the distri-

bution of parameters in order to make a prediction. In other words, a probability

distribution is put over the weights w. As a neural network is highly non-linear and

complex an exact Bayesian treatment is practically impossible. Therefore, approxi-

mation methods are used to approximate the distribution. This section focuses on

a family of approximation methods called variational inference. Another family of

approximation methods are the Markov Chain Monte Carlo (MCMC) methods. The

advantage of variational inference methods over MCMC methods is that variational

inference methods do not require any sampling and hence are fast and deterministic.

Variational inference is a family of methods that cast inference in a distribution

as an optimization problem. This is done by minimizing the Kullback-Leibler (KL)

divergence [Kullback and Leibler, 1951] between the approximate posterior and the

true posterior. In other words, the distribution p(y|x), which is too complicated

to evaluate directly, is approximated by a simpler distribution q(y). Usually this

simpler distribution q makes more independence assumptions than p. The problem

then becomes which simpler distribution q to select. This is done by defining a family

13

of distributions Q that are all simple enough to evaluate, then the q in Q that best

approximates p is selected (this is the optimization part). To evaluate how well q

approximates p the KL divergence is often used.

There exist a lot of variational inference methods such as loopy belief propaga-

tion, mean-field approximation, and expectation propagation. Recent research in this

area, with a focus on applications in neural networks, includes work from [Graves,

2011, Hernandez-Lobato and Adams, 2015, Blundell et al., 2015, Kingma et al., 2015,

Louizos and Welling, 2016]. This thesis will use the variational Bayesian neural net-

work method defined in this last paper, called VMG (Variational Matrix Gaussian).

All recent approaches mentioned above, besides the approach of Louizos and

Welling [2016], assume a fully factorized posterior distribution over the neural network

weights. In other words, they treat each weight of the weight matrix independently.

In contrast, Louizos and Welling [2016] treat the whole weight matrix as one us-

ing a matrix variate Gaussian distribution Gupta and Nagar [1999]. This leads to

a reduction in variance parameters to estimate, better weight posterior uncertainty

estimation, more information sharing between weights, and an easier learning task.

14

Chapter 3

Predicting NDVI

In this chapter NDVI is predicted using soil moisture. The performance of different

neural network architectures is analyzed, as well as the influence of different areas and

time periods on performance. Afterwards a lag period is introduced between input

and prediction to quantify how far into the future soil moisture has predictive value.

Finally, the performance and benefits of using ensembles of locally connected neural

networks over a single large neural network are discussed.

All our models take a map with soil moisture levels as input, and produce a map

with vegetation levels as output. To represent vegetation the normalized difference

vegetation index (NDVI) is used, throughout this chapter vegetation and NDVI can

be used somewhat interchangeably. Both the CCI Soil Moisture dataset and the

GIMMS NDVI dataset provide maps that cover the entire world. A lot of areas are

simply not interesting to look at since water is always readily available, and so soil

moisture has a small impact on vegetation. Furthermore, predicting vegetation for the

entire world instead of a single country is more expensive computationally. Hence, in

the experiments only a single country will be considered, namely Australia. Australia

was chosen, and not for example south-Africa, because it is a well studied area and

because the data available for Australia is of high quality. To focus on Australia

image patches of 140 by 180 pixels containing just Australia were extracted and used

as new input and output datasets.

Figure 3.1 shows a few examples of input, output, and prediction maps (predictions

were done by our best performing model). Visually the predictions look similar to

the expected output, with areas bordering water looking more similar, while a few

spots in the middle of Australia prove more difficult to predict.

15

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(a) The first month in the test set

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(b) The sixth month in the test set

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(c) The twelfth month in the test set

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(d) The wettest (and worst) month in the test set

Figure 3.1: Input, output and prediction for a few samples in the test set. For the

first column blue represents dry areas, while red represents wet areas. For the other

two columns blue represents no vegetation, while red represents high vegetation.

16

3.1 Using neural networks

In this thesis the focus is on using neural networks as a prediction model, however it

is helpful to establish a baseline using a few simpler models. This is done by applying

two linear models on the problem: linear regression and ridge regression. The linear

regression model has no parameters and the ridge regression model uses a weight

penalty of α = 100. Figure 3.2 shows that the ridge regression model performs

significantly better than linear regression suggesting that overfitting is a problem.

The fact that ridge regression performs similar to our best neural network suggests

that the relation between soil moisture and NDVI might be of linear nature.

A variety of different neural network architectures were tested on our problem,

the performance of the most interesting ones together with the performance of the

two baseline models can be found in table 3.1. The neural network with the best

performance has 2 hidden layers each with 2500 ReLU units and uses heavy `1 and

`2 regularization. It performs about 39% better than the ridge regression model.

Table 4.1 also contains the performance of a model that simply predicts all zeros, in

other words, it predicts there are no anomalies and so the predicted vegetation map

is equal to the average of all vegetation maps in the training data of that same time

period. The best neural network model performs about 20% better than this model.

The next few paragraphs will elaborate more on how these neural networks were

initialized, trained, how their architectures impacted performance, what properties

worked best and why.

All neural networks weights were initialized by drawing from a standard normal

distribution with standard deviation 0.01 as described in Alex’ One Weird Trick Paper

[Krizhevsky et al., 2012]. All biases were initialized to a constant bias of 0.1.

Each neural network was optimized using the Adam [?] optimizer with a learning

rate of 0.0001. The Adam optimizer was initialized with the following parameters:

β1 = 0.9, β2 = 0.999, ε = 10−8. The minibatch size was set to 24 at the beginning

of training and was slowly increased to the size of the complete training set. In total

there are about 520 samples in the dataset, of which 390 (75%) are used for training

purposes. The training dataset is small enough to allow for a full non-stochastic

gradient update.

So far there hasn’t been any details on what exactly is being optimized. The

output data (the vegetation maps) are essentially matrices of real values, and so any

matrix similarity measure might be used as an error measure. This work uses the

`2 norm ||A− B||22 which is simply the mean of the squared differences between two

matrices: E(A,B) = 1n

∑ni=0 (Ai −Bi)

2.

Heavy regularization was required for good generalization, especially a strong `1-

17

2009 2010 2011 2012 2013

Time

0.000

0.005

0.010

0.015

0.020

0.025Err

or

LR

Ridge

NN

Figure 3.2: Time series with the mean squared error of linear regression, ridge regres-

sion, and the best neural network, on a period from Aug 2008 to 2014.

Model Parameters Error

All zeros - 0.00234371

Linear regression - 0.0100653

Ridge regression α = 100 0.00305129

Neural network 1x500 ReLU 0.00192698





Neural network 2x2500 ReLU dropout 0.00186956

Neural network 2x2500 tanh 0.0021368

Neural network 2x2500 sigmoid 0.00222824


Neural network 2x3500 ReLU, dropout 0.00186964


Table 3.1: The performance of several models. The error column contains the mean

squared error for each model. The best performing model is an NN with parameters

2x2500 ReLU, meaning it has 2 hidden layers each with 2500 ReLU units and does

not use dropout.

18

regularization was important. This is probably due to the strong spatial relation

present in the data, i.e. vegetation growth in the south hardly depends on the soil

moisture levels in the north. For our models using a regularization rate of `1 = 10−6

and `2 = 10−5 as a general rule worked well. These were also the exact values used as

regularization constants in our best performing neural network. For smaller networks

the regularization parameters were decreased by a factor 10 - 100 and for larger

networks there were increased by a factor 10 - 100.

Out of the three different non-linearities tested the ReLU non-linearity performed

best. As table 3.1 shows, it performs about 15% to 20% more accurate than the sig-

moid or tanh non-linearities. The problem of predicting vegetation using soil moisture

can be seen as a problem were essentially one map is morphed into another, both

containing real values that are somehow correlated. By using a non-linearity that

squashes values such as the sigmoid or tanh (see figure 2.6) the model loses valuable

information about the input signal.

Table 3.1 shows that having a neural network with one hidden layer only decreases

performance by 5%, again suggesting that the relation between soil moisture and

NDVI has a linear nature. Having three or more hidden layers did not improve

performance and having more than five hidden layers actually decreased performance.

This is probably due to the small amount of available data; deep neural networks

require a lot of data samples to train on. Experiments showed that 2500 hidden units

per layer was the sweet spot. Adding any additional hidden units did not result in an

increase of performance. Neural networks with less hidden units (e.g. 500) were able

to predict general vegetation levels of large areas correctly, but couldn’t accurately

predict smaller areas.

Finally, adding dropout [Srivastava et al., 2014] to the neural networks hardly

impacted the test error, which is surprising. It did, however, improve the training

error. Dropout makes a lot of sense because, like mentioned above, there is a strong

spatial relation, where an output pixel only depends on the surrounding input pixels.

By dropping a lot of pixels from the input image the model effectively gets rid of a lot

of noise. One possible reason for the lack of improvement in test error is the strong

`1 regularization all neural networks have. Another possible explanation would be

that our neural network architectures aren’t very deep which is often a requirement

for dropout to work satisfactory.

One architecture that has not been discussed so far is that of convolutional net-

works. Convolutional networks did not perform as well as conventional neural net-

works and due to their high computational cost no further research was done in this

direction. There are two problems with the convolution neural networks: the first one

19

is that filters are only applicable locally and therefore a lot of them are a waste of time

and space, and the second one is that pooling removes a lot of important information.

This is a problem as the model needs to learn exactly which pixels have increased or

decreased vegetation, and by what amount, and not just the general areas which have

increased or decreased vegetation. Another problem with using convolutional neural

networks is that they are inherently deep, which due to the amount of available data,

is a problem.

3.2 Analyzing the performance

In this section the performance of our best model: a 3 layer neural network with 2500

ReLU units per hidden layer is analyzed. First the performance of the prediction

method over different areas in Australia are analyzed. Second, the impact of timing

on the predictive skills are investigated. As will be shown in the following two sec-

tions, the prediction performance depends on the quantity of soil moisture available

where more soil moisture means worse performance. Finally, in the last subsection an

attempt will be made to explain why an increase in soil moisture results in a decrease

in test accuracy.

3.2.1 Performance on different areas

In this section different regions of Australia are analyzed to see if vegetation is more

difficult to predict in any of the regions. Measuring the difficulty of predicting vege-

tation of a region will be done by aggregating the mean squared error per pixel over

the test data.

The resulting map can be seen in figure 3.3a, and shows that there are a few small

spots in the middle of Australia where the errors are concentrated. These seem to be

spots where the model generalizes poorly. To get a more realistic look at areas that

are troublesome another map where the maximum error is bounded by 0.02, is shown

in figure 3.3b.

Figure 3.3b shows that the errors concentrate in east Australia and that the north-

west of Australia contains the least amount of errors. It is probable that this happens

due to sudden changes in wetness in these areas. Figure 3.4 shows the average soil

moisture over the entire dataset for the year 2011, which is the wettest period. Look-

ing at these figures one can see that, for example, in the year 2011 in south Australia

there is a spike in soil moisture which is an area with high error. Similar reasoning

can be applied to other spots where there is a large difference between average soil

20

moisture and soil moisture in 2011, for example, the spots in mid Australia.

0 50 100 150

0

20

40

60

80

100

1200.015

0.030

0.045

0.060

0.075

0.090

0.105

0.120

(a) Error map with unbounded errors.

0 50 100 150

0

20

40

60

80

100

1200.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

0.018

0.020

(b) Error map with the error bounded to a maximum

of 0.02

Figure 3.3: Average error maps where each pixel represents the average mean squared

error for that pixel.

0 50 100 150

0

20

40

60

80

100

120

0.00

0.04

0.08

0.12

0.16

0.20

0.24

0.28

0.32

(a) Average soil moisture over the entire dataset

0 50 100 150

0

20

40

60

80

100

120

0.00

0.04

0.08

0.12

0.16

0.20

0.24

0.28

0.32

(b) Average soil moisture over the wettest year (2011)

Figure 3.4: Maps with the average soil moisture for Australia

However, this does not explain the lack of error in the north west of Australia

which also has an increased soil moisture during 2011. The average NDVI over the

entire dataset and the average NDVI over 2011 are shown in figure 3.5. From these

two figures one can clearly see that even though there were increased soil moisture

levels in the north west there was hardly any increase in NDVI. Intuitively, this can

be explained by the idea that different areas respond differently to an increase of soil

moisture. As our neural network has not encountered anything this extreme before it

21

has to guess which areas respond strongly to the increased soil moisture and which do

not respond at all. Furthermore, areas with less NDVI response have less variation

and are thus easier to predict.

0 50 100 150

0

20

40

60

80

100

120

1.0

0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(a) Average NDVI over the entire dataset

0 50 100 150

0

20

40

60

80

100

120

1.0

0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(b) Average NDVI over the wettest year (2011)

Figure 3.5: Maps with the average NDVI for Australia

3.2.2 Performance on different time periods

The analysis in the previous section hinted at a strong relation between soil moisture

levels and prediction performance. This section further explores this theory and

analyzes the driest and wettest periods and their performances.

The five driest periods in the test set are the samples from periods: 1-15 Aug 2008,

1-15 Nov 2009, 1-15 Oct 2009, 6-31 Oct 2009, and 16-30 Nov 2009, these are the 1st,

32nd, 30th, 31st, and 33rd samples in the test set. These samples have the 3th, 34th,

20th, 21th and 26th best performances (out of 130 samples). These performances

aren’t spectacular but nearly all of them are in the top 20% of performances. In

contrast, the five wettest periods are all in the bottom 8% of performances. The

five wettest periods are: 1-15 Mar 2011, 16-28 Feb 2011, 16-30 Mar 2011, 1-15 Apr

2011, 1-15 Feb 2011. They have the 3rd, 1st, 2nd, 5th and 10th worst prediction

performances, four of them are even in the top 5 of worst performances.

To further solidify the relation between soil moisture and prediction performance

two time series are shown in figure 3.6. The green line represents the error of the

neural network and is scaled to be on the same scale as that of the blue line which

represents soil moisture. Besides the clear relation between the two time series that

one can see visually, it has a Pearson correlation of 0.816. In other words, 66.5% of

the error variance can be explained by a simple linear regression on the average soil

22

moisture. A scatter plot with the error on the y-axis and the average soil moisture on

the x-axis can be seen in figure 3.7. This scatter plot shows a clear positive relation

between the average soil moisture and the prediction error of the neural network.

2009 2010 2011 2012 2013

Time

Average SM

Error

Figure 3.6: The average soil moisture and (scaled) test error per sample over a period

from Aug 2008 to 2014. The graphs move similar suggesting a relation between soil

moisture levels and errors.

Combing the observations of this section with the observations of the previous

sections one can conclude there is a strong negative relation between wetness and

prediction performance. The next section will attempt to explain why wetness de-

creases performance.

3.2.3 Why wetness decreases performance

The previous two sections showed that an increase of soil moisture often results in

an increase of test error or a decrease in test accuracy. This section discusses two

reasons why this is.

Firstly, an increase in soil moisture often results in an increase in vegetation which

leads to greater variability. This variability is difficult to predict as each area reacts

differently and as there is little of data available. Another possible issue is that the

soil moisture differential maps fed to the neural networks are relative to the average

soil moisture for that month. In other words, the input are relative values, not say

percentages of change, which means that one input map might represent a 5% increase

23

0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Average soil moisture

0.002

0.000

0.002

0.004

0.006

0.008

0.010Err

or

Figure 3.7: Scatter plot comparing average soil moisture and error

in soil moisture for one month but 15% for another month. Further research using

percentages of change as input to the models did not yield any positive performance

improvements, proving this is not an issue.

Secondly, vegetation is a complex process that is influenced by a lot of different

factors. Soil moisture, which is part of water availability, is only one such factor.

Other factors include sunlight availability, geography of area, and temperature. Dur-

ing drier periods, when water is scarce, soil moisture is one of the most important

factors, because water availability is the limiting factor. During wetter periods where

water availability is high soil moisture because a less important factor and other

factors such as sunlight availability become more important.

3.2.4 Adaptability on anomalies

In the experiments conducted above the models are simply trained on 75% of the

available data and tested on the remaining 25%. However, this is not a very realistic

scenario. In practice, the machine learning models would be continuously re-trained

whenever new data becomes available. In other words, Therefore, for practical appli-

cability of the models, it might be interesting to see if the model could have predicted

the anomaly (the wettest year) better had it been trained on all data available just

24

up before this anomaly. Furthermore, it might be interesting to see when the neural

network adapts and learns how vegetation reacts to these wet circumstances.

To investigate how quickly the neural network adapts to wet circumstances several

neural networks have been trained. Each was given one more training sample (of 2011

the wettest year) than the previous one. Figure 3.8 shows the performance of four

neural networks on the data for the year 2011 - 2012. The blue line corresponds to the

error of the original neural network that has not seen any extra data, the green line

corresponds to the error of the neural network that has been trained up until 2011,

the red line corresponds to the error of the neural network that has been trained up

to the first vertical black line and the cyan line corresponds to the error of the neural

networks that has been trained up until the second vertical black line.

Time0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

Err

or

Original

End 2011

BL 1

BL 2

Figure 3.8: Error on 2011 - 2012 period for different training datasets. Original is

the original 75% training dataset, End 2011 is trained until the end of 2011, BL 1 is

trained till the first vertical black line, and BL 2 is trained until the second vertical

black line.

The results show that training until 2011 gives a performance boost for the first

two months of 2011, which was expected, however it doesn’t improve performance for

the wettest period from March to May. Training until just before the wettest period

results in a 28.5% performance improvement over the original neural network for

the wettest period. Extending the training data by including the first peak of the

25

wettest period results in a 45% performance improvements over the original neural

networks for the second peak in the wettest period. One can conclude that the neural

network has trouble predicting the anomaly until just before it happens, and that

once reaching that peak it quickly adapts. This is useful for practical applicability

of the neural network because one could be confident it quickly adapts to current

trends.

3.3 Predicting further into the future

In the previous experiments the models were trying to predict vegetation for the

same period as the given soil moisture. In other words the lag between an input and

output sample was zero. To stronger quantify the predictive value of soil moisture a

lag period between input and output samples is introduced. This lag period ranges

from zero to four months and allows for analysis on where relevant soil moisture

information is being stored. In other words, it allows for quantification of the future

predictive value of soil moisture for vegetation.

Multiple neural networks, each with the same architecture, parameters and opti-

mization methods as described in the previous section, but each with a different lag

period have been trained. In total four neural networks were trained, one with a lag

of 1 month, one with a lag of 2 months, one with a lag of 3 months, and one with a

lag of 4 months.

The mean squared error per lag has been plotted in figure 3.9, and table 3.2

shows the combined mean squared error per lag. Looking at the wet anomalies in

2011, figure 3.9 clearly shows that most important information is contained within

the first month. For time periods where the behavior is relatively normal, all lags

seem to perform equally well, this is probably due to the strong seasonality present in

the problem. Looking at the mean squared error there seems to be a linear increase

in error per increased lag period.

A model which always predicts all zeros is added to table 3.2, all zeros meaning

it predicts the average NDVI recorded for that biweek. This clearly shows that the

neural network with a lag of 3 months is only marginally better than predicting

all zeros, and the neural network with a lag of 4 months performs even worse than

predicting all zeros. This suggests that all relevant information is present in the three

months preceding the month one wants to predict.

To further analyze the future predictive value of soil moisture the average vegeta-

tion per time period is shown in figure 3.10, with table 3.2 showing the corresponding

Pearson correlation between the averages of true vegetation maps and each lag pe-

26

riod. The Pearson correlation coefficients show that neural networks with a lag up to

two months can still predict the average vegetation trends well, however, with a lag

of 3 months or more these predictions become inaccurate. This reinforces the idea

that all relevant information is present in the three months preceding the month one

wants to predict.

2009 2010 2011 2012 2013

Time

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

Err

or

No lag

Lag of 1 month

Lag of 2 months

Lag of 3 months

Lag of 4 months

All zeros

Figure 3.9: The mean squared error per lag period from 2008 to 2014.

Lag Error Pearson correlation

Zero lag 0.00186817 0.800

Lag of a month 0.00205393 0.581

Lag of 2 months 0.00213289 0.234

Lag of 3 months 0.00230171 -0.032

Lag of 4 months 0.00237467 -0.095

All zeros 0.00234371 -

Table 3.2: The mean squared error and Pearson correlation of different lag periods

27

2009 2010 2011 2012 2013

Time

0.03

0.02

0.01

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Avera

ge N

DV

I

Target averages

No lag

Lag of 1 month

Lag of 2 months

Lag of 3 months

Lag of 4 months

Figure 3.10: The correct average vegetation and the predicted average vegetation for

each lag period from 2008 to 2014.

3.4 Locally connected methods

The previous sections showed there is a strong spatial relation present in the data.

This section will take advantage of this by using multiple neural networks that each

predict a single pixel of the NDVI map. Instead of having one neural network that

predicts the entire vegetation map at once, 140 · 180 = 25200 neural networks were

trained each predicting one pixel of the vegetation map. Instead of receiving the

entire soil moisture map as input each neural network will receive a 16 by 16 patch

surrounding the pixel it tries to predict as input. A 16 by 16 pixel patch is an area

of 400km by 400km which should contain all the important information.

The first experiment will use a one layer neural network, essentially a perceptron,

as architecture. These neural networks are initialized and trained similarly to previous

experiments. There are two notable differences in the hyperparameters, one is that

the regularization parameters have been decreased to `1 = 10−3 and `2 = 10−3, and

the second one is that the learning rate has been decreased by a factor 10 to 10−5.

These hyperparameters have been decreased because the neural networks are much

smaller.

The second experiment is similar to the first one, but changes the neural network

28

Model Error

Neural network 0.00186817

Ensemble of 1 layer neural networks 0.00194662

Ensemble of 2 layer neural networks 0.00184438

Table 3.3: The mean squared error of our best neural network and of the two ensem-

bles.

architecture to a two layer neural network with 16 ReLU units. Again these neural

networks are initialized and trained like before. The learning rate is the same as

our first experiment, 10−5, however, the regularization parameters have changed to

`1 = 10−5 and `2 = 10−2.

Table 3.3 shows the test error of our standard neural network, our most optimized

neural network and these two ensembles of neural networks. Both ensemble models

perform better than the standard neural network, with the one layer neural network

ensemble performing worse than the optimized neural network and the two layer

neural network ensemble performing a tiny bit better. This reinforces the intuition

that there is a strong spatial relation in the data. The fact that the ensemble of

one layer neural networks is able to predict with such high accuracy, again suggest

that there are a lot of areas where there is a linear relation between soil moisture

and vegetation. The fact that the 2 layer neural network ensemble performs better

suggests that at least a few areas where there is a non-linear relation between soil

moisture and vegetation. Regardless of what kind of relation it is, it is clear that

these ensembles perform equally well or better than a single large neural network.

A plot containing the average error per sample, can be seen in figure 3.11. It is

interesting to see that although the ensemble of 2 layer neural networks outperforms

the other neural networks on average, it does not outperform them on the wet periods

in 2011-2012. In fact the ensemble of one layer neural networks performs best there.

One disadvantage of having to train so many neural network is that manual tuning

of the hyperparameters is infeasible, most importantly the regularization parameters.

In a single neural network, as all the weights are regularized together, the neural

network is able to make intelligent decisions about which neurons require more weight,

and which can do with less. This is in contrast with the ensemble of neural networks

where it is highly likely that some neural networks will overfit and some will underfit,

as the hyperparameters are not optimized per neural network separately. This makes

selecting a good set of hyperparameters for the neural networks in the ensemble extra

important. In our experiments above we, due to computation constraints, gave each

29

2009 2010 2011 2012 2013

Time

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

Err

or

NN

LC 1 layer

LC 2 layers

Figure 3.11: The mean squared error of our best neural network and of the two

ensembles. The green line represents the error of the ensemble of 1 layer neural

networks, and the red line the ensemble of 2 layer neural networks.

neural network in the ensemble the same hyperparameters. The performance of these

ensembles could be improved by applying techniques such as grid search or random

search to select a good set of hyperparameters for each neural network in the ensemble

individually.

Having an ensemble of neural networks also offers some advantages. Besides the

improved performance, the neural networks are a lot smaller and can be trained

very quickly, and as they are all independent they can be trained in parallel. For

our datasets, where the resolution of input maps is still manageable, this might not

seem very important, especially with the support for multi-cpu/gpu and distributed

training in deep learning frameworks. However, when this resolution scales up by a

large factor this becomes a problem. This is mainly due to memory constraints and

the computational overhead. Remember that our input maps contain 25200 pixels

and that one pixel corresponds to an area of 25km by 25km. There are efforts, for

example by Vandersat, to improve the quality of these maps such that one pixel

corresponds to an area of 100m by 100m, leading to input maps of 6300000 pixels.

For input maps of this magnitude these locally connected methods might be highly

beneficial.

30

Chapter 4

Predicting NDVI with uncertainty

The previous chapter focused on predicting vegetation using soil moisture. In this

chapter vegetation will again be predicted using soil moisture, however, this time

models are used that also try to assign a level of uncertainty to the predictions. A

disadvantage of neural networks used in the previous chapter is that they do not

provide any kind of uncertainty measure with their output, in other words, they do

not provide any confidence intervals. This is especially important for problems where

key decisions are being made based on the predictions.

To add a level of confidence to the predictions Bayesian methods are used. The

focus here will be on Bayesian neural networks. Many different kind Bayesian neural

networks models exist, in this work the focus will be on a variational one called the

Variational Matrix Gaussian introduced by Louizos and Welling [2016].

Again the GIMMS NDVI dataset is used for vegetation and the CCI SM dataset is

used for soil moisture. The same data preprocessing steps as in the previous chapter

have been applied. Figure 4.1 shows a few examples of input, output, and prediction

maps (predictions were done by the best performing Bayesian neural network). Again

the predictions look very accurate, albeit a bit worse than the non-Bayesian neural

network.

The next section will focus on different architectures and hyperparameters and

investigate how each one affects performance on our problem. The remaining sections

will focus on the analysis of uncertainty levels for different areas and time regions.

4.1 Bayesian neural networks

This section presents the results of the Variational Matrix Gaussian model for various

architectures and hyperparameters. The performance of various models are analyzed

and the performance between non-Bayesian neural networks and Bayesian neural

31

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(a) The first month in the test set

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(b) The sixth month in the test set

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(c) The twelfth month in the test set

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(d) The wettest (and worst) month in the test set

Figure 4.1: Input, output and prediction for a few samples in the test set. For the

first column blue represents dry areas, while red represents wet areas. For the other

two columns blue represents no vegetation, while red represents high vegetation.

32

networks are compared.

All models are trained with the Adam optimizer using the following parameters:

β1 = 0.9, β2 = 0.999, ε = 10−8. A learning rate of 0.01 and a batch size of 72 were

used. Each Bayesian neural network was trained for 100 epochs. The same initializa-

tion and parameterization as described in the regression experiments of Louizos and

Welling [2016] were used. This means all models were initialized using the default

he2 initialization scheme [He et al., 2015] for the mean of each matrix variate Gaus-

sian. A Gamma prior p(τ) = Gamma(a0 = 6, b0) was introduced, as was a posterior

q(τ) = Gamma(a1, b1) for the precision of the Gaussian likelihood. The matrix vari-

ate Gaussian prior for each layer was parametrized as p(W ) =MN (0, τ−1r I, τ−1

c I),

where p(τr) and p(τc) equals Gamma(a0 = 1, b0 = 0.5) and q(τr) = Gamma(ar, br)

and q(τc) = Gamma(ac, bc). The pseudo-data was initialized using samples from

the entries of A,B. One difference is that instead of using one posterior sample to

estimate the expected log-likelihood used to update the parameters, five posterior

samples were used.

Table 4.1 shows the test error for the most interesting architecture and hyperpa-

rameter combinations. All models perform similar to their non-Bayesian counterpart,

albeit a little bit worse. The best Bayesian neural network has 3 hidden layers each

with 2500 ReLU units, uses 500 pseudo data pairs, and has a variational dropout rate

of 0.05.

The performance of various architecture for Bayesian neural networks follow a

similar pattern to that of non-Bayesian neural networks. For example increasing the

number of hidden units past 2500 did not result in any performance improvements.

One difference is that in contrast with the non-Bayesian neural network, the Bayesian

neural networks with 3 hidden layers performs better than the Bayesian neural net-

work with 2 hidden layers.

Selecting the number of pseudo-data pairs is a trade off between increased per-

formance and increased training time. In a sense it is a limiting factor as increasing

the number of pseudo-data pairs never decreases performance. By manual search 500

pseudo-data pairs was selected as a good balance. Similarly by a simple linear search

the variational dropout rate resulting in the best performance was found to be 0.05.

It is interesting to look at the errors of the best performing Bayesian neural net-

works and the best non-Bayesian neural network. Therefore, both time series are

shown in figure 4.2. The plot shows that both graphs are very similar, they per-

form well on the same time periods and perform poorly on the same time periods.

The Bayesian neural network almost always performs a little bit worse than the non-

Bayesian neural network, however, it never performs a lot worse.

33

Model Parameters pdp vdr Error

Neural network 2x2500 ReLU - - 0.00186817

Bayesian neural network 1x500 ReLU 5 0.1 0.002008188















Table 4.1: The mean squared error of several models. pdp stands for pseudo data

pairs, and vdr stands for variational dropout rate. The best Bayesian NN has 3

hidden layers each with 2500 ReLU units.

34

2009 2010 2011 2012 2013

Time

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

Err

or

NN

Bayes NN

Figure 4.2: The mean squared error for the best Bayesian neural network and best

regular neural network on a period from Aug 2008 to 2014.

To show how similar predictions are figure 4.3 contains the expected output, the

prediction of the non-Bayesian neural network, and the prediction of the Bayesian

neural network, for three samples in the test dataset. All images have the same scale

with red representing higher than average vegetation and blue representing lower than

average vegetation. With the Bayesian neural network performing similar to the non-

Bayesian neural networks and with the added benefit of providing confidence levels

it might be a suitable alternative to non-Bayesian neural networks.

35

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(a) Period of 1-15 August 2009.

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(b) Period of 16-30 November 2009.

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

0 50 100 150

0

20

40

60

80

100

120

(c) Period of 16-31 March 2013.

Figure 4.3: The output, non-Bayesian NN prediction, and Bayesian NN prediction

for a few samples in the test set. Blue represents no vegetation, while red represents

high vegetation.

36

4.2 Analyzing the uncertainty

This section contains the analysis on the performance of the best performing Bayesian

neural network. As a strong relationship between soil moisture and the models’ accu-

racy was already established in the previous chapter the focus of this section is on the

relationship between soil moisture and uncertainty, and between error levels and un-

certainty. A strong relationship between soil moisture and uncertainty would solidify

that soil moisture has a strong predictive value for vegetation, and a strong relation-

ship between error levels and uncertainty would prove the effectiveness of Bayesian

neural networks. Additionally, the difference in uncertainty between different areas

and time periods is analyzed.

4.2.1 Uncertainty in different areas

This section presents the analysis on the average error, uncertainty, soil moisture,

and vegetation maps. The maps can be seen in figure 4.4. The average error, soil

moisture, and vegetation maps are created the same way as in the previous chapter.

The average uncertainty map represents the standard deviation of 1000 drawn samples

for each pixel.

Ideally one would want the error map and the uncertainty map to be roughly

equal, unfortunately this is not the case. Many areas in mid and south Australia

with high error have low uncertainty, an undesirable result. Similarly many area with

low error have relatively high uncertainty, for example south west Australia.

The average soil moisture map bears more resemblance to the average uncertainty

map, where higher soil moisture levels correspond to more uncertainty. This suggests

that soil moisture has a stronger predictive value when soil moisture levels are low,

which is in line with the results seen so far. Areas with high soil moisture levels

and low error generally also have low uncertainty. This suggests that soil moisture

is the leading factor for uncertainty unless the models’ predictions are accurate. The

vegetation map and the uncertainty map share the same general structure but do not

seem to have any clear relation.

Although these maps seem to confirm soil moistures’ predictive value, they fail to

show that the uncertainty is working well. The next section focuses on uncertainty

in different time series and shows a clearer relationship between soil moisture levels,

error levels, and vegetation levels.

37

0 50 100 150

0

20

40

60

80

100

120

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

0.018

0.020

(a) The average error map bounded by 0.02

0 50 100 150

0

20

40

60

80

100

120

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

(b) The average uncertainty map

0 50 100 150

0

20

40

60

80

100

120

0.00

0.04

0.08

0.12

0.16

0.20

0.24

0.28

0.32

(c) The average soil moisture

0 50 100 150

0

20

40

60

80

100

120

1.0

0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

(d) Average vegetation

Figure 4.4: The error, uncertainty, average soil moisture, and average vegetation

maps for the best Bayesian neural network. All maps are averaged over a period

from Aug 2008 to 2014.

4.2.2 Uncertainty in different time periods

The average maps in previous section failed to show a clear relation between the

uncertainty map and the error and soil moisture maps. This section presents the

analysis on the time series of the average soil moisture levels, error levels, and uncer-

tainty levels, as well as the analysis on specific time periods. The time series is shown

in figure 4.5. All three the time series are scaled to be between 0 and 1 to provide a

better comparison.

The average soil moisture levels and error levels seems to be similar to that of

a non-Bayesian neural network, and have high correlation. The uncertainty levels

can be roughly categorized into three groups, large uncertainty, medium uncertainty,

38

2009 2010 2011 2012 2013

Time

Average SM

Error

Uncertainty

Figure 4.5: The average uncertainty, average error, and average soil moisture levels

of each map from Aug 2008 to 2014.

and low uncertainty. Visually uncertainty seems to be more related to soil moisture

than to error levels. This is confirmed by the Pearson correlation coefficients, which

is 0.724 for soil moisture and uncertainty, and 0.525 for error and uncertainty. This

provides extra evidence that soil moisture’s predictive value becomes stronger as it

becomes a scarcity. When soil moisture becomes readily available other ecological

factors such temperature and solar radiation become more important for vegetation

levels. As this information is unavailable to the models the uncertainty increases, and

potentially also the error. This explains why uncertainty is more closely related to

soil moisture than to the error levels.

From the analysis it is clear there is a strong positive relation between soil moisture

and uncertainty, however, as shown in the previous section, it is not clear that the

uncertainty is concentrated in the right areas. To further investigate the relations

between uncertainty and soil moisture, and between uncertainty and error, a few

sample cases are analyzed. Differential soil moisture, error, and uncertainty maps for

a few selected samples are shown in figure 4.6.

Although the Bayesian neural network is able to capture general uncertainty levels

well it is unable to consistently estimate high uncertainty for areas with high error.

In other words, the model is able to detect the existence of areas with relatively

high error, but it isn’t able to estimate where exactly. Analysis on the error maps

and uncertainty maps for all test cases showed that uncertainty was most accurately

39

estimated when there was little overall uncertainty. There are only a handful of test

cases where this is the case, one such example is shown in figure 4.6d. The model

correctly estimates high uncertainty in mid west Australia, the area where there is

also high error. However, more commonly the model is not able to estimate where

these areas are. The result is a very grainy uncertainty map where areas with high

error have low uncertainty and areas with low error have high uncertainty. A few of

such examples are shown in figures 4.6a, 4.6c, and 4.6b.

The relation between uncertainty and soil moisture suffers from the same problems

as the relation between uncertainty and error levels, where the model is unable to

consistently estimate high uncertainty for areas with large amounts of soil moisture,

something which is expected as there exists a strong positive relation between soil

moisture and uncertainty levels. It could have been that these areas simply have low

error, however, as can be seen in figure 4.6, this is not the case.

In conclusion, the Bayesian neural network is able to capture the general uncer-

tainty, however, it cannot pinpoint the location of this uncertainty. Further research

is required to investigate why uncertainty estimations are so poorly localized.

40

0 50 100 150

0

20

40

60

80

100

120

8

6

4

2

0

2

4

6

8

0 50 100 150

0

20

40

60

80

100

120

0.51

0.48

0.45

0.42

0.39

0.36

0.33

0.30

0.27

0 50 100 150

0

20

40

60

80

100

120

0

3

6

9

12

15

18

21

(a) Period of 1 - 15 January 2010

0 50 100 150

0

20

40

60

80

100

1207.5

6.0

4.5

3.0

1.5

0.0

1.5

3.0

4.5

0 50 100 150

0

20

40

60

80

100

120

0.12

0.06

0.00

0.06

0.12

0.18

0.24

0.30

0.36

0 50 100 150

0

20

40

60

80

100

120

0

4

8

12

16

20

24

28

(b) Period of 1 - 15 April 2011

0 50 100 150

0

20

40

60

80

100

120

7.5

6.0

4.5

3.0

1.5

0.0

1.5

3.0

4.5

0 50 100 150

0

20

40

60

80

100

120

0.60

0.57

0.54

0.51

0.48

0.45

0.42

0.39

0.36

0 50 100 150

0

20

40

60

80

100

1200

1

2

3

4

5

6

7

8

(c) Period of 16 - 30 September 2013

0 50 100 150

0

20

40

60

80

100

1208

6

4

2

0

2

4

6

0 50 100 150

0

20

40

60

80

100

120

0.76

0.72

0.68

0.64

0.60

0.56

0.52

0.48

0 50 100 150

0

20

40

60

80

100

120

0

2

4

6

8

10

12

14

16

(d) Period of 1-15 November 2013

Figure 4.6: Maps showing the soil moisture (column 1), uncertainty (column 2), and

error (column 3) for a few selected test cases. Both the error and uncertainty maps

represent percentage change from the average for that period.

41

Chapter 5

Conclusion

The previous chapters have shown that vegetation can be successfully predicted using

satellite based soil moisture, thereby showing that soil moisture has a strong predictive

value for vegetation. They have also shown that soil moisture has a stronger predictive

value when its available in limited quantities. In other words, soil moisture has

a stronger predictive value in dry regions than in wet regions. Furthermore, they

showed that soil moisture can be used to reliably predict vegetation up to two months

in advance and that it has a strong local spatial relation with vegetation.

The performance of our ensemble of small neural networks confirms that soil

moisture has a strong local spatial relation with vegetation. This local spatiality

allows us to use ensembles of small neural networks instead of a single large neural

networks without sacrificing performance. These ensembles allow us to scale up to a

much higher spatial resolution, a procedure that becomes difficult with a single neural

network without sacrificing performance. In fact, due to the local properties of soil

moisture, these ensembles have better performance. Furthermore, these ensembles

also allow for faster training as they are easily parallelizable.

Finally, the addition of an uncertainty measure to our predictions was analyzed.

To produce uncertainty levels with our predictions the variational Bayesian neural

network model introduced by Louizos and Welling [2016] was used. Analysis showed

that prediction performance was slightly worse than the non-Bayesian neural net-

works, however, it was still reasonably successful. The Bayesian neural network has

good general uncertainty levels that mimic the levels of error, but has limited success

in pinpointing exactly where this uncertainty should be.

Within this research the following follow-up steps are recommended:

1. Apply semi-supervised machine learning techniques. There was a lot of incom-

plete data, which was aggregated into a more complete but smaller dataset. It’s

possible that this data is of more use for semi-supervised learning models.

42

2. Include the constants and history size in the learning strategy.

3. Weigh the training examples. The climate is a very dynamic system and cur-

rently training examples from 1992 are as important as examples from 2010,

even though the examples from 2010 are more closely related to the climate

that exists today.

4. Apply different Bayesian models. Although the test accuracy of our Bayesian

neural network was only slightly worse than that of the non-Bayesian neural

network the uncertainty maps weren’t really accurate. More research into why

this is the case, and whether other Bayesian models work better would be

helpful.

43

Bibliography

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight

uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.

L Bounoua, GJ Collatz, SO Los, PJ Sellers, DA Dazlich, CJ Tucker, and DA Randall.

Sensitivity of climate to changes in ndvi. Journal of Climate, 13(13):2277–2292,

2000.

T Chen, RaM De Jeu, YY Liu, GR Van der Werf, and AJ Dolman. Using satellite

based soil moisture to quantify the water driven variability in ndvi: A case study

over mainland australia. Remote Sensing of Environment, 140:330–338, 2014.

Galina Churkina and Steven W Running. Contrasting climatic controls on the esti-

mated productivity of global terrestrial biomes. Ecosystems, 1(2):206–215, 1998.

Ronan Collobert and Jason Weston. A unified architecture for natural language

processing: Deep neural networks with multitask learning. In Proceedings of the

25th international conference on Machine learning, pages 160–167. ACM, 2008.

RAM De Jeu, W Wagner, TRH Holmes, AJ Dolman, NC Van De Giesen, and

J Friesen. Global soil moisture patterns observed by space borne microwave ra-

diometers and scatterometers. Surveys in Geophysics, 29(4-5):399–420, 2008.

Wouter A Dorigo, Klaus Scipal, Robert M Parinussa, YY Liu, Wolfgang Wagner,

Richard AM De Jeu, and Vahid Naeimi. Error characterisation of global active and

passive microwave soil moisture datasets. Hydrology and Earth System Sciences,

14(12):2605–2616, 2010.

Alex Graves. Practical variational inference for neural networks. In Advances in

Neural Information Processing Systems, pages 2348–2356, 2011.

Arjun K Gupta and Daya K Nagar. Matrix variate distributions, volume 104. CRC

Press, 1999.

44

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti-

fiers: Surpassing human-level performance on imagenet classification. In Proceed-

ings of the IEEE International Conference on Computer Vision, pages 1026–1034,

2015.

Martin Heimann and Markus Reichstein. Terrestrial ecosystem carbon dynamics and

climate feedbacks. Nature, 451(7176):289–292, 2008.

Jose Miguel Hernandez-Lobato and Ryan P Adams. Probabilistic backpropagation

for scalable learning of bayesian neural networks. arXiv preprint arXiv:1502.05336,

2015.

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed,

Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N

Sainath, et al. Deep neural networks for acoustic modeling in speech recogni-

tion: The shared views of four research groups. IEEE Signal Processing Magazine,

29(6):82–97, 2012.

Martin Hirschi, Sonia I Seneviratne, Vesselin Alexandrov, Fredrik Boberg, Con-

stanta Boroneant, Ole B Christensen, Herbert Formayer, Boris Orlowsky, and

Petr Stepanek. Observational evidence for soil-moisture impact on hot extremes in

southeastern europe. Nature Geoscience, 4(1):17–21, 2011.

P Illera, A Fernandez, and JA Delgado. Temporal evolution of the ndvi as an indicator

of forest fire danger. International Journal of remote sensing, 17(6):1093–1105,

1996.

Matayo Indeje, M Neil Ward, Laban J Ogallo, Glyn Davies, Maxx Dilley, and Assaf

Anyamba. Predictability of the normalized difference vegetation index in kenya and

potential applications as an indicator of rift valley fever outbreaks in the greater

horn of africa. Journal of Climate, 19(9):1673–1687, 2006.

Rengui Jiang, Jiancang Xie, Hailong He, Chun-Chao Kuo, Jiwei Zhu, and Mingxiang

Yang. Spatiotemporal variability and predictability of normalized difference veg-

etation index (ndvi) in alberta, canada. International journal of biometeorology,

pages 1–15, 2016.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980, 2014.

Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the

local reparameterization trick. arXiv preprint arXiv:1506.02557, 2015.

45

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with

deep convolutional neural networks. In Advances in neural information processing

systems, pages 1097–1105, 2012.

Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals

of mathematical statistics, 22(1):79–86, 1951.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):

436–444, 2015.

Yi Y Liu, RM Parinussa, Wouter A Dorigo, Richard AM De Jeu, Wolfgang Wagner,

AIJM Van Dijk, Matthew F McCabe, and JP Evans. Developing an improved

soil moisture dataset by blending passive and active microwave satellite-based re-

trievals. Hydrology and Earth System Sciences, 15(2):425–436, 2011.

YY Liu, Wouter A Dorigo, RM Parinussa, Richard AM de Jeu, Wolfgang Wagner,

Matthew F McCabe, JP Evans, and AIJM Van Dijk. Trend-preserving blending of

passive and active microwave soil moisture retrievals. Remote Sensing of Environ-

ment, 123:280–297, 2012.

Alexander Lotsch, Mark A Friedl, Bruce T Anderson, and Compton J Tucker. Cou-

pled vegetation-precipitation variability observed from satellite and climate records.

Geophysical Research Letters, 30(14), 2003.

Christos Louizos and Max Welling. Structured and efficient variational deep learning

with matrix gaussian posteriors. NIPS, 2016.

Lina M Mercado, Nicolas Bellouin, Stephen Sitch, Olivier Boucher, Chris Hunting-

ford, Martin Wild, and Peter M Cox. Impact of changes in diffuse radiation on the

global land carbon sink. Nature, 458(7241):1014–1017, 2009.

Diego G Miralles, Wade T Crow, and Michael H Cosh. Estimating spatial sampling

errors in coarse-scale soil moisture estimates derived from point-scale observations.

Journal of Hydrometeorology, 11(6):1423–1429, 2010.

Ramakrishna R Nemani, Charles D Keeling, Hirofumi Hashimoto, William M Jolly,

Stephen C Piper, Compton J Tucker, Ranga B Myneni, and Steven W Running.

Climate-driven increases in global terrestrial net primary production from 1982 to

1999. science, 300(5625):1560–1563, 2003.

Manfred Owe, Richard de Jeu, and Thomas Holmes. Multisensor historical clima-

tology of satellite-derived global land surface moisture. Journal of Geophysical

Research: Earth Surface, 113(F1), 2008.

46

Albert J Peters, Elizabeth A Walter-Shea, Lei Ji, Andres Vina, Mlchael Hayes, and

Mark D Svoboda. Drought monitoring with ndvi-based standardized vegetation

index. Photogrammetric engineering and remote sensing, 68(1):71–75, 2002.

Jorge E Pinzon and Compton J Tucker. A non-stationary 1981–2012 avhrr ndvi3g

time series. Remote Sensing, 6(8):6929–6960, 2014.

Amilcare Porporato and Ignacio Rodriguez-Iturbe. Ecohydrology-a challenging mul-

tidisciplinary research perspective/ecohydrologie: une perspective stimulante de

recherche multidisciplinaire. Hydrological sciences journal, 47(5):811–821, 2002.

Jr.; Haas R.H.; Schell J.A.; Deering D.W. Rouse, J.W. Monitoring the vernal ad-

vancement and retrogradation (green wave effect) of natural vegetation. Prog. Rep.

RSC 1978-1, page 93p, 1973.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal

representations by error propagation. Technical report, DTIC Document, 1985.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan

Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.

Journal of Machine Learning Research, 15(1):1929–1958, 2014.

Nathan L Stephenson. Climatic control of vegetation distribution: the role of the

water balance. American Naturalist, pages 649–670, 1990.

RK Teal, B Tubana, K Girma, KW Freeman, DB Arnall, O Walsh, and WR Raun.

In-season prediction of corn grain yield potential using normalized difference vege-

tation index. Agronomy Journal, 98(6):1488–1494, 2006.

Michiel K van der Molen, Albertus J Dolman, Philippe Ciais, T Eglin, Nadine Gobron,

Beverly E Law, Patrick Meir, Wouter Peters, Oliver L Phillips, M Reichstein, et al.

Drought and ecosystem carbon cycling. Agricultural and Forest Meteorology, 151

(7):765–773, 2011.

W Wagner, Wouter Dorigo, Richard de Jeu, Diego Fernandez, Jerome Benveniste,

Eva Haas, and Martin Ertl. Fusion of active and passive microwave observations to

create an essential climate variable data record on soil moisture. In Proceedings of

the XXII International Society for Photogrammetry and Remote Sensing (ISPRS)

Congress, Melbourne, Australia, volume 25, 2012.

47

quantifying the predictive value of soil moisture for ... · observed soil moisture is arguably the...

Documents