neural networks on steroids

Neural Networks on Steroids

Adam Blevins

April 10, 2015

Preface

This report represents an introduction to Neural Networks; what Neural Networks are, how theywork, simple examples, limitations and current solutions to these limitations.

Chapter 1 considers the motivation behind research into Neural Networks. The basicarchitecture of simple Neural Networks are investigated and the most common learningalgorithm in practice, the Backpropagation algorithm, is discussed.

Chapter 2 applies the theory from Chapter 1 to an example Neural Network designed to mapx 7→ x2 followed by analysis and evaluation of the results produced.

Chapter 3 discusses the Universal Approximation Theorem, a theorem that states a certaintype of simple Neural Network has the capability to approximate any function underparticular conditions. This continues to describe how training such a network has itsown difficulties and we provide solutions to avoiding these problems.

Chapter 4 motivates the desire to use more complicated Neural Networks by increasing theirsize. Unsurprisingly this yields two fundamental training problems with theBackpropagation algorithm called the Exploding Gradient problem and Vanishing Gradientproblem.

Chapter 5 considers a particular training technique called greedy unsupervised layerwisepre-training which avoids the Exploding and Vanishing Gradient problems. This givesus a much more reliable and accurate Neural Network model.

Chapter 6 concludes the investigation of this report and suggests possible topics forfuture work.

1

Declaration

“This piece of work is a result of my own work except where it forms an assessment based on groupproject work. In the case of a group project, the work has been prepared in collaboration with other

members of the group. Material from the work of others not involved in the project has beenacknowledged and quotations and paraphrases suitably indicated.”

2

Contents

1 An Introduction to Neural Networks 51.1 What is a Neural Network? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Uses of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 The Perceptron and the Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . 7

1.3.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.2 Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 The Workings of an MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.1 The Backpropagation Training Algorithm . . . . . . . . . . . . . . . . . . . . 101.4.2 Initial Setup of a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 A Single Layer MLP for Function Interpolation 152.1 Aim of this Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Software and Programming Used . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Great Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.2 Incredulous Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 Common Poor Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.4 Other Interesting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.5 Run Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Universal Approximation Theorem 253.1 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Sketched Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.1 A Statistical View on Network Training: Bias-Variance Tradeoff . . . . . . . 313.3.2 Applying Overfitting and Underfitting to our Example . . . . . . . . . . . . . 333.3.3 How to Avoid Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . 35

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Multiple Hidden Layer MLPs 384.1 Motivation for Multiple Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Example Problems for Multiple Hidden Layer MLPs . . . . . . . . . . . . . . . . . . 42

4.2.1 Image Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3

4.2.2 Facial Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Vanishing and Exploding Gradient Problem . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.1 Further Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Autoencoders and Pre-training 505.1 Motivation and Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2 The Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2.1 What is an Autoencoder? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.2 Training an Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2.3 Dimensionality Reduction and Feature Detection . . . . . . . . . . . . . . . . 51

5.3 Autoencoders vs Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . 535.3.1 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . 535.3.2 The Denoising Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 Stacked Denoising Autoencoders for Pre-training . . . . . . . . . . . . . . . . . . . . 545.5 Summary of the Pre-training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 585.6 Empirical Evidence to Support Pre-training with Stacked Denoising Autoencoders . 595.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Conclusion 616.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A Single Layer MLP Python Code 68

B Python Output Figures from Section 2.3.4 73

4

Chapter 1

An Introduction to Neural Networks

1.1 What is a Neural Network?

Neural Networks (strictly speaking, ’artificial’ Neural Networks and henceforth referred to as ANNs)are so called because they resemble the mammalian cerebral cortex of the brain [1]. The nodes(sometimes called neurons, units or processing elements) of an ANN represent the neurons of thebrain. The weighted interconnections between nodes symbolise the communicative electrical pulses.

The following image represents a basic, directed ANN:

Input

Hidden

Output

Arrows are weighted connections

Figure 1.1: An example of a simple directed ANN

As shown above, ANNs consist of an input layer, some hidden layers and an output layer. De-pending on the use of the network, one can have as many nodes in each layer and as many hiddenlayers as desired (although it will be seen later the disadvantages of having too many hidden layersor nodes).

5

One use of such a network is image recognition. Say you have an image and you would like toclassify said image, for example, is there an orange in this picture or a banana? An image is madeup of a number of pixels all with an associated rgb(red-green-blue) colour code. We can define ournetwork to have the same number of input nodes as pixels in the image, allowing one and only onepixel to enter each input node. The hidden nodes process the information received via the weightedconnections, potentially picking out which colour is most prominent in total in the entire image, andthen hopefully the output would give you the correct label - banana or orange. This idea could beextrapolated to handwriting recognition such as the technology available in the Samsung GalaxyNote, which allows a user to write with a stylus where the technology recognises these inputs ascertain letters to write computerised documents.

This is where machine learning comes in. ANNs are built to adapt and learn to the informationthey are given. If we extend the orange-banana example, we could teach our network to tell thedifference between a banana and orange by giving it an arbitrary number of images with a bananaor orange in them, and telling the network what the target label is. This set of images would becalled a training set and this method of learning would be called supervised learning (i.e. we givethe network an input and a target). In this example, the network will likely hone in on the colourdifference to deliver its verdict. Once trained, all future inputted images to the network should givea reliable label.

1.2 Uses of Neural Networks

Neural Networks are everywhere in technology. Some additional examples to the image recognitionfrom above include:

1. The autocorrect on smartphones. Neural Networks learn to adapt to the training set given tothem and therefore, if storable on a smartphone, have the ability to adapt a dictionary to auser like autocorrect.

2. Character recognition. This is extremely popular with the idea of handwriting with a styluson tablets and phones these days as mentioned before.

3. Speech recognition. This has become more powerful in recent years. Bing has utilised NeuralNetworks to double the speed of their voice recognition on their Windows phones [2].

4. A quirky use includes, and I quote, “a real-time system for the characterisation of sheepfeeding phases from acoustic signals of jaw sounds” [3]. This was an actual research article inthe Australian Journal of Intelligent Information Processing Systems (AJIIPS), Vol 5, No. 2 in1998 by Anthony Zaknich and Sue K Baker. Eric Roberts’ Sophomore Class of 2000 reportedonline that radio microphones attached to the head of the sheep allow for chewing sounds to betransmitted and comparing this with the time of day, allows an ANN to predict future eatingtimes [4]. If anything this demonstrates the versatility of ANNs.

5. Generally, a Neural Network can take a large number of variables which appear to have noconceivable pattern and find associations or regularities. This could be something extremelyunusual like football results coinciding with a person watching (or not watching) the game.

6

1.3 The Perceptron and the Multi-Layer Perceptron (MLP)

1.3.1 Perceptron

The perceptron is the simplest ANN. It only consists of an input and output layer. It was firstconceived by Rosenblatt in 1957 [5]. It was named as such because its invention was to modelperceptual activities, for example responses of the retina. This was the basis of Rosenblatt’s researchat the time. Quoted from a journal entry by Herve Abdi from 1994 [6]:

“The main goal was to associate binary configurations (i.e. patterns of [0, 1] values) presented asinputs on a (artificial) retina with specific binary outputs. Hence, essentially, the perceptron is

made of two layers the input layer (i.e., the “retina”) and the output layer.”

The perceptron can successfully fulfil this need and works as such:

inj =∑i

xiwij

x1w1j

x2

w2j

oj = aj

Figure 1.2: The perceptron with two input nodes and one output node

where:

• inj is the total input of the j-th node which in this case is the sum of the weighted connectionsmultiplied by their respective inputs:

inj =∑i

xiwij (1.3.1)

• xi is the input of the i-th connection

• wij is the weight of the connection from node i to node j

• oj =: output of the j-th node

• aj is some activation function to be defined by the programmer e.g. tanh(inj). This becomesmore apparent in an MLP where there exist a greater number of layers.

Usually with one output node like in Figure 1.2, oj = inj . Notice that there are two input nodesbecause the intended inputs were binary configurations of the form [0, 1].

If the data is linearly separable, the perceptron convergence theorem proves that a solutioncan always be found. This proof is omitted (but can be found via the link in the bibliography[7]). However, the output node is a linear combination of the input nodes and hence can onlydifferentiate between linearly separable data. This is a severe limitation.

7

We will use the non-linearly separable logical XOR function as an example to show non-linearly separable functions have no perceptron solutions. The logical XOR function is definedas:

[0, 0] 7→ [0]

[0, 1] 7→ [1]

[1, 0] 7→ [1]

[1, 1] 7→ [0] (1.3.2)

We can see it is not linearly separable by the following diagram:

(0,0)

(1,1)

= value [0]

= value [1]

Figure 1.3: There is no way to separate the grey dots and the black dots

The original proof showing perceptrons could not learn non-linearly separable data was by MarvinMinsky and Seymour Papert, published in a book called “Perceptrons” in 1969 [8]. However, thefollowing proof showing the perceptron’s inability to differentiate between non-linearly separabledata is quoted from Herve Abdi’s journal entry in the Journal of Biological Systems in 1994, page256 [6]:

Take a perceptron of the form Figure 1.2 and define our weights as w1 and w2. The inputsare of the form [x1, x2]. The association of the input [1, 0] 7→ [1] implies that:

w1 > 0 (1.3.3)

The association of the input [0, 1] 7→ [1] implies that:

w2 > 0 (1.3.4)

Adding together Equations 1.3.3 and 1.3.4 gives:

w1 + w2 > 0 (1.3.5)

Now if the perceptron gives the response 0 to the input pattern [1, 1], this implies that:

w1 + w2 ≤ 0 (1.3.6)

Clearly, the last two equations contradict each other, hence no set of weights can solve the XORproblem.

Due to the severe limitations of the perceptron, the multi-layer perceptron (MLP) was intro-duced.

8

1.3.2 Multi-Layer Perceptron (MLP)

The MLP is an extension of the perceptron. Hidden layers are now added between the inputand output layers to gift the ANN a greater flexibility in calculations. Each additional weightedconnection is another parameter which allows for more complex problems and calculations in whichlinear solutions just cannot help. Every aspect between an MLP and a perceptron is the same,including the underlying node functions, but most importantly an MLP allows for non-linearlyseparable data to be processed.

An MLP features a feedforward mechanism. A feedforward ANN is one which is directedsuch that each layer may only send data to following layers but not within layers. Additionally, theMLP is fully connected from one layer to the next. The following diagram is one of the simplestMLP’s possible regarding the logical XOR function, taken from a Rumelhart et al. report from1985 [9]:

1.5 0.5

Inputs HiddenUnit

OutputUnit

x1

x2

+ 1

+ 1

+ 1

+ 1

- 2

f(inj)g(ink)

Figure 1.4: An example of an MLP solution to the logical XOR function. Explanation in thefollowing text.

All of these nodes work exactly like Figure 1.2. Each node takes the sum of its inputs multiplied bythe weights on the connections. However, you will notice that two of the nodes contain numbers.This is an indication of the output function on these specific nodes, oj = f(inj) and ok = g(ink).Sometimes called the threshold function, the hidden unit’s output is basically saying the following:

f(inj) =

{1 if 1.5 < inj0 if inj < 1.5

(1.3.7)

Similarly, the threshold function for the output unit is:

g(ink) =

{1 if 0.5 < ink0 if ink < 0.5

(1.3.8)

The threshold function is effectively deciding whether the node should be activated and thus thenode’s output function is referred to as an activation function. The superiority of the MLP over thesimple perceptron is immediately clear. Having added just one extra unit to the architecture, we

9

obtain a successful ANN which solves the logical XOR function, something the perceptron cannot do.

Due to the greater flexibility of the MLP over the perceptron, it is difficult to see immedi-ately the best values for the weights and sometimes even how many hidden nodes are ideal.Typically, an ANN learns socratically and there are a number of ways to train MLPs to find thebest values (such as dropout, unsupervised learning, autoencoders etc.). Before considering trainingmethods, let’s first understand the inner calculations of an MLP.

1.4 The Workings of an MLP

I have described how the nodes calculate the total input inj and explained with notation, the ideaof a threshold function as an activation function. Although a threshold function can be extremelyuseful in certain situations, the activation functions commonly used are one of the two followingsigmoids:

aj(inj) =1

(1 + e−inj )(logistic function) (1.4.1)

and

aj(inj) = tanh(inj) (hyperbolic tangent) (1.4.2)

These functions are usually applied to the hidden layer nodes but can be applied to the outputnodes as desired. It should be noted that the logistic function is bounded between 0 and 1, and thehyperbolic tangent is bounded between -1 and 1. This is especially important when initialising theweights of a network because if you expect to get a large output, you may need large weights tocompensate due to the activation function. The idea of using such functions allows for non-linearsolutions to be produced which in turn allows a more competent and functional ANN.

Now that our networks can use these activation functions that allow non-linear solutions, itis helpful to add nodes which allow for a linear change in the solution, i.e. bias nodes. The actionof this node is to add a constant input from one layer to the following layer. It is connected likeall nodes, via a weighted connection, to allow the network to correct its influence as required. Thebias can take any value, but a value of 1 is common. The notation we shall use to represent a biasnode will be b used much like the threshold function notation, i.e. the b shall be within the node ona diagram.

There are a number of ways to train an ANN and we will now investigate the most com-monly used training algorithm, Backpropagation which is short for backpropagation of errors.ANNs which employ this algorithm are usually referred to as Backpropagation Neural Networks(BPNNs).

1.4.1 The Backpropagation Training Algorithm

The Backpropagation algorithm was first applied to Neural Networks in a 1974 PhD Thesis by PaulJ. Werbos [10], but its importance was not fully understood until David Rumelhart, Geoffrey Hintonand Ronald Williams published a 1986 paper called “Learning representations by back-propagatingerrors” [11]. It was this paper that detailed the usefulness of the algorithm and its functionality,and it was these men that sparked a comeback within the neural network community for ANNs,

10

inspiring successful deep learning (i.e. successful learning of ANNs with hidden layers). They wereable to show that the backpropagation algorithm could train a network so much faster than earlierdevelopments of learning algorithms that previously unsolvable problems were now solvable. Thelarge increase in efficiency meant a massive set of training data was not essential, allowing for amore attainable training set for such problems.

The remainder of this subsection is a detailed description of the Backpropagation trainingalgorithm. It is heavily based upon the 1986 Rumelhart, Hinton and Williams article mentionedabove [11], however notation has been altered for consistency and explanations have been expanded.It should also be noted that one is not expected to understand the following equations immediately.Naturally it is expected to take significant experience and examples to fully understand theirmeaning.

Our aim is to find appropriate weights such that the input vector delivered to the ANN re-sults in a sufficiently accurate output vector to our target output vector for the entirety of ourtraining data. The ANN will then have the ability to “fill in the gaps” between our training setto provide a smooth, accurate function fitted to purpose. Defining the input layer as layer 0, let’s

define each individual weight with the notation w(l)ij where l represents the layer the weighted

connection is entering (i.e. a weight with index l = 1 means the weighted connection will enter anode in layer 1), i represents a node from layer l− 1 and j represents a node from layer l. Thus thetotal input for each node, defined as the function in Equation 1.3.1, can be rewritten as:

in(l)j =

∑i

a(l−1)i w

(l)ij (1.4.3)

where a(l−1)i represents the activation function output from node i in layer l − 1. This algorithm

eventually differentiates the activation function to instigate the fundamental concept of theBackpropagation algorithm, gradient descent. Any activation function can be used, as long as it hasa bounded derivative. Also note that the resulting value of the activation function for a node is therespective output of said node.

Finally, claiming our ultimate layer is l = L we shall define the output vector of the net-work as a(L)(xi) where xi is the corresponding input vector and this a is still referring to theactivation function. Thus this yields our error function of the network for a single training sample:

Ei =1

2

∥∥ti − a(L)(xi)∥∥22

(1.4.4)

where i represents an input-target case from our training data and ti is our respective target vectorfor xi.

The idea of gradient descent is to take advantage of the chain rule by finding the partial

derivative of the error, Ei, with respect to each weight, w(l)ij . This will then help minimise E. We

will then update the weights as such:

∆w(l)ij = −η ∂Ei

∂w(l)ij

(1.4.5)

for some suitable value of η (referred to as the learning rate). This learning rate is there to controlthe magnitude of the weight change. The negative sign is by convention to indicate the direction of

11

change should be towards a minimum and not a maximum. Ideally, we want to minimise the errorvia the weight changes to get an ideal solution. To see this equation visually, let’s imagine we havea special network in which the Error function only depends on one weight. Then the Error functioncould look something like this:

Figure 1.5: An example Error function which depends on one weight

To get the smallest possible E we need to find the global minimum. It is possible for there to bemany local minima that you want to avoid. The idea behind the learning rate is to find a balancebetween converging on the global minimum and “jumping” out of or over the local minima. Toosmall a learning rate and you might remain stuck as illustrated by the orange dot, too big and youmay overshoot the global minimum entirely as shown by the blue dot.

Unfortunately it is difficult to see the ideal learning rate from the outset and it is common-place to use trial and error to find the optimal η. A regular starting point would be a valuebetween 0.25 and 0.75, but it is not unusual to be as small as 0.001 if you have a simple function toapproximate.

Now to find the actual values of Equation 1.4.5. First of all we need to calculate ∂E/∂w(l)ij

for our output nodes. For each output node we have the error:

∂E

∂a(L)j

= −(tj − a(L)j

)(1.4.6)

where j corresponds to the j-th output node and recalling a(L)j is the output of the j-th output

node. This is simple for the output layer of the ANN. It is just the output of the node minus thetarget of the node. Calculating for previous layers becomes more difficult.

12

The next layer to consider is the penultimate layer of the ANN. Recalling the total input fora node as follows:

in(l)j =

∑i

a(l−1)i w

(l)ij (1.4.7)

where a(l−1)i is the output of the i-th node in layer l − 1. Using this equation we calculate for our

penultimate layer the following:

∂Ei

∂w(L−1)ij

=∂E

∂in(L−1)j

∂in(L−1)j

∂w(L−1)ij

=∂E

∂in(L−1)j

a(L−2)i (1.4.8)

We must now figure out the value for ∂E/∂in(L−1)j which is as follows:

∂E

∂in(L−1)j

=∂E

∂a(L−1)j

∂a(L−1)j

∂in(L−1)j

=∂E

∂a(L−1)j

aj ′(in(L−1)j ) (1.4.9)

where aj ′(in(L−1)j ) is the derivative of the chosen activation function. Recalling the activationfunction depends only on the respective node’s input, for example the hyperbolic tangent activationfunction from Equation 1.4.2 was just aj = tanh(inj), this is easily calculated. The layer of aj ′ isthe same as the layer associated with inj (i.e. the layer the input is entering) but we remove thisindex from aj ′ to avoid a bigger mess of indices.

From here until the end of the subsection, credit must also be given to R Rojas [12] and A

Venkataraman [13] in addition to Rumelhart et al. [11]). Now we just need to calculate ∂E/∂a(L−1)j

in the above equation. Taking E as a function of inputs from all nodes K = 1, 2, ..., n receivinginput from node j:

∂E

∂a(L−1)j

=∑k∈K

(∂E

∂in(L)k

∂in(L)k

∂a(L−1)j

)

=∑k∈K

(∂E

∂a(L)k

∂a(L)k

∂in(L)k

∂in(L)k

∂a(L−1)j

)

=∑k∈K

(∂E

∂a(L)k

ak′(in(L)k )w

(L)jk

)(1.4.10)

This same formula can be used for weights connecting to layers before the penultimate layer. Thuswe can now find how to change any weight in the whole network. We can therefore conclude thefollowing:

∂E

∂w(l)ij

=

(∂E

∂a(l)j

∂a(l)j

∂in(l)j

)a(l−1)i =

(∂E

∂a(l)j

)aj ′(in(l)j )a

(l−1)i

with:

∂E

∂a(l)j

=

(a

(L)j − tj) if j is a node in the output layer∑

k∈K

(∂E

∂a(l+1)k

ak′(in(l+1)k )w

(l+1)jk

)if j is a node in any other layer

(1.4.11)

13

The appropriate terms are substituted into Equation 1.4.5 and allows for full training of the system.The time it takes the network to run through all input-target cases once is defined as an epoch. Themost common way to update the weights is after each epoch. After a pre-defined number of epochsthe network will stop training. This could be any number but 1000 is a good stopping point toprevent the system taking too long to train but also allowing plenty of time for convergence to theideal solution.

1.4.2 Initial Setup of a Neural Network

An ANN will be programmed into a computer and could be on any language. This is fortunatebecause we don’t have to worry about calculating the gruelling equations in the previous subsectionourselves. In the following chapter we will see an example of an MLP with its own solutions and theproblems it faces. There are a few things we should consider first. From the outset it is not alwaysclear how exactly to initialise the network. You will find it difficult to guess the ideal weights, it maybe difficult to estimate the number of epochs to run through before ceasing training of the networketc. but here are a few ideas to help get your head around it.

• It is ideal to set up your network with weights that are randomised between certain values. Therandomised values should not have an origin (be seeded in programming language) becausethis allows each run through to give a different set of results. This is ideal in finding thebest possible set up for your ANN. You can only estimate your weights based upon previousexamples you may have seen and the expected outputs from your network.

• With regards to the epochs, it will depend entirely upon the size of your training set and thepower of your computer. The more epochs the better for converging networks, but not somany that you have to wait too long for a solution because time constraints as well as dataconstraints is what caused ANNs to fall out of popularity in the ’70s.

• The learning rate is a difficult one to guess but it is best to start small. This is because you arealmost guaranteed to converge to some minimum given enough epochs. Learning rates thatare too large always have the opportunity to find the global minimum but jump back out too.

• Bias nodes are recommended as one per layer (except the output layer). You won’t need morethan one as you can adjust the influence of the bias. This allows a lot more freedom for errorcorrection.

• With regards to the training data, you will want to leave maybe 10-20% of it aside to test thenetwork once trained with. This helps to establish how accurate the network is. The trainingdata should also be normalised to be within the set [-1, 1]. This helps stabilise the ANN withregards to the activation functions used and allows for smaller weights. Smaller weights meansgreater accuracy. Thinking back to how the weights are adjusted, a bigger weight has greaterlikeliness of being adjusted a significant percentage, and with a lot of weights to adjust thiscan impact convergence and training time.

The next chapter focusses on an example network written in Python and the problems faced infinding the convergence on the global minimum we need for an accurate network using the MLParchitecture.

14

Chapter 2

A Single Layer MLP for FunctionInterpolation

2.1 Aim of this Example

The aim is to teach a single layer feedforward MLP to accurately map x 7→ x2 ≡ f(x), wherex ∈ [0, 1] ⊂ R, using the Backpropagation algorithm for training.

2.1.1 Software and Programming Used

The code has been adapted in Enthought Canopy [14] and written in the language Python usingthe version distributed by Enthought Python Distribution [15]. The code, heavily based on a Back-propagation network originally written by Neil Schemenauer [16], is attached in Appendix A, fullyannotated.

2.2 Method

1. The ANN is to learn via the Backpropagation algorithm as discussed in Section 1.4.1 andtherefore needs training data. The training data used is the following:

x ∈ {0, 0.1, 0.2, ..., 1} and their respective targets f(x) = x2 (2.2.1)

For simplicity, the entire training set is used to train the network and then used once againto test the network. Python outputs these test results once the network completes its trainingin such a format “([0], ′ →′, [0.02046748392039354])”, for each input data. A graph is thengenerated showing 100 equally spaced points’ resultant network output for x ∈ (0, 1) comparedwith the function f(x) = x2.

2. The sigmoid function for this network is the logistic function from Equation 1.4.1.

3. The learning rate is set to η = 0.5.

4. The number of epochs is set to 10000. The code tells Python to give the current error of thenetwork, for every 1000th epoch, to 9 decimal places.

5. The weights are initialised using the seed() function from the random module [17]. Thisfunction is a pseudo-random number generator that uses a Gaussian distribution with standard

15

deviation 1 and mean based upon the system time when the network is ran. The weightsfrom the input nodes to the hidden nodes are specified to be randomly distributed in theinterval (−0.5, 0.5), and similarly the weights connecting hidden and output nodes are randomlydistributed in the interval (−5, 5). The latter weights have a greater randomisation rangebecause the logistic function has an upper bound of 1. Thus for larger outputs we need largerweights and from experimentation, these values can provide very accurate results.

The network structure of this ANN is based upon an example from Kasper Peeters’ unpublishedbook, Machine Learning and Computer Vision [18] and takes the following architecture:

x

b = 1

σ

σ

b = 1

Inputsl = 0

Hidden Unitsl = 1

Output Unitl = L = 2

f(x)

where b = 1 indicates the node output’s a bias of value 1, σ indicates the node’s activation functionis the logistic function and recalling l references the layer of the network with l = L correspondingto the final layer. It should be noted that many different structures could be used, for example theinclusion of a greater number of hidden nodes, but this simple structure succeeds.

2.3 Results and Discussion

The following are examples of results obtained from running the Neural Network program.

2.3.1 Great Result

Figure 2.1: Example of a great learning result

Figure 2.1 shows a graph that plots x againstf(x) = x2. The magenta line represents x2 andthe black dotted line represents the network’s pre-diction for the 100 equidistant points between 0and 1 after training. Generally, a result like thisis generated from a final error less than 2 ∗ 10−4.This is an especially successful case which can beseen in Figure 2.2. The figure shows the networkresults of the test data and delightfully each givesthe correct result when rounded to two decimalplaces.

16

Figure 2.2: The “Great Result’s” respective Python output

Turning to the respec-tive Python outputin Figure 2.2, it isinteresting to note thesize of the weights.They are all relativelysmall (less than 5).However, the inputto hidden weightshave significantly in-creased (recalling thatthey were initialisedrandomly in the set(−0.5, 0.5)) and thehidden to outputweights have relativelydecreased (initialisedbetween (−5, 5)).Fascinatingly, if theinitialisation of theweights were switchedso that the input to

hidden weights were randomly in the interval (−5, 5) as well, the network becomes significantlymore unreliable to the extent that in 100 attempts, no run had error less than 2 ∗ 10−4. But why?Presumably this is because the increased size of weight increases the inputs significantly to thehidden nodes. This means the logistic function’s output is generally larger which leads to a largernetwork output, further away from our target data. This would cause larger error and thereforecause a greater magnitude of change in the weights of the network which can affect the network’sability of convergence on the global minimum. This indicates that keeping the weights generallyinitialised smaller allows for a more stable training algorithm. Another notable detail is the firsterror output. It is relatively small at just over 3 and this plays a part in comparison for the otherresults.

2.3.2 Incredulous Result

Figure 2.3: Example of an incredulous result

This solution is clearly not ideal but it does pro-vide some very interesting insight in to the prob-lems faced with teaching an ANN. Figure 2.3 onlyappears to be accurate for 2 of the 100 pointstested on the trained network. However, posi-tively this type of result only occurred once in100 runs. A result like this can only be gener-ated for large final error with the training be-ginning with initial convergence but ultimatelydiverging. In this case, it causes the network totrain to the shape of the logistic function, thesigmoid function chosen for this network. Thisfigure also shows that the bulk of the 100 points

17

tested give an output between 0 and 0.5. Referring to Figure 2.4, the results for the test data showthat for 0.7 and above, the network overestimates the intended output significantly and the test data

Figure 2.4: The “Incredulous Result’s” respective Python output

for 0.6 and under aregenerally underesti-mated. This providesthe significant biastowards a smalleroutput from the net-work. In comparisonto the aforementioned“Great Result”, theinitial output error forthis run through wassignificantly higher,an increase of over900 %!

But why such adifferent result? Bothnetworks are initialisedwith random weightswhich means bothbegin with a differentError function. Thismeans each network isattempting to find a

different global minimum using the same learning rate. Although the learning rate η was ideal forthe “Great Result”, it was clearly not ideal for this initial set of weights in which the learning ratewas unable to escape a local minimum in the Error function. Alternatively, the weight adjustmentcalculation as in Equation 1.4.5 gives a large partial derivative term. This causes a huge changein weights which in turn causes the error function to jump away from the global minimum we’reaiming for. It is difficult to tell if this meant the learning rate was too large or too small. However,if we compare these theories to the error output in Figure 2.4 we see the error converges initiallybut then starts to diverge. The initial weight change allowed us to reach close to a minimum butthe ultimate divergence suggests this was a local minimum and hence the learning rate was in facttoo small to overcome the entrapment of this trough.

As discussed before, increasing the learning rate risks the likelihood of jumping away fromthe global minimum altogether. If the number of epochs was increased for this run through,eventually the local minimum could be overcome but the convergence is too slow for this to beworthwhile. It is for this reason that in practice, multiple run throughs for the same network areundertaken.

One should also notice the difference in weights between the two results. The “Great Re-sult” had smaller weights, suggesting a stable learning curve. This “Incredulous Result” did nothave a stable learning curve due to the divergence and this is clear in the significant weight increases.All the input to hidden weights in Figure 2.4 are greater than the respective weights in Figure

18

2.2. This demonstrates the idea that larger initialised weights do not lead to more accurate resultsand actually lead to a more unstable network. Furthermore, the hidden to output weights for the“Incredulous Result” are very small which leads to the bulk of the network outputs to lie between0 and 0.5.

2.3.3 Common Poor Result

Figure 2.5: Example of a poor learning result

This next result is almost as common as the“Great Result”. It seems that, despite the im-plications of the “Incredulous Result”, naturallythis Neural Network setup is able to predictoutputs with greater accuracy and consistencynearer x = 1 than 0 and thus we get a tail be-low f(x) = 0 when approaching x = 0 regu-larly. Generally for function interpolation, thenetwork will attempt to find a linear result tocorrelate the training data to its target data andtherefore the network can regularly attempt todraw a best fit line due to the initialised weightsas opposed to the quadratic curve we require.The logistic function gives us the ability to findnon-linear solutions but if the total inputs to the logistic nodes become too large then the outputcan be on the tails of the logistic function which is effectively linear as shown:

Figure 2.6: The logistic function

This once more demonstrates the desire to initialise smaller weights for a network as well asnormalising training data.

The test data results from Figure 2.7 show for inputs greater than 0.2, the network outputhas consistently overestimated which corresponds to Figure 2.5. This can again be associated tothe starting error. The initialised weights were accurate enough to allow the network to almostinstantly fall into a trough near a minimum. Unfortunately, the lack of any real convergence to a

smaller error suggests that this was a local minimum. The initial error was so small that ∆w(l)ij could

19

Figure 2.7: The poor result’s respective Python output

be negligible for someof the weights. Thiscan cause the networkto get stuck in alocal minimum andtherefore unable tojump out leaving nochance to find theglobal minimum.

As was the casewith the “IncredulousResult”, there aresome weights whichare very large rela-tive to our “GreatResult’s” respectiveweights. Once morethis can cause insta-bility in the learningalgorithm which stag-nates convergence. Ifone weight is much

larger than the others (which is the case here) it has the controlling proportion of the input to thefollowing node, almost making all other inputs obsolete. This gives a lot of bias to one input andtherefore makes accurate training a lot more difficult for the smaller weights around it. Fortunatelythis is likely down to the initialised weights. One weight could have been randomised much higherrelative to the others which causes this downfall. This highlights the importance of the range inwhich the weights are randomised once more.

20

2.3.4 Other Interesting Results

(a) A network whose error converged for the firststep, but diverged slowly from then on.

(b) A network whose error started very small andconverged slowly.

Figure 2.8: Two more examples of an inaccurate interpolation by the network. Their respectivePython outputs are placed in Appendix B

Figure 2.8a shows accuracy for half the data but after the error started increasing halfwaythrough training, the network tries to correct itself almost through a jump. This occurs be-cause the learning rate is now too small to escape the local minima in the restricted number ofepochs and thus the error continues to increase. The algorithm terminates before escape fromthe local minima and yields a generally large error for the system at 3.1 ∗ 10−3. We can rec-tify this by increasing the learning rate slightly but overall this comes down to the initialised weights.

Figure 2.8b describes a network in which the initialised weights gave a relatively small firsterror of roughly 0.14. The learning rate η is small and the error changes with respect to the weight

changes are small. This means ∆w(l)ij from Equation 1.4.5 will be very small. Thus convergence will

be slow. This can be overcome by an increased learning rate or increasing the number of epochsbut once again, the most important factor is the weight initialisation.

2.3.5 Run Errors

Occasionally the network fails to train entirely and the algorithm terminates prematurely. These arecalled run errors and ours occur because the math range available to Python to compute equations isbounded. Specifically, the power of our exponential in the logistic function, e, may only have powersbetween −308 and 308 (This can be found by typing “import sys” in to the Python commandline, followed by a second command “sys.float info”). The following Python output shows such anerror occurrence and the final line states math range error. The remaining jargon describes theerror route through the Python code, starting with the initialisation command and stemming at thelogistic function definition:

21

Figure 2.9: An example run error Python output from the network

Mathematically, we can see this happening. In the lines numbered 85-87, which are boxed in white,

the code has defined what we called in(l)j . For this network, we only have one layer using the sigmoid

function, i.e. l = 1. Recall the following equations:

inj =∑i

xiwij (2.3.1)

and also

sig(sum = inj) = aj(inj) =1

(1 + e−inj )(2.3.2)

If |inj | > 308 then the float is out of computing range and causes the premature termination of thenetwork’s learning algorithm. If we consider this in terms of limits:

liminj→−∞

sig(sum) = 0 (2.3.3)

Therefore as inj → −∞ the output layer would only receive an input from the bias node and causegiant error. As the bias node is only a constant, the network cannot converge on the targets for ourinput data because our target is quadratic. Similarly this is the case for the upper bound:

liminj→+∞

sig(sum) = 1 (2.3.4)

22

If inj is sufficiently large then the logistic nodes essentially become their own bias nodes and thesame problem occurs. It should be noted that the size of the input to the logistic nodes are basedupon the initialised weights which are randomly distributed between −0.5 and 0.5. The training dataonly has inputs between 0 and 1. Therefore, to cause |inj | > 308, the adjustment of the weights intraining must cause significant change and as this adjustment, ∆wij , depends on the change in errorwith respect to the weights, the network must either be diverging and hence changing the weightsmore and more each epoch, or the network is stuck in a local minimum and effectively just addinga constant weight change epoch after epoch. The former can be seen in Figure 2.10a and the latterin Figure 2.10b.

(a) A run error caused by divergence from the globalminimum of ∂E/∂wij

(b) A run error caused by a network stuck in a rel-atively large valued local minimum

Figure 2.10: Two more examples of run errors

In summary, run errors can occur, but this just indicates a network that would have been extremelyinaccurate. This comes down to the weight initialisation causing divergence from the global minimumwhich further indicates the need for multiple run throughs to find the ideal solution.

2.4 Conclusions

In general, there are a great number of variables and factors that influence how efficient andaccurate an ANN will be. The learning rate is vital, the number of epochs play a part in accuracy,the initialisation of the weights is paramount and the size of the training data set impacts thesuccess of the Backpropagation algorithm.

The most influential factor is certainly the range in which weights are initialised. Given anappropriate set of starting weights, the system can either converge very quickly or diverge signifi-cantly to the extent of a run error. This was a clear problem in a small network. Now if we imaginean even bigger network with more weights to randomise and train, what will happen? Will anincrease in hidden nodes allow for a more accurate network or will the training become substantiallymore difficult with the increased number of weights to be altered? We will investigate an answer tothese questions in the following chapter.

It can be argued that the second most important factor regarding successful training is the

23

learning rate. The “Incredulous Result” was stuck in a minimum of high error but the learning ratewas too small with regards to Equation 1.4.5 to escape. Each network has different requirementsand predicting an ideal learning rate is extremely difficult. The learning rate chosen for this examplewas based on experimentation to find an η that ideally results in a trained network similar to the“Great Result” for as high a proportion of run throughs as possible. One cannot consider this themost important factor because it is a lot harder to predict than the weight initialisation range.One can simply guess based on bounds of the sigmoid function as to the necessary weights neededto output the magnitude of the target data. On the other hand the learning rate impacts aftersuch an initialisation on completion of the first epoch and therefore depends on this randomisation.Therefore the weights are the most important factor. This begs the question - are there moreappropriate ways to initialise the weights than a random distribution within a bounded interval?There certainly are, and this concept will be revisited later.

One should note that the training data is of significant importance too. The reason for rat-ing its importance lower than the learning rate and weight initialisation is due to the fact that itis unlikely one would want to build a network with a very limited training set to begin with onthe basis the result would be highly unreliable. If the training set is too small then naturally theBackpropagation algorithm will struggle to get a good picture of the function we are trying to teachit. This would indicate the need for a large number of epochs to ensure accuracy. Unfortunatelythis leads to another problem called Overfitting. The network is taught to accurately predict thetraining data, but can cause inaccuracy to all other data points. For example, take the networkfrom above. Instead of the network outputting a curve close to x2 it could output a curve similarto the sine curve with period 0.2, waving through all the training points but being a completelyinaccurate estimation for all other points between 0 and 1. Extensive details of this phenomenonshall be discussed in Chapter 3.

Due to such a number of impacting factors, research is very active. To conquer the prob-lems faced, we must first discover the limitations of a single layer feedforward MLP and thenconsider methods of countering them. We begin with the Universal Approximation Theorem.

24

Chapter 3

Universal Approximation Theorem

3.1 Theorem

The Universal Approximation Theorem (UAT) formally states:

Let σ(·) be a non-constant, bounded and monotonically increasing continuous function.Let In denote the n-dimensional unit hypercube [0, 1]n. Therefore define the space ofcontinuous functions on the unit hypercube as C(In). Then for any f(x) ∈ C(In) withx ∈ In and some ε > 0, ∃N ∈ Z such that:

F (x) =N∑i=1

ciσ

(n∑

j=1

wijxj + bi

)(3.1.1)

is an approximation realisation of the function f(x) where ci, bi ∈ R and wij ∈ Rn.Therefore:

|F (x)− f(x)| < ε ∀x ∈ In (3.1.2)

Given that our logistic function is a non-constant, bounded and monotonically increasing continuousfunction we can directly apply this to a single hidden layer MLP. If we now claim that we have anMLP with n input nodes and N hidden nodes, then F (x) represents the output of such a networkwith f(x) our respective target vector given an input vector x. We can appropriately choose ourhidden to output connection weights such that they equal ci and let bi represent our bias node inthe hidden layer. Finally normalising our training data to be in the interval [0, 1] we have a fullydefined single hidden layer MLP with regards to this theorem.

Therefore we can directly apply the UAT to any single hidden layer MLP that uses a sig-moid function in its hidden layer and can conclude that any function f(x) ∈ C(In) can beapproximated by such a network. This is extremely powerful. In Chapter 1 we began with aperceptron which had just an input and output layer. This was unable to distinguish non-linearlyseparable data. By adding in this one hidden layer in a feedforward network we can now not onlydistinguish between non-linearly separable data, but under certain assumptions on our activationfunction, can now approximate any continuous function with a finite number of hidden nodes.

3.2 Sketched Proof

Cybenko in 1989 was able to detail a proof in a book named “Approximation by Superpositions of aSigmoidal Function” for the UAT [19]. The aim of his paper was to find the assumptions necessary

25

for equations of the form Equation 3.1.1 to be dense in C(In). In our theorem we explain f(x) canbe approximated ε-close in C(In) and hence F (x) gives the dense property. We will now investigateCybenko’s 1989 paper to prove the conditions for the theorem to hold.

Definition: First of all, let’s define what it means for σ to be sigmoidal as in Cybenko’spaper [19]. σ is sigmoidal if:

σ(x) −→{

1 x −→ +∞0 x −→ −∞ (3.2.1)

Notice this is exactly the case for the logistic function as shown in the previous chapter.

We can describe why Cybenko’s result should hold via a logical argument with reference toa post made by Matus Telgarsky [20]. Firstly we recall that a continuous function of a compactset is uniformly continuous. In is clearly compact and as our sigmoid function is defined to be acontinuous function over this interval, σ is uniformly continuous and thus can be approximated bya piecewise constant function. In his post, Telgarsky describes how a piecewise constant functioncan then be represented by a Neural Network as such [20]:

• An indicator function is defined as: Given a set X and a subset Y ⊆ X then for any x ∈ X

IY (x) =

{1 if x ∈ Y0 otherwise

(3.2.2)

• For each constant region of the piecewise constant function we can form a node within a NeuralNetwork that effectively acts as an indicator function, and multiply the node’s output by aweighted connection equal to the constant required.

• We want to form this Neural Network such that it uses a sigmoidal function as defined as F (x)in Equation 3.1.1 and to form such an indicator function using sigmoidal nodes we can takeadvantage of the limits as defined above. Therefore the weighted connections acting as inputto this node can either be large positively or negatively to allow for the output to be arbitrarilyclose to 1 or 0 respectively.

• The final layer of the Neural Network needs just a single node whose output is the sum ofthese “indicators” multiplied by the appropriately chosen weights to approximate the piecewiseconstant function.

This is what we shall now attempt to show mathematically.

Defining M(In) as the space of finite Borel measures on In we are in a position to explainwhat it means for σ to be discriminatory.

Definition: σ is discriminatory if for a measure µ ∈M(In) then ∀w ∈ Rn and b ∈ R:∫In

σ

(n∑

j=1

wjxj + b

)dµ(x) = 0 (3.2.3)

implies that µ = 0. Notice this integrand takes the same form as in Equation 3.1.1. With thisdefinition we are now in a position to consider Cybenko’s first Theorem of his paper [19]:

26

Theorem 1: Let σ be a continuous discriminatory function. Then given any f ∈ C(In),∃F (x) of the following form:

F (x) =

N∑i=1

ciσ

(n∑

j=1

wijxj + bi

)(3.2.4)

such that for some ε > 0|F (x)− f(x)| < ε ∀x ∈ In (3.2.5)

This theorem is extremely close to the UAT but it does not impose the conditions on σ required forthe application to Neural Networks yet. We will now prove this theorem.

3.2.1 Proof of Theorem 1

To fully understand and investigate the proof in Cybenko’s 1989 paper we must begin by consideringtwo theorems; the Hahn-Banach theorem and the Riesz-Markov-Kakutani theorem whose proofswill be omitted.

The Hahn-Banach Theorem [21, 22]: Let V be a real vector space, p : V 7→ R a sub-linear function (i.e. p(λx) = λp(x) ∀λ ∈ R+, x ∈ V and p(x + y) ≤ p(x) + p(y) ∀x, y ∈ V ) andϕ : U 7→ R a linear function on a linear subspace U ⊆ V which is dominated by p on U (i.e.ϕ(x) ≤ p(x) ∀x ∈ U). Then there exists a linear function ψ : V 7→ R of φ to the whole space V suchthat:

ψ(x) = φ(x) ∀x ∈ U (3.2.6)

ψ(x) ≤ p(x) ∀x ∈ V (3.2.7)

Now defining Cc(X) as the space of continuous, compact, complex-valued functions on a locallycompact Hausdorff space X (A Hausdorff space means any two distinct points of X can be separatedby neighbourhoods [23]) we can state the Representation Theorem.

Riesz-Markov-Kakutani Representation Theorem [24, 25, 26]: Let X be a locallycompact Hausdorff space. Then for any positive linear functional ψ ∈ Cc(X) there exists a uniqueBorel measure µ ∈ X such that:

ψ(f) =

∫Xf(x)dµ(x) ∀f ∈ Cc(X) (3.2.8)

With these two theorems we are now in a position to understand Cybenko’s proof of Theorem 1(written in italics), adapted to our notation from his paper [19]:

27

Let S ⊂ C(In) be the set of functions of the form F (x) as in Equation 3.1.1. Clearly Sis a linear subspace of C(In). We claim that the closure of S is all of C(In).

Here, Telgarsky helps find our route of argument [20]. The single node in theoutput layer as defined in our logical argument earlier is a linear combination of theelements in the previous layer. The nodes in the hidden layer are functions and thus thislinear combination from the output node is also a function contained in the subspaceof functions spanned by the hidden layer’s outputs. This subspace contains the sameproperties as the space spanned by the hidden node functions but we need to show it isclosed. Thus Cybenko is arguing that this subspace is not only closed but contains allcontinuous functions by means of contradiction.

Assume that the closure of S is not all of C(In). Then the closure of S, say R,is a closed proper subspace of C(In). By the Hahn-Banach theorem, there is a boundedlinear functional on C(In), call it L, with the property that L 6= 0 but L(R) = L(S) = 0.

By the Riesz(-Markov-Kakutani) Representation Theorem, this bounded linear function,L, is of the form:

L(h) =

∫In

h(x)dµ(x) (3.2.9)

for some µ ∈M(In) for all h ∈ C(In). In particular, since σ

(n∑

j=1

wjxj + b

)is in R for

all w and b, we must have that:∫In

σ

(n∑

j=1

wjxj + b

)dµ(x) = 0 (3.2.10)

for all w and b.

However, we assumed that σ was discriminatory so that this condition impliesthat µ = 0 contradicting our assumption. Hence, the subspace S must be dense in C(In).

�

This proof shows that if σ is continuous and discriminatory then Theorem 1 holds. All we needto do now is show that for any continuous sigmoidal function σ, σ is discriminatory. This will thengive us all the ingredients to prove the Universal Approximation Theorem.

Cybenko gives us the following Lemma to Theorem 1:

Lemma 1: Any bounded, measurable sigmoidal function, σ, is discriminatory. In particular,any continuous sigmoidal function is discriminatory.

The proof of this Lemma is heavily measure theory based and is omitted from this report,however it can be found via the reference for Cybenko’s 1989 paper for perusal [19].

Finally, we can state Cybenko’s second theorem which gives us the Universal ApproximationTheorem and shows that an MLP with only one hidden layer and an arbitrary continuous sigmoidal

28

function allows for approximation of any function f ∈ C(In) to arbitrary precision:

Theorem 2: Let σ be any continuous sigmoidal function. Then given any f ∈ C(In),∃F (x) of the following form:

F (x) =N∑i=1

ciσ

(n∑

j=1

wijxj + bi

)(3.2.11)

such that for some ε > 0|F (x)− f(x)| < ε ∀x ∈ In (3.2.12)

The proof for this theorem is a combination of Theorem 1 and Lemma 1.

3.2.2 Discussion

Interestingly, Cybenko mentions in his 1989 paper that typically in Neural Network applications,sigmoidal activation functions are typically taken to be monotonically increasing. We assumethis for the UAT but for the results in Cybenko’s paper and the two theorems we investigated,monotonicity is not needed. Although this appears an unnecessary condition on the UAT, amonotonically increasing activation function allows for simpler approximating and is therefore asensible condition to include. If the activation function wasn’t monotonically increasing, it can beassumed that generally training a network would either take longer or struggle to converge on theglobal minimum of the Error function. This is because the activation function would have minimaand hence cause problems in the Backpropagation algorithm when updating weights because itgives the ability for some weight changes to be 0 at the minima of the activation function.

In addition to this proof from Cybenko, it is worth noting that 2 years later in 1991, KurtHornik published a paper called “Approximation Capabilities of Multilayer Feedforward Networks”in which he proved that the ability for a single hidden layer MLP to approximate all continuousfunctions was down to its architecture rather than the choice of the activation function [27].

With the Universal Approximation Theorem in mind we can conclude that if a single hiddenlayer MLP fails to learn a mapping under the defined constraints, it is not down to the architectureof the network but the parameters that define it. For example, this could be poorly initialisedweights, it could be the learning rate or it could even be down to an insufficient number of hiddennodes to suitably approximate the function with too few degrees of freedom to produce a complexenough approximation.

Another thing to note is this theorem just tells us we have the ability to approximate anycontinuous function using such an MLP given a finite number of hidden nodes. It does not tell uswhat this finite number actually is or even give us a bound. However, when we want to computeextremely complex problems we are going to require a large number of hidden nodes to cope withthe number of mappings represented. The number of hidden nodes is important with regards to anaccurate network and we will now consider the consequences of a poorly chosen hidden layer size.

3.3 Overfitting and Underfitting

Once we require a huge number of nodes to solve complex approximations, the number of calcu-lations the network has to do increases significantly in an MLP. Considering i input nodes and o

29

output nodes, adding one more hidden node increases the number of weighted connections by i+ o.Presumably if a large number of hidden nodes are required, a large number of input and outputnodes are also present. Ideally we want to minimise this number of increased calculations becausea Neural Network that takes days, weeks or even months to train is completely inefficient and oneof the reasons Neural Networks fell out of popularity in the ’70s. However, balancing training timeand efficiency with a network that can train accurately is surprisingly difficult.

Given a set of data points, the idea of a Neural Network is to teach itself an appropriateapproximation to data whilst generalising to unseen data accurately. Unlike in Chapter 2 in whichthe network was trained using perfect targets for the training data, in reality the training data islikely to have noise. Noise can be defined as the error from the ideal solution. For example usingthe mapping x2, our training data could in fact have the mappings 1 7→ 1.04 and 0.5 7→ 0.24. Theseare not precise but in practice not every data set can be completely accurate. The noise is whatcauses this inaccuracy. We want our Neural Network to find the underlying function of the trainingdata despite the noise as such:

Figure 3.1: An example of a curve fitting to noisy data. Image taken from a lecture by Bullinariain 2004 [28]

The blue curve represents the underlying function, similar to how our Example had the underlyingfunction x2, and the circles represent the noisy data points forming the training set. We want a Neu-ral Network to approximate a function, using this noisy data set, as close to the blue curve as possible.

However, two things may occur in the process:

1. Underfitting is a concept in which a Neural Network has been “lazy” and has not learnedhow to fit the training data at all, let alone generalise to unseen data. This yields an all-roundpoor approximator.

30

2. Overfitting is the opposite of Underfitting. In this concept a Neural Network has workedextremely hard to learn the training data. Unfortunately, although the training data may havebeen approximated perfectly, the network has poorly “filled in the gaps” and thus generalisedincompetently. This yields a poor approximator to unseen data.

To understand these concepts, consider the following figure that illustrates these cases to theirextremes:

Figure 3.2: An illustration of Underfitting (left) and Overfitting (right) of an ANN. Image takenfrom a lecture by Bullinaria in 2004 [28]

The graph on the left shows extreme Underfitting of the training data and clearly the redbest fit line would be a poor approximation for almost all points on the blue curve. Thegraph on the right shows extreme Overfitting in which the network has effectively generalised usinga “dot-to-dot” method and again provides a poor approximation to all points outside the training set.

Why might this occur? First let’s consider the concepts behind the error of a Neural Net-work.

3.3.1 A Statistical View on Network Training: Bias-Variance Tradeoff

Our aim here is to identify the expected prediction error of a trained Neural Network when presentedwith a previously unseen data point. Ideally we want to minimise the error between the network’sapproximated function and the underlying function of the data. Due to the addition of noise to ourtraining data, we may not want to truly minimise the error of our output compared to our targetwhich was:

Ei =1

2

(ti − a(L)(xi)

)2(3.3.1)

If we have noisy data then the minimum of this error could cause Overfitting. We want to ensurethe ANN is able to generalise beyond the noise to the underlying function and this generalisationwill not give the minimal error on a data point by data point basis.

In 2013, Dustin Stansbury wrote an article called “Model Selection: Underfitting, Overfitting,and the Bias-Variance Tradeoff” [29] and we shall follow his argument investigating Underfitting,

31

Overfitting and the Bias-Variance Tradeoff with adapted notation with close attention to thesection named “Expected Prediction Error and the Bias-variance Tradeoff”. Firstly let f(x) beour underlying function we wish to accurately approximate and let F (x) be the approximatingfunction generated by the Neural Network. Recalling x are the training data inputs and t theirrespective targets, this F (x) has been fit using our x− t pairs. Therefore we can define an expectedapproximation over all data points we could present the network with from F (x) as such:

Expected approximation over all data points = E[F (x)

](3.3.2)

Similar to Stansbury, our overall goal is to minimise the error between previously unseen data pointsand the underlying function we’ve approximated, f(x). Therefore we want to find the expectedprediction error of a new data point (x∗, t∗ = f(x∗) + ε) where ε is the constant that accounts fornoise in the new data point. Thus we can naturally define our expected prediction error as:

Expected prediction error = E[(F (x∗)− t∗)2

](3.3.3)

To achieve our overall goal, we therefore intend to minimise Equation 3.3.3 instead of minimisingEquation 3.3.1. This does not affect the Backpropagation learning algorithm imposed in Chapter1 and theoretically these errors will be roughly the same given a successfully trained Neural Network.

Let’s investigate our expected prediction error further. First we will take the following sta-tistical equations for granted which can also be found on Stansbury’s article [29]:

Bias of the approximation F(x) = E[F (x)

]− f(x) (3.3.4)

Variance of the approximation F(x) = E[(F (x)− E

[F (x)

])2](3.3.5)

E[X2]− E

[X]2

= E[(X − E

[X])2]

(3.3.6)

The bias of the approximation function represents the deviation between the expected approxi-mation over all data points (E

[F (x)

]) with regards to our underlying function f(x). If we have a

large bias, then we can conclude that our approximation function is generally a long way from theunderlying function we are aiming for. If the bias is small then we have an accurate representationof the underlying function in the form of our approximation function.

The variance of the approximation function is the average squared difference between anapproximation function based on a single data set (i.e. F (x)) and the expected approximation overall data sets (i.e. E

[F (x)

]). A large variance indicates a poor approximation to all data sets by our

single data set. A small variance indicates a good approximation to all data sets by our single data set.

Preferably we would like as small a bias and variance as possible to allow us the best approximationfunction to the underlying function. Equation 3.3.6 is a well known Lemma of the properties betweenthe bias and variance. The statement and proof can be found posted online by Dustin Stans-bury [30] as an extension from Stansbury’s current argument towards the Bias-Variance tradeoff [29].

Now to investigate our expected prediction error further. This argument follows Stansbury’sargument in his section named ”Expected Prediction Error and the Bias-variance Tradeoff” [29]

32

with adapted notation:

E[(F (x∗)− t∗)2

]= E

[F (x∗)2 − 2F (x∗)t∗ + t∗2

]= E

[F (x∗)2

]− 2E

[F (x∗)t∗

]+ E

[t∗2]

= E[(F (x∗)− E

[F (x∗)

])2]+ E

[F (x∗)

]2 − 2E[F (x∗)

]f(x∗) + E

[(t∗ − f(x∗)

)2]= E

[(F (x∗)− E

[F (x∗)

])2]+(E[F (x∗)

]− f(x∗)

)2+ E

[(t∗ − f(x∗)

)2]= variance of F (x∗) +

(bias of F (x∗)

)2+ variance of the target noise (3.3.7)

As Stansbury notes, the variance of the target data noise provides us a lower bound on the expectederror prediction. Logically this makes sense. It indicates that if our data set has noise and henceisn’t accurate, we will have some error in the prediction. Now we can see the effects of bias andvariance on our expected prediction error, which is what we want to minimise.

Similarly to Bullinaria in his 2004 lecture [28], we can now investigate our extreme exampleswith regards to this Equation 3.3.7. If we pretend our network has Underfitted extremely and takeF (x) = c where c is some constant then we are going to have a huge bias. However, our variancewill be zero which overall will give a large expected prediction error. Alternatively assume ournetwork has Overfitted extremely and F (x) is a very complicated function of large order such thatit fits our training data perfectly. Then our bias is zero but our variance on the data is equal to thevariance on the target noise. This variance could be huge in practice depending on the data set youare presenting your trained Neural Network with. This defines the Bias-Variance Tradeoff.

Preferentially we wanted to minimise the bias and the variance. However this explanationshows that as one increases, the other decreases and vice versa. This means we have a point atwhich the bias and variance provide the smallest expected prediction error and completes our aim.If we favour bias or variance too much then we risk running into the problems with Underfittingand Overfitting as described above.

3.3.2 Applying Overfitting and Underfitting to our Example

Having now researched the concepts of Overfitting and Underfitting and the reasons for them occur-ring we can illustrate them using our Example in Chapter 2. To put this into practice a couple ofalterations had to be made to the code used for Chapter 2 in Appendix A:

1. Firstly to ensure clarity in the Overfitting and Underfitting, noise was added to our data pointsin our training set. I also added in more data points to our training set to finally give us thefollowing set of training data:

(0, -0.05),(0.05, -0.0026),(0.1, 0.011),(0.15, 0.022),(0.2, 0.041),(0.25, 0.0613),(0.3, 0.093),

(0.35, 0.12),(0.4, 0.165),(0.45, 0.2125),(0.5, 0.26),(0.55, 0.295),(0.6, 0.364),(0.65, 0.4225),

33

(0.7, 0.49),(0.75, 0.553),(0.8, 0.651),(0.85, 0.72),

(0.9, 0.805),(0.95, 0.9125),(1, 1.04)

Recalling our original data set consisted of x ∈ {0, 0.1, 0.2, ..., 1}.

2. The number of epochs was increased from 10000 to 40000.

3. The Neural Network was originally defined to produce a network with 2 input nodes, 3 hiddennodes and 1 output node which includes the bias nodes on the input and hidden layer. Forthe following examples the number of hidden nodes will be changed appropriately to yield thedesired results. This number will be specified for each example and will include the bias node.

Everything else remains the same, including the learning rate and the number of test data pointsto be plotted on the graphs.

Let’s begin by considering a control graph:

Figure 3.3: A graph to show convergence can occur with the new code setup. This network used4 hidden nodes.

Figure 3.3 shows a graph generated by an MLP which used 4 hidden nodes and proves that the newsetup for our Example Network can produce results extremely similar to the “Great Result” fromChapter 2.

Now to demonstrate Overfitting and Underfitting of the network. Overfitting can occur ifwe allow the bias to be low and we can generate this problem by building the network to betoo complex. Therefore the network was built with an excessive number of nodes, 11 in total.Similarly, we can illustrate Underfitting by allowing the network to be too simple and thus unableto appropriately map each training data point to its target:

34

(a) A network whose training has caused Overfitting.A total of 11 hidden nodes were used.

(b) A network whose training has caused Underfit-ting. A total of 2 hidden nodes were used.

Figure 3.4: Two graphs representing our Example Network’s ability to Overfit (left) or Underfit(right) under a poorly chosen size of hidden layer.

Figure 3.4a under inspection with the training set demonstrates Overfitting, although not to theextreme as was discussed earlier. The excessive number of hidden nodes allows the network to form amuch more complex approximation function than a quadratic and therefore it attempts to hit all thenoisy training data points as best as possible. This could be exaggerated further by increasing thenumber of epochs, allowing the network to train longer and become even more suited to the train-ing data. However, for fairness and continuity’s sake I did not change this for the Overfitting example.

Figure 3.4b shows a clear Underfitting. Having just a total of 2 hidden nodes of which onlythe sigmoid node has the ability to produce a function that isn’t linear, is nowhere enough to mapour 21 data points to any sort of accuracy. It therefore compromises by effectively approximating astep function.

This clearly indicates the necessity for an appropriately sized hidden layer in a single hiddenlayer MLP or in fact any Neural Network which can exhibit Underfitting and Overfitting. Toomany hidden nodes allows the network to Overfit and too few gives the network no ability to mapall data points accurately and hence Underfits. So in general, how can we avoid these problems?

3.3.3 How to Avoid Overfitting and Underfitting

Bullinaria highlights some important analysis in methods of preventing Overfitting and Underfitting[28]. To help prevent the possibilities of Overfitting we can consider the following when building ournetwork:

• Do not build the network with too many hidden nodes. If Overfitting is clearly occurringduring run throughs, reducing the number of hidden nodes should help.

• The network can be instructed to cease training under evidence of Overfitting beginning tooccur. If the error on a test set of data after each epoch of training increases for a pre-definedthreshold of consecutive measurements, say 10 epochs, then the network can be commanded

35

to stop training and use the solution prior to the increase of the errors. It is important not touse the training data for this because Overfitting can occur without noticing.

• Adding noise to the training data is in fact recommended by Bullinaria [28] because it allowsthe network the chance to find a smoothed out approximation whereas if one data point wasan anomaly in the data set (and could be described as the only point with noise) then thiscould dramatically effect the networks ability to predict unseen data in a neighbourhood ofthis anomaly.

To assist in Underfitting avoidance, being mindful of the following helps:

• If Underfitting is occurring, increasing the number of parameters in the system is necessary.This can be done by increasing the size of the hidden layer or even adding in other layers. If thesystem has too few parameters to represent all the mappings, it will be unable to accuratelyapproximate each mapping.

• The length of training must be long enough to allow for suitable convergence to the globalminimum of the Error function. If you only train your network for 1 epoch, chances are yourapproximation function will be incredibly inaccurate. On the other hand, you don’t want totrain too long and risk Overfitting.

3.4 Conclusions

In this Chapter we learned about the Universal Approximation Theorem which informally statesthat we can approximate any continuous function to arbitrary accuracy using a single hiddenlayer MLP with an arbitrary sigmoidal function. We then investigated Cybenko’s 1989 paper [19]detailing his proof on the matter and became aware of the fact it was the architecture of the MLPthat allowed this rather than the choice of activation function as proved by Kurt Hornik in 1991[27].

Furthermore, the Universal Approximation Theorem does not provide us with any sort ofbound or indication on the number of hidden nodes necessary to do this, it just provides theknowledge that we can approximate any continuous function. This brought into question theconsiderations necessary for appropriately choosing the size of our hidden layer.

Moreover, we investigated the problems associated with this choice, most notably Overfittingand Underfitting. We discovered that the Bias and Variance of our data set and ultimate approx-imation function were closely linked and we had to find a favourable Tradeoff in which the sumof the Bias squared with the Variance was minimised to minimise the expected prediction errorof a new data point when presented to the trained network. Tending towards a low Bias causesUnderfitting but tending towards a low Variance causes Overfitting.

Finally we illustrated, using an adapted version of our Example Network from Chapter 2,that under these conditions we can show evidence of Underfitting and Overfitting occurring. Wewere able to show Underfitting by minimising the number of nodes in the hidden layer, preventingthe network from having the required parameters to map all of our data accurately. Similarly,excessively increasing the size of the hidden layer allowed too great a number of mappings whichresulted in Overfitting.

36

It is fair to say that the choice of the number of hidden nodes is quite delicate. There areno concrete theorems to suggest how many hidden nodes to choose for an MLP with regards toinput and output layer sizes and training data set sizes. Therefore we can conclude that slowlyguessing and increasing the number of nodes in a single hidden layer is practically useless when facedwith a complex problem, for example image recognition. Given a 28x28 pixel picture, we alreadyneed 784 input nodes to begin with. Combining this with the length of time it can take to train anetwork of such huge size calls into question the efficiency of a single hidden layer MLP. Additionally,considering preventing Underfitting can essentially come down to a sensible training time and asuitable number of parameters to encompass all the mappings necessary, perhaps it is useful tocontemplate a greater number of hidden layers, especially after appreciating the improvement thesingle hidden layer MLP had on the perceptron. This shall be our next destination.

37

Chapter 4

Multiple Hidden Layer MLPs

4.1 Motivation for Multiple Hidden Layers

In Chapter 3 we ascertained that to prevent Underfitting we essentially just need to ensure thereare enough parameters in the MLP to allow for all mappings of our data (and train the networkfor suitably long). We also theorised that one layer may cause a problem in approximation, dueto the unknown number of hidden nodes required for an approximation to be accurate to ourunderlying continuous function, as in the Universal Approximation Theorem. As well as the problemof approximation, a large number of hidden nodes would require a large training time as eachconnection requires a calculation by the Backpropagation algorithm to update the associated weight.

We shall now consider the impact of adding one extra hidden node to an MLP and thencomparing this with other MLPs with the same total number of hidden nodes which have a slightlydifferent architecture (i.e. more hidden layers). We will then compare the number of calculationseach MLP would have to undergo in training via the number of weighted connections in eachMLP and simultaneously compare the complexity to which the MLPs are able to model. Overall,we aim to minimise the number of calculations undertaken by the Neural Network and maximiseits flexibility in the number of mappings it can consider. In theory, the smaller the numberof calculations the Neural Network has to make, the shorter training time will be. Similarly,maximising the flexibility of the network by finding the simplest architecture for a certain flexibilityrequirement should also decrease training time. This is because the simplest architecture reallymeans the fewest total nodes in the network and therefore fewer calculations between inputting xand receiving an output F (x).

First let’s define our base MLP, the one which we will be adding a hidden node to beforerestructuring the hidden nodes for comparison. Let the number of connections := #connections.Similarly let the number of routes through the network (i.e. any path from any input node to anyoutput node) := #routes. Then our base MLP will have the structure of 2 input nodes, 4 hiddennodes and 2 output nodes (i.e. structure = 2-4-2):

38

l = 0 l = 1 l = L = 2 Structure: 2-4-2

#connections = 16

#routes = 16

Figure 4.1: Our base MLP

We can simply check the number of connections and routes by hand to find the numbers are correct.Recalling our input layer is l = 0 and our output layer is l = L allows us to define the followingequations for calculating #connections and #routes:

|l| := the number of nodes in layer l

#connections :=L−1∑l=0

|l| |l + 1| (4.1.1)

#routes :=

L∏l=0

|l| (4.1.2)

If we check these equations with Figure 4.1 we find the numbers match up. We can understandwhere these equations come from by a simple logical argument. For the number of connections, weknow an MLP must be fully connected as this is one of its defining features. Therefore every nodefrom one layer will connect to every node on the following layer and this provides Equation 4.1.1.For the number of routes, we can say we have 2 choices for the first node on our path in Figure4.1. We then have 4 choices for the second node on our path and 2 choices for our output node tocomplete our path. As an MLP is fully connected, this generates Equation 4.1.2.

Next we intend to add one more node to our hidden layer of our base MLP and then goabout restructuring our hidden nodes. This will give us a total of 5 hidden nodes to work with andwe can see the increase in #connections and #routes:

39

Structure: 2-5-2

#connections = 20

#routes = 20

Figure 4.2: An MLP which has one additional hidden node compared to our base MLP

#connections has increased by 4 and similarly #routes has increased by 4. The addition of thisnode increases both the number of mappings available and the number of calculations for trainingas expected. Of course this is not what we are aiming for and it’s helpful to notice that all we’vedone is increase our risk of Overfitting.

Now let’s investigate what happens if we change the structure of our hidden nodes by putting ouradditional hidden node into its own second hidden layer as such:

Structure: 2-4-1-2

#connections = 14

#routes = 16

Figure 4.3: A restructured MLP of Figure 4.2

We find the number of routes this network provides is the same number of routes our base MLPprovides. However, in our new MLP with 2 hidden layers here, we only have 14 connections in com-parison to the original 16. This is good news because we achieve the same flexibility for mappingsbut there are fewer calculations required in the Backpropagation algorithm and hence faster training.

Further, let’s consider a few more different architectures of the MLP with a total of 5 hid-den nodes:

Structure: 2-3-2-2

#connections = 16

#routes = 24

Figure 4.4: A second restructure of the MLP in Figure 4.2

40

Structure: 2-2-2-1-2

#connections = 12

#routes = 16

Figure 4.5: A third restructure of the MLP in Figure 4.2

Structure: 2-2-1-1-1-2

#connections = 10

#routes = 8

Figure 4.6: A fourth restructure of the MLP in Figure 4.2

Figure 4.4 is arguably the most favourable choice for the structure of 5 hidden nodes given the sizeof the input and output layers we have. This is because it gives us the greatest ratio of routes toconnections at 3 : 2. This maximises the complexity of the network with minimal calculations andhence training time as well as simplicity of the network.

Figure 4.5 improves on Figure 4.3 in which it gives you the same number of routes but evenfewer connections. This appears as if it could also be an optimal choice of network with 2 hiddenlayers for our base MLP.

As one would expect, stretching the number of hidden nodes to more and more layers re-sults in poorer networks for our aims. Naturally when building a network, one would not believe aseries of one node layers would have any benefit to us. This would just cause serious Underfittingand hence Figure 4.6 is here to demonstrate how increasing a network by more and more layersisn’t beneficial.

We can conclude that the architecture of the network has interesting benefits with the inten-tion to minimise training time by minimising connections and hence calculations without losing theapproximation capabilities our Neural Network is able to produce. One may think applying ourlogic to our base MLP could show similar benefits so let’s check:

Structure: 2-2-2-2

#connections = 12

#routes = 16

Figure 4.7: An advantageous restructure of our base MLP in Figure 4.1

As one may have surmised, we can theoretically decrease training time without loss of generalityto the number of mappings available. This form of our base MLP in Figure 4.1 gives us the samenumber of routes but with 4 fewer connections. Figure 4.7 actually yields identical results to Figure4.5 except Figure 4.7 uses fewer nodes and hence should be even faster in training because there is

41

one fewer node to undertake calculations.

In conclusion, the advantages to adding more hidden layers include:

• A simplified architecture without loss of mapping ability

• Shorter training times due to a decrease in calculations within the learning algorithm

These advantages are significantly amplified if we need millions of hidden nodes to satisfy an ap-proximation function ε-close to an underlying function as in the Universal Approximation Theorem.Breaking a large single hidden layer into multiple layers will leave benefits. For an example, assumewe have n input nodes, n output nodes and 106 hidden nodes. Then for a single layer MLP we wouldhave:

#connections = 2n · 106 #routes = n2 · 106

If we wanted to find an MLP with 2 hidden layers with the same mapping ability, then we wouldneed the size of our two hidden layers, l = 1 and l = 2, such that |l = 1| · |l = 2| = 106. Assumingthey are the same size for simplicity, we can let each hidden layer contain just 1000 nodes (because#routes = n · 103 · 103 · n = n2 · 106 as before). We have already decreased the number of nodes inthe network by 106 − 2000 = 998, 000. This is a significant decrease in the number of calculationsrequired before an output of the network is given.

This implies that #connections = n · 103 + 103 · 103 + 103 · n = 2000n+ 106 < 2n · 106

⇔ 0.500 < n

and hence there are fewer connections, as well as fewer nodes in the network for any chosen n. Thegreater n gets, the greater the decrease in connections. This describes a network whose trainingtime would significantly decrease if a second hidden layer were to be implemented instead of usingjust a single layer MLP.

One may ask, “Why doesn’t everyone just employ large multiple hidden layer MLPs?”. Asone may have also suspected, training an MLP with a large number of hidden layers also has itsproblems. We will investigate these problems after first considering an example problem in which2+ hidden layers truly are useful.

4.2 Example Problems for Multiple Hidden Layer MLPs

Generally we can rely on 2 hidden layer MLPs to solve most problems we can expect to come acrossin which Neural Networks would be an ideal instrument for a solution. For example, finding obscurepatterns in huge data sets such as a pattern between how likely someone is to get a job comparedto the number of cups of coffee they have a week. However, it sometimes pays off to include a largernumber of layers.

4.2.1 Image Recognition

Let’s consider simple image recognition. If we consider a U.K. passport photo, its size is 45mm by35mm [31] which equals 1575mm2. The U.K. demands professional printing which could be variable

42

in pixels per millimetre ratios so basing this on the U.S.A requirements of picture quality, theydemand a minimum of 12 pixels per millimetre [32]. This gives 144 pixels per mm2 and thereforea standard UK passport photo contains 226, 800 pixels at least. If the image recognition systemwas using the passport photo database to recognise a headshot of a criminal they’re searching forthen the neural network would need 226, 800 input nodes, one for each pixel, to teach this systemhow to do so. It would then require enough hidden nodes and layers to recognise the colours andfacial features etc. to accurately identify the person. With this construct the advantage to applyingmultiple layers to allow this feature detection is clear to see. Reducing the number of connectionswould be extremely useful without losing mapping complexity to allow quicker training and runningtimes.

4.2.2 Facial Recognition

Figure 4.8: A picture demonstrating the use of mul-tiple hidden layers in a feedforward Neural Network

Facial recognition is a further step up fromimage recognition because it jumps from 2dimensions to 3 dimensions. Facial recogni-tion requires a Neural Network to learn ex-tremely complex shapes such as the contoursand structure of your face as well as eye, skinand lip colour. To teach a Neural Networkto recognise such complex shapes it mustbe taught to recognise simple features likecolour and lines before moving onto shapesand contours which can be done layer bylayer. Figure 4.8 is an image constructedby Nicola Jones in an article called “Com-puter science: The learning machines” from2014 [33]. The images within are courtesyof an article by Honglak Lee et al. [34].

The image from Jones briefly details howeach layer of this Neural Network could beseen as a feature detector, and the featuresto be detected get more complex the fur-ther into the network we travel until even-tually the Neural Network is able to con-struct faces. As can be seen from the finalimage, the constructed faces are still lack-ing a significant amount of detail. This ismainly due to the lack of understanding to-wards teaching a network such a task, butmore layers could again be included to alloweven greater refinement.

43

4.3 Vanishing and Exploding Gradient Problem

In 1963, Arthur Earl Bryson et al. published an article regarding “Optimal Programming Problemswith Inequality Constraints” in which the first arguable claim to the invention of the Backpropaga-tion algorithm can be assigned [35]. The theories made in this paper were being solely applied toprogramming and not until 1974 did Paul J. Werbos apply it to Neural Networks in his PhD Thesisnamed “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences” [10].Unfortunately, the lack of computer power to make the application efficient enough for use meantthe Backpropagation algorithm went quiet until 1986. 1986 was the year of the publication byRumelhart, Hinton and Williams in which a breakthrough was made in the efficiency of the algorithmand the upturn in computing power available to run said algorithm [11]. Then, an incredible 28years on from the Bryson article, Sepp Hochreiter finished his Diploma thesis on the “FundamentalProblem in Deep Learning” [36]. This thesis described the underlying issues with training largedeep Neural Networks. These underlying issues defined significant research routes through the ’90sand ’00s and although usually applied to what are called recurrent Neural Networks (which canbe described as feedforward Neural Networks that also include the concept of time and thereforeallow loops within layers) we can apply this to our MLPs as they are a form of deep Neural Networks.

The underlying issues are now famously known as the Vanishing and Exploding GradientProblems. The Backpropagation algorithm utilises gradient descent as the crux of operation. As thenumber of hidden layers increases, the number of gradients the learning algorithm has to calculatethrough to update weights early in the network increases exponentially. This either leads to avanishing gradient in which the weight updates become negligible towards the start of the network,or an exploding gradient in which the weight updates become exponentially bigger as we reach thestart of the network. The former causes issues with extracting critical features from the input datawhich leads to Underfitting of the training data and the latter leaves the learning algorithm entirelyunstable and reduces the chances of finding a suitable set of weights.

Let’s investigate why these two problems occur within the Backpropagation algorithm andunderstand that it is a consequence of using this learning algorithm. The backbone of this algorithmboils down to iteratively making small adjustments to our weights in an attempt to find a minimum.Ideally we would like to converge on the global minimum, but there is always the chance we willconverge on a local minimum. Considering the complexity of multiple hidden layer MLPs withhundreds of nodes, chances are a local minimum will be found. First of all, let’s remind ourselves ofthe weight update equations from Chapter 1:

∆w(l)ij = −η ∂Ei

∂w(l)ij

(4.3.1)

represents our actual weight update equation. This was then determined by its Error functionderivative term which is:

∂E

∂w(l)ij

=

(∂E

∂a(l)j

∂a(l)j

∂in(l)j

)a(l−1)i =

(∂E

∂a(l)j

)aj ′(in(l)j )a

(l−1)i

with:

∂E

∂a(l)j

=

(a

(L)j − tj) if j is a node in the output layer∑

k∈K

(∂E

∂a(l+1)k

ak′(in(l+1)k )w

(l+1)jk

)if j is a node in any other layer

(4.3.2)

44

recalling that a(l)j is the activation function of the j-th node in layer l and aj ′(in(l)j ) is the derivative

of the activation function of the j-th node on layer l with respect to the input to that node. For anode not in the output layer, we can simplify this equation by defining the error of the j-th node in

layer l to be δ(l)j :

δ(l)j =

∂E

∂a(l)j

aj ′(in(l)j )

= aj ′(in(l)j ) ·∑k∈K

(∂E

∂a(l+1)k

ak′(in(l+1)k )w

(l+1)jk

)= aj ′(in(l)j ) ·

∑k

δ(l+1)k w

(l+1)jk (4.3.3)

This is an especially important method of phrasing the weight adjustments with respect to these

gradients δ(l)j . We notice that the gradients of layer l depend on all gradients of the layers ahead of

it (i.e. layers l = l+1, l+2, ..., L). This is the cause of the Vanishing/Exploding Gradient Problem.

In his soon to be published book called “Neural Networks and Deep Learning” [37], MatthewNielsen offers a simple example to demonstrate these problems. We will now adapt his examplehere by considering a 3 hidden layer MLP:

x σ σ σ

w1 w2 w3 w4

f(x)

Let’s calculate the change in error with respect to the first weight using our equations from above:

∂E

∂w1= a′(in(1)1 ) · w1 · a′(in(2)2 ) · w2 · a′(in(3)3 ) · w3 · a′(in(4)4 ) · w4 · E · x (4.3.4)

where E is the error from the output node and for this case we have a(0) = x representing theoutput from the input layer. From this we can see that there are two very distinctive dependencieshere on the changes to w1; the weights which will have been randomly initialised and the choiceof our activation function. The weights can ultimately decide which problem we have; Vanishingor Exploding. If our weights are small then the Vanishing Gradient Problem may occur, but ifwe choose large weights then we can result in an Exploding Gradient Problem. Generally, theVanishing Gradient is more common due to our desire to minimise our parameters such as smallerweights, normalised input data and as small an output error as possible.

However, let’s also consider the impact of the choice of activation function. In Chapter 1 welooked at two in particular, the logistic function as used in Chapter 2 for our Example, and thehyperbolic tangent which we have pretty much ignored until now. The impact of the activationfunction’s derivative to the change of w1 could be significant so let’s consider the following figure:

45

(a) A graph representing the derivative of the logisticfunction

(b) A graph representing the derivative of tanh(x)

Figure 4.9: Graphs representing the derivatives of our two previously seen activation functions

Figure 4.9 shows the form of the derivatives of the logistic function and tanh(x). The logisticfunction’s derivative is such that σ(x) ≤ 0.25 with its maximum at x = 0. In comparison, thehyperbolic tangent’s derivative has a maximum value of 1 also at x = 0. If we consider weights thatare all defined such that wi ≤ 1 for i ∈ {1, 2, 3, 4} then for the activation function chosen as thelogistic function:

∂E

∂w1≤ 1

44· E · x (4.3.5)

which, considering we normalise our input data x ∈ [0, 1], is going to be a negligible change to theweight unless our learning rate is huge. Recalling we don’t want to choose a high learning ratebecause this causes serious problems with convergence on a minimum, it is simple to realise thatthis problem is very much unavoidable. The maximum value of the activation function’s derivativeis one of the reasons supporting the use of the hyperbolic tangent function as well as it having thequalifying conditions for the Universal Approximation Theorem.

Although this was a very simple example to demonstrate the exponential problem to theBackpropagation algorithm, it holds for much more complex networks in which the number ofhidden nodes and layers are increased. The greater the number of layers, the greater the impact ofthe exponential weight change from the repeating pattern of the gradients.

Adding in more nodes and layers increases the complexity and this can actually exacerbatethe problem. With an increased complexity, there is a chance of increasing the frequency andseverity of local minima which is noted by James Martens in a presentation of his from 2010 [38]. Ifwe have a greater number of local minima, then we have a greater chance of slipping into one. Theerror may be large at this minima, but the gradient of the minima will cause only a small change toweights nearer the end of the network, which means even less of a change to weights in early layersof the network.

Therefore, as stipulated by Hochreiter originally in his thesis [36], the fundamental problemwith gradient descent based learning algorithms like the Backpropagation algorithm is its charac-

46

teristically unstable nature from the gradient dependency on all gradients of subsequent layers [37].It is extremely difficult to avoid this. As Nielsen points out, the pattern we saw in Equation 4.3.4

was the recurrence of a factor with the form waj ′(in(l)j ) and thus to prevent the vanishing gradientproblem for that simple example, we would need:

|waj ′(in(l)j )| ≥ 1 (4.3.6)

Therefore, with regards to the logistic function we would need |w| ≥ 4. But aj ′(in(l)j ) also depends

on the weights w and with reference to Figure 4.9a, a large w leads to a small aj ′(in(l)j ).

We therefore need to balance this and it turns out the band of flexibility is extremely thin.Supposing a logistic function, which therefore requires |w| ≥ 4 to prevent the Vanishing GradientProblem, implies the following to achieve the maximum range of w in which Equation 4.3.6 issatisfied [37]:

|w| ≈ 6.9 which gives us a range of only ≈ 0.45 (4.3.7)

The chances of us initialising our weights within this band is next to none, and that’s in a hugelysimplified network. The Vanishing/Exploding Gradient Problems are therefore almost certainlyunavoidable in any Neural Network implementing a gradient descent learning algorithm such as theBackpropagation algorithm.

4.3.1 Further Results

In 2010, Glorot and Bengio published a paper in which they investigated the use of sigmoid functionswithin large Neural Networks which were likely to exhibit the Vanishing Gradient Problem [39].They were training networks for image recognition set up with initialised weights randomised withina logical bound, and learning with the Backpropagation algorithm. Their results showed that thelogistic function caused problems to the learning algorithm functionality because, very early on intraining, the sigmoid nodes in the final hidden layer would saturate near 0 and this stunted thegradients ability to flowback through the layers and make more than negligible weight updates.

To rectify this, Glorot and Bengio further investigated two other activation functions thatwould hopefully avoid this saturation problem. Firstly, they considered tanh(x). This is becausetanh(x) has the ability to output 0 because it can take values between -1 and 1. This meant itwould not saturate at 0 and therefore should allow the gradients to continue training weights [39].This also supports our argument from before for using tanh(x) instead of the logistic functionbecause when differentiated it allows a higher maximum value which can help prevent a seriousimpact from the Vanishing Gradient Problem.

The second activation function they considered is called the softsign function, first proposedby Bergstra et al. in 2009 [40]:

Softsign function :=x

1 + |x|(4.3.8)

47

Figure 4.10: The softsign Function

The function takes the same form of tanh(x) with the same limits but the idea behind the proposalwas inspired by the nature of the tails. The tails are quadratic polynomials which differs fromthe logistic function’s and the hyperbolic tangent’s tails which are exponentials. This means thesoftsign function reaches its asymptotes much slower and this avoids problems such as saturation [39].

The conclusions of this paper highlighted that tanh(x) and the logistic function fare a lotworse than the softsign function under random initialisation of weights within a logically chosenbound. Glorot and Bengio also noted that once the input data was normalised, tanh(x) performedmuch better than the logistic function but was still outperformed by the softsign function [39].

4.3.2 Conclusions

All in all, with the help of Nielsen, we have shown the Vanishing/Exploding Gradient Problems areimpossible to avoid with the Backpropagation algorithm in large networks. With small MLPs likeour Chapter 2 Example MLP the problem doesn’t come into play, but when 10s or 100s of layersare coming together it certainly plays a part. This is the answer to the question “Why doesn’teveryone just employ large multiple hidden layer MLPs?”.

In addition to this, Glorot and Bengio were able to extend the argument to the choice of ac-tivation function. The sigmoid function fares poorly in comparison to the softsign function andhyperbolic tangent and therefore the use of the softsign or hyperbolic tangent functions as activationfunctions would help prevent saturation of hidden nodes in large MLPs.

Even with the new choice of activation function inspired by Bergstra et al. atop Glorot andBengio’s research, it seems that no matter what problems we solve with our MLP, a new morecomplicated problem arises. With the problem of network architecture combined with learningtimes, Underfitting/Overfitting and the Vanishing/Exploding Gradient Problem all coming downto a common factor of weight initialisation, surely there is a method in which we can initialisethe weights with a more probabilistic chance of success than logical bound estimates. Or perhaps

48

the Backpropagation algorithm served a purpose but soon became obsolete. Fortunately, theBackpropagation algorithm still holds strong (it will soon be used for fine-tuning a network) andthere does exist a method to initialise the weights with a greater chance of success. This method iscalled pre-training.

49

Chapter 5

Autoencoders and Pre-training

5.1 Motivation and Aim

In Chapter 4 we motivated the desire for multiple hidden layers in an MLP. We now intend to find asolution to effective training of large MLPs which avoid the Vanishing/Exploding Gradient Problem.The concept we will consider is called pre-training in which the MLP undergoes unsupervised learning(in which no target for an input is given to the network) before being fine-tuned by supervised learningunder the Backpropagation algorithm. The intention of pre-training is to discover an appropriateweight initialisation, before supervised learning occurs, which will increase the chances of convergenceonto the global minimum that only needs minor weight updates to fine-tune the network to a problem.If the initialised weights are closer to the ideal set of weights to begin with, training will be efficient,quick and accurate. There are a number of methods of pre-training a network, but we shall focuson one in particular - greedy layerwise pretraining using stacked Autoencoders. First, what is anAutoencoder?

5.2 The Autoencoder

5.2.1 What is an Autoencoder?

An Autoencoder (also known as an Autoassociator or Diabolo Network [41]) is a type of NeuralNetwork which carries an important feature - the size of the input layer must equal the size ofthe output layer. We shall consider one of the simplest forms of an Autoencoder - a feedforward,fully connected Autoencoder. Then the following example represents such an Autoencoder and thesimilarities with the MLP become clear:

50

x1

x2

x3

x4

σ

σ

σ

x1

x2

x3

x4

Figure 5.1: A simple example of an Autoencoder

Here, xi represent elements of an input vector x and xi represent the reconstructed elements of theinputs from a vector x, i ∈ {1, 2, 3, 4}. As usual, σ represents a sigmoid function. An Autoencodermay have many more layers and hidden nodes and it need not be symmetrical, as long as the inputand output layers have the same number of nodes (not including bias nodes). A bias node in theinput layer need not be replicated in the output layer because it does not actually take input fromthe training set.

The intention of this architecture is to allow an Autoencoder to learn to reconstruct its owninputs. So instead of having a target vector like the MLP, the targets are in fact the inputsthemselves.

5.2.2 Training an Autoencoder

Instead of minimising the error between outputs and targets, the idea is to minimise the reconstruc-tion error of the inputs. We do this by changing our Error equation from:

Ei =1

2

∥∥ti − a(L)(xi)∥∥22

(5.2.1)

which was our equation in the Backpropagation algorithm for our MLP, to

Ei =1

2

∥∥xi − xi

∥∥22

(5.2.2)

Other than this change, the Autoencoder can be trained using the Backpropagation algorithm justlike an MLP. Of course, given too many hidden layers or too many hidden nodes, the Autoencoderwill be challenged with all the same problems as we’ve discussed with the MLP such as Overfitting,Underfitting and the Vanishing/Exploding Gradient Problems. However, the method in which weeventually intend to use them will avoid all these problems. But why have we introduced them?

5.2.3 Dimensionality Reduction and Feature Detection

In 2006, Hinton and Salakhutdinov published a paper named “Reducing the Dimensionality of Datawith Neural Networks” [42]. They introduced an idea for unsupervised learning in which a pre-training concept teaches a network underlying patterns of the input data before subjecting thenetwork to conventional supervised learning methods like the Backpropagation algorithm. The

51

intention is to use an Autoencoder as a means for dimensionality reduction, which should cause thekey features of the input data to be extracted and represented in the reconstructed input (i.e. theoutput of the Autoencoder). Here’s how it works:

• Begin with an MLP in which the first hidden layer is smaller than the input layer.

• Continue subsequent hidden layers to decrease further in size until the output layer, whichshould be the smallest of them all.

• We can now form an Autoencoder by mirroring the input layer and hidden layers about theoutput layer.

Original MLP

Mirrored Layers and Connections

xi ∈ X xi ∈ X

Figure 5.2: An example of creating an Autoencoder from an MLP

The black nodes and connections represent the original MLP we want to train and the magentanodes and connections represent the mirrored layers and connections.

• The smaller number of nodes in the first hidden layer forces the input data to be compressed,thus reducing its dimensionality. The second hidden layer then forces the output of the firsthidden layer to be compressed further and so on until the smallest layer.

• The compressed data at this point then runs through the mirrored layers until the initialtraining data set X is reconstructed in the mirrored input layer giving a new data set X.

• The Autoencoder undergoes learning via the Backpropagation algorithm using Equation 5.2.2.

When undergoing dimensionality reduction, the input data should be forced to extract the importantinformation. If we were classifying images based on whether they have an orange or a banana inthem, the key features that may be extracted include the shape in the image, or the colour orthe size. The reconstructed data X is therefore a simplified version of the training set X whichincludes these features. This pre-training should then allow key features to be extracted during the

52

fine-tuning of our original MLP with the Backpropagation algorithm (in which the target vectorsare used) quickly and efficiently.

It is important to note that this dimensionality reduction will only occur if the hidden layersare of smaller size than the input layer. If the hidden layers of an Autoencoder have a greaternumber of nodes than the input layer, there is a chance of the MLP learning the identity functionas mentioned in a 2007 paper “Greedy Layerwise Training of Deep Networks” by Bengio et al. [43].This is because the training algorithm attempts to minimise the reconstruction error and could dothis perfectly by learning the identity function.

However, Bengio et al. did find some interesting results when experimenting with this. Al-though the Autoencoder would have the ability to learn the identity function, the experiments in[43] showed that the Autoencoder was still able to generalise well and hypothesised that this wasbecause, and I quote:

“...optimization falls in a local minimum which corresponds to a good transformation of the input(that provides a good initialization for supervised training of the whole net).”

Despite this being an interesting addition to the theory, generally Autoencoders are used for pre-training for dimensionality reduction with the interest of feature detection in mind.

5.3 Autoencoders vs Denoising Autoencoders

Furthering our investigation of Autoencoders, we mentioned in Chapter 4 that generally we canexpect any previously unseen input data to contain noise. The idea of the Neural Network was tolearn an approximation function to the underlying function using noisy data and thus be able topredict an accurate output of the previously unseen noisy input data. We shall now investigate amethod of training an Autoencoder with a non-noisy data set by forcing noise upon the inputs. Thiswill constitute a Denoising Autoencoder which in theory should be a better approximator becausethe Autoencoder should learn how to deal with noise.

5.3.1 Stochastic Gradient Descent (SGD)

A Denoising Autoencoder is essentially a stochastic version of an Autoencoder. So to get a bettergrasp of understanding on stochastic variants of technology, let’s first consider the stochastic variantof gradient descent (GD).

In our Backpropagation algorithm, GD is imposed as our basis of learning. After all thetraining data has been put through the network, the total Error is calculated as an average of all theerrors for each individual sample from the training data set. Only then does the Backpropagationalgorithm perform updates on the weights. If the training data set is huge, say 106 samples, thenit is going to take an incredibly long time before the network learns after just one epoch. This iscomputationally expensive and also time consuming. If you then intend to train your network overthousands or millions of epochs then perhaps GD is nonsensical.

To adjust this approach, Stochastic Gradient Descent (SGD) was introduced. Instead ofrunning all training samples through the network before learning occurs, a pre-defined number ofsamples randomly chosen from the training set are run through the network. This smaller numberof samples is then used as an estimate of the true gradient of the entire training set. This estimated

53

gradient is then used to update the weights. If more than one training sample is used before learningoccurs then this is referred to as mini-batch SGD. This ensures that learning starts occurring a lotsooner than in GD and thus requires less training time and hence is cheaper computationally.

The big question is: Does it affect the learning result? Actually, SGD finds a set of weightsclose to a minimum much faster because it starts adjusting weights much sooner [44]. This meansin practice, SGD is extremely more common than GD. For more information, Andrew Ng’s lectures[44] contains evidence of this with regards to a practical example and a 2010 paper by Bottou called“Large-Scale Machine Learning with Stochastic Gradient Descent” shows empirical evidence for thesuccess of SGD with regards to massive data sets [45].

5.3.2 The Denoising Autoencoder

As previously mentioned, a Denoising Autoencoder is a stochastic Autoencoder and the differencebetween the two is similar to the variation between SGD and GD, which we’ve just discussed. Thedifference here is, instead of limiting the number of input samples before weight updates, we limitthe number of inputs that take values other than 0. Naturally, not all previously unseen data willhave perfect input values for an underlying function we want to predict; this unseen data may havenoise. By adding noise to the inputs ourselves during training, we force the Autoencoder to learneven more robust features of the input data.

So how do we do this? Given an input vector, the intention is to corrupt the input vectorsby randomly setting a pre-defined proportion of the input vector elements to 0 each time, sothat the Autoencoder has to predict the values of the corrupted input vector elements usingthe uncorrupted elements. The target vector remains the uncorrupted input vector [41]. Aspreviously mentioned, it is ideal to normalise your input data so that x ∈ [0, 1]d for some input vec-tor x in d-dimensions. This makes it simple to adjust some of the elements of x to 0 to simulate noise.

This concept was firmly investigated by Vincent et al. [46] in which they hypothesise that,and I quote:

“partially destructed inputs should yield almost the same representation ”

as if the inputs were unchanged. They found that the Denoising Autoencoders actually avoid thepotential to learn the identity function we discussed in Section 5.2.3. To be able to predict thecorrupted input vector elements, the Denoising Autoencoder needs to be capable of finding thestatistical dependencies between the input elements. The perspectives from which this can beconsidered are complex but are all covered in [46].

All we need to take away from this is that an algorithm exists to cope with noisy data totrain an Autoencoder, which in fact eliminates the problem of learning the identity function andtherefore will allow our pre-training technique to be used on any MLP irrespective of hidden layersizes.

5.4 Stacked Denoising Autoencoders for Pre-training

Finally, we have all the instruments necessary to implement our pre-training technique which canavoid the Vanishing/Exploding Gradient Problems. We shall begin with an MLP which we wouldnaturally train using supervised learning under the Backpropagation algorithm with a training set

54

X which contains input vectors x and their respective targets t.

In Figure 5.2 we demonstrated how we can take an MLP and mirror the input and hiddenlayers around the output layer to form an Autoencoder. This is essentially our pre-trainingtechnique, except instead of mirroring the entire MLP, we will consider it two layers at a time(i.e. one layer of weights at a time), starting with the input and first hidden layer. This is whatconstitutes the greedy layerwise approach to pre-training. Our aim now is to show we can representour MLP by stacking our Denoising Autoencoders. Take the following MLP:

x1

x2

x3

x4

x5

b

σ

σ

σ

b

σ

σ

b

f(x)

Figure 5.3: An example MLP of structure 6-4-3-1 before greedy layerwise pre-training

Recall that b represents a bias and σ represents a sigmoid function (e.g. logistic, tanh, softsign). Wecan now form our first Autoencoder by taking the input layer and first hidden layer, and similarlyto Figure 5.2, mirror the input layer about the first hidden layer as such:

55

x1

x2

x3

x4

x5

b

h(1)1

h(1)2

h(1)3

b

x1

x2

x3

x4

x5

Original Input and First Hidden Layer

Mirrored Input Layer and Connections

Figure 5.4: An Autoencoder formed from the input layer and first hidden layer of Figure 5.3

where h(1)i represents the output of the hidden node i in layer l = 1, i ∈ {1, 2, 3}. The mirrored layers

and connections are illustrated by faded versions of the originals and will continue this pattern.Now we have our Autoencoder, we can train this Autoencoder using the Denoising Autoencodermethod. The training data X will be our inputs that the Autoencoder will try to reconstruct. ThisDenoising Autoencoder will attempt to minimise the reconstruction error, and due to the decreasein layer size, the hidden layer forces dimensionality reduction on the input space which shouldcause the reconstructed inputs to contain specific features of the input space. Once this DenoisingAutoencoder is trained, the updated weights should be better able to extract these specific featuresand thus be a better initialisation of weights for the supervised learning which will occur afterpre-training.

We must now move onto the first hidden layer and second hidden layer. Very much thesame concept is used. We will mirror the first hidden layer about the second hidden layer to formour Denoising Autoencoder and then train it as such. However, to do this we need a set of inputsto train it with. We can’t use our initial training data set X because it is of the wrong dimension.So instead we will use the final output of the hidden layer in our first Denoising Autoencoder inFigure 5.4 (i.e. h(1) ∈ H(1) where H(1) now defines the set of vectors outputted by the hidden layerin Figure 5.4). This training set, H(1), is therefore a condensed representation of our training setX; the expected input to the first hidden layer at this point. We therefore get the following:

56

h(1)1

h(1)2

h(1)3

b

h(2)1

h(2)2

b

h(1)1

h(1)2

h(1)3

Original First Hidden and Second Hidden Layers

Mirrored First Hidden Layer and Connections

Figure 5.5: An Autoencoder formed from the first and second hidden layers of Figure 5.3

where, similarly to before, h(2)j represents the output of the hidden node j in layer l = 2, j ∈ {1, 2}.

Therefore h(2) ∈ H(2) where H(2) now defines the set of vectors outputted by the hidden layer inFigure 5.5. The set H(2) can now be used to train our final two layers of our MLP in Figure 5.3,the second hidden layer and output layer.

As this is the final two layers to be pre-trained, it now makes sense to change to supervisedlearning, and use the target vectors from our original training set X as our targets for H(2). Wetherefore do not generate a Denoising Autoencoder for the final two layers, but instead train themwith the Backpropagation algorithm as such:

h(2)1

h(2)2

b

f(x)

Figure 5.6: An Autoencoder formed from the second hidden layer and output layer of Figure 5.3

Once all of this pre-training is completed, we discard the mirrored parts of our Denoising Autoen-coders and rebuild our initial MLP from Figure 5.3 by stacking their remains. This forms our originalMLP except now instead of randomly initialised weights, these weights have been pre-trained so thatthe Error function is already in the vicinity of a minimum:

57

x1

x2

x3

x4

x5

b

σ

σ

σ

b

σ

σ

b

f(x)

Figure 5.7: The rebuilt MLP with pre-trained weights to be fine-tuned by Backpropagation

We finish off training the MLP by applying our usual Backpropagation algorithm (with SGD insteadof GD if the training set is sufficiently large) as an optimisation technique to fine-tune these weightsto the minimum it’s currently in the vicinity of.

5.5 Summary of the Pre-training Algorithm

We begin with an MLP with L + 1 layers (noting our first layer is l = 0) such that L ≥ 2 so thatwe have at least one hidden layer. We begin with a training set X ⊆ [0, 1]d for some d ∈ N whichcontains input vectors x and their respective target vectors t. Denote H(l) as the final set of vectorsoutputted from a trained Denoising Autoencoder’s hidden layer in which l denotes the layer thesevectors are outputted from. For example, H(1) represents the set of vectors outputted from thehidden layer of our first Denoising Autoencoder.

1. Take the first two layers l = 0, 1 and the weighted connections between them.

2. Mirror layer l = 0 about layer l = 1 to form the first Denoising Autoencoder.

3. Using the original training set X, train this first Denoising Autoencoder to reconstruct its owninputs as described in Section 5.3.2.

4. Once trained, the final output of the hidden layer of this Denoising Autoencoder for eachtraining sample from X is saved in H(1).

5. Repeat steps 1-4 for all l, l + 1 layers with l ∈ {1, 2, ...L − 2}, replacing l = 0 with l and Xwith the appropriate H(l), each time forming a new training set for the following DenoisedAutoencoder, H(l+1).

6. Train the final two layers l = L − 1 and l = L using the Backpropagation algorithm, H(L−1)

as the input data and the target vectors in X.

58

7. Discard all mirrored connections and layers from the Denoising Autoencoder and rebuild theoriginal MLP with the remains of the Denoising Autoencoders to give the original MLP withpre-trained weights.

8. Fine-tune the network using the original Backpropagation algorithm and the training set X.

5.6 Empirical Evidence to Support Pre-training with Stacked De-noising Autoencoders

In 2006, Hinton et al. proposed a method of initialising the weights of a deep Neural Network sothat Gradient Descent could provide a more accurate solution than was currently available withrandom weight initialisations [42]. They called this greedy layerwise pre-training and used thesame method as above, but instead of stacking Autoencoders, Restricted Boltzmann Machines werestacked. Restricted Boltzmann machines (RBMs) are made up of just an input and hidden layerand are taught to learn a probability distribution over the training inputs with the intention ofmaximising these probabilities for an accurate approximation network. As RBMs are made up ofjust two layers which learn feature activations in the forms of these probabilities, they were easilyapplicable to stacking.

In 2007, Bengio et al. [43] expanded on the proposition of greedy layerwise pre-training pro-posed by Hinton et al. [42] by applying it to Autoencoders instead of the more complex RestrictedBoltzmann Machines. The comparisons were made by setting up Autoencoders to classify digitsfrom the MNIST database [47].

As an aside, the MNIST database contains 60,000 training images and 10,000 test images ofhandwritten digits from 0-9 and was constructed from NIST’s Special Database 3 and SpecialDatabase 1 which contain an even larger set of binary handwritten digits [47]. This database wascreated to help normalise digit classification research within the Neural Network community and isused regularly for comparisons between different training formats on different structures of NeuralNetworks.

Bengio et al. established that greedy unsupervised layerwise pre-training gave much betterresults under digit classification using the MNIST digit classification data than deep NeuralNetworks trained in the standard way using a form of gradient descent like Backpropagation. Giventhat a difference of more than 0.1% is significant on this database, the deep Neural Networkspre-trained with Stacked Autoencoders had a test error of 1.4% compared to a deep Neural Networkwithout pre-training which had a test error of 2.4% [43]. This is a significant improvement.

Furthermore in 2007, the so called best algorithms for classifying images using variations ofthe MNIST digit classification problem [47] were rigorously studied by Larochelle et al. [48]. In a2008 report by Vincent et al., an algorithm using stacked Denoising Autoencoders is compared tothe results from the Larochelle et al. paper [46].

Vincent et al. experimented with different levels of noise on their Denoising Autoencoders,setting from 0% to 50% of the input vector elements to zero. The Stacked Denoising Autoencoderswith the best classification results were then compared with the best results from Larochelle et al.’spaper [46, 48]. The Stacked Denoising Autoencoder performed better than all the previously best

59

algorithms on 7 of the 8 varying classification tasks. The explicit results can be found in Section 5 ofthe paper by Vincent et al. and I would encourage the reading of this section for full understanding[46].

Simultaneously, Vincent et al. compared the Stacked Denoising Autoencoder with the nor-mal Stacked Autoencoder and found that, and I quote:

“...the corruption+denoising training works remarkably well as an initialization step, and in mostcases yields significantly better classification performance than basic autoencoder stacking with no

noise.”

This concretely confirmed the advantage of the Stacked Denoising Autoencoders over the standardStacked Autoencoders as well as all the other learning algorithms in [48].

5.7 Conclusions

The intention of this training method was to avoid the Vanishing/Exploding Gradient Problems.Pre-training successfully eradicates these problems because we only ever train two layers at a timeuntil the fine-tuning of the weights. Fine-tuning is an appropriate term because by this point, thepre-training has done the bulk of the effort to update the weights and therefore the fine-tuning isonly used to make very small corrections to the weights. These small corrections are therefore notconcerned by the Vanishing/Exploding Gradient Problems.

Importantly, we also avoid the potential of Overfitting. This is because we use Stacked De-noising Autoencoders, no bigger than a total of 3 layers, which attempt to reconstruct the inputsrather than learn how to send the inputs to their appropriate targets. As this is a form ofunsupervised learning, the MLP can easily avoid Overfitting by ceasing training appropriately, asdiscussed in Section 3.3.3. There is a risk of Underfitting in training the MLP if the number ofparameters within the MLP are not great enough to accommodate all the mappings to begin with.However, this is easily avoided by ensuring a sufficient number of hidden nodes and layers.

Another important feature of this method was the choice to stack Denoising Autoencodersinstead of Autoencoders. Although Autoencoders could do almost as good a job as long as thehidden layers reduce in size, the Denoising Autoencoders run no risk of learning the identity functionirrespective of layer sizes [46], and therefore this algorithm can be applied to all MLPs.

The empirical evidence investigated shows that Stacked Denoising Autoencoders perform muchbetter than standard Stacked Autoencoders [46]. It also showed that such pre-training algorithmslike the greedy unsupervised layerwise pre-training the Stacked Denoising Autoencoders representperform significantly better than their deep Neural Network counterparts which receive no form ofpre-training [43].

We therefore finally have a method of training our MLPs which can accurately yield approx-imations and classifications no matter what the size of the network and hidden layers. In fact, theonly real constraints on the size of the MLP are the time in which to train the MLP and the cost ofcomputing this training.

60

Chapter 6

Conclusion

6.1 Conclusion

In this report we have formed a good knowledge base of Neural Networks, with emphasis towardsfeedforward networks such as the Multilayer Perceptron and Autoencoders. We have built upfrom the simplest form of Neural Network, the perceptron, to one of the most successful trainingalgorithms for deep feedforward Neural Networks of any size (up to computational cost). We beganby introducing the structure of an MLP, a fully connected feedforward Neural Network with atleast a total of 3 layers, and described in detail the most common learning algorithm for MLPs, theBackpropagation algorithm which is a form of Gradient Descent (GD). We finished our introductionto machine learning by consideration of the parameters associated with building a Neural Network.

In Chapter 2 we wanted to investigate a practical example of an MLP and the types ofproblems that can be faced when setting initial parameters on a network, such as appropriateweight initialisation, the learning rate and size of the training data set. A number of interestingsituations occurred, most fascinatingly the effect the random weight initialisation truly has on aNeural Network. As important as the learning rate can be on ensuring convergence to a “good”local minimum or the global minimum, we gave evidence to suggest that no matter the learningrate, it isn’t always possible to escape “bad” local minima because the random weight initialisationgenerates a different Error function for each network. This caused the learning rate to be appropriatefor some initialisations but a poor choice for others. We were able to conclude that the mostimportant factor in a network’s setup is the weight initialisation technique. Unfortunately, this is avery difficult problem to solve and is still very active in research today, so before tackling commonconcepts to battle the weight initialisation we set about discovering the limitations of the MLP.

We continued with the Universal Approximation Theorem in Chapter 3. Informally, thisstates that any continuous function can be approximated ε-close by a single hidden layer MLPassuming the activation function is non-constant, bounded and monotonically increasing. With thehelp of Cybenko’s paper [19], we sketched a proof and found that in fact this theorem only toldus the approximation of any continuous function was possible. It did not give us any suggestiontowards how many hidden nodes would be necessary. This brought two more problems with thearchitecture of a single hidden layer MLP - Overfitting and Underfitting. Too few nodes in thehidden layer risked too few parameters which could prevent the network learning all the mappings ofthe training data appropriately and forming a poor approximator. Similarly, too many nodes couldallow the network to form extremely complex functions and learn the training data too well, causing

61

poor generalisation and Overfitting. To date, there are no concrete theorems behind the numberof hidden nodes to set a network up with and thus experimentation is still required. However, aswe have no bound, there is a significant scare that the hidden layer could require millions of nodesfor ε-close approximation and thus we went about considering multiple hidden layers to alleviatecomputation time and cost.

Chapter 4 began investigating reasons and examples for using multiple hidden layer MLPs,for example Facial Recognition which requires shapes to be built through the network becomingfurther complex layer by layer. We found that by restructuring the number of hidden nodes in thenetwork from a single layer to two hidden layers or potentially more, the number of connections andnodes could drastically decrease but still allow the network to hold enough mappings to generalisejust as well. This is extremely useful for massive networks because fewer connections and nodesmeans fewer calculations during training which decreases computational cost as well as trainingtime. However, the increase of hidden layers brought about another problem with training - theVanishing/Exploding Gradient Problems. These problems arise due to the method of GradientDescent’s update system. The algorithm causes exponential slowdown in weight updates as weapproach the early layers of the network, up to the point of negligible change (or extreme change inthe Exploding case). This causes a network to become unstable and learn extremely poorly unlessinitialising weights in a very narrow band of avoidance, a statistically unlikely event.

To find a solution to avoiding the Vanishing/Exploding Gradient Problems, we sought aftera pre-training concept first proposed by Hinton in 2006 [42] in Chapter 5. This method wasapplicable to Autoencoders and we first researched a method for Autoencoders to deal with noisyinput data. This was found in Denoising Autoencoders which have a greater ability to generaliseand yield greater representation results [46]. Importantly, Vincent et al. established the DenoisingAutoencoders were capable of avoiding the identity function, making them invaluable. Finally, wenoticed we could represent any MLP as a set of Stacked Denoising Autoencoders and with thecombination of avoiding the identity function, any MLP could be pre-trained using such technology.This algorithm was explored before concluding with empirical evidence that MLPs with greedyunsupervised layerwise pre-training using such structures as Stacked Denoising Autoencoders faresignificantly better than MLPs without any form of pre-training [42, 43, 46, 48].

So all in all, this project has considered the simplest Neural Network, the perceptron, whichwas incapable of differentiating between non-linearly separable data, and peaked at the ability tosuccessfully train an MLP of any size using greedy unsupervised layerwise pre-training techniqueswhich can approximate any continuous function under simple conditions on the activation function.

6.2 Future Work

There are still a number of issues within the Neural Network community. Pre-training performsextremely well, but the research is still heavily limited by computational cost (hence the introduc-tion of Stochastic Gradient Descent over Gradient Descent) and the time it takes to train NeuralNetworks. There are always directions to head in to improve accuracy of networks even further.Moreover, the types of software Neural Networks could be used successfully for, such as voicerecognition on the Windows phone [2], in most cases struggle with including the storage spacenecessary to contain or run the programs. I will briefly mention a few further methods of improvingNeural Networks which could be the basis of future work.

62

In Chapter 5, we considered stacking Autoencoders despite the proposal of Hinton et al.which originally suggested the use of Restricted Boltzmann Machines (RBMs) [42]. RBMs havea distinct advantage over Autoencoders because they find a probabilistic distribution of the setof inputs using energy configurations applied to the input vectors to best encode and representthe input space. Autoencoders only have the ability to minimise reconstruction error on theinputs which is based on the Backpropagation algorithm, which has been well documented to havesignificant flaws. RBMs are essentially a stochastic version of Autoencoders and we have seen howstochastic measures in Neural Networks reap rewards in the form of Stochastic Gradient Descentand Denoising Autoencoders. Intuitively RBMs will also provide a distinct advantage. Thereforeinvestigation in to the true workings of RBMs and their computational power over Autoencoderswould be a natural next step of research.

A significant portion of this report focussed on avoiding learning problems such as Overfit-ting. Overfitting is a common problem with small training sets for large Neural Networks.Sometimes large Neural Networks are required to represent the entire possible input space, but onlya small training set may be available for the Neural Network to learn to generalise from. A conceptput forward by Hinton et al. in 2012 was called dropout in which a proportion of the network’snodes are “turned off”, eliminating their connections too like such:

Figure 6.1: An image of a standard 2 layer MLP, which on the left is untouched, and the rightrepresents an example of the dropout concept (image taken from [49])

Generally no more than half of the nodes in each layer of the network are “dropped out”. However, ina 2014 paper called “Dropout: A simple way to prevent neural networks from overfitting”, Srivastavaet al. found the optimal probability of retention in the input layer is closer to 1 than 0.5 [49]. Hintonet al. found in 2012 that, and I quote:

“Overfitting is greatly reduced by randomly omitting half of the feature detectors (nodes) on eachtraining case”.

Further investigation into the concept of dropout and combining it with greedy unsupervised lay-erwise pre-training techniques are of further interest and still currently being researched by leadingNeural Network experts such as Hinton himself.

63

6.3 Acknowledgements

Firstly, I would like to acknowledge the efforts of my project supervisor, Dr. Kasper Peeters, for hisadvice and guidance throughout this report. He receives my wholehearted thanks.

Acknowledgement must also go to the following software that made my report possible:

• Enthought Canopy: used for the scripting requirements of this project.

• Enthought Python Distribution: used for the coding requirements and graphical datarepresentations in this report.

• LATEX: used for the production of this report.

• tikz: used to generate the Neural Network figures throughout the report.

64

Bibliography

[1] University of Wisconsin. Introduction to neural networks, unknown. Available at: http:

//pages.cs.wisc.edu/~bolo/shipyard/neural/local.html.

[2] James Plafke. Bing uses deep neural networks to double speed of windows phone voice recog-nition, 2013.

[3] Sue K Anthony Zacknich; Baker. An analysis of sheep rumination and mastication. 1998.

[4] Eric Roberts’ Sophomore College. Nn applications, 2000. Available at: http://cs.

stanford.edu/people/eroberts/courses/soco/projects/2000-01/neural-networks/

Applications/miscellaneous.html.

[5] F Rosenblatt. The perceptron: a perceiving and recognizing automation, 1957.

[6] H Abdi. A neural network primer. 1994. Available at:http://eebweb.arizona.edu/faculty/dornhaus/courses/materials/papers/Abdi

[7] Jacob Janecek. The simple perceptron, 2007. Available at: http://aass.oru.se/~lilien/

ml/seminars/2007_02_01b-Janecek-Perceptron.pdf.

[8] Marvin Minsky and Seymour Papert. Perceptrons. 1969.

[9] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representa-tions by error propagation, 1985.

[10] Paul J. Werbos. Beyond regression: New tools for prediction and analysis in the behavioralsciences, 1974.

[11] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations byback-propagating errors. Nature, 323:533–536, 1986. Available at: http://www.cs.toronto.

edu/~hinton/absps/naturebp.pdf.

[12] Raul Rojas. Neural networks: a systematic introduction. Springer, 1996.

[13] Anand Venkataraman. The backpropagation algorithm, 1999. Available at: http://

pandamatak.com/people/anand/771/html/node37.html.

[14] Enthought. Enthought canopy, 2014. Software, Version 1.3.0.

[15] Enthought. Enthought python distribution, 2012. Software, Version 7.3.

[16] Neil Schemenauer. Back-propagation neural networks, 2014. Available at: http://arctrix.

com/nas/python/bpnn.py.

65

[17] Python. Random, 2014. Available at: https://docs.python.org/2/library/random.html.

[18] Kasper Peeters. Machine learning and computer vision. Durham University report, 2014.

[19] G Cybenko. Approximation by Superpositions of a Sigmoidal Function, pages 303–314. 1989.

[20] Matus Telgarsky. Universal approximation theorem - neural networks,2013. Available at: http://cstheory.stackexchange.com/questions/17545/

universal-approximation-theorem-neural-networks.

[21] Hans Hahn. Uber lineare gleichungssysteme in linearen raumen. Journal fur die reine undangewandte Mathematik, 157:214–229, 1927.

[22] Stefan Banach. Sur les fonctionnelles lineaires. Studia Mathematica, 1(1):211–216, 1929.

[23] Felix Hausdorff. Set theory, volume 119. American Mathematical Soc., 1957.

[24] Frederic Riesz. Sur les operations fonctionnelles lineaires. Comptes Rendus Acad. Sci. Paris,149:974–977, 1909.

[25] Andrei Markov. On mean values and exterior densities. 4:165–190, 1938.

[26] Shizuo Kakutani. Concrete representation of abstract (m)-spaces (a characterization of thespace of continuous functions). Annals of Mathematics, pages 994–1024, 1941.

[27] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. pages 251–257,1991.

[28] 2004. Bias and variance, under-fitting and over-fitting, 2004. Available at: http://www.cs.

bham.ac.uk/~jxb/NN/l9.pdf.

[29] Dustin Stansbury. Model selection: Underfitting, overfitting, and the bias-variancetradeoff, 2013. Available at: https://theclevermachine.wordpress.com/2013/04/21/

model-selection-underfitting-overfitting-and-the-bias-variance-tradeoff/.

[30] Dustin Stansbury. Supplemental proof 1, 2013. Available at: https://theclevermachine.

wordpress.com/2013/04/21/supplemental-proof-1/.

[31] UK Government. Passport photo requirements, 2014. Available at: https://www.gov.uk/

photos-for-passports.

[32] U.S. Department of State. Passport photo requirements, 2014. Available at: http://travel.

state.gov/content/passports/english/passports/photos/photos.html.

[33] Nicola Jones. Computer science: The learning machines, 2014. Available at: http://www.

nature.com/news/computer-science-the-learning-machines-1.14481#/face.

[34] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep beliefnetworks for scalable unsupervised learning of hierarchical representations. In Proceedings ofthe 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009.

[35] Arthur E Bryson, Walter F Denham, and Stewart E Dreyfus. Optimal programming problemswith inequality constraints. AIAA journal, 1(11):2544–2550, 1963.

[36] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen, 1991.

66

[37] Michael A. Nielsen. Neural Networks and Deep Learning. 2015. Available at: http:

//neuralnetworksanddeeplearning.com/chap5.html.

[38] James Martens. Deep learning via hessian-free optimization, 2010. Available at: http://www.

cs.toronto.edu/~asamir/cifar/HFO_James.pdf.

[39] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforwardneural networks. In International conference on artificial intelligence and statistics, pages 249–256, 2010.

[40] James Bergstra, Guillaume Desjardins, Pascal Lamblin, and Yoshua Bengio. Quadratic poly-nomials learn better image features. Technical report, Technical Report 1337, Departementd’Informatique et de Recherche Operationnelle, Universite de Montreal, 2009.

[41] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends R© in MachineLearning, 2(1):1–127, 2009.

[42] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data withneural networks. Science, 313(5786):504–507, 2006.

[43] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise train-ing of deep networks. Advances in neural information processing systems, 19:153, 2007.

[44] Andrew Ng. Cs229 lecture notes, 2012. Available at: http://cs229.stanford.edu/notes/

cs229-notes1.pdf.

[45] Leon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings ofCOMPSTAT’2010, pages 177–186. Springer, 2010.

[46] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting andcomposing robust features with denoising autoencoders. In Proceedings of the 25th internationalconference on Machine learning, pages 1096–1103. ACM, 2008.

[47] Corinna; Burges Christopher J.C. LeCun, Yann; Cortes. The mnist database of handwrittendigits, 1998. Available at: http://yann.lecun.com/exdb/mnist/.

[48] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. Anempirical evaluation of deep architectures on problems with many factors of variation. InProceedings of the 24th international conference on Machine learning, pages 473–480. ACM,2007.

[49] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overfitting. The Journal of MachineLearning Research, 15(1):1929–1958, 2014.

67

Appendix A

Single Layer MLP Python Code

#This code is heavily based upon "http://arctrix.com/nas/python/bpnn.py". I will explicitly say

#before each definition which bits I’ve changed from the original which from now on will be

#referred to as "bpnn.py". All annotations are of my own words and input.

#One main difference is I eliminated the momentum factor from this code as we are not using one.

#I have disregarded the "import string" from bpnn.py as it isn’t used either in bpnn.py or in

#the following code.

import math

import random

import matplotlib.pyplot as plt #for plotting error function later <- the coding for all the

#plotting is original

import numpy as np #used in plotting later

random.seed() # this method sets the integer starting value for generating random numbers. We

#have ignored it to allow the neural network the ability to fail in finding a global minimum

# (Definition quoted from bpnn.py) calculate a random number for weight initialisation later

#with: a <= rand < b

def rand(a, b):

return (b-a)*random.random() + a

# (Definition quoted from bpnn.py) need to define matrix generation for calculations later such

#as weight correction

def makeMatrix(I, J, fill=0.0):

m = []

for i in range(I):

m.append([fill]*J) #the append function just adds data to a set, so here

#it is adding values to my matrix

return m

#This definition is different because bpnn.py uses the tanh sigmoid function, not the logistic

#sigmoid function used here.

def sig(x):

return 1/(1 + math.exp(-x))

68

def dsig(x):

return sig(x)*(1-sig(x))

# ni = number on input nodes, nh = hidden, no = output

#__init__ is a constructor so later NeuralNetwork(ni, nh, no) will build my net

#self is an ’object’, which refers to itself. self.ni is similar to math.exp()

class NeuralNetwork:

#Definition quoted from bpnn.py

def __init__(self, ni, nh, no):

self.ni = ni + 1 # +1 for bias

self.nh = nh + 1 # +1 for bias

self.no = no

#(Definition quoted from bpnn.py) activations for nodes

self.ai = [1.0]*self.ni

self.ah = [1.0]*self.nh

self.ao = [1.0]*self.no

#(Definition based upon from bpnn.py - I have modified the matrix prefix from "wi -> wih"

#and "wo -> who" for personal clarity) create weights with random values

self.wih = makeMatrix(self.ni, self.nh) #wih means weights from input to hidden

self.who = makeMatrix(self.nh, self.no) #who means weights from hidden to output

for i in range(self.ni):

for j in range(self.nh):

self.wih[i][j] = rand(-0.5, 0.5)


for k in range(self.no):

self.who[j][k] = rand(-5, 5)#larger than before because the sigmoid function had

#an output bound between 0 and 1 and hence can’t reach large numbers without

#larger weights

#(Definition based from bpnn.py - changed "update -> nodeoutput" for personal clarity) this

#defines the outputs of each node

def nodeoutput(self, inputs):

#for the input nodes: the -1 is because of the bias, i only varies over the variable

#inputs, we haven’t defined the number of input nodes yet.

for i in range(self.ni-1):

self.ai[i] = inputs[i]

#hidden nodes, -1 again for bias.

for j in range(self.nh-1):

sum = 0.0 #sum = 0.0 so that it doesn’t recall the previous sum when iterating.


sum = sum + self.ai[i] * self.wih[i][j]

self.ah[j] = sig(sum)

#output nodes


sum = 0.0

69


sum = sum + self.ah[j] * self.who[j][k]

self.ao[k] = sum #modified from bpnn.py to reflect our output nodes not using the

#logistic function

#this returns the outputs as an array as expected

return self.ao[:]

#(Definition based from bpnn.py - modified to reflect my use of the logistic function only

#in the hidden layer) now to define our backpropagation algorithm. N = learning rate.

def backPropagate(self, targets, N):

#error terms for the output nodes

output_deltas = [0.0] * self.no


error = targets[k]-self.ao[k]

output_deltas[k] = error

#error terms for hidden nodes

hidden_deltas = [0.0] * self.nh


error = 0.0 #error = 0.0 again so system doesn’t remember previous error for

#calculation from previous iteration


error = error + output_deltas[k]*self.who[j][k]

hidden_deltas[j] = dsig(self.ah[j]) * error

#update who[j][k] i.e. weights from hidden to output nodes



change = output_deltas[k]*self.ah[j]

self.who[j][k] = self.who[j][k] + N*change

#update wih[i][j]



change = hidden_deltas[j]*self.ai[i]

self.wih[i][j] = self.wih[i][j] + N*change

#calculate the output error in the first place

error = 0.0 #for reasons as before

for k in range(len(targets)): #len() is the length function where "targets" is a vector

#formed by the outputs

error = error + 0.5*(targets[k]-self.ao[k])**2 #this is the squared error function.

return error

#(Definition quoted from bpnn.py)

def train(self, patterns, epochs=10001, N=0.5):

for i in range(epochs):

70

error = 0.0

for p in patterns:

inputs = p[0] #this says, when defining my input and output froman array like

#[ [[1], [1]], [[2],[4]] ] etc. p[0] means the first element of each set.

targets = p[1]

self.nodeoutput(inputs)

error = error + self.backPropagate(targets, N)

if i % 1000 == 0: #this is equivalent to saying "if i = 0 (mod 100)" remembering

#that i is the current epoch.

print(’The error: %.9f’ % error)

#(Definition quoted from bpnn.py) test function

def test(self, patterns):

for p in patterns:

print(p[0], ’->’, self.nodeoutput(p[0])) #e.g. should print something like "1 -> 1"

#original code

def plot(self, patterns):

for p in patterns:

w, = plt.plot(p[0], self.nodeoutput(p[0]), ’ko’, label=’Network Prediction’)

#plot points to be defined later and the networks prediction of said point with

#black circles and label as shown

x = np.linspace(0, 1, num = 1000)#generates x to be one of 1000 equally spaced numbers

#between 0 and 1

y = x**2

z, = plt.plot(x, y, ’m-’, label=’f(x)=x^2’, linewidth=2.5)#plot x and y as above with a

#magenta solid line and label as shown

plt.axis([-0.1, 1.2, -0.1, 1.2])#axis definition ([xmin, xmax, ymin, ymax])

plt.legend([z, w], [’f(x) = x^2’, ’Network Prediction’], loc=’upper left’)

plt.grid(True)

plt.show()

#Definition based from bpnn.py - changed to reflect our need for the x**2 function.

def trial():

#Teach the network f(x) = x^2 with values between 0 and 1 due to the bound on the sigmoid

#function image

pattrain = [

[[0], [0]],

[[0.1], [0.01]],

[[0.2], [0.04]],

[[0.3], [0.09]],

[[0.4], [0.16]],

[[0.5], [0.25]],

[[0.6], [0.36]],

[[0.7], [0.49]],

[[0.8], [0.64]],

[[0.9], [0.81]],

[[1], [1]]

71

]

n = NeuralNetwork(1, 2, 1)

n.train(pattrain)

print ’ ’ #puts gap between errors and test results

n.test(pattrain)

print ’ ’ #puts gap between test results and weights

#(Based upon bpnn.py at "def weights(self)" - I’ve put it in a different place so that it

#actually works) this will print out the weights for my neural network immediately after

#the errors and test results

print(’Input to Hidden weights: ’)

for i in range(n.ni):

print(n.wih[i])

print ’ ’ #leaves a gap between weights

print(’Hidden to Output weights: ’)

for j in range(n.nh):

print(n.who[j])

#this will find all outputs of my network when putting in 100 numbers equally spaced between

#0 and 1 and be in the same format as pattrain for calculation purposes with the plot

#command defined in my NeuralNetwork class

z = []

for i in np.linspace(0, 1, num = 100):

z.append([[i], n.nodeoutput])#note that n.nodeoutput results in a number, say b, in the

#format [b] so no square brackets are necessary around n.nodeoutput

n.plot(z) #from original coding

if __name__ == ’__main__’:

trial()

[16]

72

Appendix B

Python Output Figures from Section2.3.4

Figure B.1: The Python output code corresponding to Figure 2.8a

73

Figure B.2: The Python output code corresponding to Figure 2.8b

74