data mining and statistics for decision making (tufféry/data mining and statistics for decision...
TRANSCRIPT
8
Neural networks
Data mining would be not be the samewithout neural networks, which lie at the root of certain
descriptive and predictivemethods of data mining. These networks have becomewidely used,
owing to their modelling power (they can approximate any sufficiently regular function), with
excellent results across a broad range of problems, even when faced with complex phenome-
na, irregular forms, and data that are difficult to grasp and follow no particular probability law.
In some cases, however, their use is impeded by certain difficulties in implementation, such as
the ‘black box’ nature of the networks, the delicacy of the necessary adjustments, the amount
of computing power required, and especially the risks of overfitting and convergence to a
globally non-optimal solution.
This chapter has been placed before the chapters on clustering, classification and
predictionmethods, because neural networks are used both for clustering (Kohonen networks)
and classification and prediction (perceptrons, radial basis function networks). Any reader not
interested in the details of these methods may skip this chapter.
8.1 General information on neural networks
Following the initial description of a formal neuron by McCulloch and Pitts in 1943, the first
neural networks appeared in 1958 with the ‘perceptron’ of Rosenblatt. They were developed
rapidly in the 1980s and have been used widely in industry since the 1990s. A neural network
has an architecture based on that of the brain, organized in neurons and synapses, and takes the
form of a set of interconnected units (or formal neurons), with each continuous input variable
corresponding to a unit at a first level, called the input layer, and each category of a qualitative
variable also corresponding to a unit of the input layer. In some cases, when the network is
used in a predictive technique, there may be one or more dependent variables: in this case each
of them corresponds to one unit (or several units in the case of qualitative variables – see
below) at a final level, called the output layer. Predictive networks are called ‘supervised
learning’ networks, and descriptive networks are called ‘unsupervised learning’ networks.
Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.
© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8
Units belonging to an intermediate level, the hidden layer, are sometimes connected between
the input layer and the output layer. There may be several hidden layers.
A unit receives values at its input and returns 0 to n values at the output. All these values
are normalized so that they lie between 0 and 1 (or sometimes between �1 and 1, depending
on the limits of the transfer function described below). A combination function calculates
a first value from the units connected at the input and the weight of the connections. Thus,
in the most widely used networks, this is the weighted sumP
inipi of the input values ni of
the units. To determine an output value, a second function, called the transfer function
(or activation function), is applied to this value. The units in the input layer are simple, in the
sense that they do not create any combinations but only transmit the values of the variables
corresponding to them.
Thus a perceptron unit takes the form shown in Figure 8.1. The notation used in this
diagram is as follows:
. ni is the value of unit i at the preceding level (the summation over i corresponds to all the
units at the preceding level connected to the unit being observed);
. pi is the weight associated with the connection between unit i and the observed unit;
. f is the transfer function associated with the observed unit.
The learning of the neural network takes place on the basis of a sample of the population
under study; it uses the individuals in the sample to adjust the weights of the connections
between the units. In the course of learning, the value delivered by the output unit is compared
with the actual value, and the weights pi of all the units are adjusted so as to improve the
prediction, by a mechanism which depends on the type of neural network. One mechanism
which is still widely used is ‘gradient back-propagation’, but there are more recent and more
effective ones such as the Levenberg–Marquardt, quasi-Newton, conjugate gradient, quick
propagation, and genetic algorithms (see Section 11.13). The network runs through the
learning sample many (often several thousand) times. Learning is completed when an optimal
solution1 has been found and the weights pi are no longer modified significantly, or when
a previously specified number of iterations have been run. At the end of the learning phase,
the network forms a function which associates the variables with each other. In the perceptron,
the transfer function may be the linear function f(x)¼ x, but it is best to choose a function that
behaves linearly in the neighbourhood of 0 (when the weights of the units are small) and non-
linearly at the limits, so that both linear and non-linear phenomena can be modelled. In almost
Input 1
Input 2
Input n
∑nipi (∑nf ipi) Output
Figure 8.1 Unit of a neural network.
1 This optimum may be global or only local: see Section 8.7.1 on the multilayer perceptron.
218 NEURAL NETWORKS
all cases, a sigmoid function is chosen, more specifically the logistic sigmoid
sðxÞ ¼ 1=ð1þ e�xÞ (which we will meet again in the section on logistic regression), or the
tangent sigmoid, which is simply the hyperbolic tangent function:
tanhðxÞ ¼ ex�e�x
ex þ e�x
There are other sigmoid functions, but the performance is very similar regardless of which
is chosen. Although these functions are constantly increasing, they can approach any
continuous function when they are combined with each other (see Section 8.7.1), in other
words when the activation of a several units is combined.
The capacity to handle non-linear relations between the variables is a major benefit of
neural networks.
The necessity of normalizing the values of the input data can be seen with the logistic
function (Figure 8.2). If this were not done, the data with large values would ‘crush’ the others,
and the adjustments of the weights would have no effect on the value 1= 1þ exp �Pnipið Þð Þ,
as this value does not vary greatly around 0 or 1 when the absolute value ofP
nipi is large.
Also, the fact that all the values lie between 0 and 1 (or�1 and 1) means that a unit can receive
the output of a preceding unit at its input without encountering problems due to excessively
large values.
As a general rule, the stages in the implementation of a neural network for prediction or
classification are:
(i) identification of the input and output data;
(ii) normalization of these data;
Sigmoid Function
0
0.2
0.4
0.6
0.8
1
1.2
543210-1-2-3-4-5
x
1/(1
+ex
p(-
x))
Figure 8.2 The logistic function.
GENERAL INFORMATION ON NEURAL NETWORKS 219
(iii) establishment of a network with a suitable structure;
(iv) learning;
(v) testing;
(vi) application of the model generated by learning;
(vii) denormalization of the output data.
8.2 Structure of a neural network
The structure of a neural network, also referred to as its ‘architecture’ or ‘topology’, consists
of the number of layers and units, the way in which the different units are interconnected
(the choice of combination and transfer functions) and theweight adjustment mechanism. The
choice of this structure will largely determine the results that will be obtained, and is the most
critical part of the implementation of a neural network.
The simplest structure is one in which the units are distributed in two layers: an input layer
and an output layer. Each unit in the input layer has a single input and a single output which is
equal to the input (see Figure 8.3). The output unit has all the units of the input layer connected
to its input, with a combination function and a transfer function. There may be more than one
output unit. In this case, the resulting model is a linear or logistic regression, depending on
whether the transfer function is linear or logistic, and the weights of the network are the
regression coefficients.
The predictive power can be increased by adding one or more hidden layers between the
input and output layers (Figure 8.4). Although the predictive power increases with the number
p1
p2
p3
p5
s(n1p1+ … + nkpk)
n2
n3
n4
n5
n1
input layer
output layer
data
p4
Σ
Figure 8.3 Neural network with no hidden layer.
220 NEURAL NETWORKS
of hidden layers and units in these layers, this number must nevertheless be as small as
possible, to ensure that the neural network does not simply store all the information from the
learning set but can generalize it, thus avoiding what is known as ‘overfitting’ (see Section
11.3.4), which occurs when the weights simply make the system learn the details of the
learning set, instead of discovering general structures. This happens when the size of the
learning set is too small in relation to the complexity of the model, which in this case means
the complexity of the network topology. This is discussed further in Section 8.4 below.
Whether or not a hidden layer is present, the output layer of the network can sometimes
have a number of units, when there are a number of classes to predict (Figure 8.5).
8.3 Choosing the learning sample
The learning of the neural network will be improved if it takes place on a sample that is
sufficiently rich to represent all the possible values of all the layers of the network, in other
words all the possible categories of each variable, at the input or at the output. A network can
only learn from the configurations that it has encountered during its learning: customers with
overdrafts of more than D1000 may be very much at risk, but if the network’s learning sample
did not include any of these, then the network will not be able to predict anything about them.
However, we should remember that the learning time increases greatly with the size of the
sample, as the neural network runs through its learning sample many times.
For the output variables, the learning sample must include all the categories in equal
proportions, even if some categories are more frequent in the real population (for example, we
must have as many ‘negative’ events as ‘positive’ ones, even if the ‘negative’ events are much
rarer in reality).
s [Π.s(Σnipi)+ Θ.s(Σniqi)]
output layer
n2
n3
n4
n5
n1
input layer
data
hidden layer
age
income
...
...
p1
q1
q5
p5
Σ nipi
Σ niqi
s(Σnipi)
s(Σniqi)
Σ
ΣΣ
nb children
Figure 8.4 Neural network with a hidden layer.
CHOOSING THE LEARNING SAMPLE 221
8.4 Some empirical rules for network design
In a back-propagation network, at least 5–10 individuals will be needed to adjust each weight.
To increase the robustness of the network, it is advisable to have a single hidden layer for
a radial basis function network, and one, or in exceptional cases two, for the multilayer
perceptron: instead of adding a third hidden layer, it is better to modify other parameters,
retest with other initial weights, or reprocess the input data.
A network with n input units, a single hidden layer,m units in the hidden layer and k output
units has m(n þ k) weights. We therefore need a sample of at least 5m(n þ k) individuals for
the learning process. If the number of input units has to be reduced, because the learning
sample is too small, then the number of predictive variables must also be reduced. Suppose
that we wish to reduce 20 predictive variables to 10. We can test all the combinations of
10 variables, changing only two or three of them each time. This is time-consuming, but takes
into account the fact that some variables only reveal their predictive nature when combined
with certain other variables. Another procedure, which is fast and elegant, involves carrying
out a principal component analysis (see Section 7.1) and substituting the first principal
components for the variables at the input of the network. A minor drawback of this technique
is that it is inherently linear, and may conceal important non-linear structures.
The value of m generally lies between n/2 and 2n. Some authors suggest extending the
range to 3n; others recommend 3n/4, and yet others prefer a value betweenffiffiffiffiffink
p=2 and 2
ffiffiffiffiffink
p.
The interested reader should consult the report by Iebeling Kaastra and Milton Boyd.2 For
n2
n3
n4
n5
n1
input layer
data
hidden layer output layer
Figure 8.5 Neural network with more than one output unit.
2 Kaastra, I. and Boyd, M. (1996) Designing a neural network for forecasting financial and economic time series.
Neurocomputing, 10, 215–236.
222 NEURAL NETWORKS
classification, m is generally at least equal to the number of classes to be predicted. It is
best to proceed by conducting a number of tests, measuring the error rate on the test
sample each time, and stopping the increases in m as soon as this rate reaches a minimum, to
avoid overfitting.
8.5 Data normalization
You will recall that the data used in a neural network must be numeric and their categories
must lie within the range [0,1]; if this is not already the case, the data must be normalized. To
ensure that the normalization process described below is correct, the learning data set must
of course cover all the values found in thewhole population, particularly the extreme values of
continuous variables.
8.5.1 Continuous variables
Even when continuous variables are normalized, the extreme values may still tend to ‘bury’
the normal values. Thus most monthly income levels are in the range from D0 to D10 000, but
if an income exceeds D100 000, the standard normalization of the ‘income’ variable, i.e. its
replacement with the variable
income�minimum income
maximum income�minimum income
will make the difference between D5000 and D10 000 almost imperceptible, placing it on the
same level as the much less significant difference between D95 000 and D100 000.
There are several ways of normalizing this type of variable correctly. The variable can be
discretized, and replaced with its quartiles, for example. We could normalize the logarithm of
the variable, instead of the variable itself; this would ‘stretch’ the lower part of the scale. We
could normalize the variable in a linear way, as mentioned above, in respect of its values in the
range from �3 to þ 3 times the standard deviation3 s about the mean m, then change values
lower than m� 3s to 0 and change values greater than m þ 3s to 1. In this variant, we can
divide the range [m� 3s, m þ 3s] in two if necessary, by setting themean m to the centre of therange, 0.5, and applying the two half-ranges in a linear way.
8.5.2 Discrete variables
To normalize discrete variables for which the difference between 0 and 1 is greater than
between 1 and 2, 2 and 3, etc., we can carry out the following translation:
. 0! 0
. 1! 1/2
. 2! 1/2 þ 1/4
3 Note that, when a variable follows a normal distribution with a mean m and standard deviation s, we find 68% of
the observations in the range [m�s, m þ s], 95% of the observations in the range [m� 2s, m þ 2s] and 99.7% of the
observations in the range [m� 3s, m þ 3s].
DATA NORMALIZATION 223
. . . .
. n!Pnk¼1 2
�k
8.5.3 Qualitative variables
The normalization of qualitative variables poses a problem: it makes an order relationship
appear among its categories, which is often artificial and leads the neural network astray.
A common way of overcoming this difficulty is to make the number of units equal to
the number of categories of the qualitative variables, by creating binary variables (called
‘indicator variables’) whose value of 1 or 0 signifies that the qualitative variable does or does
not have this category. The drawback of this solution is that it requires a larger number of
units, resulting in a more complex network with a longer learning time, as well as an increase
in the size of the sample required for learning.
Before using a neural network on qualitative data, therefore, we should reduce the number
of categories as much as possible.
8.6 Learning algorithms
At the present time, the Levenberg–Marquardt algorithm is often favoured by experts, because
it converges more quickly, and towards a better solution, than the gradient back-propagation
algorithm. However, it needs a large amount of computer memory, proportional to the square
of the number of units. It is therefore limited to small networks with only a few variables. It is
also restricted to a single output unit.
The gradient back-propagation algorithm is the oldest and most widely used method,
especially for large data volumes. But it lacks reliability because of its sensitivity to
local minima.
The conjugate gradient descent algorithm is a good compromise, because its performance
approaches that of the Levenberg–Marquardt algorithm in terms of convergence, but it can be
used on more complex networks with more than one output if necessary.
Finally, I should also mention the quasi-Newton algorithm and the genetic algorithms
which will be examined in Section 11.13.
8.7 The main neural networks
There are various neural networkmodels. Themain ones are themultilayer perceptron (MLP),
the radial basis function (RBF), and the Kohonen network, which are described below. More
recently, the density estimation networks of Specht (1990)4 have been used both for
classification (probabilistic neural networks) and for prediction (general regression neural
networks). There are also networks similar to RBF networks but based on the mathematical
theory of wavelets.
The Kohonen network is an unsupervised learning network used for clustering, while the
other networks mentioned above (MLP, RBF, etc.) are supervised learning networks, used
with one or more dependent variables at the output.
4 Specht, D.F. (1990) Probabilistic neural networks. Neural Networks, 3, 109–118.
224 NEURAL NETWORKS
8.7.1 The multilayer perceptron
The archetypal neural network is the multilayer perceptron. It is particularly suitable for the
discovery of complex non-linear models. Its power is based on the possibility of approxi-
mating any sufficiently regularly function with a sum of sigmoids (Figure 8.6). As its name
indicates, this network is made up of several layers: the input variables, the output variable or
variables, and one or more hidden levels. Each unit at a level is connected to the set of units at
the preceding level.
The number of input units is always equal to the number of variables in the model; if
necessary, these variables may be the ‘indicator’ variables substituted for the original
qualitative variables (see Section 8.5.3). There is usually just one output unit. For the choice
of the number of units in the hidden layer, see Section 8.4.
To explain the operation of theMLP, let us consider the special, but quite common, case of
an MLP using gradient back-propagation.
Each connection has an associated weight, which changes in the course of learning. The
network starts its learning by assigning a random value to each of the weights and calculating
the output value on the basis of a set of records for which the expected output value is known:
this is the learning sample. The network then compares the calculated output value with the
expected value, and calculates an error function e, which can be the sum of squares of the
errors occurring for each individual in the learning sample:
Xi
Xj
ðEij�OijÞ2;
where the first summation is performed on the individuals of the learning set, the second
summation is performed on the output units, and Eij (Oij) is the expected (obtained) value of
the jth unit for the ith individual.
-1.5
-1
-0.5
0
0.5
1
1.5
543210-1-2-3-4-5
neuron with withweight < 1
neuron with weight >1 and bias
neuron with weight >1 and negative sign
neuron negative signand bias
FUNCTION
Figure 8.6 Approximation of a function by a sum of sigmoids.
THE MAIN NEURAL NETWORKS 225
The network then adjusts theweights of the different units, checking each time to see if the
error function has increased or decreased. As in a conventional regression, this is a matter of
solving a problem of least squares.
If there are n connections in the network, each n-tuplet (p1, p2, . . ., pn) of weights can be
represented in a space with n þ 1 dimensions, the last dimension representing the error
function e. The set of values (p1, p2, . . ., pn,e) is a ‘surface’ (or, rather, a hypersurface) in
a space of dimension n þ 1, the ‘error surface’, and the adjustment of theweights to minimize
the error function can be seen as a movement on the error surface with the aim of finding the
minimum point. Unlike linear models, in which the error surface is a well-defined and well-
known mathematical object (in the shape of a parabola, for example), and the minimum point
can be found by calculation, neural networks are complex non-linear models where the error
surface has an irregular layout, criss-crossed with hills, valleys, plateaux, deep ravines, and
the like. To find the minimum point on this surface, for which no maps are available, we must
explore it. In the gradient back-propagation algorithm, we move over the error surface by
following the line with the greatest slope, which offers the possibility of reaching the lowest
possible point. We then have to work out how quickly we should travel down the slope. If we
go too quickly, we may pass over the minimum point or set off in thewrong direction; if we go
too slowly, we will need too many iterations in the network to find a solution. When we speak
of an ‘iteration’, this means inputting the whole learning set into the network, comparing the
expected and obtained outputs, and calculating the error function. The range of possible
iterations is very wide, but the order of magnitude is 10 000.
The correct speed is proportional to the slope of the surface and to another important
parameter, namely the learning rate. This rate, between 0 and 1, determines the extent of the
modification of the weights during learning. It is useful to vary this rate, which will be high at
the outset (between 0.7 and 0.9) to allow a speedy exploration of the error surface and a fast
approximation to the best solutions (the minima of the surface), and then decrease at the end
of the learning to bring us as close as possible to an optimal solution. In a situation such as that
shown in Figure 8.7, this decrease in the learning rate will ensure that we do not go from the
local optimumA straight to the local optimum C, possibly with oscillations between A and C,
without reaching the global optimum B.
A second important parameter affects the performance of a multilayer perceptron: this is
the moment (of a neural network), which makes the weights tend to keep the same direction
of change, increasing or decreasing, because a factor incorporates the preceding weight
adjustments. The moment limits oscillations which could be caused by irregularities in the
A
B
C
Figure 8.7 Local optimum and global optimum.
226 NEURAL NETWORKS
learning examples. The effect of the moment is that, if we move several times successively in
the same direction over the error surface, we tend to continue the movement without being
‘trapped’ by the local minima (such as point A in Figure 8.7) and pass over them to reach the
global minima (such as point B in the same figure). Just as the learning rate decreases as
learning continues, themoment often increases during learning, to enable the network tomake
a smooth approach to a globally optimal solution.
To sum up, the learning rate controls the extent of modification of the weights during the
learning process; a higher ratemeans faster learning, but there is a greater risk that the network
will converge towards a solution other than the globally optimal one. The moment acts as
a damping parameter, reducing oscillations and helping to achieve convergence; with
a smaller moment, the network is better at ‘adapting to its environment’, but extreme data
have more effect on the weights. To some extent, the learning rate controls the speed of
movement and the moment controls the speed of the changes of direction on the error surface;
at the start of the process, we move quickly in all directions, but at the end we slow down and
change direction less often.
The main danger of neural network modelling is obvious: the network may converge
towards a solution that is locally, but not globally, optimal. This risk has led to the
development of graphic tools for real-time display of the error rate in learning and validation,
enabling the learning to be interrupted as soon as there is any sign of overfitting and an
increased error rate in validation (Figure 8.8).
8.7.2 The radial basis function network
An RBF network is an supervised learning network, like the multilayer perceptron, which it
resembles in someways. However, it works with only one hidden layer, and, when calculating
Figure 8.8 Graphic monitoring of a neural network with SAS Enterprise Miner.
THE MAIN NEURAL NETWORKS 227
the value of each unit in the hidden layer for an observation, it uses the distance in space
between this observation and the centre of the unit, instead of the sum of the weighted values
of the units of the preceding level. Unlike theweights of amultilayer perceptron, the centres of
the hidden layer of an RBF network are not adjusted at each iteration during learning (but
some may be added if the space is not sufficiently covered). In a perceptron, the modification
of a synaptic weight makes it necessary to re-evaluate all the others, but in an RBF network the
hidden neurons share the space and are virtually independent of each other. This makes for
faster convergence of RBF networks in the learning phase, which is one of their strong points.
Now, the response surface (the set of values) of a unit of a hidden layer of a multilayer
perceptron, before the application of the (generally non-linear) transfer function, is a hyperplanePipiXi ¼ K, and similarly the response surface of a unit of the hidden layer of anRBF network
is a hypersphereP
iðXi�oiÞ2 ¼ R2, and the response of the unit to an individual (xi) is a
decreasing function G of the distance between the individual and this hypersphere. As this
functionG is generally aGaussian function, the response surface of the unit, after the application
of the transfer function, is a Gaussian surface, in other words a ‘bell-shaped’ surface
(Figure 8.9). We speak of a radial function for G, i.e. a function symmetrical about a centre.
Comparing the MLP and RBF networks, we find the differences listed in Table 8.1.
Finally, the global response of the network to each individual (xi) presented to it is:
Xno: of hidden units
k¼1
lk exp � 1
2s2k
Xno: of input units
i¼1
ðxi�oki Þ2
" #
The learning of an RBF is a matter of determining the number of units in the hidden layer,
i.e. the number of radial functions, their centres Ok¼ (oki ), their radii sk, and the coefficients
lk. The critical point in learning is the choice of the number of radial functions, their centres
and their radii. When this has been done, the coefficients lk are determined in a supervised
way, as simply as in a linear regression. The coefficients can be limited if required, as in a ridge
regression (see Section 11.7.2). This is known as ‘weight decay’.
Figure 8.9 Response surface of a radial unit.
228 NEURAL NETWORKS
The number of units is generally specified by the user, even if the network can create others
to improve the accuracy of the results. A sufficiently high number must be provided, generally
more than in a multilayer perceptron, to enable the data structure to be modelled correctly. The
unitsmay commonly number several hundred. This is because the fast decrease of theGaussian
means that theRBFnetwork has a lower extrapolation capacitywhen farther from the centres of
the units of the hidden layer. These units must therefore be sufficiently numerous to ensure that
at least one unit is activated for each observation; in other words, at least one radial function
must have a non-negligible value in any region where data are present. This is an evident
drawback of the RBF network as compared with the multilayer perceptron, even if it is also
better protected against certain risky extrapolations found with the multilayer perceptron. In
fact, the complexity of the RBF network increases exponentially with the number of input
variables, because the radial function space has to be filled. As the number of variables
increases, therefore, the calculation time of the RBF network increases, together with the
number of observations needed for learning. This is one of its major weaknesses. It is therefore
essential to select the input variables of a RBF network with great care.
When the number of units has been chosen, we must consider their centres. Some
networks position the centres in a randomway. However, the results can be improved by using
the moving centres method (see Section 9.9.1) or Kohonen networks (see below) to divide the
space into clusters (partitions) according to the distribution of the data. Thus, if the data
are distributed in packets, the centres of these packets will be chosen as the centres of the
RBF network. Also, more centres will be positioned in areas with a high observation density
(input adaptation), or in areas where the result to be predicted varies more rapidly (output
Table 8.1 Comparison of MLP and RBF networks.
Network ! MLP RBF
‘Weight’ Weight pi Centre oi
Hidden
layer(s)
Combination
function
Scalar productP
ipixi Euclidean distancePi (xi�oi)
2
Transfer
function
Logistic
s(X)¼ 1/(1 þ exp(�X))
Gaussian
G(X)¼ exp(�X2/2s2)
Number of
hidden layers
� 1 ¼ 1
Output
layer
Combination
function
Scalar productP
kpkxk Linear combination
of GaussiansP
k lkGk
(see below)
Transfer
function
Logistic
s(X)¼ 1/(1 þ exp(�X))
Linear function f(X)¼X
Speed Faster in ‘model
application’ mode
Faster in ‘model
learning’ mode
Advantage Better generalization Less risk of non-optimal
convergence
THE MAIN NEURAL NETWORKS 229
adaptation), in order to reduce the output error. Output adaptation is not very often used in
learning, but if it is used it gives rise to the problem of not always being compatible with the
input adaptation. This is because the response distribution (the dependent variable) may not
coincide with the data density distribution, and may lead to the determination of other centres
for the radial function.
With the exception of the possible application of output adaptation, the search for the
centres is unsupervised and can also be carried out on observations for which the response is
not always known, since it is then simply a matter of estimating the data probability density. It
may be useful to be able to train the RBF network on observations for which the dependent
variable is not always known: this enables us to use a larger number of observations. In this
case, we speak of semi-supervised learning. This is used whenever the collection of labelled
observations is more difficult or costly than the collection of unlabelled observations.
However, this semi-supervised learning has the drawback of beingmore sensitive to noise.
When areas of high density are sought in order to determine their centres without considering
the dependent variable, the input variables which are not related to the dependent variable are
not distinguished from the related variables, and may introduce noise.
The final aspect of parameter setting relates to the radii of the units of the hidden layer,
which are the standard deviations of the Gaussian distributions. A simple solution is to choose
radii equal to twice the mean distance between centres. If they are too large, the network will
lack structural detail and its precision will be reduced. If they are too small, the space will be
poorly covered by the Gaussian surfaces, and the network will have to interpolate between
these surfaces, which will decrease the capacity for generalizing the results of the learning
phase. As for the centres, the radii will be chosen so that they are smaller in areas with a high
density of observations, or in areas in which the result to be predicted varies more rapidly.
There are several ways of determining the radii as precisely as possible: a useful method finds
the k nearest neighbours (see Section 11.2), and examines each unit centre to see where its k
nearest neighbours are located (k is chosen appropriately by the user), and the mean distance
to these k nearest neighbours is taken to be the radius. This method has the merit of adapting to
the structure of the data. The radii are not necessarily equal to each other, but this is sometimes
assumed to decrease the number of network parameters.
Compared with the MLP network, the RBF network has the major advantage of needing
only a single hidden layer, and using linear combination and transfer functions in the output
layer in most cases – except in certain sophisticated variants (see Table 8.1). This makes for
faster learning and far fewer problems of complicated parameter adjustment for the user. It
also avoids the risk, inherent in the back-propagationmechanism of themultilayer perceptron,
of convergence towards a locally, but not globally, optimal solution. From this point of view,
the sequential search for the centres of the radial functions, their radii, and then the
coefficients of their linear combination is an advantage: it provides greater simplicity and
faster learning, and decreases the risk of overfitting by comparison with a search for global
optimization by gradient descent.
Theweakness of the radial basis function, compared with the multilayer perceptron, is that
it may need a large number of units in its hidden layer, which increases the execution time of
the network without always yielding perfect modelling of complex structures and irregular
data. This happens when the number of input variables is too large, and it is desirable to reduce
this number as far as possible. This problem is due to the fact that the RBF network, evenmore
than the multilayer perceptron, requires a learning set which covers all the configurations and
all the categories of variables which may be found when it is applied to the whole population
230 NEURAL NETWORKS
to be studied. The advantages and disadvantages of the RBF network tend to be those that are
generally found in networks for probability density estimation. The MLP network offers the
best generalization capacity, especially for noisy data.
8.7.3 The Kohonen network
The Kohonen network is the most widely used unsupervised learning network. It can also be
called a self-adaptive or self-organizing network, because it ‘self-organizes’ around the data.
Other synonyms are ‘Kohonen map’ and ‘self-organizing map’.
Like any neural network, it is made up of layers of units and connections between these
units. Themajor difference from the networks described above is that there is no variable to be
predicted. The purpose of the network is to ‘learn’ the structure of the data so that it can
distinguish clusters in them.
The Kohonen network is composed of two levels (Figure 8.10):
. the input layer, with a unit for each of the n variables used in the clustering;
. an output layer, whose units are arranged as a generally square or rectangular
(sometimes hexagonal) grid of l�m units (in some cases l and m 6¼ n), each of these
l�m units being connected to each of the n units of the input layer, the connection
having a certain weight pijk (i2 [1,l], j2 [1,m], k2 [1,n]).
The units of the output layer are not interconnected, but a distance is defined between them,
such that we can speak of the ‘neighbourhood’ of a unit.
individual 1
individual 2
individual N
input layer
output layer
…
pijk
Figure 8.10 Kohonen network.
THE MAIN NEURAL NETWORKS 231
The units of the input layer correspond to the variables of the individuals to be clustered,
and this layer is used to present the individuals; the states of its units are the values of the
variables characterizing the individuals to be clustered. This is why this layer contains n units,
where n is the number of variables used in the clustering.
The grid on which the output units are placed is called the ‘topological map’. The shape
and size of this grid are generally chosen by the user, but they may also change in the course of
learning. Each output unit (i,j) is associated with a weight vector (pijk) k2[1,n], and therefore theresponse of this unit to an individual (xk) k2[1,n] is, by definition, the Euclidean distance
dijðxÞ ¼Xnk¼1
ðxk�pijkÞ2:
So how does a Kohonen network learn? First of all, the weights pijk are initialized
randomly. Then the responses of the l�m units of the output layer are calculated for each
individual (xk) in the learning sample. The unit chosen to represent (xk) is the unit (i,j) for
which dij(x) has theminimum value.We say that this unit is ‘activated’ (Figure 8.11). This unit
and all the neighbouring units have their weights adjusted to bring them closer to the
(i+1,j+1)(i,j+1)(i-1,j+1)
(i+1,j)(i,j)(i-1,j)
(i+1,j-1)(i,j-1)(i-1,j-1)
incomeagenumber of children
…
Figure 8.11 Activation of a unit of a Kohonen network.
232 NEURAL NETWORKS
individual at the input. For example, the neighbouring units of (i,j) are the eight units (i� 1,j),
(i þ 1, j), (i, j� 1), (i, j þ 1), (i þ 1, j þ 1), (i þ 1, j� 1), (i� 1, j þ 1), (i� 1, j� 1). The
size of the neighbourhood generally decreases during learning: at the beginning, the
neighbourhood can be the whole grid; by the end, it may be reduced to the unit itself. These
adjustments form part of the network parameters.
The new weights of a neighbour (I,J) of the ‘winner’ (i,j) are
pIJk þY � f ði; j; I; JÞ � ðxk�pIJkÞ for every k 2 ½1; n�;where f(i,j;I,J) is a decreasing function of the distance between the units (i,j) and (I,J), such
that f(i,j;i,j)¼ 1. It may also be a Gaussian function: exp(– distance(i,j;I,J)2/2s2).The parameterY 2 [0,1] is a learning rate which, as in the case of a multilayer perceptron,
changes during learning by decreasing linearly or exponentially.
It is the extension of the weight adjustment to the whole neighbourhood of the ‘winning’
unit that brings the neighbouring units of (i,j) close to the individual (xk) at the input, and
enables the individuals that are close together in variable space to be represented by identical
or neighbouring units in the layer, just as neighbouring neurons respond to nearby stimuli in
the cerebral cortex. The whole process takes place as though the Kohonen network was made
of rubber and was deformed tomake the cloud of individuals pass over it while approaching as
closely as possible to the individuals. By contrast with the factor plane (see Section 7.1), the
projection concerned is non-linear.
When all the individuals in the learning sample have been presented to the network and all
the weights have been adjusted, the learning is complete.
To summarize, during the network’s learning:
. For each individual, only one output unit (the ‘winner’) is activated.
. The weights of the winner and its neighbours are adjusted.
. The adjustment is such that two closely placed output units correspond to two closely
placed individuals.
. Groups (clusters) of units are formed at the output.
In the application phase, the Kohonen network operates by representing each input
individual by the unit of the network which is closest to it in terms of the distance defined
above. This unit will be the cluster of the individual.
This algorithm has some similarities with the moving centres and k-means methods (see
Section 9.9.1). However, there is an important difference. In the k-means method, the
introduction of a new individual into a cluster only results in the recalculation of the centre
of gravity of the cluster, without any effect on the other centres of gravity. But the introduction
of a new individual into a Kohonen network results in the adjustment of not just the unit
nearest to the individual, but also the neighbouring units. The neighbourhood of the ‘winner’
unit is significant, while the neighbourhood of the ‘winner’ centre of gravity is not.
Another major difference between Kohonen networks and the moving centres of k-means
methods is that, unlike these methods, the Kohonen clustering takes place by reducing the
number of dimensions of the variable space, as in factor analysis, the new working space
generally being of dimension 2, as in my description, or, exceptionally, of dimension 3 or 1.
THE MAIN NEURAL NETWORKS 233