data mining and statistics for decision making (tufféry/data mining and statistics for decision...

8

Neural networks

Data mining would be not be the samewithout neural networks, which lie at the root of certain

descriptive and predictivemethods of data mining. These networks have becomewidely used,

owing to their modelling power (they can approximate any sufficiently regular function), with

excellent results across a broad range of problems, even when faced with complex phenome-

na, irregular forms, and data that are difficult to grasp and follow no particular probability law.

In some cases, however, their use is impeded by certain difficulties in implementation, such as

the ‘black box’ nature of the networks, the delicacy of the necessary adjustments, the amount

of computing power required, and especially the risks of overfitting and convergence to a

globally non-optimal solution.

This chapter has been placed before the chapters on clustering, classification and

predictionmethods, because neural networks are used both for clustering (Kohonen networks)

and classification and prediction (perceptrons, radial basis function networks). Any reader not

interested in the details of these methods may skip this chapter.

8.1 General information on neural networks

Following the initial description of a formal neuron by McCulloch and Pitts in 1943, the first

neural networks appeared in 1958 with the ‘perceptron’ of Rosenblatt. They were developed

rapidly in the 1980s and have been used widely in industry since the 1990s. A neural network

has an architecture based on that of the brain, organized in neurons and synapses, and takes the

form of a set of interconnected units (or formal neurons), with each continuous input variable

corresponding to a unit at a first level, called the input layer, and each category of a qualitative

variable also corresponding to a unit of the input layer. In some cases, when the network is

used in a predictive technique, there may be one or more dependent variables: in this case each

of them corresponds to one unit (or several units in the case of qualitative variables – see

below) at a final level, called the output layer. Predictive networks are called ‘supervised

learning’ networks, and descriptive networks are called ‘unsupervised learning’ networks.

Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.

© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8

Units belonging to an intermediate level, the hidden layer, are sometimes connected between

the input layer and the output layer. There may be several hidden layers.

A unit receives values at its input and returns 0 to n values at the output. All these values

are normalized so that they lie between 0 and 1 (or sometimes between �1 and 1, depending

on the limits of the transfer function described below). A combination function calculates

a first value from the units connected at the input and the weight of the connections. Thus,

in the most widely used networks, this is the weighted sumP

inipi of the input values ni of

the units. To determine an output value, a second function, called the transfer function

(or activation function), is applied to this value. The units in the input layer are simple, in the

sense that they do not create any combinations but only transmit the values of the variables

corresponding to them.

Thus a perceptron unit takes the form shown in Figure 8.1. The notation used in this

diagram is as follows:

. ni is the value of unit i at the preceding level (the summation over i corresponds to all the

units at the preceding level connected to the unit being observed);

. pi is the weight associated with the connection between unit i and the observed unit;

. f is the transfer function associated with the observed unit.

The learning of the neural network takes place on the basis of a sample of the population

under study; it uses the individuals in the sample to adjust the weights of the connections

between the units. In the course of learning, the value delivered by the output unit is compared

with the actual value, and the weights pi of all the units are adjusted so as to improve the

prediction, by a mechanism which depends on the type of neural network. One mechanism

which is still widely used is ‘gradient back-propagation’, but there are more recent and more

effective ones such as the Levenberg–Marquardt, quasi-Newton, conjugate gradient, quick

propagation, and genetic algorithms (see Section 11.13). The network runs through the

learning sample many (often several thousand) times. Learning is completed when an optimal

solution1 has been found and the weights pi are no longer modified significantly, or when

a previously specified number of iterations have been run. At the end of the learning phase,

the network forms a function which associates the variables with each other. In the perceptron,

the transfer function may be the linear function f(x)¼ x, but it is best to choose a function that

behaves linearly in the neighbourhood of 0 (when the weights of the units are small) and non-

linearly at the limits, so that both linear and non-linear phenomena can be modelled. In almost

Input 1

Input 2

Input n

∑nipi (∑nf ipi) Output

Figure 8.1 Unit of a neural network.

1 This optimum may be global or only local: see Section 8.7.1 on the multilayer perceptron.

218 NEURAL NETWORKS

all cases, a sigmoid function is chosen, more specifically the logistic sigmoid

sðxÞ ¼ 1=ð1þ e�xÞ (which we will meet again in the section on logistic regression), or the

tangent sigmoid, which is simply the hyperbolic tangent function:

tanhðxÞ ¼ ex�e�x

ex þ e�x

There are other sigmoid functions, but the performance is very similar regardless of which

is chosen. Although these functions are constantly increasing, they can approach any

continuous function when they are combined with each other (see Section 8.7.1), in other

words when the activation of a several units is combined.

The capacity to handle non-linear relations between the variables is a major benefit of

neural networks.

The necessity of normalizing the values of the input data can be seen with the logistic

function (Figure 8.2). If this were not done, the data with large values would ‘crush’ the others,

and the adjustments of the weights would have no effect on the value 1= 1þ exp �Pnipið Þð Þ,

as this value does not vary greatly around 0 or 1 when the absolute value ofP

nipi is large.

Also, the fact that all the values lie between 0 and 1 (or�1 and 1) means that a unit can receive

the output of a preceding unit at its input without encountering problems due to excessively

large values.

As a general rule, the stages in the implementation of a neural network for prediction or

classification are:

(i) identification of the input and output data;

(ii) normalization of these data;

Sigmoid Function

0

0.2

0.4

0.6

0.8

1

1.2

543210-1-2-3-4-5

x

1/(1

+ex

p(-

x))

Figure 8.2 The logistic function.

GENERAL INFORMATION ON NEURAL NETWORKS 219

(iii) establishment of a network with a suitable structure;

(iv) learning;

(v) testing;

(vi) application of the model generated by learning;

(vii) denormalization of the output data.

8.2 Structure of a neural network

The structure of a neural network, also referred to as its ‘architecture’ or ‘topology’, consists

of the number of layers and units, the way in which the different units are interconnected

(the choice of combination and transfer functions) and theweight adjustment mechanism. The

choice of this structure will largely determine the results that will be obtained, and is the most

critical part of the implementation of a neural network.

The simplest structure is one in which the units are distributed in two layers: an input layer

and an output layer. Each unit in the input layer has a single input and a single output which is

equal to the input (see Figure 8.3). The output unit has all the units of the input layer connected

to its input, with a combination function and a transfer function. There may be more than one

output unit. In this case, the resulting model is a linear or logistic regression, depending on

whether the transfer function is linear or logistic, and the weights of the network are the

regression coefficients.

The predictive power can be increased by adding one or more hidden layers between the

input and output layers (Figure 8.4). Although the predictive power increases with the number

p1

p2

p3

p5

s(n1p1+ … + nkpk)

n2

n3

n4

n5

n1

input layer

output layer

data

p4

Σ

Figure 8.3 Neural network with no hidden layer.

220 NEURAL NETWORKS

of hidden layers and units in these layers, this number must nevertheless be as small as

possible, to ensure that the neural network does not simply store all the information from the

learning set but can generalize it, thus avoiding what is known as ‘overfitting’ (see Section

11.3.4), which occurs when the weights simply make the system learn the details of the

learning set, instead of discovering general structures. This happens when the size of the

learning set is too small in relation to the complexity of the model, which in this case means

the complexity of the network topology. This is discussed further in Section 8.4 below.

Whether or not a hidden layer is present, the output layer of the network can sometimes

have a number of units, when there are a number of classes to predict (Figure 8.5).

8.3 Choosing the learning sample

The learning of the neural network will be improved if it takes place on a sample that is

sufficiently rich to represent all the possible values of all the layers of the network, in other

words all the possible categories of each variable, at the input or at the output. A network can

only learn from the configurations that it has encountered during its learning: customers with

overdrafts of more than D1000 may be very much at risk, but if the network’s learning sample

did not include any of these, then the network will not be able to predict anything about them.

However, we should remember that the learning time increases greatly with the size of the

sample, as the neural network runs through its learning sample many times.

For the output variables, the learning sample must include all the categories in equal

proportions, even if some categories are more frequent in the real population (for example, we

must have as many ‘negative’ events as ‘positive’ ones, even if the ‘negative’ events are much

rarer in reality).

s [Π.s(Σnipi)+ Θ.s(Σniqi)]

output layer

n2

n3

n4

n5

n1

input layer

data

hidden layer

age

income

...

...

p1

q1

q5

p5

Σ nipi

Σ niqi

s(Σnipi)

s(Σniqi)

Σ

ΣΣ

nb children

Figure 8.4 Neural network with a hidden layer.

CHOOSING THE LEARNING SAMPLE 221

8.4 Some empirical rules for network design

In a back-propagation network, at least 5–10 individuals will be needed to adjust each weight.

To increase the robustness of the network, it is advisable to have a single hidden layer for

a radial basis function network, and one, or in exceptional cases two, for the multilayer

perceptron: instead of adding a third hidden layer, it is better to modify other parameters,

retest with other initial weights, or reprocess the input data.

A network with n input units, a single hidden layer,m units in the hidden layer and k output

units has m(n þ k) weights. We therefore need a sample of at least 5m(n þ k) individuals for

the learning process. If the number of input units has to be reduced, because the learning

sample is too small, then the number of predictive variables must also be reduced. Suppose

that we wish to reduce 20 predictive variables to 10. We can test all the combinations of

10 variables, changing only two or three of them each time. This is time-consuming, but takes

into account the fact that some variables only reveal their predictive nature when combined

with certain other variables. Another procedure, which is fast and elegant, involves carrying

out a principal component analysis (see Section 7.1) and substituting the first principal

components for the variables at the input of the network. A minor drawback of this technique

is that it is inherently linear, and may conceal important non-linear structures.

The value of m generally lies between n/2 and 2n. Some authors suggest extending the

range to 3n; others recommend 3n/4, and yet others prefer a value betweenffiffiffiffiffink

p=2 and 2

ffiffiffiffiffink

p.

The interested reader should consult the report by Iebeling Kaastra and Milton Boyd.2 For

n2

n3

n4

n5

n1

input layer

data

hidden layer output layer

Figure 8.5 Neural network with more than one output unit.

2 Kaastra, I. and Boyd, M. (1996) Designing a neural network for forecasting financial and economic time series.

Neurocomputing, 10, 215–236.

222 NEURAL NETWORKS

classification, m is generally at least equal to the number of classes to be predicted. It is

best to proceed by conducting a number of tests, measuring the error rate on the test

sample each time, and stopping the increases in m as soon as this rate reaches a minimum, to

avoid overfitting.

8.5 Data normalization

You will recall that the data used in a neural network must be numeric and their categories

must lie within the range [0,1]; if this is not already the case, the data must be normalized. To

ensure that the normalization process described below is correct, the learning data set must

of course cover all the values found in thewhole population, particularly the extreme values of

continuous variables.

8.5.1 Continuous variables

Even when continuous variables are normalized, the extreme values may still tend to ‘bury’

the normal values. Thus most monthly income levels are in the range from D0 to D10 000, but

if an income exceeds D100 000, the standard normalization of the ‘income’ variable, i.e. its

replacement with the variable

income�minimum income

maximum income�minimum income

will make the difference between D5000 and D10 000 almost imperceptible, placing it on the

same level as the much less significant difference between D95 000 and D100 000.

There are several ways of normalizing this type of variable correctly. The variable can be

discretized, and replaced with its quartiles, for example. We could normalize the logarithm of

the variable, instead of the variable itself; this would ‘stretch’ the lower part of the scale. We

could normalize the variable in a linear way, as mentioned above, in respect of its values in the

range from �3 to þ 3 times the standard deviation3 s about the mean m, then change values

lower than m� 3s to 0 and change values greater than m þ 3s to 1. In this variant, we can

divide the range [m� 3s, m þ 3s] in two if necessary, by setting themean m to the centre of therange, 0.5, and applying the two half-ranges in a linear way.

8.5.2 Discrete variables

To normalize discrete variables for which the difference between 0 and 1 is greater than

between 1 and 2, 2 and 3, etc., we can carry out the following translation:

. 0! 0

. 1! 1/2

. 2! 1/2 þ 1/4

3 Note that, when a variable follows a normal distribution with a mean m and standard deviation s, we find 68% of

the observations in the range [m�s, m þ s], 95% of the observations in the range [m� 2s, m þ 2s] and 99.7% of the

observations in the range [m� 3s, m þ 3s].

DATA NORMALIZATION 223

. . . .

. n!Pnk¼1 2

�k

8.5.3 Qualitative variables

The normalization of qualitative variables poses a problem: it makes an order relationship

appear among its categories, which is often artificial and leads the neural network astray.

A common way of overcoming this difficulty is to make the number of units equal to

the number of categories of the qualitative variables, by creating binary variables (called

‘indicator variables’) whose value of 1 or 0 signifies that the qualitative variable does or does

not have this category. The drawback of this solution is that it requires a larger number of

units, resulting in a more complex network with a longer learning time, as well as an increase

in the size of the sample required for learning.

Before using a neural network on qualitative data, therefore, we should reduce the number

of categories as much as possible.

8.6 Learning algorithms

At the present time, the Levenberg–Marquardt algorithm is often favoured by experts, because

it converges more quickly, and towards a better solution, than the gradient back-propagation

algorithm. However, it needs a large amount of computer memory, proportional to the square

of the number of units. It is therefore limited to small networks with only a few variables. It is

also restricted to a single output unit.

The gradient back-propagation algorithm is the oldest and most widely used method,

especially for large data volumes. But it lacks reliability because of its sensitivity to

local minima.

The conjugate gradient descent algorithm is a good compromise, because its performance

approaches that of the Levenberg–Marquardt algorithm in terms of convergence, but it can be

used on more complex networks with more than one output if necessary.

Finally, I should also mention the quasi-Newton algorithm and the genetic algorithms

which will be examined in Section 11.13.

8.7 The main neural networks

There are various neural networkmodels. Themain ones are themultilayer perceptron (MLP),

the radial basis function (RBF), and the Kohonen network, which are described below. More

recently, the density estimation networks of Specht (1990)4 have been used both for

classification (probabilistic neural networks) and for prediction (general regression neural

networks). There are also networks similar to RBF networks but based on the mathematical

theory of wavelets.

The Kohonen network is an unsupervised learning network used for clustering, while the

other networks mentioned above (MLP, RBF, etc.) are supervised learning networks, used

with one or more dependent variables at the output.

4 Specht, D.F. (1990) Probabilistic neural networks. Neural Networks, 3, 109–118.

224 NEURAL NETWORKS

8.7.1 The multilayer perceptron

The archetypal neural network is the multilayer perceptron. It is particularly suitable for the

discovery of complex non-linear models. Its power is based on the possibility of approxi-

mating any sufficiently regularly function with a sum of sigmoids (Figure 8.6). As its name

indicates, this network is made up of several layers: the input variables, the output variable or

variables, and one or more hidden levels. Each unit at a level is connected to the set of units at

the preceding level.

The number of input units is always equal to the number of variables in the model; if

necessary, these variables may be the ‘indicator’ variables substituted for the original

qualitative variables (see Section 8.5.3). There is usually just one output unit. For the choice

of the number of units in the hidden layer, see Section 8.4.

To explain the operation of theMLP, let us consider the special, but quite common, case of

an MLP using gradient back-propagation.

Each connection has an associated weight, which changes in the course of learning. The

network starts its learning by assigning a random value to each of the weights and calculating

the output value on the basis of a set of records for which the expected output value is known:

this is the learning sample. The network then compares the calculated output value with the

expected value, and calculates an error function e, which can be the sum of squares of the

errors occurring for each individual in the learning sample:

Xi

Xj

ðEij�OijÞ2;

where the first summation is performed on the individuals of the learning set, the second

summation is performed on the output units, and Eij (Oij) is the expected (obtained) value of

the jth unit for the ith individual.

-1.5

-1

-0.5

0

0.5

1

1.5

543210-1-2-3-4-5

neuron with withweight < 1

neuron with weight >1 and bias

neuron with weight >1 and negative sign

neuron negative signand bias

FUNCTION

Figure 8.6 Approximation of a function by a sum of sigmoids.

THE MAIN NEURAL NETWORKS 225

The network then adjusts theweights of the different units, checking each time to see if the

error function has increased or decreased. As in a conventional regression, this is a matter of

solving a problem of least squares.

If there are n connections in the network, each n-tuplet (p1, p2, . . ., pn) of weights can be

represented in a space with n þ 1 dimensions, the last dimension representing the error

function e. The set of values (p1, p2, . . ., pn,e) is a ‘surface’ (or, rather, a hypersurface) in

a space of dimension n þ 1, the ‘error surface’, and the adjustment of theweights to minimize

the error function can be seen as a movement on the error surface with the aim of finding the

minimum point. Unlike linear models, in which the error surface is a well-defined and well-

known mathematical object (in the shape of a parabola, for example), and the minimum point

can be found by calculation, neural networks are complex non-linear models where the error

surface has an irregular layout, criss-crossed with hills, valleys, plateaux, deep ravines, and

the like. To find the minimum point on this surface, for which no maps are available, we must

explore it. In the gradient back-propagation algorithm, we move over the error surface by

following the line with the greatest slope, which offers the possibility of reaching the lowest

possible point. We then have to work out how quickly we should travel down the slope. If we

go too quickly, we may pass over the minimum point or set off in thewrong direction; if we go

too slowly, we will need too many iterations in the network to find a solution. When we speak

of an ‘iteration’, this means inputting the whole learning set into the network, comparing the

expected and obtained outputs, and calculating the error function. The range of possible

iterations is very wide, but the order of magnitude is 10 000.

The correct speed is proportional to the slope of the surface and to another important

parameter, namely the learning rate. This rate, between 0 and 1, determines the extent of the

modification of the weights during learning. It is useful to vary this rate, which will be high at

the outset (between 0.7 and 0.9) to allow a speedy exploration of the error surface and a fast

approximation to the best solutions (the minima of the surface), and then decrease at the end

of the learning to bring us as close as possible to an optimal solution. In a situation such as that

shown in Figure 8.7, this decrease in the learning rate will ensure that we do not go from the

local optimumA straight to the local optimum C, possibly with oscillations between A and C,

without reaching the global optimum B.

A second important parameter affects the performance of a multilayer perceptron: this is

the moment (of a neural network), which makes the weights tend to keep the same direction

of change, increasing or decreasing, because a factor incorporates the preceding weight

adjustments. The moment limits oscillations which could be caused by irregularities in the

A

B

C

Figure 8.7 Local optimum and global optimum.

226 NEURAL NETWORKS

learning examples. The effect of the moment is that, if we move several times successively in

the same direction over the error surface, we tend to continue the movement without being

‘trapped’ by the local minima (such as point A in Figure 8.7) and pass over them to reach the

global minima (such as point B in the same figure). Just as the learning rate decreases as

learning continues, themoment often increases during learning, to enable the network tomake

a smooth approach to a globally optimal solution.

To sum up, the learning rate controls the extent of modification of the weights during the

learning process; a higher ratemeans faster learning, but there is a greater risk that the network

will converge towards a solution other than the globally optimal one. The moment acts as

a damping parameter, reducing oscillations and helping to achieve convergence; with

a smaller moment, the network is better at ‘adapting to its environment’, but extreme data

have more effect on the weights. To some extent, the learning rate controls the speed of

movement and the moment controls the speed of the changes of direction on the error surface;

at the start of the process, we move quickly in all directions, but at the end we slow down and

change direction less often.

The main danger of neural network modelling is obvious: the network may converge

towards a solution that is locally, but not globally, optimal. This risk has led to the

development of graphic tools for real-time display of the error rate in learning and validation,

enabling the learning to be interrupted as soon as there is any sign of overfitting and an

increased error rate in validation (Figure 8.8).

8.7.2 The radial basis function network

An RBF network is an supervised learning network, like the multilayer perceptron, which it

resembles in someways. However, it works with only one hidden layer, and, when calculating

Figure 8.8 Graphic monitoring of a neural network with SAS Enterprise Miner.


the value of each unit in the hidden layer for an observation, it uses the distance in space

between this observation and the centre of the unit, instead of the sum of the weighted values

of the units of the preceding level. Unlike theweights of amultilayer perceptron, the centres of

the hidden layer of an RBF network are not adjusted at each iteration during learning (but

some may be added if the space is not sufficiently covered). In a perceptron, the modification

of a synaptic weight makes it necessary to re-evaluate all the others, but in an RBF network the

hidden neurons share the space and are virtually independent of each other. This makes for

faster convergence of RBF networks in the learning phase, which is one of their strong points.

Now, the response surface (the set of values) of a unit of a hidden layer of a multilayer

perceptron, before the application of the (generally non-linear) transfer function, is a hyperplanePipiXi ¼ K, and similarly the response surface of a unit of the hidden layer of anRBF network

is a hypersphereP

iðXi�oiÞ2 ¼ R2, and the response of the unit to an individual (xi) is a

decreasing function G of the distance between the individual and this hypersphere. As this

functionG is generally aGaussian function, the response surface of the unit, after the application

of the transfer function, is a Gaussian surface, in other words a ‘bell-shaped’ surface

(Figure 8.9). We speak of a radial function for G, i.e. a function symmetrical about a centre.

Comparing the MLP and RBF networks, we find the differences listed in Table 8.1.

Finally, the global response of the network to each individual (xi) presented to it is:

Xno: of hidden units

k¼1

lk exp � 1

2s2k

Xno: of input units

i¼1

ðxi�oki Þ2

" #

The learning of an RBF is a matter of determining the number of units in the hidden layer,

i.e. the number of radial functions, their centres Ok¼ (oki ), their radii sk, and the coefficients

lk. The critical point in learning is the choice of the number of radial functions, their centres

and their radii. When this has been done, the coefficients lk are determined in a supervised

way, as simply as in a linear regression. The coefficients can be limited if required, as in a ridge

regression (see Section 11.7.2). This is known as ‘weight decay’.

Figure 8.9 Response surface of a radial unit.

228 NEURAL NETWORKS

The number of units is generally specified by the user, even if the network can create others

to improve the accuracy of the results. A sufficiently high number must be provided, generally

more than in a multilayer perceptron, to enable the data structure to be modelled correctly. The

unitsmay commonly number several hundred. This is because the fast decrease of theGaussian

means that theRBFnetwork has a lower extrapolation capacitywhen farther from the centres of

the units of the hidden layer. These units must therefore be sufficiently numerous to ensure that

at least one unit is activated for each observation; in other words, at least one radial function

must have a non-negligible value in any region where data are present. This is an evident

drawback of the RBF network as compared with the multilayer perceptron, even if it is also

better protected against certain risky extrapolations found with the multilayer perceptron. In

fact, the complexity of the RBF network increases exponentially with the number of input

variables, because the radial function space has to be filled. As the number of variables

increases, therefore, the calculation time of the RBF network increases, together with the

number of observations needed for learning. This is one of its major weaknesses. It is therefore

essential to select the input variables of a RBF network with great care.

When the number of units has been chosen, we must consider their centres. Some

networks position the centres in a randomway. However, the results can be improved by using

the moving centres method (see Section 9.9.1) or Kohonen networks (see below) to divide the

space into clusters (partitions) according to the distribution of the data. Thus, if the data

are distributed in packets, the centres of these packets will be chosen as the centres of the

RBF network. Also, more centres will be positioned in areas with a high observation density

(input adaptation), or in areas where the result to be predicted varies more rapidly (output

Table 8.1 Comparison of MLP and RBF networks.

Network ! MLP RBF

‘Weight’ Weight pi Centre oi

Hidden

layer(s)

Combination

function

Scalar productP

ipixi Euclidean distancePi (xi�oi)

2

Transfer

function

Logistic

s(X)¼ 1/(1 þ exp(�X))

Gaussian

G(X)¼ exp(�X2/2s2)

Number of

hidden layers

� 1 ¼ 1

Output

layer

Combination

function

Scalar productP

kpkxk Linear combination

of GaussiansP

k lkGk

(see below)

Transfer

function

Logistic

s(X)¼ 1/(1 þ exp(�X))

Linear function f(X)¼X

Speed Faster in ‘model

application’ mode

Faster in ‘model

learning’ mode

Advantage Better generalization Less risk of non-optimal

convergence


adaptation), in order to reduce the output error. Output adaptation is not very often used in

learning, but if it is used it gives rise to the problem of not always being compatible with the

input adaptation. This is because the response distribution (the dependent variable) may not

coincide with the data density distribution, and may lead to the determination of other centres

for the radial function.

With the exception of the possible application of output adaptation, the search for the

centres is unsupervised and can also be carried out on observations for which the response is

not always known, since it is then simply a matter of estimating the data probability density. It

may be useful to be able to train the RBF network on observations for which the dependent

variable is not always known: this enables us to use a larger number of observations. In this

case, we speak of semi-supervised learning. This is used whenever the collection of labelled

observations is more difficult or costly than the collection of unlabelled observations.

However, this semi-supervised learning has the drawback of beingmore sensitive to noise.

When areas of high density are sought in order to determine their centres without considering

the dependent variable, the input variables which are not related to the dependent variable are

not distinguished from the related variables, and may introduce noise.

The final aspect of parameter setting relates to the radii of the units of the hidden layer,

which are the standard deviations of the Gaussian distributions. A simple solution is to choose

radii equal to twice the mean distance between centres. If they are too large, the network will

lack structural detail and its precision will be reduced. If they are too small, the space will be

poorly covered by the Gaussian surfaces, and the network will have to interpolate between

these surfaces, which will decrease the capacity for generalizing the results of the learning

phase. As for the centres, the radii will be chosen so that they are smaller in areas with a high

density of observations, or in areas in which the result to be predicted varies more rapidly.

There are several ways of determining the radii as precisely as possible: a useful method finds

the k nearest neighbours (see Section 11.2), and examines each unit centre to see where its k

nearest neighbours are located (k is chosen appropriately by the user), and the mean distance

to these k nearest neighbours is taken to be the radius. This method has the merit of adapting to

the structure of the data. The radii are not necessarily equal to each other, but this is sometimes

assumed to decrease the number of network parameters.

Compared with the MLP network, the RBF network has the major advantage of needing

only a single hidden layer, and using linear combination and transfer functions in the output

layer in most cases – except in certain sophisticated variants (see Table 8.1). This makes for

faster learning and far fewer problems of complicated parameter adjustment for the user. It

also avoids the risk, inherent in the back-propagationmechanism of themultilayer perceptron,

of convergence towards a locally, but not globally, optimal solution. From this point of view,

the sequential search for the centres of the radial functions, their radii, and then the

coefficients of their linear combination is an advantage: it provides greater simplicity and

faster learning, and decreases the risk of overfitting by comparison with a search for global

optimization by gradient descent.

Theweakness of the radial basis function, compared with the multilayer perceptron, is that

it may need a large number of units in its hidden layer, which increases the execution time of

the network without always yielding perfect modelling of complex structures and irregular

data. This happens when the number of input variables is too large, and it is desirable to reduce

this number as far as possible. This problem is due to the fact that the RBF network, evenmore

than the multilayer perceptron, requires a learning set which covers all the configurations and

all the categories of variables which may be found when it is applied to the whole population

230 NEURAL NETWORKS

to be studied. The advantages and disadvantages of the RBF network tend to be those that are

generally found in networks for probability density estimation. The MLP network offers the

best generalization capacity, especially for noisy data.

8.7.3 The Kohonen network

The Kohonen network is the most widely used unsupervised learning network. It can also be

called a self-adaptive or self-organizing network, because it ‘self-organizes’ around the data.

Other synonyms are ‘Kohonen map’ and ‘self-organizing map’.

Like any neural network, it is made up of layers of units and connections between these

units. Themajor difference from the networks described above is that there is no variable to be

predicted. The purpose of the network is to ‘learn’ the structure of the data so that it can

distinguish clusters in them.

The Kohonen network is composed of two levels (Figure 8.10):

. the input layer, with a unit for each of the n variables used in the clustering;

. an output layer, whose units are arranged as a generally square or rectangular

(sometimes hexagonal) grid of l�m units (in some cases l and m 6¼ n), each of these

l�m units being connected to each of the n units of the input layer, the connection

having a certain weight pijk (i2 [1,l], j2 [1,m], k2 [1,n]).

The units of the output layer are not interconnected, but a distance is defined between them,

such that we can speak of the ‘neighbourhood’ of a unit.

individual 1

individual 2

individual N

input layer

output layer

…

pijk

Figure 8.10 Kohonen network.


The units of the input layer correspond to the variables of the individuals to be clustered,

and this layer is used to present the individuals; the states of its units are the values of the

variables characterizing the individuals to be clustered. This is why this layer contains n units,

where n is the number of variables used in the clustering.

The grid on which the output units are placed is called the ‘topological map’. The shape

and size of this grid are generally chosen by the user, but they may also change in the course of

learning. Each output unit (i,j) is associated with a weight vector (pijk) k2[1,n], and therefore theresponse of this unit to an individual (xk) k2[1,n] is, by definition, the Euclidean distance

dijðxÞ ¼Xnk¼1

ðxk�pijkÞ2:

So how does a Kohonen network learn? First of all, the weights pijk are initialized

randomly. Then the responses of the l�m units of the output layer are calculated for each

individual (xk) in the learning sample. The unit chosen to represent (xk) is the unit (i,j) for

which dij(x) has theminimum value.We say that this unit is ‘activated’ (Figure 8.11). This unit

and all the neighbouring units have their weights adjusted to bring them closer to the

(i+1,j+1)(i,j+1)(i-1,j+1)

(i+1,j)(i,j)(i-1,j)

(i+1,j-1)(i,j-1)(i-1,j-1)

incomeagenumber of children

…

Figure 8.11 Activation of a unit of a Kohonen network.

232 NEURAL NETWORKS

individual at the input. For example, the neighbouring units of (i,j) are the eight units (i� 1,j),

(i þ 1, j), (i, j� 1), (i, j þ 1), (i þ 1, j þ 1), (i þ 1, j� 1), (i� 1, j þ 1), (i� 1, j� 1). The

size of the neighbourhood generally decreases during learning: at the beginning, the

neighbourhood can be the whole grid; by the end, it may be reduced to the unit itself. These

adjustments form part of the network parameters.

The new weights of a neighbour (I,J) of the ‘winner’ (i,j) are

pIJk þY � f ði; j; I; JÞ � ðxk�pIJkÞ for every k 2 ½1; n�;where f(i,j;I,J) is a decreasing function of the distance between the units (i,j) and (I,J), such

that f(i,j;i,j)¼ 1. It may also be a Gaussian function: exp(– distance(i,j;I,J)2/2s2).The parameterY 2 [0,1] is a learning rate which, as in the case of a multilayer perceptron,

changes during learning by decreasing linearly or exponentially.

It is the extension of the weight adjustment to the whole neighbourhood of the ‘winning’

unit that brings the neighbouring units of (i,j) close to the individual (xk) at the input, and

enables the individuals that are close together in variable space to be represented by identical

or neighbouring units in the layer, just as neighbouring neurons respond to nearby stimuli in

the cerebral cortex. The whole process takes place as though the Kohonen network was made

of rubber and was deformed tomake the cloud of individuals pass over it while approaching as

closely as possible to the individuals. By contrast with the factor plane (see Section 7.1), the

projection concerned is non-linear.

When all the individuals in the learning sample have been presented to the network and all

the weights have been adjusted, the learning is complete.

To summarize, during the network’s learning:

. For each individual, only one output unit (the ‘winner’) is activated.

. The weights of the winner and its neighbours are adjusted.

. The adjustment is such that two closely placed output units correspond to two closely

placed individuals.

. Groups (clusters) of units are formed at the output.

In the application phase, the Kohonen network operates by representing each input

individual by the unit of the network which is closest to it in terms of the distance defined

above. This unit will be the cluster of the individual.

This algorithm has some similarities with the moving centres and k-means methods (see

Section 9.9.1). However, there is an important difference. In the k-means method, the

introduction of a new individual into a cluster only results in the recalculation of the centre

of gravity of the cluster, without any effect on the other centres of gravity. But the introduction

of a new individual into a Kohonen network results in the adjustment of not just the unit

nearest to the individual, but also the neighbouring units. The neighbourhood of the ‘winner’

unit is significant, while the neighbourhood of the ‘winner’ centre of gravity is not.

Another major difference between Kohonen networks and the moving centres of k-means

methods is that, unlike these methods, the Kohonen clustering takes place by reducing the

number of dimensions of the variable space, as in factor analysis, the new working space

generally being of dimension 2, as in my description, or, exceptionally, of dimension 3 or 1.


data mining and statistics for decision making (tufféry/data mining and statistics for decision...

Documents