linear data projection using a feedforward neural network

ANALYTICA CHIMICA ACTA

ELSEVIER Analytica Chimica Acta 348 (1997) 495-501

Linear data projection using a feedforward neural network

Peter Cleij*, Ronald Hoogerbrugge

Division Analytical Chemical Laboratories, National Institute of Public Health and the Environmental, PO. Box 1,

3720 BA Bilthoven, The Netherlands

Received in revised form 8 January 1997; accepted 9 January 1997

Abstract

Linear data projection methods provide useful tools in exploratory data analysis and pattern recognition. The most widely used methods in this respect are principal components analysis (PCA) and linear discriminant analysis (LDA). These methods are able to condense multidimensional information in two (or three) dimensions, allowing us to analyze the underlying structure of the data by visual inspection. This paper describes an alternative linear data projection method using a feedforward neural network. The method uses category information to obtain a projection maximally separating the predefined classes of data patterns (supervised data projection). The projection is constructed by training a feedforward network consisting of two subnetworks. The first subnetwork receives the training input and performs the actual projection, while the second one performs the desired classification using the output of the first subnetwork. After training the projection, subnetwork provides the 2D (or 3D) projection map, where as the complete network behaves as an ordinary classification network. As an example, the method is applied to a data set consisting of 13 dioxin concentrations for 145 samples of cow’s milk, originating from the vicinity of five different sources of dioxin contamination. 2D projection maps were constructed using PCA, LDA and the proposed neural network using different classification subnets and starting values of the network weights. The neural network approach produced a number of significant different mappings, most of them showing an improved class separation as compared to PCA and LDA.

Keywords: Data projection; Neural network; Principal component analysis; Linear discriminant analysis; Pattern recognition; Dioxin

1. Introduction

Data projection methods are able to condense high- dimensional data in two (or three) dimensions, allow-

ing us to analyze the underlying structure and cluster- ing tendency of the data by visual inspection. Applied

this way, data projection methods provide useful tools

in exploratory data analysis and pattern recognition. The most widely used projection methods are linear

methods: principal components analysis (PCA) and

*Corresponding author.

various types of linear discriminant analysis (LDA). PCA aims at retaining maximum variation among the data and requires no category information about the

data patterns considered (unsupervised data projection). In contradiction to this, LDA uses such category information to obtain a projection maximally separating the predefined classes of data patterns (supervised data projection).

Artificial neural networks are data processing sys- tems with applications in various fields. Among these applications, the use of feedforward networks (trained

0003-2670/97/$17.00 0 1997 Elsevier Science B.V. All rights reserved.

PII SOOO3-2670(97)00228-6

496 P: Cleij, R. Hoogerbrugge/Analytica Chimica Acra 348 (1997) 495-501

according to the Least Mean Squared Error principle) as classifiers are the most well known.

Less known is the use of feedforward neural networks in data projection. Data projection with these networks is achieved taking the number of nodes of some hidden layer equal to the desired dimensionality of the reduced data space. The output of these hidden nodes then form the representation in the reduced data space of the data pattern presented as input to the network. The type of target output patterns used during training determines whether an unsupervised or a supervised type of projection is achieved.

Unsupervised data projection/reduction is achieved using a so-called auto-associative network. A network of this type has equal number of input and output nodes. It is trained using the data patterns to be projected as input patterns as well as target output patterns. Application of this technique is, for instance found in the field of image compression [ 141. In case of a three-layer linear network it can be shown that the network performs a kind of principal components analysis [5-71. The output of the n nodes of the hidden layer is a projection of the input onto a subspace spanned by the first n principal components of the input.

Supervised data projection is achieved using (bin- ary or bipolar) category codes as the target output patterns during training, as in training feedforward networks as classifiers. An example of such a network is the linear network of Gallinari et al. [8,9]. They showed that with a number of hidden nodes equal to the number of predefined classes minus one, the hidden layer of a three-layer linear network performs a type of linear discriminant analysis, in which the ratio of the determinants of the total covariance matrix and the between-class covariance matrices of the projected data patterns is optimized.

Webb and Lowe [lo] studied the projection provided by the output of the last hidden layer of networks with linear output nodes and one or more hidden layers. They showed that this projection-a non-linear projection in case of non-linear hidden nodes - is such that the network discriminantfunction, as applied to the output of the last hidden layer, is maximized. It was demonstrated that, when applying the usual coding scheme to represent class member- ship, the network discrimination function measures a

type of class separation which strongly bias in favor of the larger classes. The network of Gallinari et al. was recognized as a special case in which the network discriminant function is equivalent to the ratio of determinants criterion (see above).

Webb and Lowe also showed that with an alternative coding scheme (or an equivalent weighting of the network errors) the strong bias in favor of the larger classes is circumvented. The network discriminant function in this case directly relates to the total covariance and between-class covariance matrices of the projected data patterns as used in linear discrimi- mint analysis.

Mao and Jain [ 11,121 studied the non-linear projection provided by the last hidden layer of feedforward networks with sigmoidal activation used in all layers, including the output layer.

The neural network based projection method presented in this paper is of the linear type and as such competes with PCA and the LDA methods.

1.1. Linear-projection/classification network

Linear data projection is performed by a two-layer feedforward network with linear activation used in all nodes of the output layer (although a network of this type cannot be trained as such to provide a particular projection). Classification is usually achieved by a three- or four-layer network with sigmoid activation for the nodes of hidden and output layers.

Combining two networks of either type, feeding the m output signals of the projection network forward to the input nodes of the classification network, results is a four- or five-layer network with linear activation is the first hidden layer and sigmoidal activation in the subsequent hidden layers and output layer. In this network the output nodes of the projection network combine with the input nodes the classification network to form the nodes of the first hidden layer (see Fig. 1).

The training of such a linear-projection/classification (LP/C) network as an ordinary classification network will be a search for a projection optimized for classification by the classification subnetwork, mini- mizing the network’s mean squared error in this way.

Stripping the classification part of the network after training leaves a two-layer network, performing the desired linear projection. When choosing m equal to 2

I? Cleij, R. HoogerbruggeIAnalytica Chimica Acta 348 (1997) 495-501 497

second hidden layer

Classification

1 2 13

Fig. 1. Example of a linear-projection/classification (LP/C) net-

work.

(or 3) a linear projection for visualization purposes is obtained.

It is expected that an LP/C network provides a linear projection with a generally better class separation as obtained by LDA. In a three-layer linear projection network, providing a kind of linear discriminant analysis (see [lo]), the classification subnetwork actually consists of a simple linear perceptron. In an LPK network this perceptron is replaced by a full multi- layer classification network. So, it might be expected that the increased classification ability of the classification subnetwork also will allow for a improved projection as compared to that obtained by linear discriminant analysis.

Note that the architecture of the LP/C network is easily extended to forms of non-linear projection as studied by Webb and Lowe and Mao and Jain. This is achieved by replacing the linear projection subnetwork with a non-linear network with sigmoidal activation and zero or more hidden layers.

2. Example data set

In the end of the 198Os, dioxins and furans appeared to be present in Dutch cow’s milk from the vicinity of some sources at levels which are high in comparison to toxicological information [13]. Therefore the Dutch authorities initiated a comprehensive study on dioxin

levels in cow’s milk near incinerators and other potential sources in the country. Also the government set a limit to the dioxin level in dairy products for human consumption at 6 pg TEQ/g of fat, based on the tolerable daily intake (TDI).

This study resulted in a comprehensive data set on which principal components analysis is used for pattern recognition showing that differences between sources can be identified without some kind of supervised classification. To study the potential of the various pattern recognition techniques a sub set of this data is selected, representing milk samples from the vicinity of one of five sources. These are three municipal waste incinerators (Lickebaert, Zaandam and Duiven), a metal reclamation plant (Culemborg) and the presence of mushroom farms (Bommeler- waard).

2.1. Data pre-treatment

The influence of the individual compounds is standardized by dividing the levels by the standard devia- tion of the repeated analysis of a control sample. To obtain relative patterns, the individual samples is normalized by scaling the sum of levels of the PCDDs and PCDFs in each sample to the same value.

3. Experimental

3.1. Principal components analysis

Principal components analysis is performed using the DataDesk statistical program. Excel from Micro- soft is used for data pre-treatment. The standardized scores are used to classify each sample to the nearest of the five class centroids.

3.2. Linear discriminant analysis

The linear discriminant analysis is performed by a double stage Principal Component Analysis [ 141. From the result of the PCA, the scores on the 9 PC’s which each describe more than 1% of the total variance are standardized to unit variance. Then the scores of each sample is replaced by the mean score of it’s class. On this data set of mean scores again a PCA is performed which gives the 4 (number of

498 I? Cleij, R. Hoogerbrugge/Analytica Chimica Acta 348 (1997) 495-501

classes- 1) discriminant functions. Again the samples are classified using the distances to the class centro’ids. In calculating these distances the scores of the discriminant functions are weighted with their estimated within-class variance.

3.3. LP/C network

To evaluate the projection performance of an LPIC network for the dioxin data set an in house written C software package for the creation, training and opera- tion of feedforward neural networks, running on a HP 9OOO/HP-UX computer, was used. The network used in this study is a four-layer feedforward network with 13 input nodes, two adaptive nodes in the first hidden layer, a variable number (k) of adaptive nodes in the second hidden layer and five nodes in the output layer. Bias nodes were used in both hidden layers. The identity function was used as activation function in the first hidden layer, while the tanh function was used in the second hidden layer and output layer.

The (pre-treated) data for each dioxine concentra- tion were linear transformed to obtain resealed concentrations with zero mean and unit variance, and presented as such to the network. The target output patterns used for training were bipolar category codes (e.g., category 2 represented by code ‘-1 +l -1 -1 -1’).

Weight initialization was performed by generating for each weight a random number in the range of - dm and +dm, where n is the number of connections which feed forward to the node to which the weight is assigned [15].

Training was done by gradient descent, using the back propagation algorithm to calculate the gradient. After some experimentation a learning rate of 0.005 and a number of training cycles of 500 was chosen.

Classification performance was measured consider- ing an output signal 20.2 as +l, and an output signal 1-0.2 as -1.

After training the weights of the first hidden layer were used in a linear projection network with 13 input nodes and two output nodes. Next, a normalization procedure was applied to the weights of the network in order to correct for trivial differences between projection maps. This procedure consists of orthogonaliza- tion and normalization (to unit length) of the weight

vectors of the two output nodes, followed by a rotation of the resulting weight vectors onto the two principal component axes.

To study the effect of an increasing classification ability of the classification subnetwork, three LPK networks were constructed with k = 3, 5 and 10. To study the effect of weight initialization each of these networks were trained 20 times. This resulted in a total of 60 projection maps.

4. Results

4.1. Principal components analysis

The projections of the samples on the first two principal components are shown in Fig. 2. The classes are not all well separated with some mixing of Zaan- dam and Lickebaert and of Duiven and Bommeler- waard.

The first two principal components describe 51.3 and 17.6% of the variance in the data set. The projection result confirms the earlier reported observations after application of PCA on a preliminary data set [13,16] that (1) cow’s milk from dairy farms in the vicinity of metal reclamation plants contains relatively more furans than dioxins, when compared to milk from farms in the vicinity of waste incinerators (first PC) and (2) certain waste incinerators at different locations can be distinguished from each other on the basis of the relative amounts of lower (tetra and

1

/r

-2 -I 0 I 2 3 4

PC-1

Fig. 2. Projection map obtained by principal components analysis.

The classes are Culemborg (x), Duiven (o), Lickebaert (+),

Zaandam (A) and Bommelerwaard (I).

P: Cleij, R. Hoogerbrugge/Analytica Chimica Acta 348 (1997) 495-501 499

Table 1 Table 2

Classification performance of the three networks, expressed as the

number false classifications for the training set Classification performance of the principal components and

disctiminant functions, expressed as the number of false classifica-

tions for the training set Network

Components n=2 it=3 n=4

Principal components 22 15 14

Discriminant functions 16 6 6

penta) and higher (hexa and hepta) chlorinated compounds (second PC).

The classification results are shown in Table 1 showing that about 90% of samples are correctly classified.

4.2. Linear discriminant analysis

The projections of the samples on the first two discriminant functions are shown in Fig. 3. The plot is very similar to the scores on the PCs which corresponds to the fact that the first DF is heavily dominated by the first PC and the second DF by the second PC.

The classification on the basis of two DFs is slightly better than the classification based on the first two PCs. The quality of classification improves a lot when also the third DF is included. This is mainly since the third DF separates the classes of Zaandam and Lick- ebaert quite well which reduces the number of mis- classifications between them.

-2’ -2 -1 0 1 2 3 4

Discriminant function 1

Fig. 3. Projection map obtained by linear discriminant analysis.

See Fig. 2 for symbols.

Average

Minimum

Maximum

k=3 k=5 k=lO

5.3 1.7 1.6

1 1 0

11 4 6

4.3. LPK network

Table 2 shows the classification performance of the three LP/C networks for the training set, expressed as the average, minimum and maximum number of false classifications (out of 145) for 20 training sessions.

The results show a good classification behavior of the LPK network with similar performance for k=5 and k=lO, and a somewhat lesser performance for k=3.

Inspection of the projection maps showed a variety of mapping solutions which, however, could easily be grouped into sets of similar solutions by visual judg- ment. The mappings of four of these sets showed a more or less good separation of all classes. The mappings of the two other sets showed a moderate separation with class Bommelerwaard mixing with one or more of the other classes. Table 3 gives some details of the sets of similar mapping solutions.

Figs. 4 and 5 show projection maps of the sets 1 and 6, both with a good projection quality.

An interesting feature of both projection maps is the outlying dam point of the Lickebaert class. Inspection of the original data revealed that the corresponding milk sample showed a strongly deviating concentra- tion for one of the dioxines. Note that the LDA projection does not show this outlying observation (the PCA map shows it less pronounced).

Table 3 also gives an indication of the relation between the number of false classifications for the training and the projection quality. Both sets with mappings of moderate projection quality also show the largest number of false classifications on the average. However, this relation not always holds in individual cases. Some mappings combine a relatively large number of false classifications with a good projection quality. Mappings with a low number of

500

Table 3

I? Cleij, R. Hoogerbrugge/Analytica Chimica Acta 348 (1997) 495-501

Details of the sets of similar mapping solutions showing the set size, total and split according to the network responsible for the mapping (&=3,5,10), the projection quality and the average number of false classifications (FC)

Set Size k=3 k=5 k=lO Quality Av. FC

1 31 8 10 13 Good 1.5 2 17 7 7 3 Moderate 4.7 3 6 1 1 4 Good 2.0 4 3 2 1 0 Good 3.0 5 2 2 0 0 Moderate 11.0 6 1 0 1 0 Good 1.0

-2 I

-4 -3 -2 -1 0 1 2 3 4

Projection 1

Fig. 4. Projection map from solution set 1 (good separation), obtained by a LP/C network with three adaptive nodes in the last hidden layer. Classification performance for training set: one false classification out of 145. See Fig. 2 for symbols.

Fig. 5. Projection map from solution set 6 (good separation), obtained by a LPK network with five adaptive nodes in the last hidden layer. Classification performance for training set: one false classification out of 145. See Fig. 2 for symbols.

false classifications sometimes show only a moderate projection quality.

4.4. Discussion

The projection results show that the neural network approach may lead to mappings with significant better class separation than obtained by PCA and LDA. It confirms the expected advantages of a projection driven by a full multi-layer classification network.

Also the neural network approach demonstrated the ability to discover multiple (good) solutions to the linear projection problem. This may be considered as a (positive) effect of the well-known problem of local minima.

Furthermore, it seems that the neural network approach is less sensible than LDA to outlying observation in the sense that such observations are also

treated as such and do not incur class separation (too much). This probably relates to the fact that during training a multi-layer classification network (with sigmoidal activation) is mainly focused on class boundaries. Observations at some distance from the class boundaries, including outlying ones, have lesser influence on the training results.

5. Conclusions

A neural network architecture for supervised linear data projection is presented. This linear-projection/ classification network consists of a projection subnet and a classification subnet and is trained as a common feedforward classification network. The use of a full multi-layer feedforward network as classification subnet enables high quality data projection, and also

P: Cleij, R. Hoogerbrugge/Analytica Chimica Acta 348 (1997) 495-501 501

provides high quality classification based on the projected data.

The application on an example data set of dioxin levels in cow’s milk shows that projections are found with an improved class separation as compared to projections obtained by PCA and LDA. The neural network approach was also able to discover different mapping with good class separation.

However, the neural network approach results in a far more laborious procedure for data projection than when applying PCA or LDA. At least some of the large number of options involving a neural network approach have to be explored. Such options, which all influence the results, concern the details of the network architecture, weight initialization method and training method. So, the presented neural network approach to linear data projection is best used as a

second option, in case PCA or LDA does not provide satisfying results.

References

[ 11 G.W. Cottrell, P. Munro and D. Zipser, Proceedings of the 9th

Annual Conference of the Cognitive Science Society, Seattle,

1987, p. 462.

121 N.P. Walker, S.J. Eglen and B.A. Lawrence, GEC J. Res., 11 (1994) 66.

[3] M.A. Abidi, S. Yasuki and P.B. Crilly, IEEE Trans. Consumer

Electronics, 40 (1994) 796.

[4] A. Rios and M. Kabuka, Math. Comput. Modelling, 21(1995)

159.

[5] H. Bourland and Y. Kamp, Biol. Cybernetics, 59 (1988) 291.

[6] G.W. Cottrell and P. Munro, SPIE Vol. 1001, Visual

Communication and Image Processing, 1988, p. 1070.

[7] P. Baldi and K. Homik, Neural Networks, 2 (1989) 53.

[8] P. Gallinari, S. Thiria and F. Fogelman-Soulie, IEEE Ann. Int.

Conf. Neural Networks, 1988, p. I-391.

[9] P. Gallinari, S. Thiria, F. Badran and F. Fogelman-Soulie,

Neural Networks, 4 (1991) 349.

[lo] A.R. Webb and D. Lowe, Neural Networks, 3 (1990) 367.

[l l] J. Mao and A.K. Jam, Proceedings of the IEEE International

Conference Neural Networks, San Francisco, CA, March

1993, Vol. 1, p. 300.

[12] J. Mao and A.K. Jain, IEEE Trans. Neural Networks, 2 (1995)

296.

[13] A.K.D. Liem, R. Hoogerbrugge, P.R. Kootstra, E.G. van der

Velde and A.P.J.M. de Jong, Chemosphere, 23 (1991) 1675.

[14] R. Hoogerbrugge, S.J. Willig and P.G. Kistemaker, Anal.

Chem., 55 (1983) 1710.

[15] J. Hertz, A. Krogh and R.G. Palmer, Introduction to the

theory of neural computation, Addison-Wesley Publishing

Company, Redwood City, California, 1992, p. 129.

[16] R. Hoogerbrugge, A.P.J.M. de Jong, A.K.D. Liem, P.R.

Kootstra and H.A. van? Klooster, in K. Seip and B. Vigerust

@is.), Multivariate Statistical Techniques for Environmental

Sciences, CEC Water Pollution Research Report 22, E. Guyot

SA, Brussel, 1990, p. 170.

linear data projection using a feedforward neural network

Documents