[ieee 2013 sixth international conference on advanced computational intelligence (icaci) - hangzhou,...

2013 Sixth International Conference on Advanced Computational Intelligence

October 19-21, 2013, Hangzhou, China

Evolving Neural Network Ensembles Using Variable String Genetic

Algorithm for Pattern Classification

Xiaoyang Fu and Shuqing Zhang

Abstract-In this paper, an evolving neural network

ensembles (ENNE) classifier using variable string genetic

algorithm (VGA) is proposed. For neural network ensembles

(NNE) with regularized negative correlation learning (RNCL)

algorithm, the two improvements are adopted: The first term is

to evolve the appropriate architecture and initial connection

weights of NNE using VGA algorithm, the second term is to

optimize automatically the regularization parameter based on

gradient descent while evolving the NNE's weights. The

effectiveness of ENNE classifier is demonstrated on a number of

benchmark data sets. Compared with back-propagation

algorithm multilayer perception (BP-MLP) classifier and NNE

classifier with RNCL algorithm, it has shown that the ENNE

classifier with VGA and RNCLgd hybrid algorithm has better

classification performance.

I. INTRODUCTION

N Eural network classifier [1] and NNE classifier [4]

have been applied widely on pattern classification. The

most widely used neural network model is the multi-layer

perception (MLP), in which the connection weight training is

normally completed by back-propagation (BP) learning

algorithm [6]. In BP learning algorithm, the error is minimized

when the network outputs match the desired outputs. Thus the

weights are adjusted to reduce the error. The essential

character of BP algorithm is gradient descent which is strictly

dependent on the shape of the error surface so that may have some local minimum [5].

A neural network ensemble is a combination of a set of neural

networks which tries to cope with a problem in a robust and

efficient way [8]. Negative correlation learning (NCL)[lO] is

a neural network ensemble learning algorithm that introduces

a correlation penalty term to the error function of each

individual network so that each neural network minimizes its

mean square error (MSE) together with the correlation of the

ensemble and thus alleviate the over-fitting problem to some

extent.

Furthermore, the regularized negative correlation (RNCL)

algorithm [3] which incorporates an additional regularization

term into the ensemble and regularization parameter is used to

control the trade off between MSE and regularization. It is

crucial to improve the ensemble's generalization ability. The

RNCL algorithm can be used for Pattern Classification,

Xiaoyang.Fu is now with the Department of Computer Science and Technology, Zhuhai College of Jilin University, Zhuhai, China. Postcode: 519041. (email:[email protected]).

Shuqing Zhang is now with Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences,Changchun 130012, China (email:[email protected]).

978-1-4673-6343-3/13/$31.00 ©2013 IEEE 81

Regression Analysis and Load Forecasting [11].

Genetic algorithm [2] is randomized and optimized

techniques guided by the principles of evolution and natural

genetics. They are efficient, adaptive and robust in search

process, producing near-optimal solutions and handle large,

highly complex and multimode space. In RNCL algorithm [3], the initial weights and architecture

of NNE are randomly set as different values or selected by empirical values, and the regularization parameter is

optimized by cross-validation (RNCLcv) or Bayesian

inference (RNCLs,) method. In this paper, we propose two

aspect improvements: The fust term is to evolve the

appropriate architecture and initial weights of NNE using

VGA algorithm; the second term is to optimize automatically

the regularization parameter based on gradient descent while

evolving the NNE's weights.

The rest of this paper is organized as follows: Section II

introduces VGA algorithm, neural network ensembles with

RNCL algorithm and regularized parameter optimization

based on gradient descent method, Section III provides the

results of ENNE classifier on a number of benchmark data sets and compared with BP-MLP classifier and NNE classifier.

Finally, Section IV concludes the paper.

II. METHODOLOGY

A. VGA algorithm [7J

As for three-layer neural network, its input nodes are equal

to the dimensional number in a feature space while output nodes are equal to the number of the pattern categories. Both

can be got from training samples. Thus, the architecture of

three-layer neural network only depends on the number of

hidden nodes. In order to evolve the architecture of three-Iayer neural network using VGA algorithm, hidden nodes and

connection weights can be encoded by a string chromosome.

1) Architecture and weights encoding

Let A = (A"A2, ... A" ... AM) where A is a population made of M individual (Aj, ... AM)

� = (li'W,) (1)

In the formula (1), II and W, represents the number of

hidden nodes and the connection weights of the /" individual

respectively.

By convenient rule of thumbs [5], the number of hidden

nodes is chosen such that the total number of weights in the

network is roughly nllO, or more than this number, but it

should not be more than the total number of training samples,

n. For three-layer neural network, the total number of weights

is I *(m+q), where m is the nwnber of feature dimension, q is

the number of pattern categories, so the maximum number of

hidden nodes I,mx ::;; n I( m + q), the hidden node I can be

chosen from range [imal1 0, Imax]. The hidden node is encoded by its binary, but the weight is encoded by a real nwnber

which is randomly chosen from rang [-1,+ 1] in initialization.

2) Genetic operators: Reference [9] pointed that it is

generally very difficult to apply crossover operator in

evolving connection weight since they tend to destroy feature detectors found during the evolutionary process [9]. Thus, this

paper adopts only selection operator and mutation operator.

Selection: the roulette wheel selection procedure has been

adopted to implement a proportional selection strategy.

Mutation: a basic mutation operator has been adopted. For

hidden nodes, each bit of each individual with probability

Jim is selected and changed from 1 or 0, vice versa. For

connection weights, the weights value of each individual W, is

selected with probability Jim and randomly got a real nwnber

from [-1,+1].

Suppose the hidden node Ii of the Ai is changed to I; after

mutation operator, then appearing three states as follows:

a) I, = I;, the weights of the individual Ai need not be

adjusted;

b) II> I; , the weights of the Ai whose hidden nodes are

more than I; must be deleted;

c) Ii <I;, some weights term must be added so that the

length of Ii reaches I; . The inserting weights values can be

got randomly from [-1,+ 1].

3) Fitness computation: For each individual with hidden

node II and weight Wi, the fitness function is defined as :

(2)

where f is the fitness of the i'h individual, I, is its number

of hidden nodes, f3 is a user-defined architecture coefficient

in the range [0,1]. Therefore, the maximization of the fitness

ensures the minimization of the MSE, and the term

f3 ·1, Il,mx will force the minimization of the number of

hidden nodes. Now, the basic steps of VGA algorithm are described as

follows:

a) Randomly construct an initial population;

b) Calculate the fitness of each individual in the population;

c) Select parents from the current generation according

to their fitness;

d) Apply mutation operator to parents to generate

offspring which form the new generation; e) Do step b) to e) repeatedly until some stopping

criterion is met or a desired nwnber of generations is reached.

82

B. Neural Network Ensemble

A neural network ensemble is a combination of a set of

neural networks. The simplest way we have to combine M

neural networks for pattern classification is an arithmetic

mean of their output Om: 1 M

Gens = -LOm(Xi) M m�l

h .

h ·th .

w ere Xi IS t el mput vector.

(3)

Negative Correlation Learning (NCL) has been introduced

by Liu and Yao [lO] with the aim of negatively correlate the

error of each network within the ensemble. In this method,

instead of training each network separately, a penalty term is

introduced to minimize the correlation between the error of

the network and the error of the rest of the ensemble.

Given the training set {Xi' f;} ��I ' the error function em for a

network m is defined:

_ n _ 2 _

n _ 2 (4) em - L(Om(x,) t,) AL(Om(x,) 0ens(x,))

1=1

where OnlxJ is the network m output for input x, while t, as

its desired response (target output), A is the weight parameter

on the penalty term, and controls a trade off between error

term and penalty term. With A =0, we would have an

ensemble with each network training independently. If A is

increased, more and more emphasis would be placed on

minimized the penalty.

Let A =1, the average error Eens of all the individual

Then

(7)

The ensemble method we considered in this work is called

Regularized Negative Correlation Learning (RNCL)[3]. This

method improves neural network ensemble performance

adding a regularization term with the objective of reduce the

over-fitting problem.

In RNCL algorithm, each network m has the following error function:

1 � 2 1 � 2 t (8) em =- L..(Om(X,)-I,) -- L..(Om(x,)-O",,(x,» +am ,WmWm M ,=! M ,=!

where Wm is weight vector of the neural network m. In

formula (8), the first term is the error of the mth neural

network, the second is the negative correlation between each

network and output of the ensemble and finally the last term is

the regularization term with its parameter a", E [0,1] . According to (7) and (8), the minimizing the error function of

the ensemble is achieved by minimizing the error function of

each individual network, and the minimization of em depends

2

on weight vector Wm and regularized parameter am which

controls the trade off between empirical training error and

regularization.

Based on the gradient descent algorithm, the weights and

regularized parameters are adjusted as follows:

t.w =- aem =- {2 � (O (x)-tm). aOm(x;) m 17 aw 17 M L.., m I I aw m � m

_2 i:(Om(xJ-Oens(x;)).Cl-�). aom(x;) +2am ·Wn,} M ;=1 M aWm

�a = -n. 8em = -n.W' .W (10) m '( 8am

'( m m

where 17 is the learning rate, 17 E [0,1]

(9)

Now, the RNCLgd algorithm of the three-layer network can

be included as the following four steps:

a) Normalization of the training samples and

initialization of each individual network weights

and regularization parameter. Weights and

architecture of each network are come from VGA

algorithm and all regularization parameters are set

as the maximum value 1;

b) For the training set {xi,l,}7=1 ' calculate

M • 0ens(xJ = 11 M . L: om (x;) ,

m=! c) For each network from m=1 to M and all training

samples, weights Wm and regularization

parameter am of each network are adjusted by

formula (9) and (10);

d) Repeat from step b) for a desired number of

iterations (epochs); In RNCLgd algorithm, weights and regularization

parameters of each network are evolved simultaneously and

converged to the optimal solution while the error function of

the ensemble is minimized.

After having achieved training NNE, the testing data can be

classified by the output of NNE, Gens '

III. RESULTS AND DISCUSSION

In the section, we will discuss some experimental results of

ENNE classifier and compared with BP-MLP and NNE

classifiers on a number of benchmark data sets, such as vowel

TABLE I THE MAIN CHARACTERS FOR IRIS, DIABETES AND VOWEL

Data sets Size of classes feature

Iris Diabetes Vowel

samples

150 750 300

3 2 10

dimensions 4 8 2

Iris and Diabetes data. Iris and Diabetes data are come from

website (http://archive.ics.edu/ml) Vowel data are come from

MLT Lincoln Laboratory (hrtp:llwww.ll.mit.edu/IST/lnknet).

Their main characters of the data organized as following table

I.

According to m-fold cross-validation [5], all samples are

83

randomly divided into five sets. The classifier is trained five

times, each time with different set hold out as a validation set

and other sets for training. The test accuracy is an average of

five runs.

A. Implementation Parameters of Classifiers

There are three types of supervised classifiers: BP-MLP

classifier, NNE classifier and ENNE classifier.

(1) BP-MLP classifier

The number of hidden nodes is set as 20, the learning rate is

set as 0.1 for vowel, 0.02 for diabetes and Iris respectively,

and the algorithm is executed for 1000 epochs.

(2) NNE classifier

The number of NNE, M is set as 10, the regularization

parameter a is optimized by cross-validation method

(RNCLcv) from range [0.001, 1], and got the optimal value

0.035. Each individual network has the same structure and

other parameters as BP-MLP classifier.

(3) ENNE classifier

The number of individuals in the population and evolving

generation is set as 20 and 300 respectively. The mutation

probability Jim and architecture coefficient f3 are variable

within the range [0.015, 0.333] during evolving process. The

initialization value of all regularization parameters is set as 1.0

while the other parameters are the same as NNE classifier.

TABLE II TRAINING ACCURACY COMPARISON OF CLASSIFIERS

DataSets

Iris Diabetes Vowel

Training Accuracy(%)

BP-MLP NNE ENNE

99.0 98.3 98.0 97.2 94.1 79.7 81.1 81.0 82.9

B. Performance Comparison of Classifiers

Table II and III are the performance comparison of classifiers

for vowel, Iris and Diabetes data respectively. They are

average training and testing accuracy of five data sets by

five-fold cross validation. As seen from Table II, the training accuracy of BP-MLP

TABLE III TESTING ACCURACY COMPARISON OF CLASSIFIERS

Data sets

Iris Diabetes Vowel

Testing Accuracy(%)

BP-MLP NNE

92.6 92.6 71.2 75.5 68.3 70.3

ENNE

93.3 77.1 71.3

classifier is higher than NNE and ENNE for Iris and Diabetes,

and similar performance as NNE and ENNE for vowel. Due to

regularization negative correlation not used in BP-MLP, it is

very easy to fall into over-fitting situation which does not

benefit some testing cases.

As seen from Table III, the testing accuracy of NNE is better

than BP-MLP for Diabetes and vowel, and similar

3

performance as BP-MLP for Iris. But the testing accuracy of

ENNE is better than BP-MLP and NNE for all data sets. It

shows that the ENNE classifier as the same as the NNE

classifier can reduce the over-fitting problem as well as has

better generalization ability. Furthermore, as a result of

evolving the appropriate architecture and initial weights of

NNE using VGA algorithm, the ENNE algorithm can search

the optimal solution in a larger feature space scope, which

avoid in falling into local minimum. Thus, it is not surprised to

get better testing accuracy with ENNE classifier than

BP-MLP and NNE classifiers.

C. Effect af Parameter in ENNE classifier

ENNE classifier has some important parameters that influence the structure and weights training of neural network

ensembles, which include mutation rate flm and the number

of evolving generation in VGA algorithm, and learning rate

lJ ' the number of iterations (epochs) and regularized

parameter am in RNCLgd algorithm.

16,------------------------------------,

if) '"

14

.E 12 E ::J C 10 if) (j)

"0 o C � : if .� .c

51 101 151

Generations 201 251

Fig. I. the optimal hidden node by VGA evolving training

25,--------------------------------,

Vi � 20 o c c Q) :g 15 .� .c

';; ,,::; 10

+' 0-o Q) ..c E-

--:... : . ... .

...... �. . ...

.

. :::;;,�:.------� �.

100 200 300

generations 500 1000

" 'lj! -Lj2 --Lj3 -Lj4 -LjS

Lj6 -Lj7 --LjS -Lj9 --LilO

Fig. 2. the optimal hidden node of each individual neural network

Fig.l is the optimal hidden nodes(Lj) of an individual

corresponding each generation during the VGA training for

vowel data. The variation of the optimal topology with the

number of generations is described as this figure.

The optimal number of hidden nodes (15) is obtained just after

the 15 generations.

84

Fig.2 is the optimal hidden nodes Lj of each individual

corresponding to 100,200,300,500,1000 generations for

vowel data. From this figure, it shows that the optimal Lj vary

with different individual and evolving generations, and almost

reach the same optimal value 19 or 20 after lOOO evolving

generations except the 8th individual.

0.9

0.8

0.7

0.6 w � 0.5

0.4

0.3

0.2

0.1

51 101 lSI 201

generations

Fig. 3. the MSE by VGA evolving training

251

0.4 ,--------------------------------------,

0.35

0.3

0.25

W ifJ 0.2 :;;

0.15

0.1

0.05

101 001 �l �l �1 WI WI �1 001

epochs

Fig. 4. the MSE changed by RNCLgd training

Because of the mutation probability fllll is varied with the

number of generations. Initially, the flm has a high value, thus

ensuring a lot of diversity in the population, at this stage, the architecture (Lj) of the optimal individual has much more

changed. As generations pass and evolving progress reaches

the vicinity of an optimal solution, the fllll should be decreased

for fine tuning the architecture and weights. Fig.3 shows the MSE of the optimal individual

corresponding to each generation during VGA training and

Fig.4 is the MSE corresponding to each epoch during

RNCLgd training. It is found that although VGA algorithm

has greatly reduced the MSE of each neural network, a more

improvement of training performance is achieved by RNCLgd

algorithm.

In ENNE classifier, if only VGA algorithm is used to evolve the architecture and weights of neural network

ensembles, it forms a special VNNE classifier. Table IV is the

performance of VNNE classifier for Iris, Diabetes and Vowel

4

data respectively. As seen from Table IV, the average testing

accuracy of VNNE classifier still is better than NNE and

BP-MLP classifiers for Diabetes data, it shows that the optimal solution of neural network ensembles almost has been

found by VGA algorithm for Diabetes data. However, for

Vowel and Iris data, the testing accuracy of VNNE is far lower

than NNE and BP-MLP, so it is necessary to evolve the

TABLElV TESTING ACCURACY COMPARISON OF CLASSIFIERS

DataSets

Iris Diabetes Vowel

Testing Accuracy(%)

BP-MLP NNE

92.6 92.6 71.2 75.5 68.3 70.3

VNNE 86.8 76.7 42.7

weights of neural network ensembles by RNCLgd algorithm.

In RNCLgd algorithm, learning rate '7 and number of epochs

is a pair of interaction parameters. The larger the learning rate,

the less the number epochs needed and vice versa. These

parameters have to be set by manual, which are not set automatically, the same as the BP-MLP algorithm and RNCL

algorithm.

The regularization parameter is a very important parameter

which controls the balance between the empirical training

error and the complexity of network. It is crucial to improve

the ensemble's generalization ability. In RNCL algorithm, the

regularization parameter a is optimized by cross-validation

(RNCLcv) or Bayesian inference (RNCLs1) method. In

RNCLsl , the regularization parameter can be automatically

adjusted, but its calculation is too complex to apply to real

application due to using the Hessan Matrix.

In RNCLcv, due to without automatic parameter

adjustment, a large number of training and testing needs to be

done in order to search an optimal a parameter while the

RNCLgd algorithm has the better solution to the problem.

The RNCLgd algorithm has some features as follows:

1) The RNCLgd algorithm can automatically adjust the a parameter as the RNCLBI algorithm. Seen from the formula

(10), the calculation of L1a is simple and easy to implement in

the iteration process of the RNCLgd algorithm so that it would

be much more efficient than the RNCLcv which has to search

the a parameter by manual.

2) With respect to operation mechanism, for the RNCLcv

algorithm, the regularization ratio a , which has been selected

by manual operation, will be kept the same value in the entire

evolving process for all the different weights of each

individual neural network in order to find the appropriate

a efficiently.

On the other hand, due to automatically adjustment of the a parameter for the operation process in the RNCLgd

algorithm, not only the a values are different for different

weights of each individual, but also the a value will be

automatically evolved during the entire operation process.

Actually, at the initiation of the evolution process, the

a value, which is set as a large value, will be strong effective

for attenuating the weights of the neural networks so that the

weights of neural networks could be kept in the appropriate

85

scope. With the iteration process development, the a will be

gradually reduced. In the all evolving process, the a value

sometime even is reduced to zero, showing that the weights

with respect to the a value has been reduced small enough so

that the furthermore attenuation penalty caused by a value

should be removed.

IV. CONCLUSION

In this paper, we have described neural network ensembles

(NNE) classifier. A neural network ensemble is a combination

of a set of neural networks which tries to cope with a problem

in a robust and efficient way. NNE classifier adopts

Regularized Negative Correlation Learning (RNCL)

algorithm. Adding a regularization term can reduce the

over-fitting problem, control the complexity of the network,

get the optimal balance between the empirical training error

and weight decay term and help the network to improve its

generalization capability. We propose an evolving neural

network ensembles (ENNE) classifier using VGA algorithm,

which does two aspect improvements for NNE classifier with

RNCL algorithm as follows: the first term is to evolve the

appropriate architecture and initial weights of NNE using

VGA algorithm, the second term is to optimize automatically

the regularization parameter while evolving the NNE's

weights. The effectiveness of ENNE classifier is

demonstrated on a number of benchmark data sets. Compared

with BP-MLP classifier and NNE classifier, it has shown that

the ENNE classifier with VGA and RNCLgd hybrid algorithm has better pattern classification performance.

REFERENCES

[ I] J. A. Benediktsson, P. H. Swain, and O. K. Ersoy, "Neural Network approaches versus statistical methods in classification of muiti-spectral remote sensing data," iEEE Transaction on Geoscience and Remote

SenSing, vol. 28, pp. 540-552, 1990. [2] D. E. Goldberg, Genetic algorithm in Search, Optimization and

Machine Learning. New York: Addison-Wesley, 1989. [3] H. Chen and X. Yao, "Regularized negative correlation learning for

neural network ensembles," iEEE Trans. Neural Networks, vol. 20, no. 12, pp. 1962-1979,2009.

[4] L. K. Hansen and Salamon, "Neural network ensembles," iEEE Trans

Pattern Anal. March. in/ell., vol. 12, no. 10, pp. 993-100 I, 1990. [5] O. Richard, E. Peter, and G. David, Pattern Classification. Second

Edition, John Wiley & Sons, 2001. [6] D. Rumelhart, G. Hinton, and R. Williams, "Learning representations

by back-propagating errors," Nature, vol. 323, pp. 533-536, 1986. [7] S. Bandyopadhyay and K. Sankar, "Pixel classification using variable

string genetic algorithms with chromosome differentiation," iEEE

Transaction on Geoscience and Remote SenSing, vol. 39, no. 2, 2001. [8] X. Yao and Y. Liu, "Making use of popUlation information in

evolutionary artificial neural networks," iEEE Trans. Systems, Man

and Cybernetics, B, vol. 28, no. 3, pp. 417-425, 1998. [9] X. Yao, "Evolving artificial neural networks," Proceeding of iEEE, vol.

87, no. 9, pp. 1423-1447, 1999. [10] Y. Liu and X. Yao, "Ensemble learning via negative correlation,"

Neural Network, vol. 12, no. 10, pp. 1399-1404,1999. [11] M. Felice and X. Yao, "Short-term load forecasting with neural

network ensembles: a comparative study," iEEE Computational

intelligence, pp. 47-56, 2011.

5

[ieee 2013 sixth international conference on advanced computational intelligence (icaci) - hangzhou,...

Documents