[ieee 2013 sixth international conference on advanced computational intelligence (icaci) - hangzhou,...
TRANSCRIPT
2013 Sixth International Conference on Advanced Computational Intelligence
October 19-21, 2013, Hangzhou, China
Evolving Neural Network Ensembles Using Variable String Genetic
Algorithm for Pattern Classification
Xiaoyang Fu and Shuqing Zhang
Abstract-In this paper, an evolving neural network
ensembles (ENNE) classifier using variable string genetic
algorithm (VGA) is proposed. For neural network ensembles
(NNE) with regularized negative correlation learning (RNCL)
algorithm, the two improvements are adopted: The first term is
to evolve the appropriate architecture and initial connection
weights of NNE using VGA algorithm, the second term is to
optimize automatically the regularization parameter based on
gradient descent while evolving the NNE's weights. The
effectiveness of ENNE classifier is demonstrated on a number of
benchmark data sets. Compared with back-propagation
algorithm multilayer perception (BP-MLP) classifier and NNE
classifier with RNCL algorithm, it has shown that the ENNE
classifier with VGA and RNCLgd hybrid algorithm has better
classification performance.
I. INTRODUCTION
N Eural network classifier [1] and NNE classifier [4]
have been applied widely on pattern classification. The
most widely used neural network model is the multi-layer
perception (MLP), in which the connection weight training is
normally completed by back-propagation (BP) learning
algorithm [6]. In BP learning algorithm, the error is minimized
when the network outputs match the desired outputs. Thus the
weights are adjusted to reduce the error. The essential
character of BP algorithm is gradient descent which is strictly
dependent on the shape of the error surface so that may have some local minimum [5].
A neural network ensemble is a combination of a set of neural
networks which tries to cope with a problem in a robust and
efficient way [8]. Negative correlation learning (NCL)[lO] is
a neural network ensemble learning algorithm that introduces
a correlation penalty term to the error function of each
individual network so that each neural network minimizes its
mean square error (MSE) together with the correlation of the
ensemble and thus alleviate the over-fitting problem to some
extent.
Furthermore, the regularized negative correlation (RNCL)
algorithm [3] which incorporates an additional regularization
term into the ensemble and regularization parameter is used to
control the trade off between MSE and regularization. It is
crucial to improve the ensemble's generalization ability. The
RNCL algorithm can be used for Pattern Classification,
Xiaoyang.Fu is now with the Department of Computer Science and Technology, Zhuhai College of Jilin University, Zhuhai, China. Postcode: 519041. (email:[email protected]).
Shuqing Zhang is now with Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences,Changchun 130012, China (email:[email protected]).
978-1-4673-6343-3/13/$31.00 ©2013 IEEE 81
Regression Analysis and Load Forecasting [11].
Genetic algorithm [2] is randomized and optimized
techniques guided by the principles of evolution and natural
genetics. They are efficient, adaptive and robust in search
process, producing near-optimal solutions and handle large,
highly complex and multimode space. In RNCL algorithm [3], the initial weights and architecture
of NNE are randomly set as different values or selected by empirical values, and the regularization parameter is
optimized by cross-validation (RNCLcv) or Bayesian
inference (RNCLs,) method. In this paper, we propose two
aspect improvements: The fust term is to evolve the
appropriate architecture and initial weights of NNE using
VGA algorithm; the second term is to optimize automatically
the regularization parameter based on gradient descent while
evolving the NNE's weights.
The rest of this paper is organized as follows: Section II
introduces VGA algorithm, neural network ensembles with
RNCL algorithm and regularized parameter optimization
based on gradient descent method, Section III provides the
results of ENNE classifier on a number of benchmark data sets and compared with BP-MLP classifier and NNE classifier.
Finally, Section IV concludes the paper.
II. METHODOLOGY
A. VGA algorithm [7J
As for three-layer neural network, its input nodes are equal
to the dimensional number in a feature space while output nodes are equal to the number of the pattern categories. Both
can be got from training samples. Thus, the architecture of
three-layer neural network only depends on the number of
hidden nodes. In order to evolve the architecture of three-Iayer neural network using VGA algorithm, hidden nodes and
connection weights can be encoded by a string chromosome.
1) Architecture and weights encoding
Let A = (A"A2, ... A" ... AM) where A is a population made of M individual (Aj, ... AM)
� = (li'W,) (1)
In the formula (1), II and W, represents the number of
hidden nodes and the connection weights of the /" individual
respectively.
By convenient rule of thumbs [5], the number of hidden
nodes is chosen such that the total number of weights in the
network is roughly nllO, or more than this number, but it
should not be more than the total number of training samples,
n. For three-layer neural network, the total number of weights
is I *(m+q), where m is the nwnber of feature dimension, q is
the number of pattern categories, so the maximum number of
hidden nodes I,mx ::;; n I( m + q), the hidden node I can be
chosen from range [imal1 0, Imax]. The hidden node is encoded by its binary, but the weight is encoded by a real nwnber
which is randomly chosen from rang [-1,+ 1] in initialization.
2) Genetic operators: Reference [9] pointed that it is
generally very difficult to apply crossover operator in
evolving connection weight since they tend to destroy feature detectors found during the evolutionary process [9]. Thus, this
paper adopts only selection operator and mutation operator.
Selection: the roulette wheel selection procedure has been
adopted to implement a proportional selection strategy.
Mutation: a basic mutation operator has been adopted. For
hidden nodes, each bit of each individual with probability
Jim is selected and changed from 1 or 0, vice versa. For
connection weights, the weights value of each individual W, is
selected with probability Jim and randomly got a real nwnber
from [-1,+1].
Suppose the hidden node Ii of the Ai is changed to I; after
mutation operator, then appearing three states as follows:
a) I, = I;, the weights of the individual Ai need not be
adjusted;
b) II> I; , the weights of the Ai whose hidden nodes are
more than I; must be deleted;
c) Ii <I;, some weights term must be added so that the
length of Ii reaches I; . The inserting weights values can be
got randomly from [-1,+ 1].
3) Fitness computation: For each individual with hidden
node II and weight Wi, the fitness function is defined as :
(2)
where f is the fitness of the i'h individual, I, is its number
of hidden nodes, f3 is a user-defined architecture coefficient
in the range [0,1]. Therefore, the maximization of the fitness
ensures the minimization of the MSE, and the term
f3 ·1, Il,mx will force the minimization of the number of
hidden nodes. Now, the basic steps of VGA algorithm are described as
follows:
a) Randomly construct an initial population;
b) Calculate the fitness of each individual in the population;
c) Select parents from the current generation according
to their fitness;
d) Apply mutation operator to parents to generate
offspring which form the new generation; e) Do step b) to e) repeatedly until some stopping
criterion is met or a desired nwnber of generations is reached.
82
B. Neural Network Ensemble
A neural network ensemble is a combination of a set of
neural networks. The simplest way we have to combine M
neural networks for pattern classification is an arithmetic
mean of their output Om: 1 M
Gens = -LOm(Xi) M m�l
h .
h ·th .
w ere Xi IS t el mput vector.
(3)
Negative Correlation Learning (NCL) has been introduced
by Liu and Yao [lO] with the aim of negatively correlate the
error of each network within the ensemble. In this method,
instead of training each network separately, a penalty term is
introduced to minimize the correlation between the error of
the network and the error of the rest of the ensemble.
Given the training set {Xi' f;} ��I ' the error function em for a
network m is defined:
_ n _ 2 _
n _ 2 (4) em - L(Om(x,) t,) AL(Om(x,) 0ens(x,))
1=1
where OnlxJ is the network m output for input x, while t, as
its desired response (target output), A is the weight parameter
on the penalty term, and controls a trade off between error
term and penalty term. With A =0, we would have an
ensemble with each network training independently. If A is
increased, more and more emphasis would be placed on
minimized the penalty.
Let A =1, the average error Eens of all the individual
Then
(7)
The ensemble method we considered in this work is called
Regularized Negative Correlation Learning (RNCL)[3]. This
method improves neural network ensemble performance
adding a regularization term with the objective of reduce the
over-fitting problem.
In RNCL algorithm, each network m has the following error function:
1 � 2 1 � 2 t (8) em =- L..(Om(X,)-I,) -- L..(Om(x,)-O",,(x,» +am ,WmWm M ,=! M ,=!
where Wm is weight vector of the neural network m. In
formula (8), the first term is the error of the mth neural
network, the second is the negative correlation between each
network and output of the ensemble and finally the last term is
the regularization term with its parameter a", E [0,1] . According to (7) and (8), the minimizing the error function of
the ensemble is achieved by minimizing the error function of
each individual network, and the minimization of em depends
2
on weight vector Wm and regularized parameter am which
controls the trade off between empirical training error and
regularization.
Based on the gradient descent algorithm, the weights and
regularized parameters are adjusted as follows:
t.w =- aem =- {2 � (O (x)-tm). aOm(x;) m 17 aw 17 M L.., m I I aw m � m
_2 i:(Om(xJ-Oens(x;)).Cl-�). aom(x;) +2am ·Wn,} M ;=1 M aWm
�a = -n. 8em = -n.W' .W (10) m '( 8am
'( m m
where 17 is the learning rate, 17 E [0,1]
(9)
Now, the RNCLgd algorithm of the three-layer network can
be included as the following four steps:
a) Normalization of the training samples and
initialization of each individual network weights
and regularization parameter. Weights and
architecture of each network are come from VGA
algorithm and all regularization parameters are set
as the maximum value 1;
b) For the training set {xi,l,}7=1 ' calculate
M • 0ens(xJ = 11 M . L: om (x;) ,
m=! c) For each network from m=1 to M and all training
samples, weights Wm and regularization
parameter am of each network are adjusted by
formula (9) and (10);
d) Repeat from step b) for a desired number of
iterations (epochs); In RNCLgd algorithm, weights and regularization
parameters of each network are evolved simultaneously and
converged to the optimal solution while the error function of
the ensemble is minimized.
After having achieved training NNE, the testing data can be
classified by the output of NNE, Gens '
III. RESULTS AND DISCUSSION
In the section, we will discuss some experimental results of
ENNE classifier and compared with BP-MLP and NNE
classifiers on a number of benchmark data sets, such as vowel
TABLE I THE MAIN CHARACTERS FOR IRIS, DIABETES AND VOWEL
Data sets Size of classes feature
Iris Diabetes Vowel
samples
150 750 300
3 2 10
dimensions 4 8 2
Iris and Diabetes data. Iris and Diabetes data are come from
website (http://archive.ics.edu/ml) Vowel data are come from
MLT Lincoln Laboratory (hrtp:llwww.ll.mit.edu/IST/lnknet).
Their main characters of the data organized as following table
I.
According to m-fold cross-validation [5], all samples are
83
randomly divided into five sets. The classifier is trained five
times, each time with different set hold out as a validation set
and other sets for training. The test accuracy is an average of
five runs.
A. Implementation Parameters of Classifiers
There are three types of supervised classifiers: BP-MLP
classifier, NNE classifier and ENNE classifier.
(1) BP-MLP classifier
The number of hidden nodes is set as 20, the learning rate is
set as 0.1 for vowel, 0.02 for diabetes and Iris respectively,
and the algorithm is executed for 1000 epochs.
(2) NNE classifier
The number of NNE, M is set as 10, the regularization
parameter a is optimized by cross-validation method
(RNCLcv) from range [0.001, 1], and got the optimal value
0.035. Each individual network has the same structure and
other parameters as BP-MLP classifier.
(3) ENNE classifier
The number of individuals in the population and evolving
generation is set as 20 and 300 respectively. The mutation
probability Jim and architecture coefficient f3 are variable
within the range [0.015, 0.333] during evolving process. The
initialization value of all regularization parameters is set as 1.0
while the other parameters are the same as NNE classifier.
TABLE II TRAINING ACCURACY COMPARISON OF CLASSIFIERS
DataSets
Iris Diabetes Vowel
Training Accuracy(%)
BP-MLP NNE ENNE
99.0 98.3 98.0 97.2 94.1 79.7 81.1 81.0 82.9
B. Performance Comparison of Classifiers
Table II and III are the performance comparison of classifiers
for vowel, Iris and Diabetes data respectively. They are
average training and testing accuracy of five data sets by
five-fold cross validation. As seen from Table II, the training accuracy of BP-MLP
TABLE III TESTING ACCURACY COMPARISON OF CLASSIFIERS
Data sets
Iris Diabetes Vowel
Testing Accuracy(%)
BP-MLP NNE
92.6 92.6 71.2 75.5 68.3 70.3
ENNE
93.3 77.1 71.3
classifier is higher than NNE and ENNE for Iris and Diabetes,
and similar performance as NNE and ENNE for vowel. Due to
regularization negative correlation not used in BP-MLP, it is
very easy to fall into over-fitting situation which does not
benefit some testing cases.
As seen from Table III, the testing accuracy of NNE is better
than BP-MLP for Diabetes and vowel, and similar
3
performance as BP-MLP for Iris. But the testing accuracy of
ENNE is better than BP-MLP and NNE for all data sets. It
shows that the ENNE classifier as the same as the NNE
classifier can reduce the over-fitting problem as well as has
better generalization ability. Furthermore, as a result of
evolving the appropriate architecture and initial weights of
NNE using VGA algorithm, the ENNE algorithm can search
the optimal solution in a larger feature space scope, which
avoid in falling into local minimum. Thus, it is not surprised to
get better testing accuracy with ENNE classifier than
BP-MLP and NNE classifiers.
C. Effect af Parameter in ENNE classifier
ENNE classifier has some important parameters that influence the structure and weights training of neural network
ensembles, which include mutation rate flm and the number
of evolving generation in VGA algorithm, and learning rate
lJ ' the number of iterations (epochs) and regularized
parameter am in RNCLgd algorithm.
16,------------------------------------,
if) '"
14
.E 12 E ::J C 10 if) (j)
"0 o C � : if .� .c
51 101 151
Generations 201 251
Fig. I. the optimal hidden node by VGA evolving training
25,--------------------------------,
Vi � 20 o c c Q) :g 15 .� .c
';; ,,::; 10
+' 0-o Q) ..c E-
--:... : . ... .
...... �. . ...
.
. :::;;,�:.------� �.
100 200 300
generations 500 1000
" 'lj! -Lj2 --Lj3 -Lj4 -LjS
Lj6 -Lj7 --LjS -Lj9 --LilO
Fig. 2. the optimal hidden node of each individual neural network
Fig.l is the optimal hidden nodes(Lj) of an individual
corresponding each generation during the VGA training for
vowel data. The variation of the optimal topology with the
number of generations is described as this figure.
The optimal number of hidden nodes (15) is obtained just after
the 15 generations.
84
Fig.2 is the optimal hidden nodes Lj of each individual
corresponding to 100,200,300,500,1000 generations for
vowel data. From this figure, it shows that the optimal Lj vary
with different individual and evolving generations, and almost
reach the same optimal value 19 or 20 after lOOO evolving
generations except the 8th individual.
0.9
0.8
0.7
0.6 w � 0.5
0.4
0.3
0.2
0.1
51 101 lSI 201
generations
Fig. 3. the MSE by VGA evolving training
251
0.4 ,--------------------------------------,
0.35
0.3
0.25
W ifJ 0.2 :;;
0.15
0.1
0.05
101 001 �l �l �1 WI WI �1 001
epochs
Fig. 4. the MSE changed by RNCLgd training
Because of the mutation probability fllll is varied with the
number of generations. Initially, the flm has a high value, thus
ensuring a lot of diversity in the population, at this stage, the architecture (Lj) of the optimal individual has much more
changed. As generations pass and evolving progress reaches
the vicinity of an optimal solution, the fllll should be decreased
for fine tuning the architecture and weights. Fig.3 shows the MSE of the optimal individual
corresponding to each generation during VGA training and
Fig.4 is the MSE corresponding to each epoch during
RNCLgd training. It is found that although VGA algorithm
has greatly reduced the MSE of each neural network, a more
improvement of training performance is achieved by RNCLgd
algorithm.
In ENNE classifier, if only VGA algorithm is used to evolve the architecture and weights of neural network
ensembles, it forms a special VNNE classifier. Table IV is the
performance of VNNE classifier for Iris, Diabetes and Vowel
4
data respectively. As seen from Table IV, the average testing
accuracy of VNNE classifier still is better than NNE and
BP-MLP classifiers for Diabetes data, it shows that the optimal solution of neural network ensembles almost has been
found by VGA algorithm for Diabetes data. However, for
Vowel and Iris data, the testing accuracy of VNNE is far lower
than NNE and BP-MLP, so it is necessary to evolve the
TABLElV TESTING ACCURACY COMPARISON OF CLASSIFIERS
DataSets
Iris Diabetes Vowel
Testing Accuracy(%)
BP-MLP NNE
92.6 92.6 71.2 75.5 68.3 70.3
VNNE 86.8 76.7 42.7
weights of neural network ensembles by RNCLgd algorithm.
In RNCLgd algorithm, learning rate '7 and number of epochs
is a pair of interaction parameters. The larger the learning rate,
the less the number epochs needed and vice versa. These
parameters have to be set by manual, which are not set automatically, the same as the BP-MLP algorithm and RNCL
algorithm.
The regularization parameter is a very important parameter
which controls the balance between the empirical training
error and the complexity of network. It is crucial to improve
the ensemble's generalization ability. In RNCL algorithm, the
regularization parameter a is optimized by cross-validation
(RNCLcv) or Bayesian inference (RNCLs1) method. In
RNCLsl , the regularization parameter can be automatically
adjusted, but its calculation is too complex to apply to real
application due to using the Hessan Matrix.
In RNCLcv, due to without automatic parameter
adjustment, a large number of training and testing needs to be
done in order to search an optimal a parameter while the
RNCLgd algorithm has the better solution to the problem.
The RNCLgd algorithm has some features as follows:
1) The RNCLgd algorithm can automatically adjust the a parameter as the RNCLBI algorithm. Seen from the formula
(10), the calculation of L1a is simple and easy to implement in
the iteration process of the RNCLgd algorithm so that it would
be much more efficient than the RNCLcv which has to search
the a parameter by manual.
2) With respect to operation mechanism, for the RNCLcv
algorithm, the regularization ratio a , which has been selected
by manual operation, will be kept the same value in the entire
evolving process for all the different weights of each
individual neural network in order to find the appropriate
a efficiently.
On the other hand, due to automatically adjustment of the a parameter for the operation process in the RNCLgd
algorithm, not only the a values are different for different
weights of each individual, but also the a value will be
automatically evolved during the entire operation process.
Actually, at the initiation of the evolution process, the
a value, which is set as a large value, will be strong effective
for attenuating the weights of the neural networks so that the
weights of neural networks could be kept in the appropriate
85
scope. With the iteration process development, the a will be
gradually reduced. In the all evolving process, the a value
sometime even is reduced to zero, showing that the weights
with respect to the a value has been reduced small enough so
that the furthermore attenuation penalty caused by a value
should be removed.
IV. CONCLUSION
In this paper, we have described neural network ensembles
(NNE) classifier. A neural network ensemble is a combination
of a set of neural networks which tries to cope with a problem
in a robust and efficient way. NNE classifier adopts
Regularized Negative Correlation Learning (RNCL)
algorithm. Adding a regularization term can reduce the
over-fitting problem, control the complexity of the network,
get the optimal balance between the empirical training error
and weight decay term and help the network to improve its
generalization capability. We propose an evolving neural
network ensembles (ENNE) classifier using VGA algorithm,
which does two aspect improvements for NNE classifier with
RNCL algorithm as follows: the first term is to evolve the
appropriate architecture and initial weights of NNE using
VGA algorithm, the second term is to optimize automatically
the regularization parameter while evolving the NNE's
weights. The effectiveness of ENNE classifier is
demonstrated on a number of benchmark data sets. Compared
with BP-MLP classifier and NNE classifier, it has shown that
the ENNE classifier with VGA and RNCLgd hybrid algorithm has better pattern classification performance.
REFERENCES
[ I] J. A. Benediktsson, P. H. Swain, and O. K. Ersoy, "Neural Network approaches versus statistical methods in classification of muiti-spectral remote sensing data," iEEE Transaction on Geoscience and Remote
SenSing, vol. 28, pp. 540-552, 1990. [2] D. E. Goldberg, Genetic algorithm in Search, Optimization and
Machine Learning. New York: Addison-Wesley, 1989. [3] H. Chen and X. Yao, "Regularized negative correlation learning for
neural network ensembles," iEEE Trans. Neural Networks, vol. 20, no. 12, pp. 1962-1979,2009.
[4] L. K. Hansen and Salamon, "Neural network ensembles," iEEE Trans
Pattern Anal. March. in/ell., vol. 12, no. 10, pp. 993-100 I, 1990. [5] O. Richard, E. Peter, and G. David, Pattern Classification. Second
Edition, John Wiley & Sons, 2001. [6] D. Rumelhart, G. Hinton, and R. Williams, "Learning representations
by back-propagating errors," Nature, vol. 323, pp. 533-536, 1986. [7] S. Bandyopadhyay and K. Sankar, "Pixel classification using variable
string genetic algorithms with chromosome differentiation," iEEE
Transaction on Geoscience and Remote SenSing, vol. 39, no. 2, 2001. [8] X. Yao and Y. Liu, "Making use of popUlation information in
evolutionary artificial neural networks," iEEE Trans. Systems, Man
and Cybernetics, B, vol. 28, no. 3, pp. 417-425, 1998. [9] X. Yao, "Evolving artificial neural networks," Proceeding of iEEE, vol.
87, no. 9, pp. 1423-1447, 1999. [10] Y. Liu and X. Yao, "Ensemble learning via negative correlation,"
Neural Network, vol. 12, no. 10, pp. 1399-1404,1999. [11] M. Felice and X. Yao, "Short-term load forecasting with neural
network ensembles: a comparative study," iEEE Computational
intelligence, pp. 47-56, 2011.
5