coevolutionary learning of neural network ensemble for complex classification tasks

13
Coevolutionary learning of neural network ensemble for complex classification tasks Jin Tian, Minqiang Li n , Fuzan Chen, Jisong Kou School of Management, Tianjin University, Tianjin 300072, PR China article info Article history: Received 28 January 2010 Received in revised form 18 August 2011 Accepted 26 September 2011 Available online 1 October 2011 Keywords: Ensemble learning Neural network Coevolutionary algorithm Classification abstract Ensemble approaches to classification have attracted a great deal of interest recently. This paper presents a novel method for designing the neural network ensemble using coevolutionary algorithm. The bootstrap resampling procedure is employed to obtain different training subsets that are used to estimate different component networks of the ensemble. Then the cooperative coevolutionary algorithm is developed to optimize the ensemble model via the divide-and-cooperative mechanism. All component networks are coevolved in parallel in the scheme of interacting co-adapted subpopula- tions. The fitness of an individual from a particular subpopulation is assessed by associating it with the representatives from other subpopulations. In order to promote the cooperation of all component networks, the proposed method considers both the accuracy and the diversity among the component networks that are evaluated using the multi-objective Pareto optimality measure. A hybrid output- combination method is designed to determine the final ensemble output. Experimental results illustrate that the proposed method is able to obtain neural network ensemble models with better classification accuracy in comparison with currently popular ensemble algorithms. & 2011 Elsevier Ltd. All rights reserved. 1. Introduction Ensemble learning, one of the most efficient approaches to complex classification problems in the last decade, proceeds by building a combination of several neural networks (NNs) or other types of classifiers. The neural network ensemble (NNE) [1] consists of a finite number of NNs and is intensively studied in recent years. The principle of NNE is to use a group of component networks to better capture important patterns in the data through combining the capabilities of multiple component net- works that are trained for the same task. The most important requirement of the NNE is the diversity of component networks that are estimated by parallel or sequential heuristics. Previous theoretical and empirical literatures have shown that an ensemble is often more accurate than a single classifier in the ensemble [2, 3]. Hansen and Salamon proved that the ensemble error rate can be reduced to zero as the number of component networks goes to infinity if each network has the average error rate less than 50% and all component network responses are independent [1]. Theoretically, the classification accuracy of the ensemble and the diversity of the component networks are the two major objectives that should be simultaneously maximized in the ensemble design. Standard ensemble techniques, like the Boosting [4] and the Bagging [5], rely on resampling techniques to obtain different training subsets for the component networks. Various ensemble learning methods based on evolutionary algorithms (EAs) have been developed and mainly adopted the Genetic Programming [6, 7]. However, component networks are often designed independently or sequentially in these ensemble learning algorithms, and little attention is paid to the interaction among the component networks in the training process, whereas the cooperation among all component networks helps preserving the component diversities. The cooperative coevolutionary algorithm (Co-CEA) is a pop- ular augmentation of standard EAs, where the adaptive fitness evaluation is used to realize the coevolutionary mechanism. In contrast with other stochastic optimization approaches, the Co-CEA resorts to the parallel evolution of multiple subpopula- tions that depend on one another in fitness evaluation. The Co-CEA model adopts the divide-and-cooperative strategy to co-evolve all subpopulations simultaneously. An individual in a subpopulation represents a particular component of the complete solution, so that a complete solution is assembled by representa- tive individuals from subpopulations [8]. An individual in a subpopulation is evaluated by associating it with representatives from other subpopulations, and it is rewarded when the complete solution performs well, or gets punished when the complete solution performs poorly. When the Co-CEA paradigm was used Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2011.09.012 n Corresponding author. E-mail addresses: [email protected] (J. Tian), [email protected] (M. Li), [email protected] (F. Chen), [email protected] (J. Kou). Pattern Recognition 45 (2012) 1373–1385

Upload: jin-tian

Post on 11-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Pattern Recognition 45 (2012) 1373–1385

Contents lists available at SciVerse ScienceDirect

Pattern Recognition

0031-32

doi:10.1

n Corr

E-m

fzchen@

journal homepage: www.elsevier.com/locate/pr

Coevolutionary learning of neural network ensemble for complexclassification tasks

Jin Tian, Minqiang Li n, Fuzan Chen, Jisong Kou

School of Management, Tianjin University, Tianjin 300072, PR China

a r t i c l e i n f o

Article history:

Received 28 January 2010

Received in revised form

18 August 2011

Accepted 26 September 2011Available online 1 October 2011

Keywords:

Ensemble learning

Neural network

Coevolutionary algorithm

Classification

03/$ - see front matter & 2011 Elsevier Ltd. A

016/j.patcog.2011.09.012

esponding author.

ail addresses: [email protected] (J. Tian

tju.edu.cn (F. Chen), [email protected] (J. Kou

a b s t r a c t

Ensemble approaches to classification have attracted a great deal of interest recently. This paper

presents a novel method for designing the neural network ensemble using coevolutionary algorithm.

The bootstrap resampling procedure is employed to obtain different training subsets that are used to

estimate different component networks of the ensemble. Then the cooperative coevolutionary

algorithm is developed to optimize the ensemble model via the divide-and-cooperative mechanism.

All component networks are coevolved in parallel in the scheme of interacting co-adapted subpopula-

tions. The fitness of an individual from a particular subpopulation is assessed by associating it with the

representatives from other subpopulations. In order to promote the cooperation of all component

networks, the proposed method considers both the accuracy and the diversity among the component

networks that are evaluated using the multi-objective Pareto optimality measure. A hybrid output-

combination method is designed to determine the final ensemble output. Experimental results

illustrate that the proposed method is able to obtain neural network ensemble models with better

classification accuracy in comparison with currently popular ensemble algorithms.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Ensemble learning, one of the most efficient approaches tocomplex classification problems in the last decade, proceeds bybuilding a combination of several neural networks (NNs) or othertypes of classifiers. The neural network ensemble (NNE) [1]consists of a finite number of NNs and is intensively studied inrecent years. The principle of NNE is to use a group of componentnetworks to better capture important patterns in the datathrough combining the capabilities of multiple component net-works that are trained for the same task. The most importantrequirement of the NNE is the diversity of component networksthat are estimated by parallel or sequential heuristics.

Previous theoretical and empirical literatures have shown that anensemble is often more accurate than a single classifier in theensemble [2,3]. Hansen and Salamon proved that the ensemble errorrate can be reduced to zero as the number of component networksgoes to infinity if each network has the average error rate less than50% and all component network responses are independent [1].Theoretically, the classification accuracy of the ensemble and thediversity of the component networks are the two major objectives

ll rights reserved.

), [email protected] (M. Li),

).

that should be simultaneously maximized in the ensemble design.Standard ensemble techniques, like the Boosting [4] and the Bagging[5], rely on resampling techniques to obtain different training subsetsfor the component networks. Various ensemble learning methodsbased on evolutionary algorithms (EAs) have been developed andmainly adopted the Genetic Programming [6,7]. However, componentnetworks are often designed independently or sequentially in theseensemble learning algorithms, and little attention is paid to theinteraction among the component networks in the training process,whereas the cooperation among all component networks helpspreserving the component diversities.

The cooperative coevolutionary algorithm (Co-CEA) is a pop-ular augmentation of standard EAs, where the adaptive fitnessevaluation is used to realize the coevolutionary mechanism.In contrast with other stochastic optimization approaches, theCo-CEA resorts to the parallel evolution of multiple subpopula-tions that depend on one another in fitness evaluation. TheCo-CEA model adopts the divide-and-cooperative strategy toco-evolve all subpopulations simultaneously. An individual in asubpopulation represents a particular component of the completesolution, so that a complete solution is assembled by representa-tive individuals from subpopulations [8]. An individual in asubpopulation is evaluated by associating it with representativesfrom other subpopulations, and it is rewarded when the completesolution performs well, or gets punished when the completesolution performs poorly. When the Co-CEA paradigm was used

J. Tian et al. / Pattern Recognition 45 (2012) 1373–13851374

to optimize the NNE design in previous researches, componentnetworks were encoded and coevolved in a population, and theensemble was constructed by eliciting heterogeneous individualsfrom the population. The output of the ensemble was the linearcombination of all component networks, and the combinationweights were coevolved in another population. In this way, thecomponent networks and the weights were evolved in twointeracted populations concurrently.

In this paper, we propose a novel approach for NNE designbased on the Co-CEA, which aims at a high-accuracy and compactcreation of NNE for complex classification problems by evolvingmultiple subpopulations of the Co-CEA. Generally, the real-worldcomplex classification problems often contain a huge amount ofinstances with noise. There are multiple classes of instances thatare usually nonlinearly separable. These features make thedesigned classification models complicated. The NNE is a suitableapproach for building robust learning models on multidimen-sional classification problems.

The proposed approach takes advantage of the cooperativecoevolutionary paradigm. Suppose that the NNE consists of M

component networks. A subpopulation of individuals is generatedfor a component network and thus the proposed model has M

interacting co-adapted subpopulations. All the subpopulations areevolved by EAs, but the individual from a particular subpopulationevaluates its fitness by associating it with representatives from othersubpopulations. Furthermore, a hybrid output-combination method isdesigned for the ensemble output, instead of calculating the combi-nation weights. The hybrid output-combination method can not onlyavoid the defect that the majority components with smaller outputsdominate the minority ones with larger outputs via introducing thethreshold, but combine the winner-takes-all rule with the voting ruleeffectively and enforce the performances of the ensemble compo-nents via the aggregated outputs by a winner-takes-all decision rule.The other important aspect of our work is the use of multiobjectiveoptimization for the evaluation of the fitness during the populationevolution. In this way, each individual is thoroughly evaluated fromdifferent points of view.

This approach has several advantages:

1)

Each component network corresponds to a subpopulation ofthe Co-CEA, and the diversity of all component networks ismainly determined by the diversity of all subpopulations.

2)

The independent evolution of the multiple subpopulationsincreases the opportunity to obtain heterogeneous ensemblecomponents, and thus it offers an effective way for keepingthe diversity of component networks. The coevolutionaryapproach to identify the ensemble model contributes to theexplorative search in the feasible space of ensemble networksfor gaining a better global optimality.

3)

The hybrid output-combination method avoids the defect that theensemble output is determined by the majority of small outputsrather than the minority of big outputs in the majority voting rule,which usually lead the samples to be wrongly classified.

The rest of the paper is organized as follows: Section 2 reviewsrelevant work to ensemble learning. In Section 3, the proposedmethod is presented in detail. Experimental results based ondatasets from the UCI repository are reported in Section 4. Finally,Section 5 summarizes the key points of the paper and points outfuture research directions.

2. Related work

Many variants of ensemble approaches have been invented inthe machine learning literature [9,10]. The Boosting and the

Bagging are the two typical techniques to obtain training subsetsfor the component networks. The Boosting iteratively constructs asequence of component networks, where each network focuseson the examples misclassified by the previous ones. The Baggingyields a unique training subset for each ensemble component bysampling uniformly with replacement on the original trainingdata. However, approaches based on the two techniques generatecomponent networks independently or sequentially, and theinteraction among the components during the training processis little considered.

As a useful alternative, evolutionary computation has also beensuccessfully applied to ensemble construction. Yao and Liu pro-posed the construction of the final ensemble by linearly combiningall individuals in the last generation [11], and further designed anegative correlation learning paradigm of the NNE, which trainedcomponent networks simultaneously through incorporating thecorrelation penalty in the error function [12]. Chandra and Yaopresented the multi-level evolution of ensemble classifiers, whichwere called the diverse and accurate ensemble learning system[13]. Ortiz-Boyer et al. used a real-coded genetic algorithm tooptimize the weights of components within an ensemble [14]. Inthese approaches, evolution is another fundamental form ofadaptation in addition to learning, which makes the adaptationin a dynamic environment much more effective and efficient. Theevolution-based NNE can be regarded as a general framework foradaptive systems that can change their architectures and learningrules adaptively without human intervention [15].

Recently, some in-depth studies on the NNE learning havebeen conducted using the cooperative paradigm, which is morecompatible with the NNE’s divide-and-conquer learning mechan-ism than the standard EAs.

Islam et al. proposed a constructive NNE algorithm (CNNE) usingthe incremental training to estimate component networks [16]. TheCNNE combined the NNE’s architecture design with the parameteridentification of component networks in an ensemble. Componentnetworks were trained incrementally for different numbers ofepochs on the same training dataset. During the incrementaltraining, a component network was built by adding hidden nodes.Component networks joined the ensemble architecture one by oneonce they were trained completely, and the output of the ensemblewas an average of all component outputs. The negative correlationlearning was used for training component networks to facilitateinteractions among component networks in an ensemble.

Garcıa-Pedrajas et al. presented a cooperative coevolutionalgorithm for learning an ensemble via maintaining two popula-tions, i.e., population of networks and population of ensembles[17]. The first population was composed of independent subpo-pulations that encoded component networks. A subpopulationwas evolved using evolutionary programming. In the secondpopulation, an individual represented an ensemble that wasformed by the combination of component networks from sub-populations in the first population, and it was encoded with labelsof component networks and combination weights. The ensembleoutput was the linear weighted sum of the output of all compo-nent networks. Thus, the two populations were evolved coopera-tively to find the optimal ensemble model.

As in our model, it offers a multi-subpopulation ensemblelearning approach based on Co-CEA. Each subpopulation evolves acomponent network in the ensemble independently, whereas thefitness of the individuals in one subpopulation is determined bythe ensemble containing the individual and the representatives inother subpopulations. Thus the components used for fitnesscalculation are always the representatives in the subpopulations.The use of multiobjective optimization is introduced with regardto the ensemble accuracy, the accuracy and cooperationof component networks. Furthermore, we design a hybrid

J. Tian et al. / Pattern Recognition 45 (2012) 1373–1385 1375

output-combination method by combining the winner-takes-allrule with the voting rule effectively.

In a previous work [18] we used the methodology of Co-CEA forthe design of radial basis function neural network (RBFNN). Thehidden nodes of the RBFNN model were partitioned into modulesof hidden nodes by a modified K-means method. We generatedseveral subpopulations for the modules of network structure. Theindividual of a subpopulation represented a part of hidden layerstructure. Collaborations among the modules were required toobtain complete solutions. The combination of the standard RBFNNtraining algorithm and the proposed Co-CEA enhanced the strengthof both methods. The model with a matrix-form mixed encodingobtained good results in several problems of classification. Ourpresent approach applies the divide-and-cooperative mechanismof the Co-CEA to the ensemble design. We use a similar matrix-form mixed encoding but an individual of a subpopulation repre-sents a wholly component network of the ensemble rather than apart of the hidden layer structure of a RBFNN.

3. The cooperative coevolutionary neural network ensemblemethod

This section outlines the architecture of the cooperative coevolu-tionary neural network ensemble (Co-NNE) method based on multi-ple subpopulations. The bootstrap resampling method works on theoriginal training data, and yields a unique training subset for eachcomponent network by sampling uniformly with replacement onthe original data. Each component network is a three-layer RBFNNwith multi-input and multi-output. The DRSC method [19] isimplemented to create the initial hidden nodes. The output weightsbetween the hidden layer and the output layer of the RBFNN arecalculated directly by the pseudo-inverse method [20]. In theCo-NNE, an individual corresponds to a candidate of a componentnetwork in the ensemble, and a group of individuals is generated foreach component network as a subpopulation in the coevolutionmechanism. A matrix-form mixed encoding method is designed forall subpopulations. The initiative is that the ensemble learning tasksmay benefit from evolving the component networks concurrentlywith separate subpopulations. The subpopulations are evolvedseparately and the fitness of an individual in a subpopulation isassessed by associating it with individuals from other subpopula-tions. The Co-NNE outputs the complete ensemble solution byintegrating the best individuals from all subpopulations.

3.1. Initialization

In this paper, we design a matrix-form of the mixed encodingchromosome for individuals in subpopulations, where the hiddennodes and the radius widths of the RBFNN are encoded as real-valued encoding matrices, and a control vector, which indicatesthe hidden nodes active or not, is added as a binary stringattached to the matrix as a column. In this mixed encodingrepresentation, an individual in one subpopulation represents atotal component network structure. Thus the lth individual in thetth subpopulation, Pl

t , is a matrix of size Nct(mþ2) as below

Plt ¼ cl

t rlt bl

t

h i¼

cl1t sl1

t bl1t

cl2t sl2

t bl2t

^ ^ ^

clNctt slNct

t blNct

t

2666664

3777775 ð1Þ

where l¼1,2,y,L, t¼1,2,y,M. L is the population size, and M isthe ensemble size. m is the number of features and Nct is theinitial number of the hidden nodes of the tth subpopulation.cl

t ¼ ½clit �Nct�m and rl

t ¼ ½slit �Nct�1 are the centers and widths of the

hidden nodes of Plt; bl

t ¼ ½blit �Nct�1 is the control vector, where

blit ¼ 0 means that the ith hidden node of Pl

t is inactive and isexcluded in the design of the network structure; otherwise bli

t ¼ 1denotes that it is a valid hidden node in the network structure,i¼1,2,y,Nct. The control vector makes the network structure ofan individual more adaptable in evolution.

The initialization of the proposed method is done in threesteps. First, the total samples are divided into three parts: thetraining set, the validation set, and the testing set. The bootstrapresampling method is adopted to create M training subsetsindependently from the original training set. The probability ofa sample to be selected at least once in Ntrain times of thebootstrap resampling operations is

1� 1�1

Ntrain

� �Ntrain

ð2Þ

where Ntrain is the size of the training set. When Ntrain becomesvery big, the probability is computed approximately as 1�(1/e)¼63.2%, which indicates that there are 63.2% different samplesin one bootstrap resampling [21].

Second, based on the different M training subsets, the M initialcomponent network structures are generated. The initial hiddennodes are generated by the DRSC, and the initial values of radiuswidths are calculated according to the clustering information ofthe sample distribution [22].

Finally, a population of L individuals is generated to form asubpopulation for a component network by adding randomvariances that are drawn from the standard normal distribution.The control vector bits in an individual are initialized as 0 or1 randomly, which indicates that the corresponding hidden nodesare inactive or active. Note that there is one individual at least in asubpopulation whose control vector bits are all 1, and there is noindividual that has a control vector with all bits being 0.

The best individuals in all subpopulations are chosen as the eliterepresentatives, which compose the elite pool Hn

¼ fPn

1,Pn

2,. . .,Pn

Mg.The elites in the initial elite pool, Pn0

1 ,Pn02 ,. . .,Pn0

M , are the RBFNNstructures obtained by DRSC in the initialization process.

3.2. The hybrid output-combination method

Generally, each RBF neuron has a receptive field and theoutput of a sample that falls outside of the receptive field willbe small. Thus the output of the ensemble composed by RBFNNsmay be decided by the majority components with small outputsrather than the minority ones with large outputs if the majorityvoting method is used. On the other hand, the winner-takes-allmethod suffers from the limitation that those components withoutputs close to the maximal are always ignored.

In the proposed method, a hybrid output-combination methodis designed to determine the final ensemble output, instead ofcalculating the combination weights. The threshold d is prede-fined as dA(0,1) (d¼0.5 in this article). For a sample,yt ¼ ðy

1t ,y2

t ,. . .,ynt Þ

T is the output vector of the tth componentnetwork (t¼1,2,y,M) in the elite pool, n is total number ofclasses in the dataset. Then we make a preprocessing of theoutput vectors before we compute the output of the ensemble.

a)

If 8jA{1,2,y,n} and 8tA{1,2,y,M}, yjtod, yt is held intact. Then

we compute the linear aggregation of the output vectors of allcomponent networks, and the ensemble output is generatedusing the winner-takes-all decision rule.

b)

Otherwise, yt (t¼1,2,y,M) is preprocessed as follows. If yjtod

(j¼1,2,y,n), yjt ¼ 0. Then we compute the linear aggregation of

reset output vectors of all component networks, and the ensembleoutput is decided by the winner-takes-all decision rule.

Table 1Hybrid output combination method of ensemble used in Co-NNE.

Input: the M outputs of component networks {yt} on a certain sample, yt ¼ ðy1t ,y2

t ,. . .,ynt Þ

T, t¼1,2,y,M

1 For t¼1–M {

2 ydt ¼ fy

jt9y

jt od,jAf1,2,. . .,ngg (find yj

t in yt that is smaller than d)

3 If ydt af then

4 ydt ¼ fy

jt9y

jt Ayt and yj

t=2ydt ,j¼ 1,2,. . .,ng(find elements that are bigger than d)

5 ydt ¼ 0 (reset yj

t(jA{1,2,y,n}) smaller than d as 0)

6 y0t ¼ ydt � yd

t (� means to unite the two sets together)

7 Otherwise y0t ¼ yt

8 }

9 Sd ¼ ðsd1 ,sd2 ,. . .,sdnÞT¼ sumðy0tÞ, t¼1,2,y,M (sum the reset component outputs)

10 maxs¼maxfsd1 ,sd2 ,. . .,sdng. If maxs¼0, which means that the outputs of all component networks on the sample are smaller than d, go to step 11, otherwise,

cl¼ argmaxfsd1 ,sd2 ,. . .,sdng, go to step 13

11 S¼(s1,s2,y,sn)T¼sum(yt), t¼1,2,y,M ( sum the original component outputs)

12 cl¼ argmaxfs1 ,s2 ,. . .,sng

13 Obtain the final class label of the sample, cl

Output: cl

J. Tian et al. / Pattern Recognition 45 (2012) 1373–13851376

Therefore, the hybrid output-combination method can avoidthe phenomenon that the majority component networks with

smaller outputs dominate the minority ones with larger outputsvia introducing the threshold. Moreover, the method also com-bines the winner-takes-all rule with the voting rule effectivelyand enforces the performances of the ensemble components viathe aggregation mechanism. The hybrid generation of the ensem-ble combination output is summarized in Table 1.

Furthermore, it is not necessary to calculate every componentnetwork’s output from scratch but rather to reuse the componentoutputs that have already been calculated. Thus a caching mechanismis used here, and the outputs of the components in the elite pool,yn

t ðt¼ 1,. . .,MÞ, are stored in advance. It does not need to update allthe elite outputs but only the corresponding output of a certain elitecomponent that has just been replaced by a more competitiveindividual. All the other elite outputs remain unchanged. Similarly,the output of the NNE Hl

t ¼ fPn

1,. . .,Pn

t�1,Plt ,P

n

tþ1. . .,Pn

Mg is determinedby only calculating the output yl

t of the component network Plt

instead of the M outputs of the NNE Hlt . The ensemble output is

generated with the modified method listed in Table 1 by combiningyl

t with the other elite outputs yns (s¼1,y,M,sat). The output caching

mechanism is able to reduce the computational complexity, which istraded with a fix-length storage complexity.

3.3. Multiobjective evaluation of individuals

With the coevolution paradigm, each subpopulation must con-tribute an individual to construct a complete NNE structure. Thus,individuals in one subpopulation are assigned fitness values inconjunction with individuals from other subpopulations. Generally,the ensemble classification accuracy and the component diversity arethe two key objectives that should be simultaneously maximized inthe ensemble design. Moreover, every individual in the subpopula-tions is a component network of the ensemble in the proposedmethod, which can perform the classification task independentlyaccording to the definition of the component network. Thus theclassification abilities of individuals by themselves are also consideredin the multiobjective evaluation and individual selection. Conse-quently, three objectives are used in the proposed method, and allobjectives have to be maximized.

3.3.1. Ensemble performance

The ensemble combination output is used as the first objective toevaluate the contribution of the component networks in the ensem-ble. This objective of an individual is calculated based on how well itworks with the other subpopulations. The lth individual in the tth

subpopulation,Plt , is evaluated by the performance of the whole NNE

structure Hlt ¼ fP

n

1,. . .,Pn

t�1,Plt ,P

n

tþ1. . .,Pn

Mg. Thus the objectives of theindividual Pl

t are calculated by the NNE structure Hlt . The evaluation

process is executed on the validation set. The hybrid output-combi-nation method described in Section 3.2 is used to determine the finalensemble output. The first objective is represented as the ratio ofcorrectly classified samples in the total validation set by the RBFNN,i.e. ArvðP

ltÞ ¼ ðNrvðH

ltÞ=NvÞ, where NrvðH

ltÞ is the number of samples

classified correctly on the validation set by the estimated networkwith hidden node structure Hl

t , and Nv is the size of the validation set.In experiments, similar accuracy rates of different individuals usuallygive rise to smaller selection pressure in the population. So theobjective is modified to increase the selection pressure as

f 1ðPltÞ ¼ að1�aÞ

I1ðPlt Þ ð3Þ

where I1ðPltÞ is the rank order of Pl

t with inversely sorting on theclassification accuracies of all Pl

t (l¼1,y,L). aA(0,1) is a user-definedreal number (a¼0.4 for default).

3.3.2. Independent performance

This objective evaluates individuals in one subpopulationwhen they are required to complete the classification taskindependently with no information from other subpopulations.This objective is represented as the ratio of correctly classifiedsamples in the total training set by the individual, i.e.ArtðP

ltÞ ¼ ðNrtðP

ltÞ=NtrainÞ, where NrtðP

ltÞ is the number of samples

classified correctly in the training set by the individual Plt , and

Ntrain is the size of the training set. When we sample withreplacement, some samples may be selected multiple times whileothers may not be selected at all, that is, some unique samples donot appear in the training subset to generate the ensemble model.Thus the original training set is used to evaluate the secondfitness, which is able to prevent using the same validation setwith the first objective evaluation process. Similarly, for avoidingthe small selection pressure caused by similar accuracy rates ofdifferent individuals, the second objective is modified as

f 2ðPltÞ ¼ að1�aÞ

I2ðPlt Þ ð4Þ

3.3.3. Measure of diversity

In Co-NNE, in order to find a tradeoff between the classifica-tion performance and the component diversity, the third objec-tive thus aims to evaluate the diversity of the ensemble. Thediversity of the ensemble means that the component networkshave to make independent classification errors. So, with regard to

J. Tian et al. / Pattern Recognition 45 (2012) 1373–1385 1377

the output of networks on the same sample set, the diversity ofthe ensemble is usually measured by the difference of thecomponent outputs. We adopt the Pairwise Failure Crediting(PFC) method [13] to get the diversity value.

The third objective of the lth individual in the tth subpopula-tion, Pl

t , is computed to measure the diversity of the ensemblethat is composed of Pl

t and the representatives in other subpo-pulations, Pn

1,. . .,Pn

t�1,Pn

tþ1. . .,Pn

M . For a component network in theensemble, a failure pattern vector is defined. Essentially, the kthelement of the vector is equal to 1 or 0, which indicates that thekth sample in the validation set is correctly or incorrectlyclassified by the network. The Hamming distance does surelygives a measure of similarity between the two networks. How-ever, in order to make this distance useful in computing thecomponent diversity with respect to the ensemble, the Hammingdistance is divided by the total failures of both componentnetworks (as given by the sum of the number of 0s in the failurepattern vectors of both networks). The modified Hamming dis-tance is called as the failure credit. Therefore, the failure credit ofPl

t with the jth component Pn

j in the elite pool (j¼1,2,y,M andjat) is defined as

hltj ¼

Hamðolt ,o

n

j Þ

Nf ltj

¼Xorðol

t ,on

j Þ

Nf ltj

ð5Þ

where on

j ¼ ½on1j ,on2

j ,. . .,onNv

j � is the failure pattern vector of Pn

j and

olt ¼ ½o

l1t ,ol2

t ,. . .,olNvt � is the failure pattern vector of Pl

t . Hamðolt ,o

n

j Þ is

the original Hamming distance between two failure patterns, olt

and on

j . Nf ltj is the sum of the number of failures recorded by the

two component networks. Specially, we denote that hltj ¼ 0, if

Hamðolt ,o

n

j Þ ¼Nf ltj ¼ 0. The maximum value for hl

tj is 1 when the

two networks make errors on disjoint portions of the validation

set, and the minimum value for hltj is zero when the two networks

make errors on exactly the same samples in the validation set ormake no errors at all [13].

In order to compute the diversity of the component networksin the ensemble, i.e., how different a component is with respect toothers in the ensemble, we take the average of the failure credit ofa component network in the ensemble. Therefore, the diversitymeasure objective of Pl

t is defined as

f 3ðPltÞ ¼

Pj ¼ 1,...,M;ja th

ltj

M�1ð6Þ

The fitness of individuals is evaluated as a multi-objectiveoptimization task in the proposed method because it is difficult toweigh different objectives when using the aggregating approach.The multiobjective algorithm is usually adopted to evaluate thefitness of individuals [23,24]. Since the objectives are in conflict witheach other, there is usually not a solution which maximizes allobjectives simultaneously. Rather, the solutions based on the Paretooptimality are able to guarantee the diversity of the population [17].

3.4. Selection based on Pareto optimality

The selection operation is based on the Pareto ranking thatwas previously used in the NSGAII [25] algorithm. The conflictionof multi-objectives motivates the concept of dominance, which isone of the main conceptions in the Pareto optimization.

Supposed that the objective vector f lt ¼ ðf 1ðP

ltÞ,f 2ðP

ltÞ,f 3ðP

ltÞÞ

represents the fitness of the individual Plt . The vector fl

t ¼

ðf 1ðPltÞ,f 2ðP

ltÞ,f 3ðP

ltÞÞ dominates another vector f l0

t ¼ ðf 1ðPl0

t Þ,f 2ðPl0

t Þ,f 3ðP

l0

t ÞÞ (la l0) if f iðPltÞZ f iðP

l0

t Þ for all i(iA{1,2,3}), and (jA{1,2,3} forf jðP

ltÞ4 f jðP

l0

t Þ. A vector f lt AD (D¼ ff1

t ,f2t . . .,f

Lt g is the fitness set of

the individuals in one subpopulation) is said to be Pareto-optimal

in D if there is no f l0

t AD such that f l0

t dominates flt . Then the Pareto

frontier is the set of vectors in D that are not dominated by othervectors in D.

The subsequent Pareto fronts are obtained and nondominatedindividuals are assigned an equal rank. The individuals in the firstnondominated front get the rank 1. The individuals in other frontscarry on their ranks as 2, 3, or bigger integers. Thus, theindividuals of a nondominated front are assigned identical fitness.

To get an estimate of the density of individuals surrounding aparticular individual in the population, the crowding distances ofindividuals in each front are computed, which measure theaverage distance of two individuals on either side of this indivi-dual along each of the objectives [25]. This quantity serves as anestimate of the perimeter of the cuboid formed using the nearestneighbors as the vertices. The tournament selection is utilized toselect individuals into the mating pool for the next generation.The individuals in the first nondominated front are selected at first.If the number of the individuals in the first nondominated front isless than the population size, the individuals in subsequent Paretofronts are further considered. Specially, if the candidates of a certainfront selected are more than one during one tour, the one with themaximum crowding distance should be selected.

In addition, the elitist selection is adopted so that the bestsolutions in all subpopulations survive definitely to the next genera-tion, which will keep the optimal solutions once they are foundduring the whole coevolution process. If there is an individual Pl0

t intth subpopulation that dominates the corresponding one, Pn

t , in theelite pool with the Pareto optimality, Pn

t is substituted by Pl0

t . Thus thenew elite pool Hn

¼Hl0

t ¼ fPn

1,. . .,Pn

t�1,Pl0

t ,Pn

tþ1. . .,Pn

Mg is obtained,and the fitness of new elite is updated as well. With the elite selectionstrategy, the proposed method will output the NNE model decodedfrom the elite pool.

3.5. Crossover

The crossover operation explores the whole search space andis aimed to find the global optima. The two-point crossover isused in the subpopulations to exchange information between twoindividuals that are picked randomly from the mating pool, andtwo offspring are produced. The individuals that undergo thecrossover operation are grouped into pairs, and for every pair thetwo crossover points are generated randomly. The genes ofthe two individuals between the two points are exchanged, andthe two offspring are produced.

3.6. Mutation

A ratio, pad, to decide whether the mutation occurs in the controlbit or the real number part, has been introduced to accommodate the

special chromosome structure of individuals. Suppose that Plt ¼

½ clt rl

t blt �, where cl

t ¼ ½clit �Nct�m, rl

t ¼ ½slit �Nct�1, and bl

t ¼ ½blit �Nct�1,

i¼1,2,y,Nct. For a hidden node clit in Pl

t , a random number rad is

generated. If pad4rad, the operation only inverts the control bit (if theoriginal bit is 0, it is mutated to 1, and vice versa). If padrrad and

blit ¼ 1, the mutation introduces variants to the real-valued genes:

cli0t ¼ cli

t þNð0,1Þ � ðcn,it �cli

t Þ ð7Þ

sli0t ¼ s

lit þNð0,1Þ ð8Þ

where cli0t and sli0

t are the new values, clit and sli

t are the current

values, cn,it is the corresponding value of the hidden node in the elite

pool. N(0,1) is a random number, which obeys the standard normaldistribution.

Table 2Overall procedure of the Co-NNE algorithm.

Step 1 Initialize M subpopulations for the coevolution

(1) Obtain M training subsets by the bootstrap resampling method

(2) Create the M initial component networks by DRSC method based on these different training subsets and calculate the initial values of radius widths with the

clustering information of the sample distribution. The output weights between the hidden layer and the output layer of the RBFNN are calculated directly by the

pseudo-inverse method

(3) Generate a population of L individuals for every component network to form a subpopulation, by adding random variances drawn from the standard normal

distribution. The control vector bits in an individual are initialized as 0 or 1 randomly

(4) The initial elite pool is composed of the initial component networks generated by DRSC. Predefine the number of iteration G, and set the initial generation as g ¼1

Step 2 Calculate the fitness values of individuals based on the multiple objectives, and set the rank to every individual based on the Pareto optimality. If there are

individuals dominating the elitists, update them with the ones that dominate the elitists and are in the first Pareto front

Step 3 Utilize the selection method to generate the mating pool population

Step 4 Execute the crossover operation on individuals in the mating pool to produce offspring

Step 5 Execute the mutation operation on offspring and form the new subpopulation

Step 6 If goG and the convergence condition is not met, g¼gþ1, go to Step 2; otherwise go to Step 7

Step 7 Output the individuals in the elite pool as the final estimation of the NNE model

Table 3Characteristics of datasets.

Datasets Train Validation Test Classes Attributes

Breast-w (Breast) 350 175 174 2 9

German 500 250 250 2 24

Glass 107 54 53 6 9

Heart-statlog (Heart) 134 68 68 2 13

Hepatitis (Hepa) 77 39 39 2 19

Horse 300 – 68 2 27

Ionosphere (Iono) 175 88 88 2 34

Lymphography (Lymph) 74 37 37 4 18

New-thyroid (New-thy) 107 54 54 3 5

Pima 384 192 192 2 8

Primary-tumor (Pri-tumor) 169 85 85 21 17

Segment 1162 574 574 7 19

Sonar 104 – 104 2 60

Soybean (Soy) 342 171 170 19 35

Vehicle 422 212 212 4 18

Votes 217 109 109 2 16

Waveform (Wave) 2500 1250 1250 3 21

Wines 89 45 44 3 13

Yeast 742 371 371 10 8

Zoo 53 24 24 7 16

J. Tian et al. / Pattern Recognition 45 (2012) 1373–13851378

3.7. Main procedure of the Co-NNE algorithm

The overall procedure of the Co-NNE algorithm is formulatedin Table 2. The algorithm execution of the Co-CEA is defined to beconvergent if the identical individuals exceeded a pre-definedpercentage in all subpopulations.

4. Experimental studies

In order to test the validity of the proposed method, weselected 20 datasets from the UCI Machine Learning Repository.These datasets are briefly characterized in Table 3. These real-world datasets are different with respect to the number ofavailable samples (from 101 to 5000), attributes (from 5 to 60),and classes (from 2 to 21). Most datasets were divided into threesubsets: 50% of the patterns were used for learning, 25% forvalidation, and the remaining 25% for testing, except Horse andSonar. For these two datasets, all samples were prearranged intwo exclusive subsets.

For each dataset, 30 runs of the algorithms were performed,and the average classification accuracies were calculated andreported in the following tables. The algorithm was treated tobe convergent if the identical individuals exceeded 80% in allsubpopulations.

4.1. Experiment 1

The experiments were carried out to evaluate the performanceof the Co-NNE. The experiment parameters used in the Co-NNEwere set as follows. The population size L was 50, the maximumgenerations G was 200, and the ensemble size M was 25. Theprobability of crossover pc was 0.8 and the structure mutationrate pad was 0.6.

First, experiments were conducted to verify the performanceof the proposed method and three classification methods. Thefirst one was the standard RBFNN. The second one was the GA-RBFNN, which trained RBFNN using the standard Genetic Algo-rithm (SGA) [26]. The third was the CO-RBFNN which wasproposed in [18]. The evolutionary parameters, such as thecrossover rate, the mutation rate, the population size and themaximum generations, were set as the same values as in the Co-NNE. Table 4 reports testing accuracies and standard deviations ofthe Co-NNE and other algorithms. The t-test statistics werecomputed. The highest accuracy in each row is outlined in boldand is further marked with ‘‘n’’ if it is statistically significant witha confidence level of 95%.

As shown in Table 4, the Co-NNE obtains the best testingaccuracies on 15 datasets and performs significantly better onGerman, Horse, Segment, Sonar, Vehicle, Wave and Yeast. The Co-NNE outperforms the RBFNN, the GA-RBFNN, and the CO-RBFNN on18, 17 and 8 datasets, respectively, with a confidence level of 95%.The CO-RBFNN performs better on 5 datasets than the Co-NNE, butthe differences are not significant except on Lymph. The averagerank of the Co-NNE is 1.25. Additionally, the Co-NNE achieves asmaller standard deviation than other compared algorithms.

Second, experiments were carried out to verify the perfor-mance of the proposed method and some ensemble algorithms,such as AdaBoost (AB) [27], Bagging (BA) [5], Decision Forests(DF) [28,29] and Random Subspace method (RS) [30]. The numberof iterations to be performed in the AdaBoost, the number of treesto be generated in the DF, and the ensemble size in the Baggingand the Random Subspace were all set as 25. The size of featuresubsets of the random subspace method was half of the totalnumber of features. For fair comparison, the Co-NNE and thecompared ensemble algorithms all used the hybrid output-com-bination method (HOC) to make the output combination. Table 5reports the average testing accuracies and the standard deviationsof the Co-NNE with HOC and the compared algorithms with HOC.The t-test statistics were computed. The highest accuracy in eachrow is outlined in bold and is further marked with ‘‘n’’ if it isstatistically significant with a confidence level of 95%.

Table 5 shows that the testing accuracies produced by theproposed algorithm are higher than those of all other ensemblemethods with the HOC for the output combination on 9 datasets.

Table 4Average testing performances for Co-NNE and other three algorithms.

Datasets Co-NNE RBFNN GA-RBFNN CO-RBFNN

Accu. Std. Accu. Std. Accu. Std. Accu. Std.

Breast 0.9672 0.0115 0.9429 0.0178 0.9554 0.0130 0.9694 0.0140

German 0.7537n 0.0221 0.6999 0.0254 0.7147 0.0196 0.7264 0.0180

Glass 0.7187 0.0492 0.6509 0.0677 0.6633 0.0602 0.7107 0.0378

Heart 0.8368 0.0389 0.8010 0.0353 0.8064 0.0432 0.8299 0.0500

Hepa 0.8350 0.0391 0.8102 0.0521 0.8342 0.0507 0.8239 0.0489

Horse 0.8401n 0.0269 0.7638 0.0355 0.7632 0.0282 0.8201 0.0298

Iono 0.9492 0.0294 0.8292 0.0548 0.8780 0.0398 0.9379 0.0227

Lymph 0.8234 0.0539 0.8132 0.0717 0.7703 0.0580 0.8527n 0.0515

New-thy 0.9564 0.0336 0.9158 0.0336 0.9467 0.0320 0.9212 0.0237

Pima 0.7686 0.0248 0.7455 0.0307 0.7503 0.0320 0.7698 0.0272

Pri-tumor 0.4208 0.0234 0.2867 0.0193 0.3894 0.0307 0.4273 0.0429

Segment 0.9527n 0.0180 0.8397 0.0460 0.8751 0.0232 0.9008 0.0253

Sonar 0.7857n 0.0419 0.7207 0.0707 0.7295 0.0503 0.7599 0.0356

Soybean 0.9273 0.0119 0.7273 0.0215 0.8933 0.0312 0.9265 0.0260

Vehicle 0.7381n 0.0276 0.6520 0.0365 0.6607 0.0294 0.7067 0.0329

Votes 0.9471 0.0243 0.9378 0.0266 0.9425 0.0197 0.9427 0.0260

Wave 0.8633n 0.0083 0.8118 0.0180 0.8384 0.0126 0.8232 0.0147

Wines 0.9659 0.0229 0.9469 0.0273 0.9371 0.0425 0.9689 0.0264

Yeast 0.6134n 0.0115 0.5805 0.0205 0.5749 0.0264 0.5943 0.0208

Zoo 0.9437 0.0462 0.9169 0.0544 0.8972 0.0527 0.9333 0.0459

Ave. 0.8304 0.0283 0.7696 0.0383 0.7910 0.0348 0.8173 0.0310

Rank 1.25 – 3.75 – 3.1 – 1.9 –

n Indicates that the highest accuracy is significantly better than others with a confidence level of 95%.

Table 5Average testing performances of Co-NNE and other ensemble algorithms with the HOC.

Datasets Co-NNE AB(HOC) BA(HOC) DF(HOC) RS(HOC)

Accu. Std. Accu. Std. Accu. Std. Accu. Std. Accu. Std.

Breast 0.9672 0.0115 0.9607 0.0156 0.9659 0.0088 0.9637 0.0145 0.9567 0.0132

German 0.7537 0.0221 0.7424 0.0237 0.7567 0.0228 0.7485 0.0198 0.7403 0.0188

Glass 0.7187 0.0492 0.7013 0.0493 0.7113 0.0180 0.7167 0.0453 0.7130 0.0718

Heart 0.8368 0.0389 0.8167 0.0341 0.8446 0.0291 0.8153 0.0422 0.8411 0.0340

Hepa 0.8350 0.0391 0.8145 0.0555 0.8342 0.0408 0.8034 0.0561 0.8639n 0.0348

Horse 0.8401 0.0269 0.7965 0.0415 0.8426 0.0265 0.8419 0.0336 0.8533 0.0232

Iono 0.9492 0.0294 0.9356 0.0268 0.9125 0.0255 0.9383 0.0219 0.9295 0.0304

Lymph 0.8234 0.0539 0.8165 0.0436 0.8252 0.0630 0.8205 0.0656 0.8037 0.0664

New-thy 0.9564 0.0336 0.8964 0.0252 0.9609 0.0311 0.9453 0.0269 0.9534 0.0495

Pima 0.7686 0.0248 0.7526 0.0249 0.7601 0.0244 0.7482 0.0326 0.7623 0.0198

Pri-tumor 0.4208 0.0234 0.3818 0.0273 0.4348 0.0301 0.4356 0.0527 0.4327 0.0329

Segment 0.9527 0.0180 0.9343 0.0132 0.9498 0.0107 0.9677n 0.0060 0.9319 0.0175

Sonar 0.7857 0.0419 0.7452 0.0356 0.7686 0.0349 0.7993 0.0520 0.7626 0.0397

Soy 0.9273 0.0119 0.7996 0.0126 0.8853 0.0219 0.9198 0.0193 0.9128 0.0239

Vehicle 0.7381 0.0276 0.7121 0.0285 0.6841 0.0278 0.7329 0.0257 0.7210 0.0383

Votes 0.9471 0.0243 0.9550 0.0265 0.9505 0.0222 0.9601 0.0199 0.9531 0.0203

Wave 0.8633n 0.0083 0.8531 0.0369 0.8348 0.0092 0.8325 0.0098 0.8329 0.0084

Wines 0.9659 0.0229 0.9629 0.0266 0.9735 0.0239 0.9728 0.0262 0.9412 0.0325

Yeast 0.6134n 0.0115 0.5596 0.0050 0.5981 0.0216 0.6017 0.0199 0.5733 0.0255

Zoo 0.9437 0.0462 0.8569 0.0339 0.9278 0.0546 0.9297 0.0472 0.8625 0.0433

Ave. 0.8304 0.0283 0.7997 0.0293 0.8211 0.0274 0.8247 0.0319 0.8171 0.0322

Rank 2 – 4.15 – 2.7 – 2.7 – 3.45 –

n Indicates that the highest accuracy is significantly better than others with a confidence level of 95%.

J. Tian et al. / Pattern Recognition 45 (2012) 1373–1385 1379

According to the t-test statistics, the proposed algorithm outper-forms other four methods significantly on 16, 6, 5 and 11 datasetsrespectively. The highest accuracy achieved by the proposedmethod on Wave and Yeast are statistically significant with aconfidence level of 95%. Additionally, the proposed algorithm getsthe highest average rank among the five methods.

We further compared the Co-NNE using the majority votingrule (VOT) for the output combination (Co-NNE(VOT)) with thoseensemble algorithms in their standard versions. The standard ABadopted a weighted aggregation method to combine the compo-nent outputs, while other compared ensemble methods used theVOT rule in standard versions. Table 6 reports the average testing

accuracies and the standard deviations of Co-NNE(VOT) andcompared algorithms in the standard version. The highest accuracyin each row is outlined in bold, and it is further marked with ‘‘n’’ ifthe t-test statistics are significant with a confidence level of 95%.

As is shown in Table 6, the Co-NNE(VOT) achieves the highestaccuracies on 4 datasets, and the highest accuracy on Wave isstatistically significant with a confidence level of 95%. When allmethods are compared on the 20 datasets regarding the averagerank, the Co-NNE(VOT) gets 2.95. By considering both the averagetesting accuracies and the performance ranks, the Co-NNE(VOT)only performs better than the AB and the RS, but worse than theBA and the DF.

J. Tian et al. / Pattern Recognition 45 (2012) 1373–13851380

When we consider the results in Tables 5 and 6, the testingaccuracy produced by an ensemble method with the HOC isusually higher than that of the same method with VOT (or in itsstandard version) on the same dataset. Thus, the HOC benefits theperformance of the four compared ensemble methods. Forinstance, the BA(HOC) achieved higher testing accuracies thanthe standard BA on 13 datasets, and the RS(HOC) performedbetter than the standard RS on 15 datasets. Moreover, the HOCimproves the performance of the BA and RS significantly on 4 and7 datasets. Among all methods with HOC, the Co-NNE achievesthe highest average accuracies and performs the best on 9 datasets(shown in Table 5), which indicates that the HOC makes a distinct

Table 6Average testing performances of Co-NNE(VOT) and compared algorithms in standard v

Datasets Co-NNE(VOT) AB BA

Accu. Std. Accu. Std. Accu

Breast 0.9600 0.0147 0.9597 0.0154 0.96German 0.7427 0.0241 0.7293 0.0229 0.74Glass 0.6347 0.0570 0.7002 0.0707 0.69

Heart 0.8211 0.0425 0.8055 0.0508 0.84Hepa 0.8368 0.0438 0.8279 0.0537 0.85

Horse 0.8376 0.0389 0.7913 0.0300 0.86

Iono 0.9189 0.0297 0.9332 0.0310 0.91

Lymph 0.8036 0.0589 0.8174 0.0756 0.82New-thy 0.9442 0.0261 0.9589 0.0430 0.96Pima 0.7556 0.0131 0.7432 0.0347 0.75Pri-tumor 0.4394 0.0347 0.2921 0.0501 0.42

Segment 0.9094 0.0160 0.9224 0.0174 0.96

Sonar 0.7692 0.0419 0.8246n 0.0314 0.76

Soy 0.8706 0.0467 0.2811 0.0446 0.83

Vehicle 0.7384 0.0027 0.6785 0.0265 0.66

Votes 0.9502 0.0186 0.9473 0.0192 0.94

Wave 0.8621n 0.0067 0.6636 0.0088 0.81

Wines 0.9477 0.0322 0.9746 0.0250 0.97Yeast 0.5993 0.0041 0.4057 0.0214 0.59

Zoo 0.9250 0.0562 0.9657n 0.0809 0.93

Ave. 0.8133 0.0304 0.7611 0.0376 0.81

Rank 2.95 – 3.5 – 2.35

n Indicates that the highest accuracy is significantly better than others with a confi

Table 7Average bias of Co-NNE and other ensemble algorithms with HOC.

Datasets Co-NNE AB(HOC) BA(H

Bias Std. Bias Std. Bias

Breast 0.0289n 0.0064 0.0380 0.0103 0.032

German 0.1861 0.0135 0.1921 0.0153 0.191

Glass 0.2638 0.0369 0.3602 0.0418 0.304

Heart 0.1464 0.0215 0.1452 0.0372 0.131

Hepa 0.1309 0.0265 0.1503 0.0300 0.134

Horse 0.1545 0.0133 0.1590 0.0200 0.131Iono 0.0686n 0.0180 0.0876 0.0213 0.097

Lymph 0.1728 0.0300 0.1533 0.0368 0.132New-thy 0.0362n 0.0299 0.0555 0.0483 0.052

Pima 0.1855 0.0151 0.1964 0.0249 0.181

Pri-tumor 0.4245 0.0246 0.3996 0.0121 0.399

Segment 0.0595n 0.0228 0.0943 0.0085 0.128

Sonar 0.1990 0.0225 0.1819 0.0225 0.161

Soy 0.1704 0.0102 0.1896 0.0280 0.092Vehicle 0.1701 0.0264 0.2209 0.0220 0.190

Votes 0.0382 0.0102 0.0371 0.0123 0.032Wave 0.1004n 0.0037 0.1163 0.0073 0.109

Wines 0.0307n 0.0136 0.0421 0.0162 0.039

Yeast 0.3082n 0.0183 0.3159 0.0111 0.321

Zoo 0.0535n 0.0195 0.1389 0.0512 0.151

Ave. 0.1464 0.0191 0.1637 0.0239 0.150

n Indicates that the smallest bias is significantly less than others with a confidence

contribution to the coevolutionary training of ensembles in theCo-NNE. Thus, the Co-NNE performs more competitively on mostdatasets compared with other ensemble algorithms.

4.2. Experiment 2

The bias-variance decomposition is often used for studyingthe performance of ensemble methods. Originally it was pro-posed for regression, but there are several variants for classi-fication [31]. In this section, we will investigate how the proposedmethod behaves in a bias/variance decomposition [32] test.

ersions.

DF RS

. Std. Accu. Std. Accu. Std.

52 0.0127 0.9625 0.0272 0.9648 0.0111

44 0.0242 0.7381 0.0394 0.7276 0.0142

95 0.0629 0.7154 0.0528 0.6940 0.0469

41n 0.0332 0.8011 0.0540 0.8271 0.0284

88 0.0439 0.8106 0.0617 0.8608 0.0316

23 0.0309 0.8681 0.0453 0.8581 0.0308

43 0.0311 0.9306 0.0571 0.9162 0.0308

39 0.0439 0.8184 0.0721 0.8113 0.0398

76 0.0202 0.9510 0.0441 0.9572 0.0198

83 0.0296 0.7418 0.0254 0.7534 0.0214

71 0.0364 0.4217 0.0080 0.4268 0.0312

13 0.0095 0.9738n 0.0265 0.9025 0.0136

24 0.0542 0.7989 0.0345 0.7202 0.0486

97 0.0247 0.9126n 0.0450 0.8873 0.0194

23 0.0289 0.7283 0.0224 0.7229 0.0196

76 0.0232 0.9555 0.0347 0.9479 0.0234

75 0.0120 0.8165 0.0072 0.8190 0.0129

68 0.0181 0.9670 0.0256 0.9245 0.0563

58 0.0311 0.5889 0.0178 0.5625 0.0281

30 0.0337 0.9189 0.0651 0.8289 0.1121

81 0.0302 0.8210 0.0383 0.8057 0.0320

– 2.8 – 3.4 –

dence level of 95%.

OC) DF(HOC) RS(HOC)

Std. Bias Std. Bias Std.

9 0.0078 0.0493 0.0143 0.0349 0.0081

5 0.0164 0.1927 0.0151 0.1887 0.0152

4 0.0346 0.2513 0.0299 0.3055 0.0426

3 0.0252 0.1580 0.0259 0.1210n 0.0277

3 0.0342 0.1136 0.0154 0.1011 0.0306

3 0.0068 0.1546 0.0188 0.1343 0.0235

1 0.0158 0.0922 0.0139 0.0873 0.0210

8 0.0403 0.1351 0.0320 0.1436 0.0318

3 0.0243 0.0645 0.0171 0.0533 0.0229

8 0.0175 0.1722 0.0160 0.1652 0.0173

0 0.0235 0.3976 0.0341 0.4231 0.0275

2 0.0151 0.0756 0.0318 0.1073 0.0063

6 0.0236 0.1578 0.0231 0.1521 0.0405

5n 0.0150 0.1046 0.0184 0.1068 0.0260

7 0.0241 0.1792 0.0126 0.1912 0.0144

0 0.0135 0.0459 0.0265 0.0395 0.0127

8 0.0071 0.1282 0.0190 0.1193 0.0218

1 0.0151 0.0474 0.0158 0.0410 0.0225

5 0.0182 0.3255 0.0399 0.3698 0.0182

1 0.0278 0.0968 0.0165 0.0729 0.0279

8 0.0203 0.1471 0.0218 0.1479 0.0229

level of 95%.

J. Tian et al. / Pattern Recognition 45 (2012) 1373–1385 1381

Here we adopt the bias-variance decomposition proposed byKohavi and Wolpert [33].

The bias measures the squared difference between the target’saverage output and the algorithm’s average output. The contribu-tion of bias to error is the portion in the total error that is made bythe central tendency of the algorithm. Given a distribution overtraining sets, the variance measures the sensitivity of the learningalgorithm to changes in the training set, and it is independent ofthe underlying target. As the algorithm becomes more sensitive tochanges in the training set, the variance will get bigger [33]. Thecontribution of variance is the portion in the total error that is dueto deviations from the central tendency [17]. Let YH be therandom variable representing the label of a sample in thehypothesis space, and YF be the random variable representing

Table 8Average variance of Co-NNE and other ensemble algorithms with HOC.

Datasets Co-NNE AB(HOC) BA(H

Var. Std. Var. Std. Var.

Breast 0.0178 0.0046 0.0324 0.0035 0.007German 0.0938 0.0084 0.1074 0.0069 0.067Glass 0.1565n 0.0182 0.2725 0.0203 0.168

Heart 0.0805n 0.0095 0.0896 0.0077 0.088

Hepa 0.0674 0.0131 0.0841 0.0124 0.062Horse 0.1032 0.0137 0.1192 0.0134 0.085Iono 0.0746 0.0163 0.0955 0.0151 0.042Lymph 0.1265n 0.0165 0.1421 0.0203 0.138

New-thy 0.0179n 0.0097 0.0322 0.0053 0.028

Pima 0.0881n 0.0109 0.0951 0.0095 0.093

Pri-tumor 0.2611 0.0168 0.2931 0.0111 0.203

Segment 0.0489 0.0182 0.1199 0.0075 0.046

Sonar 0.1592 0.0136 0.1278 0.0183 0.139

Soy 0.2992 0.0173 0.2878 0.0357 0.095Vehicle 0.1401 0.0076 0.1270 0.0122 0.118Votes 0.0288 0.0075 0.0363 0.0045 0.018Wave 0.0849 0.0084 0.0541 0.0062 0.035Wines 0.0321n 0.0169 0.0682 0.0164 0.046

Yeast 0.1523 0.0071 0.1795 0.0094 0.111Zoo 0.0554 0.0127 0.1588 0.0156 0.006Ave. 0.1044 0.0123 0.1261 0.0126 0.080

n Shows that the smallest variance is significantly less than others with a confiden

Table 9Average bias of Co-NNE(VOT) and other ensemble algorithms.

Datasets Co-NNE (VOT) AB BA

Bias Std. Bias Std. Bias

Breast 0.0291 0.0121 0.0262 0.0103 0.03

German 0.1815n 0.0157 0.1917 0.0110 0.19

Glass 0.2448 0.0368 0.5131 0.0349 0.25

Heart 0.1407 0.0290 0.1232 0.0282 0.12

Hepa 0.1297 0.0339 0.0861 0.0369 0.09

Horse 0.1550 0.0149 0.1409 0.0265 0.17

Iono 0.0803 0.0218 0.0901 0.0348 0.11

Lymph 0.1715 0.0444 0.1360 0.0406 0.13

New-thy 0.0399n 0.0186 0.0481 0.0170 0.05

Pima 0.1771 0.0218 0.1447n 0.0182 0.15

Pri-tumor 0.4240 0.0198 0.4958 0.0415 0.40Segment 0.0637 0.0059 0.3563 0.0047 0.04Sonar 0.2008 0.0249 0.1453 0.0353 0.16

Soy 0.1783 0.0276 0.5141 0.0245 0.18

Vehicle 0.1602n 0.0166 0.3491 0.0144 0.17

Votes 0.0381 0.0117 0.0149n 0.0110 0.02

Wave 0.0997n 0.0026 0.1534 0.0041 0.13

Wines 0.0359 0.0207 0.0478 0.0192 0.04

Yeast 0.3241 0.0187 0.5536 0.0179 0.28Zoo 0.0627n 0.0308 0.3661 0.0374 0.55

Ave. 0.1469 0.0214 0.2248 0.0234 0.16

n Indicates that the smallest bias is significantly less than others with a confidence

the label of an sample in the target. Then the bias and thevariance are computed, respectively, by

bias2x ¼

1

2

XyAY

PðYF ¼ y9xÞ�PðYH ¼ y9xÞ� �2

ð9Þ

variancex ¼1

21�XyAY

PðYH ¼ y9xÞ2 !

ð10Þ

Tables 7 and 8 report the average bias and the averagevariance of the Co-NNE with HOC and other ensemble methodswith HOC. The smallest bias or variance in each row is outlined inbold and is further marked with ‘‘n’’ if the t-test statistics aresignificant with a confidence level of 95%.

OC) DF(HOC) RS(HOC)

Std. Var. Std. Var. Std.

6n 0.0023 0.0228 0.0198 0.0318 0.0063

3n 0.0062 0.0962 0.0120 0.0863 0.0063

3 0.0149 0.1693 0.0249 0.1697 0.0187

5 0.0093 0.1099 0.0127 0.1090 0.0100

5 0.0134 0.0720 0.0264 0.0639 0.0134

1n 0.0044 0.0906 0.0134 0.1090 0.0070

8n 0.0072 0.0620 0.0157 0.0529 0.0071

9 0.0084 0.1419 0.0141 0.1427 0.0136

5 0.0070 0.0359 0.0225 0.0460 0.0071

7 0.0069 0.0974 0.0152 0.0980 0.0067

6 0.0253 0.2679 0.0127 0.1875n 0.0268

0 0.0218 0.0328 0.0070 0.1502 0.0322

0 0.0113 0.1362 0.0352 0.1266 0.0112

7n 0.0256 0.1972 0.0067 0.1198 0.0276

5n 0.0086 0.1402 0.0064 0.1377 0.0168

8n 0.0051 0.0252 0.0097 0.0352 0.0071

7n 0.0025 0.0972 0.0086 0.0693 0.0039

0 0.0105 0.0496 0.0123 0.0923 0.0117

2n 0.0104 0.1634 0.0182 0.1456 0.0162

1n 0.0190 0.0957 0.0048 0.0902 0.0172

2 0.0110 0.1052 0.0149 0.1032 0.0133

ce level of 95%.

DF RS

Std. Bias Std. Bias Std.

32 0.0103 0.0580 0.0200 0.0431 0.0085

56 0.0121 0.1922 0.0167 0.1975 0.0150

03 0.0344 0.2294 0.0214 0.2376 0.0346

56 0.0281 0.1610 0.0111 0.1310 0.0262

72 0.0308 0.1002 0.0157 0.1160 0.0283

13 0.0235 0.1609 0.0101 0.1407 0.0265

00 0.0349 0.0921 0.0217 0.1039 0.0193

66 0.0217 0.1335 0.0177 0.1471 0.0297

79 0.0170 0.0716 0.0225 0.0665 0.0224

45 0.0207 0.1896 0.0169 0.1672 0.0177

55 0.0308 0.4276 0.0101 0.4516 0.0278

17 0.0229 0.0506 0.0081 0.0467 0.0224

55 0.0279 0.1478 0.0373 0.1604 0.0376

63 0.0336 0.1071n 0.0166 0.1254 0.0250

65 0.0158 0.1771 0.0068 0.1755 0.0149

86 0.0154 0.0465 0.0158 0.0360 0.0087

94 0.0127 0.1319 0.0353 0.1253 0.0200

03 0.0178 0.0503 0.0075 0.0486 0.0209

74 0.0216 0.2930 0.0426 0.2977 0.0280

54 0.0345 0.1260 0.0127 0.1726 0.0345

79 0.0233 0.1473 0.0183 0.1495 0.0234

level of 95%.

Table 10Average variance of Co-NNE (with VOT) and other ensemble algorithms.

Datasets Co-NNE (VOT) AB BA DF RS

Var. Std. Var. Std. Var. Std. Var. Std. Var. Std.

Breast 0.0182 0.0060 0.0171 0.0042 0.0185 0.0060 0.0310 0.0194 0.0295 0.0055

German 0.0946 0.0106 0.1125 0.0061 0.1004 0.0101 0.1101 0.0147 0.0802n 0.0077

Glass 0.1590 0.0165 0.1393n 0.0315 0.2264 0.0237 0.1692 0.0209 0.1783 0.0174

Heart 0.0830n 0.0123 0.0920 0.0087 0.1075 0.0124 0.1129 0.0146 0.1404 0.0204

Hepa 0.0679 0.0147 0.0903 0.0138 0.0440 0.0229 0.0718 0.0225 0.0530 0.0224

Horse 0.1095 0.0200 0.1429 0.0117 0.0898 0.0204 0.0601 0.0169 0.0604 0.0200

Iono 0.0763 0.0180 0.0939 0.0119 0.0995 0.0165 0.0561 0.0198 0.0622 0.0152

Lymph 0.1269 0.0206 0.1641 0.0215 0.1427 0.0177 0.1233 0.0146 0.1190 0.0223

New-thy 0.0196n 0.0063 0.0419 0.0062 0.0576 0.0108 0.0509 0.0216 0.0812 0.0108

Pima 0.0913 0.0071 0.0915 0.0084 0.0852n 0.0114 0.1120 0.0169 0.1100 0.0106

Pri-tumor 0.2695 0.0250 0.2868 0.0247 0.2852 0.0189 0.2714 0.0157 0.2143n 0.0261

Segment 0.0480 0.0017 0.3827 0.0094 0.0482 0.0097 0.0323n 0.0055 0.0387 0.0182

Sonar 0.1612 0.0167 0.2064 0.0172 0.1565 0.0203 0.1464 0.0337 0.1423 0.0336

Soy 0.2873 0.0170 0.3022 0.0462 0.2218 0.0071 0.1270 0.0127 0.1272 0.0497

Vehicle 0.1518 0.0056 0.2294 0.0106 0.1265n 0.0087 0.1453 0.0075 0.1552 0.0078

Votes 0.0309 0.0082 0.0216 0.0047 0.0042n 0.0031 0.0313 0.0083 0.0221 0.0083

Wave 0.0829 0.0056 0.1738 0.0013 0.1583 0.0075 0.1041 0.0277 0.0950 0.0152

Wines 0.0403n 0.0225 0.0736 0.0114 0.0798 0.0134 0.0609 0.0144 0.0951 0.0146

Yeast 0.1562 0.0131 0.0609n 0.0068 0.1337 0.0169 0.1866 0.0277 0.1878 0.0265

Zoo 0.0556 0.0245 0.0892 0.0123 0.0047n 0.0017 0.1187 0.0087 0.1647 0.0225

Ave. 0.1065 0.0136 0.1406 0.0134 0.1095 0.0130 0.1061 0.0172 0.1078 0.0187

n Indicates that the smallest variance is significantly less than others with a confidence level of 95%.

Table 11Testing accuracies of the Co-NNE with different output-combination methods.

Datasets Co-NNE Co-NNE(WTA) Co-NNE(VOT)

Accu. Std. Accu. Std. Accu. Std.

Breast 0.9672n 0.0115 0.9559 0.0166 0.9600 0.0147

German 0.7537 0.0221 0.7485 0.0196 0.7427 0.0241

Glass 0.7187n 0.0492 0.6227 0.0623 0.6347 0.0570

Heart 0.8368 0.0389 0.8172 0.0544 0.8211 0.0425

Hepa 0.8350 0.0391 0.8310 0.0483 0.8368 0.0438

Horse 0.8401 0.0269 0.8278 0.0306 0.8376 0.0389

Iono 0.9492n 0.0294 0.9196 0.0374 0.9189 0.0297

Lymph 0.8234 0.0539 0.7865 0.0440 0.8036 0.0589

New-thy 0.9564 0.0336 0.8867 0.0363 0.9442 0.0261

Pima 0.7686 0.0248 0.7497 0.0236 0.7556 0.0131

Pri-tumor 0.4208 0.0234 0.4053 0.0174 0.4394 0.0347

Segment 0.9527n 0.0180 0.9137 0.0122 0.9094 0.0160

Sonar 0.7857 0.0419 0.7853 0.0493 0.7692 0.0419

Soy 0.9273n 0.0119 0.8706 0.0102 0.8706 0.0467

Vehicle 0.7381 0.0276 0.7013 0.0551 0.7384 0.0027

Votes 0.9471 0.0243 0.9462 0.0208 0.9502 0.0186

Wave 0.8633 0.0083 0.8667 0.0117 0.8621 0.0067

Wines 0.9659 0.0229 0.9576 0.0352 0.9477 0.0322

Yeast 0.6134n 0.0115 0.5885 0.0122 0.5993 0.0041

Zoo 0.9437 0.0462 0.9083 0.0506 0.9250 0.0562

Ave. 0.8304 0.0283 0.8045 0.0324 0.8133 0.0304

n Indicates that the highest accuracy is significantly better than others with a

confidence level of 95%.

Table 12Testing accuracies using 5, 10, 15, 20, 25 and 30 subpopulations of networks.

Datasets Ensemble size

5 10 15 20 25 30

Breast 0.9601 0.9626 0.9637 0.9664 0.9672 0.9693

German 0.7394 0.7513 0.7551 0.7551 0.7537 0.7618

Glass 0.6787 0.7067 0.6647 0.7207 0.7187 0.6907

Heart 0.8206 0.8157 0.8270 0.8324 0.8368 0.8162

Hepa 0.8222 0.8162 0.8205 0.8196 0.8350 0.8179

Horse 0.8029 0.8059 0.8142 0.8269 0.8401 0.8464

Iono 0.9496 0.9511 0.9591 0.9553 0.9492 0.9447

Lymph 0.7486 0.7667 0.7856 0.7892 0.7919 0.7748

New_thy 0.9595 0.9589 0.9516 0.9528 0.9564 0.9625

Pima 0.7766 0.7726 0.7712 0.7763 0.7686 0.7796

Pri_tumor 0.3894 0.3924 0.3909 0.4159 0.4208 0.4004

Segment 0.9210 0.9228 0.9433 0.9416 0.9527 0.9516

Sonar 0.7739 0.7729 0.7857 0.8017 0.7857 0.7921

Soy 0.9260 0.9240 0.9260 0.9267 0.9273 0.9267

Vehicle 0.7031 0.7088 0.7353 0.7350 0.7381 0.7397

Votes 0.9437 0.9459 0.9407 0.9431 0.9471 0.9459

Wave 0.8355 0.8408 0.8664 0.8647 0.8633 0.8659

Wines 0.9614 0.9614 0.9576 0.9682 0.9659 0.9576

Yeast 0.6002 0.6029 0.6101 0.6092 0.6134 0.6047

Zoo 0.9301 0.9383 0.9453 0.9453 0.9439 0.9480

Ave. 0.8121 0.8159 0.8207 0.8273 0.8288 0.8248

J. Tian et al. / Pattern Recognition 45 (2012) 1373–13851382

Table 7 shows that the Co-NNE achieves the smallest bias inaverage. It outperforms significantly other methods with HOC on8 datasets. Table 8 illustrates that the Co-NNE is advantageous on6 datasets over other ensemble algorithms with respect to thevariance. Although the BA(HOC) and the RS(HOC) achieve smallervariances in average, but their bias are a little bigger than that ofthe Co-NNE.

Tables 9 and 10 report the average bias and the averagevariance of the Co-NNE with VOT and other ensemble methodsin their standard versions. The smallest bias or variance in eachrow is outlined in bold and is further marked with ‘‘n’’ if the t-teststatistics are significant with a confidence level of 95%.

As shown in Tables 9 and 10, the Co-NNE(VOT) yields thesmallest bias in average, but it ranks the second regarding thesmallest variance in average. The Co-NNE(VOT) performs signifi-cantly better than other compared methods on 5 and 3 datasets,respectively, with respect to the bias and the variance.

By comparing the results in the four tables above, the HOCimproves the performance of the compared ensemble methods onboth the bias and the variance. For instance, the bias produced bythe BA with HOC is almost 10% smaller than that of the BA withVOT. The variance of the BA with HOC drops to 0.0802, whichis 26.76% less than that of the BA with VOT. Furthermore, theCo-NNE with either HOC or VOT performs the best regarding thebias in average among all methods. The Co-NNE also achievessmaller variances, which means that the Co-NNE ensures a more

Breast

0.955

0.960

0.965

0.970German

0.73

0.74

0.75

0.76

0.77Glass

0.65

0.67

0.69

0.71

0.73

0.75Heart

0.80

0.81

0.82

0.83

0.84

0.85

Hepa

0.80

0.81

0.82

0.83

0.84Horse

0.80

0.81

0.82

0.83

0.84

0.85Iono

0.940

0.945

0.950

0.955

0.960Lymph

0.74

0.75

0.76

0.77

0.78

0.79

0.80

New_thy

0.940

0.945

0.950

0.955

0.960

0.965

0.970Pima

0.760

0.765

0.770

0.775

0.780Pri_tumor

0.38

0.39

0.40

0.41

0.42

0.43Segment

0.90

0.91

0.92

0.93

0.94

0.95

0.96

Sonar

0.75

0.76

0.77

0.78

0.79

0.80

0.81Soy

0.910

0.915

0.920

0.925

0.930Vehicle

0.70

0.71

0.72

0.73

0.74

0.75Votes

0.930

0.935

0.940

0.945

0.950

Wave

0.82

0.83

0.84

0.85

0.86

0.87

0.88Wines

0.94

0.95

0.96

0.97

0.98Yeast

0.590

0.595

0.600

0.605

0.610

0.615

0.620Zoo

0.920

0.925

0.930

0.935

0.940

0.945

0.950

5 30252015105 30252015105 30252015105 3025201510

5 30252015105 30252015105 30252015105 3025201510

5 30252015105 30252015105 30252015105 3025201510

5 30252015105 30252015105 30252015105 3025201510

5 30252015105 30252015105 30252015105 3025201510

Fig. 1. Testing accuracy of the Co-NNE based on different ensemble sizes.

J. Tian et al. / Pattern Recognition 45 (2012) 1373–1385 1383

robust learning process on different training sets randomly drawnfrom a dataset. Moreover, the Co-NNE with HOC outperforms thatwith VOT on half of datasets with respect to the bias and on 85%datasets with respect to the variance. Therefore, the Co-NNE isable to achieve smaller bias and variance than other ensemblealgorithms in average.

4.3. Experiment 3

This special experiment was designed to assess the propertyof the HOC in the Co-NNE. The Co-NNE was executed withdifferent output-combination methods, including VOT, WTA,and HOC. The testing accuracies and standard deviation of three

J. Tian et al. / Pattern Recognition 45 (2012) 1373–13851384

output-combination methods are listed in Table 11. The highestaccuracy in each row is outlined in bold, and it is also markedwith ‘‘n’’ if the t-test statistics are significant with a confidencelevel of 95%.

Table 11 illustrates that the Co-NNE with HOC achieves thehighest accuracies on 15 datasets among the three methods. Thetesting accuracies of the Co-NNE with HOC are significantly betterthan those of the other two on 6 datasets. The Co-NNE with WTAor VOT performs best on only one or four datasets, respectively.Additionally, the Co-NNE with HOC achieves the smallest stan-dard deviation among the three methods. Thus, the HOC is moresuitable to the Co-NNE for output-combination than the othertwo. Because HOC can assist the Co-CEA to select more coopera-tive components, the Co-NNE is able to build an ensemble modelthat has a better classification performance.

4.4. Experiment 4

In this section, we study the performance of different ensem-ble size on 14 datasets. The experiment is carried out with 5, 10,15, 25, and 30 subpopulations of networks. For each size, weperformed 30 runs of the algorithm. The testing accuracies inaverage are shown in Table 12.

Table 12 illustrates that the addition of new networks to theensemble increases the testing accuracies of the Co-NNE modelon some datasets, such as the Breast, German, Horse, and Zoo.Fig. 1 gives the histograms on 20 datasets with different ensemblesizes. The x axis refers to the ensemble size and the y axis is thetesting accuracy. The t-test statistics were computed to verifywhether there are significant differences among the models withdifferent ensemble sizes. With a confidence level of 95%, there aresignificant differences only on German and Horse datasets. Thisexplains that the Co-NNE can produce equally good ensemblemodels that consist of various number of component networks.

5. Conclusion

This paper has presented the cooperative coevolutionaryapproach to the NNE design. A component network in theensemble is evolved in a subpopulation, and the coevolutionarymechanism evolves the component networks concurrently. Anindividual in a certain subpopulation corresponds to a candidateof the component network and receives fitness based on how wellit performs with other component networks in the ensemble.Additionally, we design a hybrid output-combination method todetermine the final ensemble output. The performance of the Co-NNE was thoroughly investigated on 20 datasets. Experimentalresults illustrated that the proposed method was able to obtainNNE models with better classification accuracy on complicatedclassification tasks compared with other learning algorithms.Moreover, the results showed that the hybrid output-combinationmethod was helpful for the proposed method to be advantageous.Meanwhile it was able to raise the performance of other ensemblemodels that provided continuous multi-outputs.

The cooperative coevolution of multiple subpopulations pro-vides a good scheme to construct the NNE model for complexclassification problems. There are many feasible extensions to ourwork in this paper. The first is the introduction of new fitnessmeasures for evaluating individuals in the coevolutionary algo-rithms so as to enhance the explorative capability of the algo-rithm. The second is the incorporation of the feature selectiontechniques for the purpose of generating more compact anddiverse components with a potentiality for forming betterensembles.

Acknowledgments

The work was supported by the National Science Fund forDistinguished Young Scholars of China (Grant no. 70925005) andthe General Program of the National Science Foundation of China(Grant no. 71001076). This work was also supported by Grant fromthe Research Fund for the Doctoral Program of Higher Education ofChina (nos. 20090032120073 and 20090032110065). It was alsosupported by the Program for Changjiang Scholars and InnovativeResearch Teams in Universities of China (PCSIRT). The authors arevery grateful to all anonymous reviewers whose invaluable com-ments and suggestions substantially helped improve the quality ofthe paper.

References

[1] L.K. Hansen, P. Salamon, Neural network ensembles, IEEE Transactions onPattern Analysis and Machine Intelligence 12 (10) (1990) 993–1001.

[2] A. Krogh, J. Vedelsby, Neural network ensembles, cross validation, and activelearning, in: G. Tesauro, D.S. Touretzky, T.K. Leen (Eds.), Advances in NeuralInformation Processing Systems, vol. 7, MIT Press, Cambridge, MA, 1995,pp. 231–238.

[3] P.W. Munro, B. Parmanto, Competition among networks improves committeeperformance, in: M.C. Mozer, M.I. Jordan, T. Petsche (Eds.), Advances inNeural Information Processing Systems, vol. 9, MIT Press, Cambridge, MA,1997, pp. 592–598.

[4] R. Shapire, The strength of weak learn ability, Machine Learning 5 (2) (1990)197–227.

[5] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140.[6] G. Folino, C. Pizzuti, G. Spezzano, Ensemble techniques for parallel genetic

programming based classifiers, in: C. Ryan, T. Soule, M. Keijzer, et al. (Eds.),Proceedings of the European Conference on Genetic Programming(EuroGP’03), Lecture Notes in Computer Science 2610, Springer, 2003,pp. 59–69.

[7] D. Song, M.I. Heywood, A.N. Zincir-Heywood, Training genetic programmingon half a million patterns: an example from anomaly detection, IEEETransactions on Evolutionary Computation 9 (3) (2005) 225–239.

[8] P.A. Mitchell, D.A. Kenneth, Cooperative coevolution: an architecture for evol-ving coadapted subcomponents, Evolutionary Computation. 8 (1) (2000) 1–29.

[9] G. Martınez-Munoz, A. Suarez, Switching class labels to generate classifica-tion ensembles, Pattern Recognition 38 (2005) 1483–1494.

[10] P. Hore, L.O. Hall, D.B. Goldgof, A scalable frame work for cluster ensembles,Pattern Recognition 42 (2009) 676–688.

[11] X. Yao, Y. Liu, Making use of population information in evolutionary artificialneural networks, IEEE Transactions on Systems, Man and Cybernetics, Part B28 (3) (1998) 417–425.

[12] Y. Liu, X. Yao, T. Higuchi, Evolutionary ensembles with negative correlationlearning, IEEE Transactions on Evolutionary Computation 4 (4) (2000)380–387.

[13] A. Chandra, X. Yao, Ensemble learning using multi-objective evolutionaryalgorithms, Journal of Mathematical Modelling and Algorithms 5 (4) (2006)417–425.

[14] D. Ortiz-Boyer, C. Hervas-Martınez, N. Garcıa-Pedrajas, Cixl2: a crossoveroperator for evolutionary algorithms based on population features, Journal ofArtificial Intelligence Research 24 (2005) 33–80.

[15] X. Yao, M.M. Islam, Evolving artificial neural network ensembles, IEEEComputational Intelligence Magazine 3 (1) (2008) 31–42.

[16] M.M. Islam, X. Yao, K. Murase, A constructive algorithm for trainingcooperative neural network ensembles, IEEE Transactions on Neural Net-works 14 (2003) 820–834.

[17] N. Garcıa-Pedrajas, C. Hervas-Martınez, D. Ortiz-Boyer, Cooperative coevolu-tion of artificial neural network ensembles for pattern classification, IEEETransactions on Evolutionary Computation 9 (3) (2005) 271–302.

[18] M.Q. Li, J. Tian, F.Z. Chen, Improving multiclass pattern recognition with a co-evolutionary RBFNN, Pattern Recognition Letters 29 (4) (2008) 392–406.

[19] M.R. Berthold, J. Diamond, Boosting the performance of RBF networks withdynamic decay adjustment, in: G. Tesauro, D.S. Touretzky, T.K. Leen (Eds.),Advances in Neural Information Processing Systems, vol. 7, MIT Press, DenverColorado, 1995, pp. 512–528.

[20] D. Casasent, X.W. Chen, Radial basis function neural networks for nonlinearfisher discrimination and Neyman–Pearson classification, Neural Networks16 (2003) 529–535.

[21] E. Bauer, R. Kohavi, An empirical comparison of voting classification algo-rithms: bagging, boosting, and variants, Machine Learning 36 (1998) 105–142.

[22] W.X. Zhao, L.D. Wu, RBFN structure determination strategy based on PLS andGas, Journal of Software 13 (2002) 1450–1455.

[23] E. Zitzler, L. Thiele, Multiobjective evolutionary algorithms: a comparativecase study and the strength Pareto approach, IEEE Transactions on Evolu-tionary Computation 3 (4) (1999) 257–271.

J. Tian et al. / Pattern Recognition 45 (2012) 1373–1385 1385

[24] S.G. Ficici, J.B. Pollack, Pareto optimality in coevolutionary learning, in: J.Kelemen, P. Sosik (Eds.), Proceedings of the European Conference on ArtificialLife, Berlin, Springer, 2001, pp. 316–325.

[25] K. Deb, A. Pratap, S. Agarwal, et al., A fast and elitist multiobjective geneticalgorithm: NSGA-II, IEEE Transaction on Evolutionary Computation 6 (2)(2002) 182–197.

[26] J. Tian, M.Q. Li, F.Z. Chen, GA-RBFNN learning algorithm for complexclassifications, Journal of Systems Engineering 21 (2006) 163–170.

[27] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in:Proceedings of the Thirteenth International Conference on Machine Learning,San Francisco, 1996, pp. 148–156.

[28] T.K. Ho, Random decision forests, in: Proceedings of the 3rd InternationalConference on Document Analysis and Recognition, Montreal, Canada, 1995,pp. 278–282.

[29] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32.[30] T.K. Ho, The random subspace method for constructing decision forests, IEEE

Transactions on Pattern Analysis and Machine Intelligence 20 (8) (1998)

832–844.[31] P. Domingos, A unified bias-variance decomposition and its applications, in:

Proceedings of 17th International Conference on Machine Learning, 2000,pp. 231–238.

[32] S. German, E. Bienenstock, R. Doursat, Neural networks and the bias/variancedilemma, Neural Computation 4 (1) (1992) 1–58.

[33] R. Kohavi, D.H. Wolpert, Bias plus variance decomposition for zero-one lossfunctions, in: Proceedings of 13th International Conference on MachineLearning, Morgan Kaufmann, San Mateo, CA, 1996, pp. 275–283.

Jin Tian received her B.E. degree in Management Information System from Tianjin University, Tianjin, China, in 2002; and the M.E. and the Ph.D. degrees in ManagementScience and Engineering from Tianjin University, Tianjin, China, in 2005 and 2008, respectively. She is currently a lecturer in the Department of Information Managementand Management Science, School of Management, Tianjin University, Tianjin, China. She has published several papers in English journals and conferences, such as PatternRecognition Letters and Neural Computing & Applications, PAKDD2007, ISNN2007, etc.

Minqiang Li received his B.E. degree in Industry and Enterprise Management from Hebei University of Science and Technology, Hebei, China, in 1986; and the M.E. and thePh.D. degrees in Systems Engineering and Management Science from Tianjin University, Tianjin, China, in 1989 and 2000, respectively. He is currently a professor in theDepartment of Information Management and Management Science, School of Management, Tianjin University, Tianjin, China. He has published over 40 papers in Englishjournals and conferences, such as the European Journal of Operational Research, Journal of Heuristics, Pattern Recognition Letters, Knowledge-Based Systems, InformationSciences, Soft Computing, Applied Soft Computing, Neural Computing & Applications, Computational Optimization and Applications, Sciences in China (Series E/F), etc. Heserves on the editorial board for the Chinese Journal of Management Sciences. He is a member of the Association for Information Systems and the member of the SystemsEngineering Society of China.

Fuzan Chen received her B.E. degree in the Information Management from Tianjin University Tianjin, China, in 1994; and the M.E. and the Ph.D. degrees in ManagementScience from Tianjin University, Tianjin, China, in 1997 and 2000, respectively. She did her post-doctoral research in Communication and Information Technology from2000 to 2002. She is currently an associate professor in the Department of Information Management and Management Science, School of Management, Tianjin University,Tianjin, PR China. She has published over 16 papers in English journals and conferences, such as the Pattern Recognition Letters, Neural Computing & Applications, etc.

Jisong Kou received his M.E. and Ph.D. degrees in Systems Engineering from Tianjin University, Tianjin, China, in 1982 and 1987, respectively. He is currently a professor inthe Institute of Systems Engineering, School of Management, Tianjin University, Tianjin, China. He has published over 50 papers in English journals and conferences, suchas the Journal of Heuristics, Soft Computing, Applied Soft Computing, Information Science, Sciences in China (Series E/F), etc. He has been awarded four prizes of theNational Science and Technology Advances and more than twenty prizes of the Provincial Science and Technology Advances. He serves as the deputy editor-in-chief for theChinese Journal of Management Sciences. He is a standing member of the Systems Engineering Society of China.