[IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational Intelligence - Balancing ensemble learning through error shif

Download [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational Intelligence - Balancing ensemble learning through error shif

Post on 02-Apr-2017

214 views

Category:

Documents

2 download

TRANSCRIPT

  • Balancing Ensemble Learning through Error Shif

    Yong Liu

    AbstractIn neural network learning, it has been often ob-served that some data have been learned extremely well whileothers have been barely learned. Such unbalanced learning oftenlead to the learned neural networks or neural network ensemblesthat could be too strongly biased on those learned-well data.The stronger bias could contribute to the larger variance andthe poorer generalization on the unseen data. It is necessary toprevent a learned model from being strong biased especially whenthe model have unnecessary large complexity for the application.This paper shows how balanced ensemble learning could guidelearning to being less biased through error shift, and create weaklearners in an ensemble.

    I. INTRODUCTIONOR two-class classification problem by neural networks,the target values for two classes are of ten defined as 1.

    The learning error function based on the 1-and-0 target valueswould force the neural network to continue its learning onsome data points even if the neural network has already be ableto correctly classify these data points. Meanwhile, other datapoints might have been barely learned Because some conflictscould exist in the whole data set. Such unbalanced learningoften lead to a learned neural network or neural networkensembles that could be too strongly biased on those learned-well data. The stronger bias could contribute to the largervariance and the poorer generalization on the unseen data. It isnecessary to prevent a learned model from being strong biasedespecially when the model have unnecessary large complexityfor the application.

    This paper shows how balanced ensemble learning [1]could guide learning to being less biased through error shift,and create weak learners in an ensemble. Balanced ensemblelearning was developed from negative correlation learning [2],in which the target values are set between [1 : 0.5) or (0.5 : 0]in the learned error function. Such error function could let theensemble avoid from unnecessary further learning on the well-learned data. Therefore, the learning direction could be shiftedaway from the well-learned data, and turned to other not-yet-learned data. Through shifting away from well-learned dataand focusing on not-yet-learned data, a good balanced learningcould be achieved in the ensemble. Different to bagging [3]and boosting [4] where learners are trained on randomly re-sampled data from the original set of patterns, learners couldbe trained on all available patterns in balanced ensemblelearning. The interesting results presented in this paper suggestthat learners could be still weak even if they had been trainedon the whole data set. Another difference among these ensem-

    Yong Liu is with the school of computer science and engineering,The University of Aizu, Aizu-Wakamatsu, Fukushima 965-8580,Japan (e-mail: yliu@u-aizu.ac.jp).

    ble learning methods is that learners are trained simultane-ously when learners are trained independently in bagging, andsequentially in boosting. Besides bagging and boosting, manyother ensemble learning approaches have been developed froma variety of backgrounds [5], [6], [7], [8], [2].

    The rest of this paper is organized as follows: Section IIdescribes ideas of balanced ensemble learning. Section IIIhow error shift affect learning at the ensemble level and theindividual level. Finally, Section IV concludes with a summaryof the paper.

    II. BALANCED ENSEMBLE LEARNING

    A balanced ensemble learning [1] was developed by chang-ing error functions in negative correlation learning (NCL)[2]. In NCL, the output y of a neural network ensem-ble is formed by a simple averaging of outputs Fi of aset of neural networks. Given the training data set D ={(x(1), y(1)), , (x(N), y(N))}, all the individual networksin the ensemble are trained on the same training data set D

    F (n) =1M

    Mi=1Fi(n) (1)

    where Fi(n) is the output of individual network i on the nthtraining pattern x(n), F (n) is the output of the neural networkensemble on the nth training pattern, and M is the number ofindividual networks in the neural network ensemble.

    The idea of NCL [2] is to introduce a correlation penaltyterm into the error function of each individual network so thatall the individual networks can be trained simultaneously andinteractively. The error function Ei for individual i on thetraining data set D in negative correlation learning is definedby

    Ei =1N

    Nn=1Ei(n)

    =1N

    Nn=1

    [12(Fi(n) y(n))2 + pi(n)

    ](2)

    where N is the number of training patterns, Ei(n) is the valueof the error function of network i at presentation of the nthtraining pattern, and y(n) is the desired output of the nthtraining pattern. The first term in the right side of Eq.(2) is themean-squared error of individual network i. The second termis a correlation penalty function. The purpose of minimizingis to negatively correlate each individuals error with errorsfor the rest of the ensemble. The parameter is used to adjustthe strength of the penalty.

    Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011

    978-1-61284-375-9/11/$26.00 @2011 IEEE

    Yong Liu is with the School of Computer Science and Engineering,

    a variety of backgrounds [5]-[8].

    F

    349

  • 0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    (a)

    (b)Fig. 1. Average error rate on the training set (left) and the testing set (right) of the diabetes data at both ensembles and individuals levels with 10 hiddennodes. (a) Average error rate of individual networks with from 0 to 0.4; (b) Average error rate of the learned ensembles with from 0 to 0.4.

    The partial derivative of Ei with respect to the output ofindividual i on the nth training pattern is

    Ei(n)

    Fi(n)= Fi(n) y(n) (Fi(n) F (n))= (1 )(Fi(n) y(n)) + (F (n) y(n)) (3)

    In the case of 0 < < 1, both F (n) and Fi(n) are trained togo closer to the target output y(n) by NCL. = 0 and = 1are the two special cases. At = 0, there is no correlationpenalty function, and each individual network is just trained

    independently based on

    Ei(n)Fi(n)

    = Fi(n) y(n) (4)

    At = 1, the derivative of error function is given by

    Ei(n)Fi(n)

    = F (n) y(n) (5)

    where the error signal is decided by F (n) y(n), i.e. thedifference between F (n) and y(n). For the classificationproblems, it is unnecessary to have the smallest differencebetween F (n) and y(n). For an example of a two-class

    350

  • 0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    (a)

    (b)Fig. 2. Average MSE on the training set (left) and the testing set (right) of the diabetes data at both ensembles and individuals levels with 10 hidden nodes.(a) Average MSE of individual networks with from 0 to 0.4; (b) Average MSE of the learned ensembles with from 0 to 0.4.

    problem, the target value y on a data point can be set upto 1.0 or 0.0 depend on which class the data point belongs to.As long as F is larger than 0.5 at y = 1.0 or smaller than 0.5at y = 1.0, the data point will be correctly classified.

    In balanced ensemble learning [1], the error function foreach individual on each data point is defined based on whetherthe ensemble has learned the data point or not. If the ensemblehad learned to classify the data point correctly, a shiftingparameter with values of 0 0.5 could be introducedinto the derivative of error function in Eq.(refrelation) for each

    individualEi(n)Fi(n)

    = F (n) |y(n) | (6)

    Otherwise, an enforcing parameter with values 1would be added to the the derivative of error function for eachindividual

    Ei(n)Fi(n)

    = (F (n) y(n)) (7)

    By shifting and enforcing the derivative of error function,the ensemble would not need to learn every data too wellto prevent from learning hard data points too slowly.

    351

  • 0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    -0.02

    -0.015

    -0.01

    -0.005

    0

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    -0.02

    -0.018

    -0.016

    -0.014

    -0.012

    -0.01

    -0.008

    -0.006

    -0.004

    -0.002

    0

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    (a)

    (b)Fig. 3. (a) Average overlapping rates of output between every two individual networks with 10 hidden nodes in the ensemble on both the training set (left)and the testing set (right); (b) Average correlation of output between every two individual networks with 10 hidden nodes in the ensemble on both the trainingset (left) and the testing set (right).

    III. EXPERIMENTAL RESULTS

    The good performance of the balanced ensemble learninghas been shown on four real-world problems [1], i.e. theAustralian credit card assessment problem, the heart diseaseproblem, the diabetes problem, and the cancer problem, weretested. After knowing the performance of the learned ensem-ble, it would be interesting to see how and why the learnedensemble had achieved such good performance.

    In this section, each individual neural network in thelearned ensemble by balanced ensemble learning have beenmeasured by the four values, including the average error

    rates, the average mean squared error (MSE), the averageoverlapping rates and the average correlations of output amongthe individual networks in the ensembles. The first valuerepresents the average performance of the learned ensemblesand individuals, while the second value measures the distancesof the sample points to the decision boundaries. The thirdand the fourth values show how similar those individualsare. The overlapping rate with value 1 means that every twolearners have the same classification on the sample data points,while the overlapping rate with value 0 implies that everytwo learners give the different classification on the sample

    352

  • 0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    (a)

    (b)Fig. 4. Average error rate on the training set (left) and the testing set (right) of the diabetes data at both ensembles and individuals levels with one hiddennode. (a) Average error rate of individual networks with from 0 to 0.4; (b) Average error rate of the learned ensembles with from 0 to 0.4.

    data points. Because of limitation of pages, the results ofthese four values have been given only to the heart diseaseproblem. The purpose of the heart disease data set is to predictthe presence or absence of heart disease given the results ofvarious medical tests carried out on a patient. This databasecontains 13 attributes, which have been extracted from a largerset of 75. The database originally contained 303 examplesbut 6 of these contained missing class values and so werediscarded leaving 297. 27 of these were retained in case ofdispute, leaving a final total of 270. There are two classes:presence and absence of heart disease.

    10-fold cross-validation were used in the experiments. 5

    runs of 10-fold cross-validation had been conducted to calcu-late the average results.

    Two ensemble architectures were tested in the experiments,including one architecture with 10 feedforward networks withone hidden layer and 10 hidden nodes, and the architecturewith 50 feedforward networks with one hidden layer and onehidden node. The number of training epochs was set to 4000.

    Fig 1 shows the average error rates for both the learnedensembles and individuals with 10 hidden nodes for thedifferent shifting parameters through the learning process. Theresults suggest that the larger shifting parameters are, thelower error rates the learned ensemble had on both the training

    353

  • 0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    (a)

    (b)Fig. 5. Average MSE on the training set (left) and the testing set (right) of the diabetes data at both ensembles and individuals levels with one hidden node.(a) Average MSE of individual networks with from 0 to 0.4; (b) Average MSE of the learned ensembles with from 0 to 0.4.

    set and the testing set. For example, with increasing from 0to 0.4, the average error rates of the ensembles were reducedfrom 0.18% to 0 on the training set, and improved from 3.7%to 2.66% on the testing set. No overfitting was observed.

    In contrast to the ensembles, the individuals had the higheraverage error rates on both the training set and the testing setwith the larger shifting parameters. The average error rateson the training set rose from 2.57% at = 0 to 38.3% at = 0.4. The average error rates on the testing set increasedfrom 6.45% at = 0 to 38.8% at = 0.4. It suggests thatthe strong individuals at the lower were turned to be weakat the higher .

    The averages of MSE for both the learned ensembles andindividuals with 10 hidden nodes for the different shiftingparameters through the learning process are displayed in Fig 2.The similar trends appeared on both training and testing set.For the learned ensembles, the higher is, the larger MSEbecame. MSE fell significantly in the initial few learningsteps, and much slower in the rest of learning process. Incomparison, MSE of individuals increased more and more inthe beginning with the increased , and then fell gradually. Itshould be noticed that the learned ensembles with the largerMSE actually had the lower error rates. It suggests that thelearning was shifted away from learning some data points

    354

  • 0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    -0.0045

    -0.004

    -0.0035

    -0.003

    -0.0025

    -0.002

    -0.0015

    -0.001

    -0.0005

    0

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    -0.0045

    -0.004

    -0.0035

    -0.003

    -0.0025

    -0.002

    -0.0015

    -0.001

    -0.0005

    0

    0 500 1000 1500 2000 2500 3000 3500 4000

    0.00.10.20.30.4

    (a)

    (b)Fig. 6. (a) Average overlapping rates of output between every two individual networks with one hidden node in the ensemble on both the training set (left)and the testing set (right); (b) Average correlation of output between every two individual networks with one hidden node in the ensemble on both the trainingset (left) and the testing set (right).

    further, and turned to learn those that were not learned well.Such balancing prevents learning from being biased on thoseeasy learned data.

    The changes of the average overlapping rates and the av-erage correlations among individual networks with 10 hiddennodes on both the training set and the testing set through thelearning process are shown in Fig 3. The average overlappingrates were near to 0.5 on both the training and testing setat = 0.3 and 0.4. It implies that individuals are ratherindependent by balanced ensemble learning with the highervalue of . The small negative values of correlations among

    individual networks suggest that individual networks by bal-anced ensemble learning were negatively correlated.

    The four measured values for the learned ensembles andindividuals with one hidden node for the different shiftingparameters through the learning process were shown in theFig 4, Fig 5, and Fig 6. The similar results had been obtainedfor the ensembles consisting of small neural networks. Fromthe average error rates, it could be seen that the ensembleswith small neural networks could not learn better at = 0.With balanced ensemble learning, the ensembles with smallneural networks could learn as well as the ensembles with big

    355

  • neural networks. In another word, balanced ensemble learningcould help to create learning machines with less complexity.

    IV. CONCLUSIONS

    Error shift led to balanced learning with the higher MSEand the lower error rates. It is certainly not to say that thehigher MSE an ensemble has, the better performance it couldachieve. The higher MSE led by balanced ensemble learningreflects that balanced ensemble learning could learn each datapoints rather equally. Without error shift, the learning oftentends to learn easy learned data points better and better. Suchtendency might prevent the ensembles from learning other harddata points. With error shift, more would be given to learnthose not learned or not learned well data points. Therefore,balanced learning could create a less biased ensemble in whichindividual learners become rather weak.

    REFERENCES[1] Y. Liu, A balanced ensemble learning with adaptive error functions,

    Lecture Notes in Computer Science, vol. 5370, pp. 18, 2008.[2] Y. Liu and X. Yao, Simultaneous training of negatively correlated

    neural networks in an ensemble, IEEE Trans. on Systems, Man, andCybernetics, Part B: Cybernetics, vol. 29, no. 6, pp. 716725, 1999.

    [3] L. Breiman, Bagging predictors, Machine Learning, vol. 24, pp. 123140, 1996.

    [4] R. E. Schapire. The strength of weak learnability, Machine Learning,vol. 5, pp. 197227, 1990.

    [5] L. K. Hansen and P. Salamon. Neural network ensembles, IEEE Trans.on Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 9931001, 1990.

    [6] D. Sarkar. Randomness in generalization ability: a source to improveit, IEEE Trans. on Neural Networks, vol.7, no. 3, pp. 676685, 1996.

    [7] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptivemixtures of local experts, Neural Computation, vol. 3, pp. 7987, 1991.

    [8] R. A. Jacobs, M. I. Jordan, and A. G. Barto. Task decomposition throughcompetition in a modular connectionist architecture: the what and wherevision task, Cognitive Science, vol. 15, pp. 219250, 1991.

    356

Recommended

View more >