[ieee 2011 fourth international workshop on advanced computational intelligence (iwaci) - wuhan,...

Balancing Ensemble Learning through Error Shif

Yong Liu

Abstract—In neural network learning, it has been often ob-served that some data have been learned extremely well whileothers have been barely learned. Such unbalanced learning oftenlead to the learned neural networks or neural network ensemblesthat could be too strongly biased on those learned-well data.The stronger bias could contribute to the larger variance andthe poorer generalization on the unseen data. It is necessary toprevent a learned model from being strong biased especially whenthe model have unnecessary large complexity for the application.This paper shows how balanced ensemble learning could guidelearning to being less biased through error shift, and create weaklearners in an ensemble.

I. INTRODUCTION

OR two-class classification problem by neural networks,

the target values for two classes are of ten defined as 1.

The learning error function based on the 1-and-0 target values

would force the neural network to continue its learning on

some data points even if the neural network has already be able

to correctly classify these data points. Meanwhile, other data

points might have been barely learned Because some conflicts

could exist in the whole data set. Such unbalanced learning

often lead to a learned neural network or neural network

ensembles that could be too strongly biased on those learned-

well data. The stronger bias could contribute to the larger

variance and the poorer generalization on the unseen data. It is

necessary to prevent a learned model from being strong biased

especially when the model have unnecessary large complexity

for the application.

This paper shows how balanced ensemble learning [1]

could guide learning to being less biased through error shift,

and create weak learners in an ensemble. Balanced ensemble

learning was developed from negative correlation learning [2],

in which the target values are set between [1 : 0.5) or (0.5 : 0]in the learned error function. Such error function could let the

ensemble avoid from unnecessary further learning on the well-

learned data. Therefore, the learning direction could be shifted

away from the well-learned data, and turned to other not-yet-

learned data. Through shifting away from well-learned data

and focusing on not-yet-learned data, a good balanced learning

could be achieved in the ensemble. Different to bagging [3]

and boosting [4] where learners are trained on randomly re-

sampled data from the original set of patterns, learners could

be trained on all available patterns in balanced ensemble

learning. The interesting results presented in this paper suggest

that learners could be still weak even if they had been trained

on the whole data set. Another difference among these ensem-

Yong Liu is with the school of computer science and engineering,

The University of Aizu, Aizu-Wakamatsu, Fukushima 965-8580,

Japan (e-mail: [email protected]).

ble learning methods is that learners are trained simultane-

ously when learners are trained independently in bagging, and

sequentially in boosting. Besides bagging and boosting, many

other ensemble learning approaches have been developed from

a variety of backgrounds [5], [6], [7], [8], [2].

The rest of this paper is organized as follows: Section II

describes ideas of balanced ensemble learning. Section III

how error shift affect learning at the ensemble level and the

individual level. Finally, Section IV concludes with a summary

of the paper.

II. BALANCED ENSEMBLE LEARNING

A balanced ensemble learning [1] was developed by chang-

ing error functions in negative correlation learning (NCL)

[2]. In NCL, the output y of a neural network ensem-

ble is formed by a simple averaging of outputs Fi of a

set of neural networks. Given the training data set D ={(x(1), y(1)), · · · , (x(N), y(N))}, all the individual networks

in the ensemble are trained on the same training data set D

F (n) =1M

ΣMi=1Fi(n) (1)

where Fi(n) is the output of individual network i on the nth

training pattern x(n), F (n) is the output of the neural network

ensemble on the nth training pattern, and M is the number of

individual networks in the neural network ensemble.

The idea of NCL [2] is to introduce a correlation penalty

term into the error function of each individual network so that

all the individual networks can be trained simultaneously and

interactively. The error function Ei for individual i on the

training data set D in negative correlation learning is defined

by

Ei =1N

ΣNn=1Ei(n)

=1N

ΣNn=1

[12(Fi(n) − y(n))2 + λpi(n)

](2)

where N is the number of training patterns, Ei(n) is the value

of the error function of network i at presentation of the nth

training pattern, and y(n) is the desired output of the nth

training pattern. The first term in the right side of Eq.(2) is the

mean-squared error of individual network i. The second term

is a correlation penalty function. The purpose of minimizing

is to negatively correlate each individual’s error with errors

for the rest of the ensemble. The parameter λ is used to adjust

the strength of the penalty.

Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011

978-1-61284-375-9/11/$26.00 @2011 IEEE

Yong Liu is with the School of Computer Science and Engineering,

a variety of backgrounds [5]-[8].

F

349

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

(a)

(b)Fig. 1. Average error rate on the training set (left) and the testing set (right) of the diabetes data at both ensembles and individuals levels with 10 hiddennodes. (a) Average error rate of individual networks with β from 0 to 0.4; (b) Average error rate of the learned ensembles with β from 0 to 0.4.

The partial derivative of Ei with respect to the output ofindividual i on the nth training pattern is

∂Ei(n)

∂Fi(n)= Fi(n) − y(n) − λ(Fi(n) − F (n))

= (1 − λ)(Fi(n) − y(n)) + λ(F (n) − y(n)) (3)

In the case of 0 < λ < 1, both F (n) and Fi(n) are trained to

go closer to the target output y(n) by NCL. λ = 0 and λ = 1are the two special cases. At λ = 0, there is no correlation

penalty function, and each individual network is just trained

independently based on

∂Ei(n)∂Fi(n)

= Fi(n) − y(n) (4)

At λ = 1, the derivative of error function is given by

∂Ei(n)∂Fi(n)

= F (n) − y(n) (5)

where the error signal is decided by F (n) − y(n), i.e. the

difference between F (n) and y(n). For the classification

problems, it is unnecessary to have the smallest difference

between F (n) and y(n). For an example of a two-class

350

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0

0.05

0.1

0.15

0.2

0.25

0.3

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0

0.05

0.1

0.15

0.2

0.25

0.3

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

(a)

(b)Fig. 2. Average MSE on the training set (left) and the testing set (right) of the diabetes data at both ensembles and individuals levels with 10 hidden nodes.(a) Average MSE of individual networks with β from 0 to 0.4; (b) Average MSE of the learned ensembles with β from 0 to 0.4.

problem, the target value y on a data point can be set up

to 1.0 or 0.0 depend on which class the data point belongs to.

As long as F is larger than 0.5 at y = 1.0 or smaller than 0.5

at y = 1.0, the data point will be correctly classified.

In balanced ensemble learning [1], the error function for

each individual on each data point is defined based on whether

the ensemble has learned the data point or not. If the ensemble

had learned to classify the data point correctly, a shifting

parameter β with values of 0 ≤ β ≤ 0.5 could be introduced

into the derivative of error function in Eq.(refrelation) for each

individual∂Ei(n)∂Fi(n)

= F (n) − |y(n) − β| (6)

Otherwise, an enforcing parameter α with values α ≥ 1would be added to the the derivative of error function for each

individual∂Ei(n)∂Fi(n)

= α(F (n) − y(n)) (7)

By shifting and enforcing the derivative of error function,

the ensemble would not need to learn every data too well

to prevent from learning hard data points too slowly.

351

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

-0.02

-0.015

-0.01

-0.005

0

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

-0.02

-0.018

-0.016

-0.014

-0.012

-0.01

-0.008

-0.006

-0.004

-0.002

0

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

(a)

(b)Fig. 3. (a) Average overlapping rates of output between every two individual networks with 10 hidden nodes in the ensemble on both the training set (left)and the testing set (right); (b) Average correlation of output between every two individual networks with 10 hidden nodes in the ensemble on both the trainingset (left) and the testing set (right).

III. EXPERIMENTAL RESULTS

The good performance of the balanced ensemble learning

has been shown on four real-world problems [1], i.e. the

Australian credit card assessment problem, the heart disease

problem, the diabetes problem, and the cancer problem, were

tested. After knowing the performance of the learned ensem-

ble, it would be interesting to see how and why the learned

ensemble had achieved such good performance.

In this section, each individual neural network in the

learned ensemble by balanced ensemble learning have been

measured by the four values, including the average error

rates, the average mean squared error (MSE), the average

overlapping rates and the average correlations of output among

the individual networks in the ensembles. The first value

represents the average performance of the learned ensembles

and individuals, while the second value measures the distances

of the sample points to the decision boundaries. The third

and the fourth values show how similar those individuals

are. The overlapping rate with value 1 means that every two

learners have the same classification on the sample data points,

while the overlapping rate with value 0 implies that every

two learners give the different classification on the sample

352

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

(a)

(b)Fig. 4. Average error rate on the training set (left) and the testing set (right) of the diabetes data at both ensembles and individuals levels with one hiddennode. (a) Average error rate of individual networks with β from 0 to 0.4; (b) Average error rate of the learned ensembles with β from 0 to 0.4.

data points. Because of limitation of pages, the results of

these four values have been given only to the heart disease

problem. The purpose of the heart disease data set is to predict

the presence or absence of heart disease given the results of

various medical tests carried out on a patient. This database

contains 13 attributes, which have been extracted from a larger

set of 75. The database originally contained 303 examples

but 6 of these contained missing class values and so were

discarded leaving 297. 27 of these were retained in case of

dispute, leaving a final total of 270. There are two classes:

presence and absence of heart disease.

10-fold cross-validation were used in the experiments. 5

runs of 10-fold cross-validation had been conducted to calcu-

late the average results.

Two ensemble architectures were tested in the experiments,

including one architecture with 10 feedforward networks with

one hidden layer and 10 hidden nodes, and the architecture

with 50 feedforward networks with one hidden layer and one

hidden node. The number of training epochs was set to 4000.

Fig 1 shows the average error rates for both the learned

ensembles and individuals with 10 hidden nodes for the

different shifting parameters through the learning process. The

results suggest that the larger shifting parameters β are, the

lower error rates the learned ensemble had on both the training

353

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0

0.05

0.1

0.15

0.2

0.25

0.3

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0

0.05

0.1

0.15

0.2

0.25

0.3

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

(a)

(b)Fig. 5. Average MSE on the training set (left) and the testing set (right) of the diabetes data at both ensembles and individuals levels with one hidden node.(a) Average MSE of individual networks with β from 0 to 0.4; (b) Average MSE of the learned ensembles with β from 0 to 0.4.

set and the testing set. For example, with increasing β from 0

to 0.4, the average error rates of the ensembles were reduced

from 0.18% to 0 on the training set, and improved from 3.7%to 2.66% on the testing set. No overfitting was observed.

In contrast to the ensembles, the individuals had the higher

average error rates on both the training set and the testing set

with the larger shifting parameters. The average error rates

on the training set rose from 2.57% at β = 0 to 38.3% at

β = 0.4. The average error rates on the testing set increased

from 6.45% at β = 0 to 38.8% at β = 0.4. It suggests that

the strong individuals at the lower β were turned to be weak

at the higher β.

The averages of MSE for both the learned ensembles and

individuals with 10 hidden nodes for the different shifting

parameters through the learning process are displayed in Fig 2.

The similar trends appeared on both training and testing set.

For the learned ensembles, the higher β is, the larger MSE

became. MSE fell significantly in the initial few learning

steps, and much slower in the rest of learning process. In

comparison, MSE of individuals increased more and more in

the beginning with the increased β, and then fell gradually. It

should be noticed that the learned ensembles with the larger

MSE actually had the lower error rates. It suggests that the

learning was shifted away from learning some data points

354

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

-0.0045

-0.004

-0.0035

-0.003

-0.0025

-0.002

-0.0015

-0.001

-0.0005

0

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

-0.0045

-0.004

-0.0035

-0.003

-0.0025

-0.002

-0.0015

-0.001

-0.0005

0

0 500 1000 1500 2000 2500 3000 3500 4000

0.00.10.20.30.4

(a)

(b)Fig. 6. (a) Average overlapping rates of output between every two individual networks with one hidden node in the ensemble on both the training set (left)and the testing set (right); (b) Average correlation of output between every two individual networks with one hidden node in the ensemble on both the trainingset (left) and the testing set (right).

further, and turned to learn those that were not learned well.

Such balancing prevents learning from being biased on those

easy learned data.

The changes of the average overlapping rates and the av-

erage correlations among individual networks with 10 hidden

nodes on both the training set and the testing set through the

learning process are shown in Fig 3. The average overlapping

rates were near to 0.5 on both the training and testing set

at β = 0.3 and 0.4. It implies that individuals are rather

independent by balanced ensemble learning with the higher

value of β. The small negative values of correlations among

individual networks suggest that individual networks by bal-

anced ensemble learning were negatively correlated.

The four measured values for the learned ensembles and

individuals with one hidden node for the different shifting

parameters through the learning process were shown in the

Fig 4, Fig 5, and Fig 6. The similar results had been obtained

for the ensembles consisting of small neural networks. From

the average error rates, it could be seen that the ensembles

with small neural networks could not learn better at β = 0.

With balanced ensemble learning, the ensembles with small

neural networks could learn as well as the ensembles with big

355

neural networks. In another word, balanced ensemble learning

could help to create learning machines with less complexity.

IV. CONCLUSIONS

Error shift led to balanced learning with the higher MSE

and the lower error rates. It is certainly not to say that the

higher MSE an ensemble has, the better performance it could

achieve. The higher MSE led by balanced ensemble learning

reflects that balanced ensemble learning could learn each data

points rather equally. Without error shift, the learning often

tends to learn easy learned data points better and better. Such

tendency might prevent the ensembles from learning other hard

data points. With error shift, more would be given to learn

those not learned or not learned well data points. Therefore,

balanced learning could create a less biased ensemble in which

individual learners become rather weak.

REFERENCES

[1] Y. Liu, “A balanced ensemble learning with adaptive error functions,”Lecture Notes in Computer Science, vol. 5370, pp. 1–8, 2008.

[2] Y. Liu and X. Yao, “Simultaneous training of negatively correlatedneural networks in an ensemble,” IEEE Trans. on Systems, Man, andCybernetics, Part B: Cybernetics, vol. 29, no. 6, pp. 716–725, 1999.

[3] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–140, 1996.

[4] R. E. Schapire. “The strength of weak learnability,” Machine Learning,vol. 5, pp. 197–227, 1990.

[5] L. K. Hansen and P. Salamon. “Neural network ensembles,” IEEE Trans.on Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993–1001, 1990.

[6] D. Sarkar. “Randomness in generalization ability: a source to improveit,” IEEE Trans. on Neural Networks, vol.7, no. 3, pp. 676–685, 1996.

[7] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. “Adaptivemixtures of local experts,” Neural Computation, vol. 3, pp. 79–87, 1991.

[8] R. A. Jacobs, M. I. Jordan, and A. G. Barto. “Task decomposition throughcompetition in a modular connectionist architecture: the what and wherevision task,” Cognitive Science, vol. 15, pp. 219–250, 1991.

356

[ieee 2011 fourth international workshop on advanced computational intelligence (iwaci) - wuhan,...

Documents