robust least squares support vector machine based on recursive outlier elimination

ORIGINAL PAPER

Robust least squares support vector machine based on recursiveoutlier elimination

Wen Wen • Zhifeng Hao • Xiaowei Yang

Published online: 9 December 2009

� Springer-Verlag 2009

Abstract To achieve robust estimation for noisy data set,

a recursive outlier elimination-based least squares support

vector machine (ROELS-SVM) algorithm is proposed in

this paper. In this algorithm, statistical information from

the error variables of least squares support vector machine

is recursively learned and a criterion derived from robust

linear regression is employed for outlier elimination.

Besides, decremental learning technique is implemented in

the recursive training–eliminating stage, which ensures that

the outliers are eliminated with low computational cost.

The proposed algorithm is compared with re-weighted least

squares support vector machine on multiple data sets and

the results demonstrate the remarkably robust performance

of the ROELS-SVM.

Keywords Least squares � Support vector machines �Regression � Robust estimation � Outliers

1 Introduction

Support vector machine (SVM) is a supervised learning

method developed by Vapnik (1995) and Cortes (1993),

with grounded foundation of statistical learning theory. It

can be viewed as a particular version of artificial neural-

networks, which strikes balance between the structural

risks and empirical risks to obtain robust classification or

function estimation. It has been widely used in solving

classification problems, and a series of good results have

been reported (Scholkopf et al. 1997; Burges 1998).

In the field of nonlinear function estimation, SVM

exhibits its powerful ability as well. Similar to SVM classi-

fication, SVM regression, i.e., SVM for nonlinear function

approximation, includes both the empirical risk and struc-

tural risk in its minimization objective. Yet different

description of ‘‘empirical risk’’ is employed in SVM

regression. Loss functions, such as epsilon-insensitive loss

function, Laplacian loss function, Huber’s robust loss func-

tion, etc. are used for describing the empirical risk for SVM

regression (Smola and Scholkopf 1998; Mangasarian and

Musicant 2000). They are derived from robust statistics and

play the role of keeping robustness and sparseness of SVM

regression. Besides, researchers find that SVM regression

has a strong connection with regularization network, and to

some extent, both the SVM regression and the regularization

network can be interpreted under the framework of Bayes’

theorem (Evgeniou et al. 2000). Now SVM regression has

been widely used in real-world problems, such as time series

prediction (Cao and Tay 2003) and function approximation

(Wu 2004), and the results are quite satisfactory.

However, outliers [‘‘An outlying observation, or outlier,

is one that appears to deviate markedly from other mem-

bers of the sample in which it occurs’’ (Grubbs 1969)]

strongly affect the performance of SVM regression. For

example, least squares support vector machine (LS-SVM),

a modified version of SVM (Suykens and Vandewalle

1999), employs sum squares error (SSE) as the loss func-

tion and converts the inequality constraints in classical

SVM to equality ones. Fast training speed and excellent

learning results were reported on LS-SVM regression for

noiseless data (Jiang et al. 2006). However, when there are

W. Wen (&) � Z. Hao

School of Computer, Guangdong University of Technology,

510006 Guangzhou, China

e-mail: [email protected]; [email protected]

X. Yang

School of Mathematical Science,

South China University of Technology,

510641 Guangzhou, China

123

Soft Comput (2010) 14:1241–1251

DOI 10.1007/s00500-009-0535-9

outliers in the training samples, estimation will be distorted

and the accuracy of LS-SVM regression is largely reduced

(Suykens et al. 2002). Such problems also exist in other

versions of SVM (Chuang et al. 2002). Some researchers

have realized this problem and they have done some work

to deal with it. The solutions can be generally classified

into three categories. One is to directly eliminate the

samples that are probably outliers (Tian and Huang 2002).

In this method, SVM regression is firstly trained using all

of the samples. Then a given threshold is set to determine

which samples are outliers and which are not. Finally,

constraints in SVM regression are modified so that the

‘‘outliers’’ are eliminated in the successive training proce-

dure. Negative impacts of outliers are eliminated when the

threshold is chosen appropriately. But unfortunately, it is

not easy to find such an ‘‘appropriate’’ threshold unless

there is sufficient prior knowledge of the training samples.

Another is to softly eliminate the outliers, namely, outliers

are not directly eliminated, but are set appropriate small

weights. The small weights on the outliers reduce their

negative influence to the training procedure of SVM

(Suykens et al. 2002, 2003; Zhang and Gao 2005). But how

to assign the weights is actually a difficult problem, espe-

cially when there are many noises and outliers in the

samples. Besides, it is difficult to know what exact effects

the small weights will bring about. We just qualitatively

know their negative influences are in parts reduced. Bra-

banter (2004) has done some valuable work on this issue.

But whether the ‘‘soft’’ elimination is enough for a correct

estimation still remains unknown. The last category is to

use a robust back propagation procedure. This approach is

proposed by Chuang et al. (2002), which is named robust

support vector regression (RSVR) network. It consists of

two steps. In the first step, a traditional SVM regression

network is trained; in the second step, robust back propa-

gation procedure is employed to adjust the weights. This

method is effective to some extent, yet it is highly

dependent on the performance of the training results of

traditional SVM regression. That is, if the traditional SVM

regression obtains very bad results, the back propagation

will not produce good results.

From the statistical perspective, traditional linear

regression, which employs SSE loss function, is also sen-

sitive to outliers. The breakdown point of linear regression

is equal to 1=n (n is the number of samples) (Rousseeuw

and Leroy 1987). This means that even a single outlier has

the potential to completely distort the regression line. To

solve this problem, Rousseeuw (1984) proposed to use

squares median residual as the loss function instead of SSE.

Estimator produced by this method is named least median

of squares (LMS) estimator. And it is claimed that LMS

achieves a breakdown point of about 1=2 (namely, the

regression line will not be distorted unless the number of

outliers exceeds half number of the total samples). Fur-

thermore, Rousseeuw and Driessen (2006) proposed to use

least trimmed squares (LTS) estimator, which is a linear

estimator having the minimized sum of h smallest squared

residuals. According to Rousseeuw and Driessen (2006),

LTS is less sensitive to local effects than LMS and pro-

duces better statistical efficiency.

Enlightened by the LTS estimator, we proposed to use a

recursive outlier eliminating algorithm. To avoid the sub-

jective determination of the threshold as that in Tian and

Huang (2002), a quantitative criterion based on the idea of

LTS is implemented to discriminate samples that can be

eliminated and that cannot. The proposed algorithm is

employed in the framework of LS-SVM and a decremental

learning technique is introduced to accelerate the computa-

tional speed. Experimental results show that this algorithm

can correctly eliminate the outliers at a very low computa-

tional cost.

The rest of this paper is organized as follows. Section 2

provides a brief review to the LS-SVM regression and the

weighted LS-SVM regression. In Sect. 3, we first present

the recursive outlier elimination-based least squares sup-

port vector machine (ROELS-SVM) for robust regression

and then introduce the decremental learning strategy to the

recursive training–eliminating stage for computational

complexity reduction. Experimental results from simulated

instances and real-world data sets are presented in Sect. 4.

Finally, some concluding remarks are given in Sect. 5.

2 Brief reviews to LS-SVM regression and weighted

LS-SVM regression

Given a training set of N samples xi; yið Þf gNi¼1 with input

features xi 2 Rd and output value yi 2 R; the standard

LS-SVM regression can be formulated as the following

optimization problem in the primal space (Suykens and

Vandewalle 1999):

min J w; eð Þ ¼ 1

2wT wþ 1

2cXN

i¼1

e2i

s:t: yi ¼ wTu xið Þ þ bþ ei; i ¼ 1; . . .;N

ð1Þ

In formula (1), u �ð Þ : Rd ! R~d is a function which maps

the input space into a higher dimensional feature space.

w 2 R~d is the weight vector in primal weight space, b is the

bias term and ei ði ¼ 0; 1; . . .;NÞ are error variables.

Using Lagrange multipliers and matrix transformation,

optimizing problem (1) can be converted to a set of linear

equations in the dual space as formula (2).

0 1Tv

1v Xþ 1cI

� �b

a

" #¼

0

y

" #ð2Þ

1242 W. Wen et al.

123

Where y ¼ y1; . . .; yN½ �T; 1v ¼ 1; . . .; 1½ �T; a ¼ a1; . . .;½aN �T and Xij ¼ u xið ÞTu xj

� �¼ K xi; xj

� �; for i ¼ 1; . . .;N;

j ¼ 1; . . .;N: K is the kernel function, for example, linear

kernel, polynomial kernel or RBF kernel. This implies that

the training procedure of LS-SVM regression just requires

solving a linear system.

In primal weight space one has the model for the esti-

mation of output value y:

y xð Þ ¼XN

i¼1

aiK x; xið Þ þ b ð3Þ

In LS-SVM, the noisier a sample is, the larger a it will

have. When there are outliers that deviate very far from the

regression curve, the large value of a will produce

overwhelming force to distort the regression curve [see

formula (3)]. In order to improve the robustness of LS-

SVM, Suykens et al. (2002) proposed to implement a

modified version of LS-SVM, which is named weighted

LS-SVM. The optimization problem of the weighted

LS-SVM can be formulated as follows:

min J� w�; e�ð Þ ¼ 1

2w�k k2þ1

2CXN

i¼1

vie�2i

s:t: y�i¼ w�TU xið Þ þ b� þ e�

i; i ¼ 1; . . .;N

ð4Þ

where vi is determined by the following formula:

vk ¼1 if ei=sj j � c1;c2� eij jc2�c1

if c1� ei=sj j � c2;

10�4 otherwise;

8<

: ð5Þ

s can be given by formulas (6) and (7).

s ¼ IQR

2� 0:6745ð6Þ

where IQR stands for the interquartile range, i.e., the

difference between the 75th percentile and 25th percentile.

Or

s ¼ 1:483MAD xið Þ ð7Þ

where MADðxiÞ stands for the median absolute deviation.

According to Suykens et al. (2002), the procedures (4)

and (5) can be repeated iteratively, but in practice one single

additional weighted LS-SVM step will often be sufficient.

3 Recursive outlier elimination-based LS-SVM

regression

3.1 The recursive training–eliminating procedure

for robust regression

The major difference between LS-SVM and weighted

LS-SVM is that weighted LS-SVM has weights on the

error variables. For the weighted LS-SVM, the first step is

training LS-SVM to obtain ei and ai ði ¼ 0; 1; . . .;NÞ: Then

weights vi ði ¼ 0; 1; . . .;NÞ are calculated according to the

statistical distribution of error variables ei ði ¼ 1; . . .;NÞ:Samples with large error variables are set small weights.

Especially, if the error variable exceeds a given statistical

value, i.e., (6) or (7), the corresponding sample will be set a

very small weight (10�4), which in fact equivalent to

eliminating this sample. Finally, the weighted LS-SVM is

trained to obtain e�i ; a�i and b�: This method is effective to

deal with the majority of data sets containing a handful of

outliers. However, it produces unsatisfying results when

dealing with data sets containing a lot of outliers, espe-

cially when the distribution of outliers is highly different

from Guassian distribution. Figure 1 illustrates an experi-

mental result on a noisy data set. The data set contains 196

samples. 45 of them are outliers and the rest are ‘‘clean’’

samples generated by sin function. As shown in Fig. 1,

samples seriously outlying, i.e., samples labeled with round

circles and triangles, are weighted with very small values

(10�4) or values between 10�4 and 1.0. This implies the

weight-setting procedure is able to detect samples that are

seriously outlying, which demonstrates that error variables

in LS-SVM can be used as a scale of the outlying degree of

a sample. However, it also reveals the negative side of the

weighted LS-SVM. For samples that are not so seriously

‘‘outlying’’, the weighted LS-SVM has quite limited ability

to detect them: about one-third of the outliers, which are

relatively near to the noiseless samples, are still weighted

1.0; they are harmful enough to influence the correct

estimation.

This naturally leads to a question: is there any improved

method that will provide better discrimination between the

Fig. 1 Results of the weighted LS-SVM on a noisy data set mixed

with approximately 20% outliers. Samples labeled by circles are

weighted 10�4 and triangles are weighted between 10�4 and 1.0

Robust least squares support vector machine based on recursive outlier elimination 1243

123

outliers and the useful samples? Results of the weighted

LS-SVM imply that error variables of the LS-SVM do

provide useful information on the outlying degree of

samples. (i.e., what the weighted LS-SVM uses for finding

samples very outlying). If the outlying samples can be

correctly eliminated, it is sure that training the LS-SVM on

the rest samples will produce estimation closer to the real

curve and furthermore provide more correct information

indicating the outlying degree of the rest samples. There-

fore, a closed-loop procedure as Fig. 2 is implemented to

detect outliers, eliminate outliers and meanwhile correct

the estimation of LS-SVM. We named this algorithm as the

ROELS-SVM.

In the ROELS-SVM, LS-SVM is trained using all of the

samples in the initial stage and statistical information of

error variables is used for eliminating samples that are

particularly outlying. Then the rest samples are used for

next LS-SVM training and error variables are analyzed for

further outlier elimination. The loop is going on until a

given condition is reached. However, two critical problems

should be carefully considered before this algorithm

become feasible. One is which samples should be elimi-

nated in each loop. The other is when the training–elimi-

nating procedure can be stopped.

For the first problem, enlightened by the LTS regression

in Rousseeuw and Leroy (1987) and Rousseeuw (1984), we

set an eliminating criterion as follows:

Q1 ¼Xh

i¼0

e i:Nlð Þ�� ð8Þ

where Nl is the number of training samples in loop l;

e 1:Nlð Þ�� e 2:Nlð Þ

�� e Nl:Nlð Þ�� are the ordered absolute

residuals (i.e., absolute values of error variables), and

h \ Nl is an adaptive parameter, which excludes the largest

error variables from the summation in each loop. After LS-

SVM is trained, the samples are ordered according to their

absolute residuals. n samples with largest absolute residu-

als are eliminated while the rest h samples are kept in the

training data set. The preceding h smallest absolute resid-

uals are summed to produce Q1: Here, n and h satisfy

nþ h ¼ Nl:

In the successive loop, LS-SVM is retrained on the h

samples, leading to updated error variables e0i ði ¼1; 2; . . .; hÞ: Q2 is calculated according to formula (9).

Q2 ¼Xh

i¼0

e0i�� ð9Þ

Since ei reflects the outlying degree of sample i;

eliminating samples with the largest error variables is

equivalent to eliminating samples that are most outlying.

Before these outlying samples are eliminated, estimations

will be distorted by them, therefore producing a relative

large e i:Nlð Þ�� for those ‘‘clean’’ samples. When some of the

outliers, especially the most outlying ones, are eliminated,

estimation will be corrected. So corresponding e0i�� for the

‘‘clean’’ samples will be reduced, thus making the value of

Q2 becomes smaller than that of Q1:

Then when should the training–eliminating procedure be

stopped? After several times elimination, all the outliers

will be eliminated at last. If the procedure goes on, samples

will be overly eliminated, i.e., samples with useful infor-

mation are also eliminated. In this situation, the regression

accuracy decreases and the absolute error variables for

‘‘clean’’ samples increase accordingly. Therefore, the value

of Q2 becomes larger than that of Q1: In this case, the

training–eliminating procedure should be stopped and the

algorithm should take a rollback to the result of previous

elimination.

Different data sets contain different amount of outliers.

Therefore, information of Suykens’ weighted LS-SVM is

used as a heuristic method to determine the number of

samples that should be eliminated. And to make sure every

step is circumspective, we suggest that just a few samples

(usually equal to a small percentage of the outlier number

found by the weighted LS-SVM) be eliminated in each

loop. This helps to avoid subjective selection of the

quantity of eliminated samples.

3.2 Speed up the learning procedure

Training the LS-SVM is a time-consuming task. Especially

in the proposed algorithm, it is not sure how many loops

will be repeated before the algorithm is stopped. The

LS-SVM retraining will become an overwhelming time-

consuming part of the whole algorithm. To accelerate the

algorithm, an iteratively decremental learning algorithm,

similar to the method proposed in Zhao and Keong (2004)

and Cawley and Talbot (2004), is implemented.

Denote A ¼ 0 1Tv

1v Xþ 1cI

� �in the formula (2). When a

sample is eliminated, it causes a certain row and column

removed from A. Let Am is the corresponding matrix after

eliminating m samples, and suppose it is the kth column

Given Conditions

LS-SVM Training Outlier Elimination

The Rest Samples

Training

Samples

End

Fig. 2 Architecture of the ROELS-SVM algorithm

1244 W. Wen et al.

123

and the kth row of Amthat is removed in the successive

elimination. Denote Am ¼ amð Þ

ij

� �; A�1

m ¼ ~amð Þ

ij

� �and

Amþ1 ¼ amþ1ð Þ

ij

� �

i;j6¼k: Then according to the Sherman–

Morrison–Woodbury Formula, there is

~amþ1ð Þ

ij ¼ ~amð Þ

ij � ~amð Þ

i;k � ~amð Þ

k;j

.~a

mð Þk;k ;

i; j ¼ 1; . . .;N � m; i; j 6¼ k ð10Þ

Detailed inference of formula (10) can be found in

literature (Zhao and Keong 2004; Cawley and Talbot

2004).

Therefore, if we set A0 ¼ A; then the retraining stage of

LS-SVM can be substituted by a decremental learning

procedure. That is, in each iteration, all the elements of

matrix A; except those corresponding to samples that have

already been eliminated, are updated according to formula

(10). Using such a decremental learning procedure, the

computational complexity is dramatically reduced, from

O N3l

� �to O nN2

l

� �for each loop (n is the number of sam-

ples that are eliminated in each loop). Since n Nl; Nl\N

and it is sure that the training–eliminating loops will be

stopped before all the samples are eliminated. With the

decremental learning, the proposed algorithm actually

requires much less time than a single extra training of

LS-SVM (whose computational complexity is O N3ð Þ).What is additionally needed is just the storage of the

inverse of matrix A.

Thus, the enhanced ROELS-SVM regression for noisy

data set can be summarized as follows.

The enhanced ROELS-SVM algorithm:

1. Use tenfold cross-validation or leave-one-out valida-

tion to train LS-SVM and find the optimal hyperpa-

rameters c; rð Þ:2. Under the optimal hyperparameter, use Suykens’

weighted LS-SVM to find samples with their weights

equal to 10�4and let m be the total number of these

samples. Let n ¼ maxðintðm� pÞ; 1Þ: p is a given

parameter and n is a fixed number in the following

steps.

3. Train LS-SVM using all of the samples under the

optimal hyperparameter. Record the training results in

W1; W1 ¼ a~; bf g and store the inverse of matrix A in ~A:

4. Let l ¼ 1; N0 ¼ N; N is the total number of samples in

the training data set.

In the lth loop, carry out the following procedures.

4.1. Backup W1 in W2; that is, let W2 ¼ W1:

4.2. Let h ¼ Nl�1 � n; order the samples according to

the absolute values of their error variables (i.e.,

absolute residuals). Calculate Q1 according to

formula (8).

4.3. Select n samples with largest absolute residuals and

record their index in an index set Eidx:

4.4. Sequentially pick out a sample from Eidx; remove the

corresponding column and row from ~A and update the

rest elements of ~A according to formula (10) until all

samples in Eidx have been eliminated. Calculate the

training results and record them in W1 ¼ a~0; b0

n o:

Then reset Eidx ¼ U and calculate Q2 according to

formula (9).

4.5. If Q2�Q1; let l ¼ lþ 1; Nl ¼ h and go back to 4.1;

Else output the training results W2:

To further reduce the computational complexity, step 3

can be omitted as long as the training results under optimal

hyperparameter, i.e., the a~opt; bopt

and the inverse of

matrix Aopt; have been recorded in the parameter-selection

stage (step 1). In this situation, the ROELS-SVM wins

comparable running speed to the weighted LS-SVM, for

the former requires less than an extra training of LS-SVM

while the latter requires a whole training step of the

weighted LS-SVM.

Except for the inputs required by the LS-SVM, the

ROELS-SVM additionally requires to input the parameter p.

From the rational perspective, an appropriate small p means a

discreet eliminating step. And the last step of the ROELS-

SVM algorithm guarantees that at any time when the data set

is over-eliminated, there is a rollback chance to correct it.

4 Experiments

Two indexes are introduced to show the performance of

these three algorithms. One of them is the mean absolute

error (MAE), defined as (11).

MAE ¼ 1

N�XN

i¼1

yi � yij j ð11Þ

However, statisticians have noticed that MAE is

sensitive to outliers in the testing data set. That is, MAE

probably makes biased judgement when outliers are also

used as testing samples. Therefore, noiseless testing

samples are desirable when using MAE as the evaluation

criterion. To solve this problem, statisticians suggested

using another statistic (Kvalseth 1985; Rousseeuw 1987),

which is more ‘‘resistant’’, namely more robust, to evaluate

the regression models in noisy circumstances. The

‘‘resistant’’ statistic is defined as (12).

R2 ¼ 1� med yi � yij jmad yið Þ

� �2

ð12Þ

Here, ‘‘mad’’ stands for the median absolute deviation,

defined as (13).


123

mad yið Þ ¼ medi

yi �medj

yj

��

��

ð13Þ

and med yi � yij j stands for the median absolute errors.

R2defined as (12) essentially excludes the outliers in the

testing data set and therefore has the robust property of

measuring how well the regression model fits the useful

samples. Generally, 0�R2� 1 for reasonable models,

R2 ¼ 1 corresponds to perfect fit and R2\0 corresponds to

‘‘bad’’ fit, which is also called ‘‘breakdown point’’. When

the testing samples are noiseless, both MAE and R2 can be

used to indicate the fitness degree of an algorithm. But

when it is unable to have completely ‘‘clean’’ testing

samples, R2 will be a better choice to evaluate the regres-

sion algorithm, for it makes more objective and robust

judgement.

To make a fair comparison, four important models of

LS-SVM are taken into account, the classical LS-SVM, the

weighted LS-SVM (WLS-SVM), the ROELS-SVM, and a

recursive version of weighted LS-SVM (RWLS-SVM).

In the RWLS-SVM, a reweighted and relearning proce-

dure of LS-SVM is implemented. The weighted formula

is the same as the WLS-SVM (See Sect. 2) and the stop

condition is that the weights do not change for almost all

samples (a threshold D ¼ 0:005 is used to judge whether

the weight is changed). Besides, an iteratively updating

algorithm (Wen et al. 2008) is used for the training

of the RWLS-SVM. It is one of the fastest methods for

the training of RWLS-SVM. In our experiments, both

the testing accuracy and the runtime of these four

algorithms are carefully investigated. And all the

experiments are carried out on a PC with 2.36 G CPU

and 1 G memory.

Besides, tenfold cross-validation are used to select hy-

perparameters r; cð Þ: In the tenfold cross-validation, the

data set are randomly divided into ten slices and each time

nine slices of them are used as the training samples and the

rest are used as validation samples. Considering we are

dealing with the data set with outliers, to avoid the negative

effects caused by the outliers in the validation samples, we

use the median absolute error as the tuning criterion.

4.1 Simulation study

Eight simulated data sets, which contain different amount of

outliers, are investigated. Training data set consists of two

category samples: one are uncontaminated samples generated

by sinc function f ; the other are random outliers obeying uni-

form distribution. All of the eight instances contain 151

uncontaminated samples. But outliers are randomly distributed

across the input space (xoutlier 2 ½�15; 15�; youtlier 2 ½0; 3�) and

these instances contain different proportional outliers,

approximately ranging from 10 to 45%.

f ðxÞ ¼ sin x

xx 2 ½�15; 15� ð14Þ

Testing samples are also generated by f ; excluding the

uncontaminated samples in the training data set. There are

totally 197 samples in the testing data set. Experiments

show that under this criterion, the first five instances have

the same optimal hyperparameters 1:0; 2:5ð Þ; and the last

three data sets (Instances 6, 7 and 8) have optimal

hyperparameters (2.5,1.0). The testing results are listed in

Tables 1 and 2.

Results in Tables 1 and 2 show that LS-SVM is easily

influenced by the outliers: testing errors drastically increase

as the outliers increase and it breaks down when the outlier

amount reaches 30. The WLS-SVM partly reduces the

negative influence of the outliers, but the effect is quite

limited, especially when there are large amount of outliers.

As for the RWLS-SVM, the results are much better than

the former two algorithms but slightly worse than the

ROELS-SVM. Both the RWLS-SVM and the ROELS-

SVM obtain good records of MAE. The testing errors

Table 1 Accuracy on the simulated data sets when p ¼ 0:1

Data set (no. of outliers) LS-SVM WLS-SVM RWLS-SVM ROELS-SVM

MAE (a) (b) MAE Iter. MAE Elm. MAE

Inst. 1 (15 outliers) 0.1215 12 4 0.0152 2 0.0118 15 0.0080

Inst. 2 (30 outliers) 0.1978 18 1 0.0278 3 0.0092 31 0.0083

Inst. 3 (45 outliers) 0.3191 25 4 0.0872 5 0.0089 45 0.0084

Inst. 4 (60 outliers) 0.3753 24 4 0.1333 7 0.0110 58 0.0078

Inst. 5 (75 outliers) 0.4564 27 5 0.1903 16 0.0820 73 0.0081

Inst. 6 (90 outliers) 0.5400 2 9 0.4817 11 0.3085 45 0.1522

Inst. 7 (105 outliers) 0.5726 1 8 0.5283 14 0.4113 56 0.1450

Inst. 8 (120 outliers) 0.6211 0 2 0.6207 14 0.5833 79 0.1000

(a), (b) the number of samples having weights equal to 1e-4 and those having weights between 1e-4 and 1.0 in weighted LS-SVM, Iter. the

reweighted iterations of the RWLS-SVM. Elm. the total number of outliers that eliminated by ROE. Notations are similar hereinafter

The bold values indicate the best MAE that can be obtained by one of the three algorithms

1246 W. Wen et al.

123

remain quite small and very stable until the outlier number

increases from 90 to 105. Records of R2 demonstrate the

same trends, the RWLS-SVM and the ROELS-SVM stably

achieves a good fit and has very high ‘‘breakdown point’’.

It is a satisfactory result.

Besides, the testing accuracy by the ROELS-SVM and

the RWLS-SVM do not necessarily reduces as the outliers

increase. This is contrary to the other two methods. An

important reason might be that, both the ROELS-SVM and

the RWLS-SVM implement a recursive learning structure

and can adjust the learning results during the training stage.

This make it possible for some chance factor (for example,

the distribution of outliers), except for the number of out-

liers, to influence the training results. But the ROELS-SVM

seems to have a more stable testing accuracy. This is

probably due to the reason that the ROELS-SVM detects

the most suspicious outlier, eliminates it, corrects the

training result and then comes into next loop. This makes it

possible for the ROELS-SVM to locate the critical outliers

one by one, and keeps correct adjustments through the re-

training procedure. On the contrary, though the RWLS-

SVM also implements a recursive learning strategy, it

changes the weight of more than one sample in each loop.

This makes it difficult to locate the critical outliers and may

have incorrect adjustment during the recursive learning

stage, especially when there are large amount of outliers in

the data set.

As for the runtime, though fast training algorithm (Wen

et al. 2008) has been used for the RWLS-SVM, it still

reveals that the ROELS-SVM has more stable perfor-

mance: on almost all of the simulated data sets, the RO-

ELS-SVM requires less runtime than the RWLS-SVM.

Especially, when the data sets contain a large amount of

outliers, the RWLS-SVM tends to require unexpected more

iterations, which cause it more time.

In the ROELS-SVM, an additional parameter p is

introduced, which decides how many samples are elimi-

nated in each loop. We change p from 0.1 to 1.0 with a

uniform step and investigate the changes of testing results.

The results are recorded in Tables 3 and 4, in which

r; cð Þ ¼ 1:0; 2:5ð Þ: As shown in these two tables, different

values of p hardly influence the testing MAE and just

slightly affect the eliminated sample number. It may be

explained by the reason that the ROELS-SVM uses a fairly

mild pruning strategy and the elimination procedure will

stop once the result becomes worse. However, the detailed

results show that small values of p p� 0:4ð Þ bring rela-

tively stable and accurate results for all of the simulated

instances. This is because smaller p means more cautious

eliminating strategy. Here, we have to point out that until

Table 2 Runtime on the simulated data sets when p ¼ 0:1

Data Set LS-SVM WLS-SVM RWLS-SVM ROELS-SVM

R2 Time R2 Time R2 Time R2 Time

Inst. 1 0.4253 0.19 0.9924 0.25 0.9937 0.30 0.9961 0.29

Inst. 2 -2.600 0.22 0.9644 0.35 0.9958 0.59 0.9961 0.58

Inst. 3 -9.640 0.28 0.6030 0.42 0.9960 0.83 0.9960 0.66

Inst. 4 -15.14 0.35 0.0067 0.51 0.9918 1.08 0.9963 0.78

Inst. 5 -26.82 0.39 -2.636 0.59 0.3936 1.27 0.9961 0.89

Inst. 6 -32.31 0.46 -16.32 0.64 -7.462 0.87 -0.667 0.78

Inst. 7 -34.46 0.53 -26.57 0.70 -16.68 1.58 -0.118 1.23

Inst. 8 -43.68 0.57 -40.06 0.77 -37.70 1.65 -0.061 1.47

Time the time cost by corresponding algorithm

The bold values indicate the best R2 that can be obtained by one of the three algorithms

Table 3 Testing MAE for various p

Data set p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.5 p = 0.6 p = 0.7 p = 0.8 p = 0.9 p = 1.0

Inst. 1 0.0080 0.0083 0.0080 0.0080 0.0081 0.0081 0.0081 0.0081 0.0082 0.0083

Inst. 2 0.0083 0.0083 0.0099 0.0084 0.0112 0.0087 0.0096 0.0104 0.0115 0.0117

Inst. 3 0.0084 0.0091 0.0091 0.0091 0.0091 0.0087 0.0098 0.0126 0.0132 0.0125

Inst. 4 0.0078 0.0084 0.0078 0.0078 0.0149 0.0129 0.0151 0.0103 0.0098 0.0088

Inst. 5 0.0081 0.0086 0.0080 0.0084 0.0087 0.0093 0.0097 0.0131 0.0150 0.0115

Inst. 6 0.0081 0.0081 0.0081 0.0081 0.0081 0.0081 0.0081 0.0081 0.0081 0.0083

Inst. 7 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278

Inst. 8 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238


123

now we cannot give an optimal p through theoretical analysis.

But according to our empirical studies, small p; such as

p ¼ 0:1; usually produces satisfactory estimation results.

4.2 Real-world data sets

To further investigate the proposed algorithm, we make

detailed experiments on 16 real-world data sets. Some

basic information about these data sets is presented in

Table 5 and the download Websites are given in the last

row. Each data set is tested using tenfold cross-validation.

And Table 6 records the average results over tenfold cross-

validation under the optimal hyper-parameters for each

algorithm (the optimal hyper-parameter ropt; copt

� �for each

algorithm is selected by grid search on the same 2D

parameter set). Results in Table 6 demonstrate that for all

the real-world data sets, both the RWLS-SVM and the

ROELS-SVM produce better accuracy than the LS-SVM

and the WLS-SVM. But the comparison between the

RWLS-SVM and the ROELS-SVM is a bit difficult: for 13

data sets the ROELS-SVM outperforms the RWLS-SVM,

for two data sets on the contrary, and for one data set both

algorithms perform equally well.

To make a fair comparison, a none-parametric statistical

test, the Wilcoxon signed-ranks test (Demsar 2006), is

implemented to evaluate the performance of ROELS-SVM

Table 4 The number of eliminated samples for various p

Data set p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.5 p = 0.6 p = 0.7 p = 0.8 p = 0.9 p = 1.0

Inst. 1 15 16 15 16 18 19 20 21 22 24

Inst. 2 31 31 34 33 37 30 32 34 36 36

Inst. 3 45 45 46 45 49 55 59 45 47 50

Inst. 4 58 60 59 60 72 66 72 62 68 72

Inst. 5 73 72 75 77 79 75 81 90 75 81

Inst. 6 89 89 89 89 89 89 89 89 89 90

Inst. 7 87 87 87 87 87 87 87 87 87 87

Inst. 8 36 36 36 36 36 36 36 36 36 36

Table 5 Basic information about the real-world data sets

Data set Sample size Number of attributes Attribute characteristics

Chwiruta 214 2 Real

Motorcycle 133 2 Real

Servob 167 5 Categorical, real

Nelsona 128 3 Integer, real

Boston Housingb 506 14 Categorical, integer, real

Auto MPGc 392 8 Categorical, real

Bodyfatc 252 15 Real

Triazinesc 186 60 Categorical, real

Pollutiond 60 16 Integer, real

Ensoa 168 2 Real

Gauss3a 250 2 Integer, real

Heart diseaseb 400 4 Integer, real

Balloond 2,001 2 Real

Crabsd 200 7 Categorical, integer, real

Compasse 108 3 Integer, real

Boltse 40 8 Integer, real

a NIST statistical reference dataset. http://www.itl.nist.gov/div898/strd/nls/nls_main.shtmlb UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets.htmlc LIBSVM data. http://www.csie.ntu.edu.tw/*cjlin/libsvmtools/datasets/d StatLib datasets. http://stat.cmu.edu/datasets/e Datasets for statistical analysis. http://www.sci.usq.edu.au/staff/dunn/Datasets/index.html

1248 W. Wen et al.

123

http://www.itl.nist.gov/div898/strd/nls/nls_main.shtml

http://archive.ics.uci.edu/ml/datasets.html

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

http://stat.cmu.edu/datasets/

http://www.sci.usq.edu.au/staff/dunn/Datasets/index.html

and RWLS-SVM. According to Demsar (2006), let di be the

difference between the performance scores of these two

algorithms on ith out of N data sets and then rank the dif-

ferences according to their absolute values (average ranks

are assigned in case of ties). Let Rþ be the sum of ranks for

the data sets on which the second algorithm outperformed

the first, and R� the sum of ranks for the opposite. Ranks

ofdi ¼ 0 are split evenly among the sums. That is,

Rþ ¼X

di [ 0

rank dið Þ þ1

2

X

di¼0

rank dið Þ;

R� ¼X

di\0

rank dið Þ þ1

2

X

di¼0

rank dið Þ

Let T ¼ min Rþ;R�ð Þ; then the statistic

z ¼T � 1

4N N þ 1ð Þ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

24N N þ 1ð Þ 2N þ 1ð Þ

q

is distributed approximately normally.

The differences and ranks of R2 on each data set are

recorded in Table 7, and corresponding differences and

ranks of computation time are recorded in Table 8. It is

obvious that from Table 7, we have TR2 ¼ min Rþ;R�ð Þ ¼R� ¼ 9þ 13þ 0:5 ¼ 22:5

Table 6 Results on the real-world instances

Data set LS-SVM WLS-SVM RWLS-SVM ROELS-SVM

R2 Time R2 Time R2 Time (iter.) R2 Time (elm.)

Chwirut 0.9732 0.30 0.9819 0.36 0.9821 0.47 (3.6) 0.9823 0.46 (15.8)

Motorcycle 0.7007 0.04 0.7248 0.08 0.7398 0.20 (7.6) 0.7489 0.18 (13.3)

Servo 0.3566 0.08 0.7172 0.10 0.8697 0.35 (9.7) 0.8619 0.16 (26.8)

Nelson 0.6621 0.06 0.6996 0.08 0.7154 0.19 (6.9) 0.6655 0.12 (11.2)

Boston Housing 0.8888 3.51 0.8968 4.74 0.8981 12.5 (12.8) 0.9050 6.09 (22.3)

Auto MPG 0.9011 0.89 0.9038 1.05 0.9076 2.68 (14.3) 0.9123 1.60 (23.0)

Bodyfat 0.9994 0.38 0.9995 0.77 0.9998 3.69 (15.6) 0.9998 1.06 (10.9)

Trianze scale 0.1420 0.33 0.3791 0.49 0.4048 1.68 (9.7) 0.4111 0.56 (7.4)

Pollution scale 0.3172 0.03 0.4495 0.04 0.4774 0.09 (5.2) 0.4820 0.07 (5.8)

Enso 0.1421 0.10 0.1423 0.14 0.1423 0.14 (1.0) 0.2703 0.19 (9.9)

Gauss3 0.9980 0.41 0.9983 0.43 0.9984 0.61 (8.7) 0.9985 0.61 (3.9)

Heart disease -3.500 0.88 0.0369 1.21 0.5045 3.12 (25) 0.6677 1.98 (133.3)

Balloon 0.1318 321.83 0.1506 350.29 0.2337 381.70 (65) 0.8025 424.28 (711)

Crabs 0.9886 0.16 0.9894 0.21 0.9897 0.67 (10.9) 0.9899 0.29 (7.6)

Compass 0.7253 0.05 0.8673 0.06 0.9204 0.11 (5.7) 0.9610 0.09 (35.6)

Bolts 0.8208 0.01 0.8229 0.02 0.8312 0.04 (3.4) 0.8605 0.03 (2.1)

Iter. the number of iterations that is needed by the RWLS-SVM, elm. the total number of outliers that eliminated by the ROELS-SVM

The bold values indicate the best R2 that can be obtained by one of the three algorithms

Table 7 Comparison of R2 for RWLS-SVM and ROELS-SVM

Data set R2 Difference Rank

RWLS-SVM ROELS-SVM RROE2 - RRW

2

Chwirut 0.9821 0.9823 0.0002 3.5

Motorcycle 0.7398 0.7489 0.0091 10

Servo 0.8697 0.8619 -0.0078 9

Nelson 0.7154 0.6655 -0.0499 13

Boston Housing 0.8981 0.9050 0.0069 8

Auto MPG 0.9076 0.9123 0.0047 6

Bodyfat 0.9998 0.9998 0.0000 1

Trianze scale 0.4048 0.4111 0.0063 7

Pollution Scale 0.4774 0.4820 0.0046 5

Enso 0.1423 0.2703 0.1280 14

Gauss3 0.9984 0.9985 0.0001 2

Heart disease 0.5045 0.6677 0.1632 15

Balloon 0.2337 0.8025 0.5688 16

Crabs 0.9897 0.9899 0.0002 3.5

Compass 0.9204 0.9610 0.0406 12

Bolts 0.8312 0.8605 0.0293 11

The bold values indicate the best R2 that can be obtained by one of the

three algorithms

The italic values indicate the cases that the RWLS-SVM outperforms

the ROELS-SVM


123

Since N ¼ 16;

zR2 ¼T � 1

4N N þ 1ð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

124

N N þ 1ð Þ 2N þ 1ð Þq ¼ �2:35\� 1:96

Similarly, from Table 8, we have

TTime ¼ min Rþ;R�ð Þ ¼ R� ¼ 6:5þ 3þ 0:5 ¼ 10

Therefore,

zTime ¼T � 1

4N N þ 1ð Þ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

24N N þ 1ð Þ 2N þ 1ð Þ

q ¼ �3:00\� 1:96

Therefore, with a ¼ 0:05; the null-hypothesis that both

algorithms perform equally well can be rejected. That is,

we may claim that the ROELS-SVM performs remarkably

better than the RWLS-SVM both for the test accuracy and

for the computation time.

5 Conclusions

To achieve robust estimation in noisy environment, the

ROELS-SVM algorithm, is proposed in this paper. It

implements a quantitative criterion derived from LTS, a

method for robust linear regression, for eliminating outliers

within a recursive framework. Decremental learning strat-

egy is introduced to the LS-SVM retraining stage, which is

fast and effective. Experiments demonstrate that the RO-

ELS-SVM eliminates a great majority of the outliers and

obtains robust estimation in most cases.

Traditional robust algorithm of LS-SVM focuses on

reducing influence of a bundle of potential outliers within

a single iteration, and, therefore, tends to make mistakes

when identifying outliers very close to the normal sam-

ples. The ROELS-SVM works on a different way, it does

not judge many outliers within one iteration, but judge

just a few, for example, two or three outliers within one

iteration, eliminates them, updates the training results and

then comes into next iteration. This is different from the

RWLS-SVM and makes it more possible for the ROELS-

SVM to adjust the training results toward the right

direction.

More interestingly, the ROELS-SVM can be viewed as a

back-propagation procedure for obtaining an uncontami-

nated set of samples for LS-SVM regression, which can be

used as an initial pruning step for sparsification. Besides,

applications of the ROELS-SVM in the field of outlier

detection will be also very interesting. Future works can be

concentrated on these points.

Table 8 Comparison of the computation time of RWLS-SVM and ROELS-SVM

Data set Time Difference Rank

TLS-SVM TRWLS-SVM TROELS-SVM (TROE - TRW)/TLS-SVM

Chwirut 0.30 0.47 0.46 -0.03 2

Motorcycle 0.04 0.20 0.18 -0.50 6.5

Servo 0.08 0.35 0.16 -2.38 13.5

Nelson 0.06 0.19 0.12 -1.17 9

Boston Housing 3.51 12.5 6.09 -1.83 12

Auto MPG 0.89 2.68 1.60 -1.21 10

Bodyfat 0.38 3.69 1.06 -6.92 16

Trianze scale 0.33 1.68 0.56 -3.39 15

Pollution Scale 0.03 0.08 0.07 -0.33 4

Enso 0.10 0.14 0.19 0.50 6.5

Gauss3 0.41 0.61 0.61 0.00 1

Heart disease 0.88 3.12 1.98 -1.30 11

Balloon 321.83 381.7 424.28 0.13 3

Crabs 0.16 0.67 0.29 -2.38 13.5

Compass 0.05 0.11 0.09 -0.40 5

Bolts 0.01 0.04 0.03 -1.00 8

Since the computation time is influenced by the sample size, difference of computation time is calculated by (TROE - TRW)/TLS-SVM. This avoids

unfair ranks caused by the influence of sample size

The bold values indicate the less computation time that is cost by either RWLS-SVM or ROELS-SVM

The italic values highlight the positive difference. In these cases the RWLS-SVM outperforms the ROELS-SVM

1250 W. Wen et al.

123

References

Brabanter JD (2004) LS-SVM regression modelling and its applica-

tions. Ph.D. thesis. ftp://ftp.esat.kuleuven.ac.be/pub/SISTA//

debrabant

Burges CJC (1998) A tutorial on support vector machines for pattern

recognition. Data Min Knowl Discov 2(2):955–974

Cao LJ, Tay FEH (2003) Support vector machine with adaptive

parameters in financial time series forecasting. IEEE Trans

Neural Netw 14(6):1506–1518

Cawley GC, Talbot NLC (2004) Fast exact leave-one-out cross-

validation of sparse least-squares support vector machines.

Neural Netw 17:1467–1475

Chuang CC, Su SF, Jeng JT, Hsiao CC (2002) Robust support vector

regression networks for function approximation with outliers.

IEEE Trans Neural Netw 13(6):1322–1330

Cortes C (1993) Prediction of generalization ability in learning

machines. Ph.D. thesis. http://homepage.mac.com/corinnacortes/

Demsar J (2006) Statistical comparisons of classifiers over multiple

data sets. J Mach Learn Res 7:1–30

Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and

support vector machines. Adv Comput Math 13:1–50

Grubbs FE (1969) Procedures for detecting outlying observations in

samples. Technometrics 11(1):1–21

Jiang JQ, Song CY, Wu CG, Maurizio M, Liang YC (2006) Support

vector machine regression algorithm based on chunking incre-

mental learning. In: Proceedings of ICCS’06. Lecture notes in

computer science, vol 3991. Springer, Berlin, pp 547–554

Kvalseth TO (1985) Cautionary Note about R2. Am Stat 39(4):279–

285

Mangasarian OL, Musicant DR (2000) Robust linear and support

vector regression. IEEE Trans Pattern Anal Mach Intell

22(9):950–955

Rousseeuw PJ (1984) Least median of squares regression. J Am Stat

Assoc 79:871–880

Rousseeuw PJ, Driessen KV (2006) Computing LTS Regression for

large data sets. Data Min Knowl Discov 12:29–45

Rousseeuw PJ, Leroy A (1987) Robust regression and outlier

detection. Wiley, New York, pp 9–11

Scholkopf B, Sung KK, Burges CJC, Girosi F, Niyogi P, Poggio T,

Vapnik V (1997) Comparing support vector machines with

Gaussian kernels to radial basis function classifiers. IEEE Trans

Signal Process 45(11):2758–2765

Smola AJ, Scholkopf B (1998) A tutorial on support vector

regression. NeuroCOLT2 Technical Report NC2-TR-1998-030

Suykens JAK, Vandewalle J (1999) Least squares support vector

machine classifiers. Neural Process Lett 9:293–300

Suykens JAK, Brabanter JD, Lukas L, Vandewalle J (2002) Weighted

least squares support vector machines: robustness and sparse

approximation. Neurocomputing 48:85–105

Tian SF, Huang HK (2002) A simplification algorithm to support

vector machines for regression. J Softw 13(6):1169–1172

Vapnik V (1995) The nature of statistical learning theory. Wiley, New

York

Wen W, Hao ZF, Yang XW (2008) A heuristic weight-setting strategy

and iteratively updating algorithm for weighted least-squares

support vector regression. Neurocomputing 71(16–18):3096–

3103

Wu CH (2004) Travel-time prediction with support vector regression.

IEEE Trans Intell Transp Syst 5(4):276–281

Zhang JS, Gao G (2005) Reweighted robust support vector regression

method. Chin J Comput Sci 28(7):1171–1177

Zhao Y, Keong KC (2004) Fast Leave-one-out Evaluation and

improvement on inference for LS-SVMs. In: Proceedings of

ICPR’04, vol 3, pp 494–497


123

ftp://ftp.esat.kuleuven.ac.be/pub/SISTA//debrabant

ftp://ftp.esat.kuleuven.ac.be/pub/SISTA//debrabant

http://homepage.mac.com/corinnacortes/

robust least squares support vector machine based on recursive outlier elimination

Documents