robust least squares support vector machine based on recursive outlier elimination
TRANSCRIPT
![Page 1: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/1.jpg)
ORIGINAL PAPER
Robust least squares support vector machine based on recursiveoutlier elimination
Wen Wen • Zhifeng Hao • Xiaowei Yang
Published online: 9 December 2009
� Springer-Verlag 2009
Abstract To achieve robust estimation for noisy data set,
a recursive outlier elimination-based least squares support
vector machine (ROELS-SVM) algorithm is proposed in
this paper. In this algorithm, statistical information from
the error variables of least squares support vector machine
is recursively learned and a criterion derived from robust
linear regression is employed for outlier elimination.
Besides, decremental learning technique is implemented in
the recursive training–eliminating stage, which ensures that
the outliers are eliminated with low computational cost.
The proposed algorithm is compared with re-weighted least
squares support vector machine on multiple data sets and
the results demonstrate the remarkably robust performance
of the ROELS-SVM.
Keywords Least squares � Support vector machines �Regression � Robust estimation � Outliers
1 Introduction
Support vector machine (SVM) is a supervised learning
method developed by Vapnik (1995) and Cortes (1993),
with grounded foundation of statistical learning theory. It
can be viewed as a particular version of artificial neural-
networks, which strikes balance between the structural
risks and empirical risks to obtain robust classification or
function estimation. It has been widely used in solving
classification problems, and a series of good results have
been reported (Scholkopf et al. 1997; Burges 1998).
In the field of nonlinear function estimation, SVM
exhibits its powerful ability as well. Similar to SVM classi-
fication, SVM regression, i.e., SVM for nonlinear function
approximation, includes both the empirical risk and struc-
tural risk in its minimization objective. Yet different
description of ‘‘empirical risk’’ is employed in SVM
regression. Loss functions, such as epsilon-insensitive loss
function, Laplacian loss function, Huber’s robust loss func-
tion, etc. are used for describing the empirical risk for SVM
regression (Smola and Scholkopf 1998; Mangasarian and
Musicant 2000). They are derived from robust statistics and
play the role of keeping robustness and sparseness of SVM
regression. Besides, researchers find that SVM regression
has a strong connection with regularization network, and to
some extent, both the SVM regression and the regularization
network can be interpreted under the framework of Bayes’
theorem (Evgeniou et al. 2000). Now SVM regression has
been widely used in real-world problems, such as time series
prediction (Cao and Tay 2003) and function approximation
(Wu 2004), and the results are quite satisfactory.
However, outliers [‘‘An outlying observation, or outlier,
is one that appears to deviate markedly from other mem-
bers of the sample in which it occurs’’ (Grubbs 1969)]
strongly affect the performance of SVM regression. For
example, least squares support vector machine (LS-SVM),
a modified version of SVM (Suykens and Vandewalle
1999), employs sum squares error (SSE) as the loss func-
tion and converts the inequality constraints in classical
SVM to equality ones. Fast training speed and excellent
learning results were reported on LS-SVM regression for
noiseless data (Jiang et al. 2006). However, when there are
W. Wen (&) � Z. Hao
School of Computer, Guangdong University of Technology,
510006 Guangzhou, China
e-mail: [email protected]; [email protected]
X. Yang
School of Mathematical Science,
South China University of Technology,
510641 Guangzhou, China
123
Soft Comput (2010) 14:1241–1251
DOI 10.1007/s00500-009-0535-9
![Page 2: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/2.jpg)
outliers in the training samples, estimation will be distorted
and the accuracy of LS-SVM regression is largely reduced
(Suykens et al. 2002). Such problems also exist in other
versions of SVM (Chuang et al. 2002). Some researchers
have realized this problem and they have done some work
to deal with it. The solutions can be generally classified
into three categories. One is to directly eliminate the
samples that are probably outliers (Tian and Huang 2002).
In this method, SVM regression is firstly trained using all
of the samples. Then a given threshold is set to determine
which samples are outliers and which are not. Finally,
constraints in SVM regression are modified so that the
‘‘outliers’’ are eliminated in the successive training proce-
dure. Negative impacts of outliers are eliminated when the
threshold is chosen appropriately. But unfortunately, it is
not easy to find such an ‘‘appropriate’’ threshold unless
there is sufficient prior knowledge of the training samples.
Another is to softly eliminate the outliers, namely, outliers
are not directly eliminated, but are set appropriate small
weights. The small weights on the outliers reduce their
negative influence to the training procedure of SVM
(Suykens et al. 2002, 2003; Zhang and Gao 2005). But how
to assign the weights is actually a difficult problem, espe-
cially when there are many noises and outliers in the
samples. Besides, it is difficult to know what exact effects
the small weights will bring about. We just qualitatively
know their negative influences are in parts reduced. Bra-
banter (2004) has done some valuable work on this issue.
But whether the ‘‘soft’’ elimination is enough for a correct
estimation still remains unknown. The last category is to
use a robust back propagation procedure. This approach is
proposed by Chuang et al. (2002), which is named robust
support vector regression (RSVR) network. It consists of
two steps. In the first step, a traditional SVM regression
network is trained; in the second step, robust back propa-
gation procedure is employed to adjust the weights. This
method is effective to some extent, yet it is highly
dependent on the performance of the training results of
traditional SVM regression. That is, if the traditional SVM
regression obtains very bad results, the back propagation
will not produce good results.
From the statistical perspective, traditional linear
regression, which employs SSE loss function, is also sen-
sitive to outliers. The breakdown point of linear regression
is equal to 1=n (n is the number of samples) (Rousseeuw
and Leroy 1987). This means that even a single outlier has
the potential to completely distort the regression line. To
solve this problem, Rousseeuw (1984) proposed to use
squares median residual as the loss function instead of SSE.
Estimator produced by this method is named least median
of squares (LMS) estimator. And it is claimed that LMS
achieves a breakdown point of about 1=2 (namely, the
regression line will not be distorted unless the number of
outliers exceeds half number of the total samples). Fur-
thermore, Rousseeuw and Driessen (2006) proposed to use
least trimmed squares (LTS) estimator, which is a linear
estimator having the minimized sum of h smallest squared
residuals. According to Rousseeuw and Driessen (2006),
LTS is less sensitive to local effects than LMS and pro-
duces better statistical efficiency.
Enlightened by the LTS estimator, we proposed to use a
recursive outlier eliminating algorithm. To avoid the sub-
jective determination of the threshold as that in Tian and
Huang (2002), a quantitative criterion based on the idea of
LTS is implemented to discriminate samples that can be
eliminated and that cannot. The proposed algorithm is
employed in the framework of LS-SVM and a decremental
learning technique is introduced to accelerate the computa-
tional speed. Experimental results show that this algorithm
can correctly eliminate the outliers at a very low computa-
tional cost.
The rest of this paper is organized as follows. Section 2
provides a brief review to the LS-SVM regression and the
weighted LS-SVM regression. In Sect. 3, we first present
the recursive outlier elimination-based least squares sup-
port vector machine (ROELS-SVM) for robust regression
and then introduce the decremental learning strategy to the
recursive training–eliminating stage for computational
complexity reduction. Experimental results from simulated
instances and real-world data sets are presented in Sect. 4.
Finally, some concluding remarks are given in Sect. 5.
2 Brief reviews to LS-SVM regression and weighted
LS-SVM regression
Given a training set of N samples xi; yið Þf gNi¼1 with input
features xi 2 Rd and output value yi 2 R; the standard
LS-SVM regression can be formulated as the following
optimization problem in the primal space (Suykens and
Vandewalle 1999):
min J w; eð Þ ¼ 1
2wT wþ 1
2cXN
i¼1
e2i
s:t: yi ¼ wTu xið Þ þ bþ ei; i ¼ 1; . . .;N
ð1Þ
In formula (1), u �ð Þ : Rd ! R~d is a function which maps
the input space into a higher dimensional feature space.
w 2 R~d is the weight vector in primal weight space, b is the
bias term and ei ði ¼ 0; 1; . . .;NÞ are error variables.
Using Lagrange multipliers and matrix transformation,
optimizing problem (1) can be converted to a set of linear
equations in the dual space as formula (2).
0 1Tv
1v Xþ 1cI
� �b
a
" #¼
0
y
" #ð2Þ
1242 W. Wen et al.
123
![Page 3: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/3.jpg)
Where y ¼ y1; . . .; yN½ �T; 1v ¼ 1; . . .; 1½ �T; a ¼ a1; . . .;½aN �T and Xij ¼ u xið ÞTu xj
� �¼ K xi; xj
� �; for i ¼ 1; . . .;N;
j ¼ 1; . . .;N: K is the kernel function, for example, linear
kernel, polynomial kernel or RBF kernel. This implies that
the training procedure of LS-SVM regression just requires
solving a linear system.
In primal weight space one has the model for the esti-
mation of output value y:
y xð Þ ¼XN
i¼1
aiK x; xið Þ þ b ð3Þ
In LS-SVM, the noisier a sample is, the larger a it will
have. When there are outliers that deviate very far from the
regression curve, the large value of a will produce
overwhelming force to distort the regression curve [see
formula (3)]. In order to improve the robustness of LS-
SVM, Suykens et al. (2002) proposed to implement a
modified version of LS-SVM, which is named weighted
LS-SVM. The optimization problem of the weighted
LS-SVM can be formulated as follows:
min J� w�; e�ð Þ ¼ 1
2w�k k2þ1
2CXN
i¼1
vie�2i
s:t: y�i¼ w�TU xið Þ þ b� þ e�
i; i ¼ 1; . . .;N
ð4Þ
where vi is determined by the following formula:
vk ¼1 if ei=sj j � c1;c2� eij jc2�c1
if c1� ei=sj j � c2;
10�4 otherwise;
8<
: ð5Þ
s can be given by formulas (6) and (7).
s ¼ IQR
2� 0:6745ð6Þ
where IQR stands for the interquartile range, i.e., the
difference between the 75th percentile and 25th percentile.
Or
s ¼ 1:483MAD xið Þ ð7Þ
where MADðxiÞ stands for the median absolute deviation.
According to Suykens et al. (2002), the procedures (4)
and (5) can be repeated iteratively, but in practice one single
additional weighted LS-SVM step will often be sufficient.
3 Recursive outlier elimination-based LS-SVM
regression
3.1 The recursive training–eliminating procedure
for robust regression
The major difference between LS-SVM and weighted
LS-SVM is that weighted LS-SVM has weights on the
error variables. For the weighted LS-SVM, the first step is
training LS-SVM to obtain ei and ai ði ¼ 0; 1; . . .;NÞ: Then
weights vi ði ¼ 0; 1; . . .;NÞ are calculated according to the
statistical distribution of error variables ei ði ¼ 1; . . .;NÞ:Samples with large error variables are set small weights.
Especially, if the error variable exceeds a given statistical
value, i.e., (6) or (7), the corresponding sample will be set a
very small weight (10�4), which in fact equivalent to
eliminating this sample. Finally, the weighted LS-SVM is
trained to obtain e�i ; a�i and b�: This method is effective to
deal with the majority of data sets containing a handful of
outliers. However, it produces unsatisfying results when
dealing with data sets containing a lot of outliers, espe-
cially when the distribution of outliers is highly different
from Guassian distribution. Figure 1 illustrates an experi-
mental result on a noisy data set. The data set contains 196
samples. 45 of them are outliers and the rest are ‘‘clean’’
samples generated by sin function. As shown in Fig. 1,
samples seriously outlying, i.e., samples labeled with round
circles and triangles, are weighted with very small values
(10�4) or values between 10�4 and 1.0. This implies the
weight-setting procedure is able to detect samples that are
seriously outlying, which demonstrates that error variables
in LS-SVM can be used as a scale of the outlying degree of
a sample. However, it also reveals the negative side of the
weighted LS-SVM. For samples that are not so seriously
‘‘outlying’’, the weighted LS-SVM has quite limited ability
to detect them: about one-third of the outliers, which are
relatively near to the noiseless samples, are still weighted
1.0; they are harmful enough to influence the correct
estimation.
This naturally leads to a question: is there any improved
method that will provide better discrimination between the
Fig. 1 Results of the weighted LS-SVM on a noisy data set mixed
with approximately 20% outliers. Samples labeled by circles are
weighted 10�4 and triangles are weighted between 10�4 and 1.0
Robust least squares support vector machine based on recursive outlier elimination 1243
123
![Page 4: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/4.jpg)
outliers and the useful samples? Results of the weighted
LS-SVM imply that error variables of the LS-SVM do
provide useful information on the outlying degree of
samples. (i.e., what the weighted LS-SVM uses for finding
samples very outlying). If the outlying samples can be
correctly eliminated, it is sure that training the LS-SVM on
the rest samples will produce estimation closer to the real
curve and furthermore provide more correct information
indicating the outlying degree of the rest samples. There-
fore, a closed-loop procedure as Fig. 2 is implemented to
detect outliers, eliminate outliers and meanwhile correct
the estimation of LS-SVM. We named this algorithm as the
ROELS-SVM.
In the ROELS-SVM, LS-SVM is trained using all of the
samples in the initial stage and statistical information of
error variables is used for eliminating samples that are
particularly outlying. Then the rest samples are used for
next LS-SVM training and error variables are analyzed for
further outlier elimination. The loop is going on until a
given condition is reached. However, two critical problems
should be carefully considered before this algorithm
become feasible. One is which samples should be elimi-
nated in each loop. The other is when the training–elimi-
nating procedure can be stopped.
For the first problem, enlightened by the LTS regression
in Rousseeuw and Leroy (1987) and Rousseeuw (1984), we
set an eliminating criterion as follows:
Q1 ¼Xh
i¼0
e i:Nlð Þ�� �� ð8Þ
where Nl is the number of training samples in loop l;
e 1:Nlð Þ�� ��� e 2:Nlð Þ
�� ��� � � � � e Nl:Nlð Þ�� �� are the ordered absolute
residuals (i.e., absolute values of error variables), and
h \ Nl is an adaptive parameter, which excludes the largest
error variables from the summation in each loop. After LS-
SVM is trained, the samples are ordered according to their
absolute residuals. n samples with largest absolute residu-
als are eliminated while the rest h samples are kept in the
training data set. The preceding h smallest absolute resid-
uals are summed to produce Q1: Here, n and h satisfy
nþ h ¼ Nl:
In the successive loop, LS-SVM is retrained on the h
samples, leading to updated error variables e0i ði ¼1; 2; . . .; hÞ: Q2 is calculated according to formula (9).
Q2 ¼Xh
i¼0
e0i�� �� ð9Þ
Since ei reflects the outlying degree of sample i;
eliminating samples with the largest error variables is
equivalent to eliminating samples that are most outlying.
Before these outlying samples are eliminated, estimations
will be distorted by them, therefore producing a relative
large e i:Nlð Þ�� �� for those ‘‘clean’’ samples. When some of the
outliers, especially the most outlying ones, are eliminated,
estimation will be corrected. So corresponding e0i�� �� for the
‘‘clean’’ samples will be reduced, thus making the value of
Q2 becomes smaller than that of Q1:
Then when should the training–eliminating procedure be
stopped? After several times elimination, all the outliers
will be eliminated at last. If the procedure goes on, samples
will be overly eliminated, i.e., samples with useful infor-
mation are also eliminated. In this situation, the regression
accuracy decreases and the absolute error variables for
‘‘clean’’ samples increase accordingly. Therefore, the value
of Q2 becomes larger than that of Q1: In this case, the
training–eliminating procedure should be stopped and the
algorithm should take a rollback to the result of previous
elimination.
Different data sets contain different amount of outliers.
Therefore, information of Suykens’ weighted LS-SVM is
used as a heuristic method to determine the number of
samples that should be eliminated. And to make sure every
step is circumspective, we suggest that just a few samples
(usually equal to a small percentage of the outlier number
found by the weighted LS-SVM) be eliminated in each
loop. This helps to avoid subjective selection of the
quantity of eliminated samples.
3.2 Speed up the learning procedure
Training the LS-SVM is a time-consuming task. Especially
in the proposed algorithm, it is not sure how many loops
will be repeated before the algorithm is stopped. The
LS-SVM retraining will become an overwhelming time-
consuming part of the whole algorithm. To accelerate the
algorithm, an iteratively decremental learning algorithm,
similar to the method proposed in Zhao and Keong (2004)
and Cawley and Talbot (2004), is implemented.
Denote A ¼ 0 1Tv
1v Xþ 1cI
� �in the formula (2). When a
sample is eliminated, it causes a certain row and column
removed from A. Let Am is the corresponding matrix after
eliminating m samples, and suppose it is the kth column
Given Conditions
LS-SVM Training Outlier Elimination
The Rest Samples
Training
Samples
End
Fig. 2 Architecture of the ROELS-SVM algorithm
1244 W. Wen et al.
123
![Page 5: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/5.jpg)
and the kth row of Amthat is removed in the successive
elimination. Denote Am ¼ amð Þ
ij
� �; A�1
m ¼ ~amð Þ
ij
� �and
Amþ1 ¼ amþ1ð Þ
ij
� �
i;j6¼k: Then according to the Sherman–
Morrison–Woodbury Formula, there is
~amþ1ð Þ
ij ¼ ~amð Þ
ij � ~amð Þ
i;k � ~amð Þ
k;j
.~a
mð Þk;k ;
i; j ¼ 1; . . .;N � m; i; j 6¼ k ð10Þ
Detailed inference of formula (10) can be found in
literature (Zhao and Keong 2004; Cawley and Talbot
2004).
Therefore, if we set A0 ¼ A; then the retraining stage of
LS-SVM can be substituted by a decremental learning
procedure. That is, in each iteration, all the elements of
matrix A; except those corresponding to samples that have
already been eliminated, are updated according to formula
(10). Using such a decremental learning procedure, the
computational complexity is dramatically reduced, from
O N3l
� �to O nN2
l
� �for each loop (n is the number of sam-
ples that are eliminated in each loop). Since n Nl; Nl\N
and it is sure that the training–eliminating loops will be
stopped before all the samples are eliminated. With the
decremental learning, the proposed algorithm actually
requires much less time than a single extra training of
LS-SVM (whose computational complexity is O N3ð Þ).What is additionally needed is just the storage of the
inverse of matrix A.
Thus, the enhanced ROELS-SVM regression for noisy
data set can be summarized as follows.
The enhanced ROELS-SVM algorithm:
1. Use tenfold cross-validation or leave-one-out valida-
tion to train LS-SVM and find the optimal hyperpa-
rameters c; rð Þ:2. Under the optimal hyperparameter, use Suykens’
weighted LS-SVM to find samples with their weights
equal to 10�4and let m be the total number of these
samples. Let n ¼ maxðintðm� pÞ; 1Þ: p is a given
parameter and n is a fixed number in the following
steps.
3. Train LS-SVM using all of the samples under the
optimal hyperparameter. Record the training results in
W1; W1 ¼ a~; bf g and store the inverse of matrix A in ~A:
4. Let l ¼ 1; N0 ¼ N; N is the total number of samples in
the training data set.
In the lth loop, carry out the following procedures.
4.1. Backup W1 in W2; that is, let W2 ¼ W1:
4.2. Let h ¼ Nl�1 � n; order the samples according to
the absolute values of their error variables (i.e.,
absolute residuals). Calculate Q1 according to
formula (8).
4.3. Select n samples with largest absolute residuals and
record their index in an index set Eidx:
4.4. Sequentially pick out a sample from Eidx; remove the
corresponding column and row from ~A and update the
rest elements of ~A according to formula (10) until all
samples in Eidx have been eliminated. Calculate the
training results and record them in W1 ¼ a~0; b0
n o:
Then reset Eidx ¼ U and calculate Q2 according to
formula (9).
4.5. If Q2�Q1; let l ¼ lþ 1; Nl ¼ h and go back to 4.1;
Else output the training results W2:
To further reduce the computational complexity, step 3
can be omitted as long as the training results under optimal
hyperparameter, i.e., the a~opt; bopt
and the inverse of
matrix Aopt; have been recorded in the parameter-selection
stage (step 1). In this situation, the ROELS-SVM wins
comparable running speed to the weighted LS-SVM, for
the former requires less than an extra training of LS-SVM
while the latter requires a whole training step of the
weighted LS-SVM.
Except for the inputs required by the LS-SVM, the
ROELS-SVM additionally requires to input the parameter p.
From the rational perspective, an appropriate small p means a
discreet eliminating step. And the last step of the ROELS-
SVM algorithm guarantees that at any time when the data set
is over-eliminated, there is a rollback chance to correct it.
4 Experiments
Two indexes are introduced to show the performance of
these three algorithms. One of them is the mean absolute
error (MAE), defined as (11).
MAE ¼ 1
N�XN
i¼1
yi � yij j ð11Þ
However, statisticians have noticed that MAE is
sensitive to outliers in the testing data set. That is, MAE
probably makes biased judgement when outliers are also
used as testing samples. Therefore, noiseless testing
samples are desirable when using MAE as the evaluation
criterion. To solve this problem, statisticians suggested
using another statistic (Kvalseth 1985; Rousseeuw 1987),
which is more ‘‘resistant’’, namely more robust, to evaluate
the regression models in noisy circumstances. The
‘‘resistant’’ statistic is defined as (12).
R2 ¼ 1� med yi � yij jmad yið Þ
� �2
ð12Þ
Here, ‘‘mad’’ stands for the median absolute deviation,
defined as (13).
Robust least squares support vector machine based on recursive outlier elimination 1245
123
![Page 6: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/6.jpg)
mad yið Þ ¼ medi
yi �medj
yj
����
���� �
ð13Þ
and med yi � yij j stands for the median absolute errors.
R2defined as (12) essentially excludes the outliers in the
testing data set and therefore has the robust property of
measuring how well the regression model fits the useful
samples. Generally, 0�R2� 1 for reasonable models,
R2 ¼ 1 corresponds to perfect fit and R2\0 corresponds to
‘‘bad’’ fit, which is also called ‘‘breakdown point’’. When
the testing samples are noiseless, both MAE and R2 can be
used to indicate the fitness degree of an algorithm. But
when it is unable to have completely ‘‘clean’’ testing
samples, R2 will be a better choice to evaluate the regres-
sion algorithm, for it makes more objective and robust
judgement.
To make a fair comparison, four important models of
LS-SVM are taken into account, the classical LS-SVM, the
weighted LS-SVM (WLS-SVM), the ROELS-SVM, and a
recursive version of weighted LS-SVM (RWLS-SVM).
In the RWLS-SVM, a reweighted and relearning proce-
dure of LS-SVM is implemented. The weighted formula
is the same as the WLS-SVM (See Sect. 2) and the stop
condition is that the weights do not change for almost all
samples (a threshold D ¼ 0:005 is used to judge whether
the weight is changed). Besides, an iteratively updating
algorithm (Wen et al. 2008) is used for the training
of the RWLS-SVM. It is one of the fastest methods for
the training of RWLS-SVM. In our experiments, both
the testing accuracy and the runtime of these four
algorithms are carefully investigated. And all the
experiments are carried out on a PC with 2.36 G CPU
and 1 G memory.
Besides, tenfold cross-validation are used to select hy-
perparameters r; cð Þ: In the tenfold cross-validation, the
data set are randomly divided into ten slices and each time
nine slices of them are used as the training samples and the
rest are used as validation samples. Considering we are
dealing with the data set with outliers, to avoid the negative
effects caused by the outliers in the validation samples, we
use the median absolute error as the tuning criterion.
4.1 Simulation study
Eight simulated data sets, which contain different amount of
outliers, are investigated. Training data set consists of two
category samples: one are uncontaminated samples generated
by sinc function f ; the other are random outliers obeying uni-
form distribution. All of the eight instances contain 151
uncontaminated samples. But outliers are randomly distributed
across the input space (xoutlier 2 ½�15; 15�; youtlier 2 ½0; 3�) and
these instances contain different proportional outliers,
approximately ranging from 10 to 45%.
f ðxÞ ¼ sin x
xx 2 ½�15; 15� ð14Þ
Testing samples are also generated by f ; excluding the
uncontaminated samples in the training data set. There are
totally 197 samples in the testing data set. Experiments
show that under this criterion, the first five instances have
the same optimal hyperparameters 1:0; 2:5ð Þ; and the last
three data sets (Instances 6, 7 and 8) have optimal
hyperparameters (2.5,1.0). The testing results are listed in
Tables 1 and 2.
Results in Tables 1 and 2 show that LS-SVM is easily
influenced by the outliers: testing errors drastically increase
as the outliers increase and it breaks down when the outlier
amount reaches 30. The WLS-SVM partly reduces the
negative influence of the outliers, but the effect is quite
limited, especially when there are large amount of outliers.
As for the RWLS-SVM, the results are much better than
the former two algorithms but slightly worse than the
ROELS-SVM. Both the RWLS-SVM and the ROELS-
SVM obtain good records of MAE. The testing errors
Table 1 Accuracy on the simulated data sets when p ¼ 0:1
Data set (no. of outliers) LS-SVM WLS-SVM RWLS-SVM ROELS-SVM
MAE (a) (b) MAE Iter. MAE Elm. MAE
Inst. 1 (15 outliers) 0.1215 12 4 0.0152 2 0.0118 15 0.0080
Inst. 2 (30 outliers) 0.1978 18 1 0.0278 3 0.0092 31 0.0083
Inst. 3 (45 outliers) 0.3191 25 4 0.0872 5 0.0089 45 0.0084
Inst. 4 (60 outliers) 0.3753 24 4 0.1333 7 0.0110 58 0.0078
Inst. 5 (75 outliers) 0.4564 27 5 0.1903 16 0.0820 73 0.0081
Inst. 6 (90 outliers) 0.5400 2 9 0.4817 11 0.3085 45 0.1522
Inst. 7 (105 outliers) 0.5726 1 8 0.5283 14 0.4113 56 0.1450
Inst. 8 (120 outliers) 0.6211 0 2 0.6207 14 0.5833 79 0.1000
(a), (b) the number of samples having weights equal to 1e-4 and those having weights between 1e-4 and 1.0 in weighted LS-SVM, Iter. the
reweighted iterations of the RWLS-SVM. Elm. the total number of outliers that eliminated by ROE. Notations are similar hereinafter
The bold values indicate the best MAE that can be obtained by one of the three algorithms
1246 W. Wen et al.
123
![Page 7: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/7.jpg)
remain quite small and very stable until the outlier number
increases from 90 to 105. Records of R2 demonstrate the
same trends, the RWLS-SVM and the ROELS-SVM stably
achieves a good fit and has very high ‘‘breakdown point’’.
It is a satisfactory result.
Besides, the testing accuracy by the ROELS-SVM and
the RWLS-SVM do not necessarily reduces as the outliers
increase. This is contrary to the other two methods. An
important reason might be that, both the ROELS-SVM and
the RWLS-SVM implement a recursive learning structure
and can adjust the learning results during the training stage.
This make it possible for some chance factor (for example,
the distribution of outliers), except for the number of out-
liers, to influence the training results. But the ROELS-SVM
seems to have a more stable testing accuracy. This is
probably due to the reason that the ROELS-SVM detects
the most suspicious outlier, eliminates it, corrects the
training result and then comes into next loop. This makes it
possible for the ROELS-SVM to locate the critical outliers
one by one, and keeps correct adjustments through the re-
training procedure. On the contrary, though the RWLS-
SVM also implements a recursive learning strategy, it
changes the weight of more than one sample in each loop.
This makes it difficult to locate the critical outliers and may
have incorrect adjustment during the recursive learning
stage, especially when there are large amount of outliers in
the data set.
As for the runtime, though fast training algorithm (Wen
et al. 2008) has been used for the RWLS-SVM, it still
reveals that the ROELS-SVM has more stable perfor-
mance: on almost all of the simulated data sets, the RO-
ELS-SVM requires less runtime than the RWLS-SVM.
Especially, when the data sets contain a large amount of
outliers, the RWLS-SVM tends to require unexpected more
iterations, which cause it more time.
In the ROELS-SVM, an additional parameter p is
introduced, which decides how many samples are elimi-
nated in each loop. We change p from 0.1 to 1.0 with a
uniform step and investigate the changes of testing results.
The results are recorded in Tables 3 and 4, in which
r; cð Þ ¼ 1:0; 2:5ð Þ: As shown in these two tables, different
values of p hardly influence the testing MAE and just
slightly affect the eliminated sample number. It may be
explained by the reason that the ROELS-SVM uses a fairly
mild pruning strategy and the elimination procedure will
stop once the result becomes worse. However, the detailed
results show that small values of p p� 0:4ð Þ bring rela-
tively stable and accurate results for all of the simulated
instances. This is because smaller p means more cautious
eliminating strategy. Here, we have to point out that until
Table 2 Runtime on the simulated data sets when p ¼ 0:1
Data Set LS-SVM WLS-SVM RWLS-SVM ROELS-SVM
R2 Time R2 Time R2 Time R2 Time
Inst. 1 0.4253 0.19 0.9924 0.25 0.9937 0.30 0.9961 0.29
Inst. 2 -2.600 0.22 0.9644 0.35 0.9958 0.59 0.9961 0.58
Inst. 3 -9.640 0.28 0.6030 0.42 0.9960 0.83 0.9960 0.66
Inst. 4 -15.14 0.35 0.0067 0.51 0.9918 1.08 0.9963 0.78
Inst. 5 -26.82 0.39 -2.636 0.59 0.3936 1.27 0.9961 0.89
Inst. 6 -32.31 0.46 -16.32 0.64 -7.462 0.87 -0.667 0.78
Inst. 7 -34.46 0.53 -26.57 0.70 -16.68 1.58 -0.118 1.23
Inst. 8 -43.68 0.57 -40.06 0.77 -37.70 1.65 -0.061 1.47
Time the time cost by corresponding algorithm
The bold values indicate the best R2 that can be obtained by one of the three algorithms
Table 3 Testing MAE for various p
Data set p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.5 p = 0.6 p = 0.7 p = 0.8 p = 0.9 p = 1.0
Inst. 1 0.0080 0.0083 0.0080 0.0080 0.0081 0.0081 0.0081 0.0081 0.0082 0.0083
Inst. 2 0.0083 0.0083 0.0099 0.0084 0.0112 0.0087 0.0096 0.0104 0.0115 0.0117
Inst. 3 0.0084 0.0091 0.0091 0.0091 0.0091 0.0087 0.0098 0.0126 0.0132 0.0125
Inst. 4 0.0078 0.0084 0.0078 0.0078 0.0149 0.0129 0.0151 0.0103 0.0098 0.0088
Inst. 5 0.0081 0.0086 0.0080 0.0084 0.0087 0.0093 0.0097 0.0131 0.0150 0.0115
Inst. 6 0.0081 0.0081 0.0081 0.0081 0.0081 0.0081 0.0081 0.0081 0.0081 0.0083
Inst. 7 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278 0.0278
Inst. 8 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238 0.3238
Robust least squares support vector machine based on recursive outlier elimination 1247
123
![Page 8: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/8.jpg)
now we cannot give an optimal p through theoretical analysis.
But according to our empirical studies, small p; such as
p ¼ 0:1; usually produces satisfactory estimation results.
4.2 Real-world data sets
To further investigate the proposed algorithm, we make
detailed experiments on 16 real-world data sets. Some
basic information about these data sets is presented in
Table 5 and the download Websites are given in the last
row. Each data set is tested using tenfold cross-validation.
And Table 6 records the average results over tenfold cross-
validation under the optimal hyper-parameters for each
algorithm (the optimal hyper-parameter ropt; copt
� �for each
algorithm is selected by grid search on the same 2D
parameter set). Results in Table 6 demonstrate that for all
the real-world data sets, both the RWLS-SVM and the
ROELS-SVM produce better accuracy than the LS-SVM
and the WLS-SVM. But the comparison between the
RWLS-SVM and the ROELS-SVM is a bit difficult: for 13
data sets the ROELS-SVM outperforms the RWLS-SVM,
for two data sets on the contrary, and for one data set both
algorithms perform equally well.
To make a fair comparison, a none-parametric statistical
test, the Wilcoxon signed-ranks test (Demsar 2006), is
implemented to evaluate the performance of ROELS-SVM
Table 4 The number of eliminated samples for various p
Data set p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.5 p = 0.6 p = 0.7 p = 0.8 p = 0.9 p = 1.0
Inst. 1 15 16 15 16 18 19 20 21 22 24
Inst. 2 31 31 34 33 37 30 32 34 36 36
Inst. 3 45 45 46 45 49 55 59 45 47 50
Inst. 4 58 60 59 60 72 66 72 62 68 72
Inst. 5 73 72 75 77 79 75 81 90 75 81
Inst. 6 89 89 89 89 89 89 89 89 89 90
Inst. 7 87 87 87 87 87 87 87 87 87 87
Inst. 8 36 36 36 36 36 36 36 36 36 36
Table 5 Basic information about the real-world data sets
Data set Sample size Number of attributes Attribute characteristics
Chwiruta 214 2 Real
Motorcycle 133 2 Real
Servob 167 5 Categorical, real
Nelsona 128 3 Integer, real
Boston Housingb 506 14 Categorical, integer, real
Auto MPGc 392 8 Categorical, real
Bodyfatc 252 15 Real
Triazinesc 186 60 Categorical, real
Pollutiond 60 16 Integer, real
Ensoa 168 2 Real
Gauss3a 250 2 Integer, real
Heart diseaseb 400 4 Integer, real
Balloond 2,001 2 Real
Crabsd 200 7 Categorical, integer, real
Compasse 108 3 Integer, real
Boltse 40 8 Integer, real
a NIST statistical reference dataset. http://www.itl.nist.gov/div898/strd/nls/nls_main.shtmlb UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets.htmlc LIBSVM data. http://www.csie.ntu.edu.tw/*cjlin/libsvmtools/datasets/d StatLib datasets. http://stat.cmu.edu/datasets/e Datasets for statistical analysis. http://www.sci.usq.edu.au/staff/dunn/Datasets/index.html
1248 W. Wen et al.
123
![Page 9: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/9.jpg)
and RWLS-SVM. According to Demsar (2006), let di be the
difference between the performance scores of these two
algorithms on ith out of N data sets and then rank the dif-
ferences according to their absolute values (average ranks
are assigned in case of ties). Let Rþ be the sum of ranks for
the data sets on which the second algorithm outperformed
the first, and R� the sum of ranks for the opposite. Ranks
ofdi ¼ 0 are split evenly among the sums. That is,
Rþ ¼X
di [ 0
rank dið Þ þ1
2
X
di¼0
rank dið Þ;
R� ¼X
di\0
rank dið Þ þ1
2
X
di¼0
rank dið Þ
Let T ¼ min Rþ;R�ð Þ; then the statistic
z ¼T � 1
4N N þ 1ð Þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
24N N þ 1ð Þ 2N þ 1ð Þ
q
is distributed approximately normally.
The differences and ranks of R2 on each data set are
recorded in Table 7, and corresponding differences and
ranks of computation time are recorded in Table 8. It is
obvious that from Table 7, we have TR2 ¼ min Rþ;R�ð Þ ¼R� ¼ 9þ 13þ 0:5 ¼ 22:5
Table 6 Results on the real-world instances
Data set LS-SVM WLS-SVM RWLS-SVM ROELS-SVM
R2 Time R2 Time R2 Time (iter.) R2 Time (elm.)
Chwirut 0.9732 0.30 0.9819 0.36 0.9821 0.47 (3.6) 0.9823 0.46 (15.8)
Motorcycle 0.7007 0.04 0.7248 0.08 0.7398 0.20 (7.6) 0.7489 0.18 (13.3)
Servo 0.3566 0.08 0.7172 0.10 0.8697 0.35 (9.7) 0.8619 0.16 (26.8)
Nelson 0.6621 0.06 0.6996 0.08 0.7154 0.19 (6.9) 0.6655 0.12 (11.2)
Boston Housing 0.8888 3.51 0.8968 4.74 0.8981 12.5 (12.8) 0.9050 6.09 (22.3)
Auto MPG 0.9011 0.89 0.9038 1.05 0.9076 2.68 (14.3) 0.9123 1.60 (23.0)
Bodyfat 0.9994 0.38 0.9995 0.77 0.9998 3.69 (15.6) 0.9998 1.06 (10.9)
Trianze scale 0.1420 0.33 0.3791 0.49 0.4048 1.68 (9.7) 0.4111 0.56 (7.4)
Pollution scale 0.3172 0.03 0.4495 0.04 0.4774 0.09 (5.2) 0.4820 0.07 (5.8)
Enso 0.1421 0.10 0.1423 0.14 0.1423 0.14 (1.0) 0.2703 0.19 (9.9)
Gauss3 0.9980 0.41 0.9983 0.43 0.9984 0.61 (8.7) 0.9985 0.61 (3.9)
Heart disease -3.500 0.88 0.0369 1.21 0.5045 3.12 (25) 0.6677 1.98 (133.3)
Balloon 0.1318 321.83 0.1506 350.29 0.2337 381.70 (65) 0.8025 424.28 (711)
Crabs 0.9886 0.16 0.9894 0.21 0.9897 0.67 (10.9) 0.9899 0.29 (7.6)
Compass 0.7253 0.05 0.8673 0.06 0.9204 0.11 (5.7) 0.9610 0.09 (35.6)
Bolts 0.8208 0.01 0.8229 0.02 0.8312 0.04 (3.4) 0.8605 0.03 (2.1)
Iter. the number of iterations that is needed by the RWLS-SVM, elm. the total number of outliers that eliminated by the ROELS-SVM
The bold values indicate the best R2 that can be obtained by one of the three algorithms
Table 7 Comparison of R2 for RWLS-SVM and ROELS-SVM
Data set R2 Difference Rank
RWLS-SVM ROELS-SVM RROE2 - RRW
2
Chwirut 0.9821 0.9823 0.0002 3.5
Motorcycle 0.7398 0.7489 0.0091 10
Servo 0.8697 0.8619 -0.0078 9
Nelson 0.7154 0.6655 -0.0499 13
Boston Housing 0.8981 0.9050 0.0069 8
Auto MPG 0.9076 0.9123 0.0047 6
Bodyfat 0.9998 0.9998 0.0000 1
Trianze scale 0.4048 0.4111 0.0063 7
Pollution Scale 0.4774 0.4820 0.0046 5
Enso 0.1423 0.2703 0.1280 14
Gauss3 0.9984 0.9985 0.0001 2
Heart disease 0.5045 0.6677 0.1632 15
Balloon 0.2337 0.8025 0.5688 16
Crabs 0.9897 0.9899 0.0002 3.5
Compass 0.9204 0.9610 0.0406 12
Bolts 0.8312 0.8605 0.0293 11
The bold values indicate the best R2 that can be obtained by one of the
three algorithms
The italic values indicate the cases that the RWLS-SVM outperforms
the ROELS-SVM
Robust least squares support vector machine based on recursive outlier elimination 1249
123
![Page 10: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/10.jpg)
Since N ¼ 16;
zR2 ¼T � 1
4N N þ 1ð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
124
N N þ 1ð Þ 2N þ 1ð Þq ¼ �2:35\� 1:96
Similarly, from Table 8, we have
TTime ¼ min Rþ;R�ð Þ ¼ R� ¼ 6:5þ 3þ 0:5 ¼ 10
Therefore,
zTime ¼T � 1
4N N þ 1ð Þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
24N N þ 1ð Þ 2N þ 1ð Þ
q ¼ �3:00\� 1:96
Therefore, with a ¼ 0:05; the null-hypothesis that both
algorithms perform equally well can be rejected. That is,
we may claim that the ROELS-SVM performs remarkably
better than the RWLS-SVM both for the test accuracy and
for the computation time.
5 Conclusions
To achieve robust estimation in noisy environment, the
ROELS-SVM algorithm, is proposed in this paper. It
implements a quantitative criterion derived from LTS, a
method for robust linear regression, for eliminating outliers
within a recursive framework. Decremental learning strat-
egy is introduced to the LS-SVM retraining stage, which is
fast and effective. Experiments demonstrate that the RO-
ELS-SVM eliminates a great majority of the outliers and
obtains robust estimation in most cases.
Traditional robust algorithm of LS-SVM focuses on
reducing influence of a bundle of potential outliers within
a single iteration, and, therefore, tends to make mistakes
when identifying outliers very close to the normal sam-
ples. The ROELS-SVM works on a different way, it does
not judge many outliers within one iteration, but judge
just a few, for example, two or three outliers within one
iteration, eliminates them, updates the training results and
then comes into next iteration. This is different from the
RWLS-SVM and makes it more possible for the ROELS-
SVM to adjust the training results toward the right
direction.
More interestingly, the ROELS-SVM can be viewed as a
back-propagation procedure for obtaining an uncontami-
nated set of samples for LS-SVM regression, which can be
used as an initial pruning step for sparsification. Besides,
applications of the ROELS-SVM in the field of outlier
detection will be also very interesting. Future works can be
concentrated on these points.
Table 8 Comparison of the computation time of RWLS-SVM and ROELS-SVM
Data set Time Difference Rank
TLS-SVM TRWLS-SVM TROELS-SVM (TROE - TRW)/TLS-SVM
Chwirut 0.30 0.47 0.46 -0.03 2
Motorcycle 0.04 0.20 0.18 -0.50 6.5
Servo 0.08 0.35 0.16 -2.38 13.5
Nelson 0.06 0.19 0.12 -1.17 9
Boston Housing 3.51 12.5 6.09 -1.83 12
Auto MPG 0.89 2.68 1.60 -1.21 10
Bodyfat 0.38 3.69 1.06 -6.92 16
Trianze scale 0.33 1.68 0.56 -3.39 15
Pollution Scale 0.03 0.08 0.07 -0.33 4
Enso 0.10 0.14 0.19 0.50 6.5
Gauss3 0.41 0.61 0.61 0.00 1
Heart disease 0.88 3.12 1.98 -1.30 11
Balloon 321.83 381.7 424.28 0.13 3
Crabs 0.16 0.67 0.29 -2.38 13.5
Compass 0.05 0.11 0.09 -0.40 5
Bolts 0.01 0.04 0.03 -1.00 8
Since the computation time is influenced by the sample size, difference of computation time is calculated by (TROE - TRW)/TLS-SVM. This avoids
unfair ranks caused by the influence of sample size
The bold values indicate the less computation time that is cost by either RWLS-SVM or ROELS-SVM
The italic values highlight the positive difference. In these cases the RWLS-SVM outperforms the ROELS-SVM
1250 W. Wen et al.
123
![Page 11: Robust least squares support vector machine based on recursive outlier elimination](https://reader031.vdocuments.mx/reader031/viewer/2022020404/57502ba41a28ab877ed28a14/html5/thumbnails/11.jpg)
References
Brabanter JD (2004) LS-SVM regression modelling and its applica-
tions. Ph.D. thesis. ftp://ftp.esat.kuleuven.ac.be/pub/SISTA//
debrabant
Burges CJC (1998) A tutorial on support vector machines for pattern
recognition. Data Min Knowl Discov 2(2):955–974
Cao LJ, Tay FEH (2003) Support vector machine with adaptive
parameters in financial time series forecasting. IEEE Trans
Neural Netw 14(6):1506–1518
Cawley GC, Talbot NLC (2004) Fast exact leave-one-out cross-
validation of sparse least-squares support vector machines.
Neural Netw 17:1467–1475
Chuang CC, Su SF, Jeng JT, Hsiao CC (2002) Robust support vector
regression networks for function approximation with outliers.
IEEE Trans Neural Netw 13(6):1322–1330
Cortes C (1993) Prediction of generalization ability in learning
machines. Ph.D. thesis. http://homepage.mac.com/corinnacortes/
Demsar J (2006) Statistical comparisons of classifiers over multiple
data sets. J Mach Learn Res 7:1–30
Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and
support vector machines. Adv Comput Math 13:1–50
Grubbs FE (1969) Procedures for detecting outlying observations in
samples. Technometrics 11(1):1–21
Jiang JQ, Song CY, Wu CG, Maurizio M, Liang YC (2006) Support
vector machine regression algorithm based on chunking incre-
mental learning. In: Proceedings of ICCS’06. Lecture notes in
computer science, vol 3991. Springer, Berlin, pp 547–554
Kvalseth TO (1985) Cautionary Note about R2. Am Stat 39(4):279–
285
Mangasarian OL, Musicant DR (2000) Robust linear and support
vector regression. IEEE Trans Pattern Anal Mach Intell
22(9):950–955
Rousseeuw PJ (1984) Least median of squares regression. J Am Stat
Assoc 79:871–880
Rousseeuw PJ, Driessen KV (2006) Computing LTS Regression for
large data sets. Data Min Knowl Discov 12:29–45
Rousseeuw PJ, Leroy A (1987) Robust regression and outlier
detection. Wiley, New York, pp 9–11
Scholkopf B, Sung KK, Burges CJC, Girosi F, Niyogi P, Poggio T,
Vapnik V (1997) Comparing support vector machines with
Gaussian kernels to radial basis function classifiers. IEEE Trans
Signal Process 45(11):2758–2765
Smola AJ, Scholkopf B (1998) A tutorial on support vector
regression. NeuroCOLT2 Technical Report NC2-TR-1998-030
Suykens JAK, Vandewalle J (1999) Least squares support vector
machine classifiers. Neural Process Lett 9:293–300
Suykens JAK, Brabanter JD, Lukas L, Vandewalle J (2002) Weighted
least squares support vector machines: robustness and sparse
approximation. Neurocomputing 48:85–105
Tian SF, Huang HK (2002) A simplification algorithm to support
vector machines for regression. J Softw 13(6):1169–1172
Vapnik V (1995) The nature of statistical learning theory. Wiley, New
York
Wen W, Hao ZF, Yang XW (2008) A heuristic weight-setting strategy
and iteratively updating algorithm for weighted least-squares
support vector regression. Neurocomputing 71(16–18):3096–
3103
Wu CH (2004) Travel-time prediction with support vector regression.
IEEE Trans Intell Transp Syst 5(4):276–281
Zhang JS, Gao G (2005) Reweighted robust support vector regression
method. Chin J Comput Sci 28(7):1171–1177
Zhao Y, Keong KC (2004) Fast Leave-one-out Evaluation and
improvement on inference for LS-SVMs. In: Proceedings of
ICPR’04, vol 3, pp 494–497
Robust least squares support vector machine based on recursive outlier elimination 1251
123