a heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares...

8
A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression Wen Wen a, , Zhifeng Hao b , Xiaowei Yang b a College of Computer Science and Engineering, South China University of Technology, Guangzhou 510641, PRC b School of Mathematical Science, South China University of Technology, Guangzhou 510641, PRC article info Available online 20 June 2008 Keywords: Support vector machines Least squares Outlier mining Regression Iterative update abstract Weighted least-squares support vector machine (WLS-SVM) is an improved version of least-squares support vector machine (LS-SVM). It adds weights on error variables to correct the biased estimation of LS-SVM. Traditional weight-setting algorithm for WLS-SVM depends on results from unweighted LS-SVM and requires retraining of WLS-SVM. In this paper, a heuristic weight-setting method is proposed. This method derives from the idea of outlier mining, and is independent of unweighted LS-SVM. More importantly, a fast iterative updating algorithm is presented, which reaches the final results of WLS-SVM through a few updating steps instead of directly retraining WLS-SVM. Circumstantial experiments on simulated instances and real-world datasets are conducted, demonstrating comparable results of the proposed WLS-SVM and encouraging performance of the fast iterative updating algorithm. & 2008 Elsevier B.V. All rights reserved. 1. Introduction Support vector machine (SVM), introduced by Vapnik, is a useful tool for data mining, especially in the fields of pattern recognition and regression. During the past few years, its solid theoretical foundation and good behaviors have attracted a number of researchers, and it has been demonstrated to be an effective method for solving real-life problems [25,26]. According to Vapnik’s ‘‘the nature of statistical learning theory’’ [21], using tactics such as introducing a kernel function, both nonlinear pattern recognition problems and regression problems can be converted into linear ones, and finally deduced to mathematical problems of Quadratics Programming (QP). This category of SVM uses the inherently sparse loss functions, such as epsilon-insensitive loss function, Laplacian loss function, Huber’s robust loss function and so on [22]. They are derived from statistical tools and theories, leading to sparse and robust approximations of certain problems [8,16,17]. However, it requires solving a QP with inequality constraints, which is complicated and time consuming. And to keep the sparseness and robustness of estimation, loss function should be carefully chosen depending on the problem. Suykens and Vandewa [19], from another perspective, pro- posed least-squares SVMs (LS-SVM), which instead uses a non- sparse loss function: sum square error (SSE). This trick converts the inequality constraints in classical SVM to equality ones. The solution follows directly from solving a set of linear equations, which is much less complex than quadratic programming. This idea is quite similar to ridge regression: a widely used method in the field of linear regression [15]. What makes it different is that, LS-SVM for regression is the kernel version of linear ridge regression. Excellent results were reported on LS-SVM regression for noiseless data [10]. However, since it uses SSE loss function, it is less robust, in other words, sensitive to noises. This is not only drawbacks of LS-SVM but also drawbacks of other learning method using SSE loss function. In order to avoid this drawback, Sukyens et al. [20] proposed a weighted version of LS-SVM (WLS-SVM), in which different weights are put on the error variables. They firstly trained the samples using classical LS-SVM, then calculated the weights for each sample according to its error variable, and finally solved the WLS-SVM. This idea, which is probably derived from weighted least-squares estimator in classical linear regression, really provides us with a novel approach to find a robust LS-SVM regressor. However, in this method, there are two points deserving considerations: since WLS-SVM has a fairly similar form with LS-SVM, is there an enhancing algorithm to reach WLS-SVM, instead of retraining it? Though Suykens’ weight-setting procedure depends on the results from unweighted LS-SVM, is there any weight-setting method independent of unweighted LS-SVM provides useful statistical information of the dataset, and thus leads to a comparable WLS-SVM? Some other researchers are also doing work on robust estimator by improving standard SVM. Lin and Wang proposed a fuzzy version of SVM (FSVM) to deal with noisy data [11–13]. Their basic idea is to assign a fuzzy membership to each sample, ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2008.04.022 Corresponding author. Tel.: +86 20 39782623. E-mail address: [email protected] (W. Wen). Neurocomputing 71 (2008) 3096– 3103

Upload: wen-wen

Post on 10-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression

ARTICLE IN PRESS

Neurocomputing 71 (2008) 3096– 3103

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

� Corr

E-m

journal homepage: www.elsevier.com/locate/neucom

A heuristic weight-setting strategy and iteratively updating algorithm forweighted least-squares support vector regression

Wen Wen a,�, Zhifeng Hao b, Xiaowei Yang b

a College of Computer Science and Engineering, South China University of Technology, Guangzhou 510641, PRCb School of Mathematical Science, South China University of Technology, Guangzhou 510641, PRC

a r t i c l e i n f o

Available online 20 June 2008

Keywords:

Support vector machines

Least squares

Outlier mining

Regression

Iterative update

12/$ - see front matter & 2008 Elsevier B.V. A

016/j.neucom.2008.04.022

esponding author. Tel.: +86 20 39782623.

ail address: [email protected] (W. Wen).

a b s t r a c t

Weighted least-squares support vector machine (WLS-SVM) is an improved version of least-squares

support vector machine (LS-SVM). It adds weights on error variables to correct the biased estimation of

LS-SVM. Traditional weight-setting algorithm for WLS-SVM depends on results from unweighted

LS-SVM and requires retraining of WLS-SVM. In this paper, a heuristic weight-setting method is

proposed. This method derives from the idea of outlier mining, and is independent of unweighted

LS-SVM. More importantly, a fast iterative updating algorithm is presented, which reaches the final results

of WLS-SVM through a few updating steps instead of directly retraining WLS-SVM. Circumstantial

experiments on simulated instances and real-world datasets are conducted, demonstrating comparable

results of the proposed WLS-SVM and encouraging performance of the fast iterative updating algorithm.

& 2008 Elsevier B.V. All rights reserved.

1. Introduction

Support vector machine (SVM), introduced by Vapnik, is auseful tool for data mining, especially in the fields of patternrecognition and regression. During the past few years, its solidtheoretical foundation and good behaviors have attracted anumber of researchers, and it has been demonstrated to be aneffective method for solving real-life problems [25,26].

According to Vapnik’s ‘‘the nature of statistical learningtheory’’ [21], using tactics such as introducing a kernel function,both nonlinear pattern recognition problems and regressionproblems can be converted into linear ones, and finally deducedto mathematical problems of Quadratics Programming (QP). Thiscategory of SVM uses the inherently sparse loss functions, such asepsilon-insensitive loss function, Laplacian loss function, Huber’srobust loss function and so on [22]. They are derived fromstatistical tools and theories, leading to sparse and robustapproximations of certain problems [8,16,17]. However, it requiressolving a QP with inequality constraints, which is complicated andtime consuming. And to keep the sparseness and robustness ofestimation, loss function should be carefully chosen depending onthe problem.

Suykens and Vandewa [19], from another perspective, pro-posed least-squares SVMs (LS-SVM), which instead uses a non-sparse loss function: sum square error (SSE). This trick convertsthe inequality constraints in classical SVM to equality ones. The

ll rights reserved.

solution follows directly from solving a set of linear equations,which is much less complex than quadratic programming. Thisidea is quite similar to ridge regression: a widely used method inthe field of linear regression [15]. What makes it different is that,LS-SVM for regression is the kernel version of linear ridgeregression. Excellent results were reported on LS-SVM regressionfor noiseless data [10]. However, since it uses SSE loss function, itis less robust, in other words, sensitive to noises. This is not onlydrawbacks of LS-SVM but also drawbacks of other learningmethod using SSE loss function. In order to avoid this drawback,Sukyens et al. [20] proposed a weighted version of LS-SVM(WLS-SVM), in which different weights are put on the errorvariables. They firstly trained the samples using classical LS-SVM,then calculated the weights for each sample according to its errorvariable, and finally solved the WLS-SVM. This idea, which isprobably derived from weighted least-squares estimator inclassical linear regression, really provides us with a novelapproach to find a robust LS-SVM regressor. However, in thismethod, there are two points deserving considerations: sinceWLS-SVM has a fairly similar form with LS-SVM, is there anenhancing algorithm to reach WLS-SVM, instead of retraining it?Though Suykens’ weight-setting procedure depends on the resultsfrom unweighted LS-SVM, is there any weight-setting methodindependent of unweighted LS-SVM provides useful statisticalinformation of the dataset, and thus leads to a comparableWLS-SVM?

Some other researchers are also doing work on robustestimator by improving standard SVM. Lin and Wang proposed afuzzy version of SVM (FSVM) to deal with noisy data [11–13].Their basic idea is to assign a fuzzy membership to each sample,

Page 2: A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression

ARTICLE IN PRESS

W. Wen et al. / Neurocomputing 71 (2008) 3096–3103 3097

which is determined by the importance of the sample. To someextent, FSVM is similar to WLS-SVM: both of them improve theclassical SVM by adding different weights on loss function (thefuzzy memberships in FSVM can be viewed as the weights inWLS-SVM). The only difference is that FSVM uses heuristicmethod to determine the fuzzy memberships while WLS-SVMdepends on pre-training the classical LS-SVM. Researchers havedeveloped various methods for determining the fuzzy member-ships [9]. Most of them are concentrated on the classificationproblems but rare are about regression problems. Besides, Zhangand Gao [27] proposed to use a recursive weighted LS-SVM.Chuang et al. [5] suggest using a back propagation procedure toimprove the robustness of SVM. However, both of these twomethods highly increase computational complexity.

Essentially, one of the issues we care when designingWLS-SVM is how to effectively find the noisy characteristicsunderlying the training data. Another is how to use informationfrom the original training procedure to enhance the computationof WLS-SVM. In this paper, we make an attempt to deal with thesetwo issues. This paper is organized as follows: Section 2 provides abrief review to WLS-SVM. In Section 3 we firstly propose the noveloutlier-labeling strategy in the one-dimensional space and thenpresent a heuristic weight-setting strategy for WLS-SVM. Further-more, based on unweighted LS-SVM, a fast training algorithm ofWLS-SVM is presented in Section 4. Experiments on simulateddatasets and real-world instances are conducted in Section 5.Finally, Section 6 draws conclusions from our work.

Fig. 1. Example.

2. Brief reviews to WLS-SVM

In order to obtain robust estimation from noisy data, Suykenset al. [20] proposed WLS-SVM. This model can be addressed asfollows:

min1

2kwk2 þ

1

2CXl

k¼1

vke2k

s:t: yk ¼ wTjðxkÞ þ bþ ek; k ¼ 1; . . . ; l (1)

where vk is determined by the following formula:

vk ¼

1 if ek=s�� ��pc1;

c2 � ek

�� ��c2 � c1

if c1 � ek=s�� ��pc2;

10�4 otherwise:

8>>>><>>>>:

(2)

Under the assumption of a normal Gaussian distribution of ek, s

can be given by formula (3) or formula (4).

s ¼IQR

2� 0:6745(3)

where IQR stands for the interquartile range, that is, the differencebetween the 75th percentile and 25th percentile. Or

s ¼ 1:483MADðxiÞ (4)

where MAD(xi)stands for the median absolute deviation.To obtain the values of vk, one should firstly use the training

dataset to train classical LS-SVM, and then compute s from the ek

distributions. From the statistical perspective of maximum like-lihood estimation, SSE cost function is optimal under theassumption of a normal Gaussian distribution. Therefore, Suykenset al. suggest using formula (2) to correct for this assumptionwhen ek’s distribution is not Gaussian. However, there are threeimportant points demanding careful consideration. Firstly, theweights, which largely depend on the original regression errorsfrom unweighted LS-SVM, might be unreliable for correcting theregression results, especially when it is seriously bent towards the

samples having large deviations. Secondly, in formula (2), someweights are categorically set 10�4, and in this case, someimportant information contained in these samples is completelylost. Finally, except for the slight differences of the weights,WLS-SVM is quite similar to unweighted LS-SVM. Therefore, thereis probably an approach to enhance the training procedure basedon the obtained results from the unweighted LS-SVM.

3. The proposed weight-setting method

In this section, we propose a heuristic algorithm mining priorknowledge of the noisy characteristics of the training samples andthen use this knowledge to set weights for the WLS-SVM. In thisalgorithm, the weight-setting procedure is independent of theoriginal regression results of unweighted LS-SVM; therefore, tosome extent can be considered a prior step mining the extra noisyinformation in the dataset.

3.1. A simple method to label the outlierness of samples

The basic idea in our method is to label the ‘‘outlierness’’ of asample according to its distance from other samples. Generally,the more outlying is a sample, the more large distances it has fromother samples. Thus, we first determine a threshold to describethe ‘‘large’’ distance. In our method, we define half of themaximum distance between the given sample and other samplesas the threshold of ‘‘large’’ distance. To make it more clearly, wetake Fig. 1 as an example. A, B, C, D, E are five samples withcoordinates 1, 2, 3, 6, 8, respectively. A, B, C are samples generatedaccording to the same distribution. It is visible that D, E areprobably outliers, and the ‘‘outlierness’’ of E is larger than D.Consider their distance matrix (sample A, B,y, E are the ithsample in sequence, i ¼ 1, 2,y,5):

Dist ¼

0 1 2 5 7

1 0 1 4 6

2 1 0 3 5

5 4 3 0 2

7 6 5 2 0

26666664

37777775

For row i, if we set

di ¼ maxj¼1;...;5

fdistði; jÞg=2 (5)

and define

Oneari ¼ fjjdistði; jÞpdig (6)

Ofari ¼ fjjdistði; jÞ4dig (7)

Nneari ¼ jOnear

i j (8)

Nfari ¼ jO

fari j (9)

We may find that for i ¼ 1, 2, 3, Ninear4Ni

far while for i ¼ 4, 5,Ni

nearoNifar. This suggests that for the outlying samples, they have

more ‘‘far’’ terms than ‘‘near’’ terms in the distance matrix. Thisenlightens us to label the ‘‘outlierness’’ of sample i according toNi

near and Nifar.

Furthermore, considering samples D and E, we find thatN4

near¼ N5

near, N4far¼ N5

far, but it is obvious that sample E is more

Page 3: A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression

ARTICLE IN PRESS

W. Wen et al. / Neurocomputing 71 (2008) 3096–31033098

outlying than sample D. Simply using information from Ninear and

Nifar cannot distinguish the different ‘‘outlierness’’ of D and E.

Therefore, distance comparison as formula (6) is introduced to thelabeling procedure.

Bi ¼

maxk2Ofar

i

fdði; kÞg � mink2Ofar

i

fdði; kÞg

maxk2Ofar

i

fdði; kÞg(10)

The more outlying is a sample, the larger distance value has itcompared to the ‘‘un-outlying’’ samples, therefore the smallervalue zi has. Considering sample D and E, the maximum distancebetween ‘‘un-outlying’’ samples A, B, C is |AC|, which is equal to

maxk2Ofar

i

fdði; kÞg � mink2Ofar

i

fdði; kÞg. i ¼ 4, 5. Therefore, we have z4 ¼25 and

z5 ¼27. As a matter of fact, the numerator of zi describes the

aggregative degree of ‘‘un-outlying’’ samples. When maxk2Ofar

i

fdði; kÞg is

determined, the more aggregative are ‘‘un-outlying’’ samples, themore outlying is sample i.

Synthetically considering Ninear, Ni

far and zi leads to the noveloutlying label of a sample:

gi ¼Nnear

i

Nfari

Bi (11)

The more outlying is a sample, the smaller is gi.

3.2. The proposed weighted method for WLS-SVM

Dataset suitable for regression usually has the property thatmost samples are distributed near to the objective regressioncurve, while outliers have large deviations. If the input featuresare limited within an appropriate small interval y, the change of y

should be mild, at least not drastic. They usually includereasonable change according to input features and a bit noisewithin acceptable boundaries. If the change of y is drastic, thesample is probably an outlier. This implicates that we can use theproposed method in Section 3.1 to label the ‘‘outlierness’’ ofsamples and hence to set appropriate weights on them.

Given training dataset {si|si ¼ (xi, yi), i ¼ 1,2,y,l}, xi ¼ (xi1,xi2,y,xiD) is the input vector, yi is the observed value. In thispaper, we simply consider the situation that yi is one-dimensional.Derived from the proposed outlier-labeling procedure in Section3.1, the heuristic weight-setting algorithm is presented as follows:

Algorithm. (the proposed weight-setting method for WLS-SVM)

1.

Use the fast leave-one-out cross-validation [4,28] to obtain theoptimal hyperparameters for unweighted LS-SVM.

2.

Initialize an appropriate threshold y for the distances in theinput space.

3.

Calculate the input and output distances between samplesaccording to formula (12) and formula (13). Store them in theinput distance matrix DX and output distance matrix DY,respectively.

DXði; jÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXD

d¼1

ðxid � xjdÞ2

vuut (12)

DYði; jÞ ¼ jyi � yjj (13)

4.

For the ith row of DX, find sample group Oi, which satisfies

Oi ¼ fjjDXði; jÞoyg (14)

For kAOi, let di ¼12 max

k2Oi

fDYði; kÞg.

Record Oinear¼ {j|DY(i,j)pdi}, Oi

far¼ {j|DY(i,j)4di}.

Count sample number Ninear¼ |Oi

near|, Nifar¼ |Oi

far|.

Let vi ¼

1 if Nneari XNfar

i

Nneari

Nfari

� Bi else

8>><>>: ,

here Bi ¼

maxk2Ofar

i

fDYði; kÞg � mink2Ofar

i

fDYði; kÞg

maxk2Ofar

i

fDYði; kÞg

5.

Use the weighted samples {s0i|s0i ¼ (xi, yi, vi), i ¼ 1,2,y,l} to

train the WLS-SVM.

Different from Suykens’ WLS-SVM, the proposed weight-settingmethod is independent of unweighted LS-SVM, which mines thenoisy characteristics in its own procedure and finds usefulstatistical information that may not be discovered by unweightedLS-SVM. It is derived from our weighted algorithm in [24], but ismore applicable for it has less parameter to set: just oneparameter, the input threshold y, has to be determined. Accordingto empirical studies, y should have the value, ensuring that withinsuch a small interval there are approximately four to ten sampleson average, this can be easily determined by simple arithmeticbefore running the weight-setting procedure.

As for the Computational Complexity, in Suykens’ weight-setting procedure, the major computational efforts happen in thecalculation of IQR, the difference between the 75th percentile and25th percentile of the errors. This requires a calculation andordering procedure of ek(k ¼ 1,2,y,l). For our method, majorcomputations come from the calculation of distance matrix andsearch for samples within a threshold interval. Both of them canbe finished within O(l2) computations (yet, in high-dimensionalinput cases, our method may take a bit long time).

4. A fast training algorithm for WLS-SVM

From the original QP problem of WLS-SVM (1), it is easy to getthe KKT system of weighted LS-SVM:

Oþ Vg 1v

1Tv 0

" #ab

� �¼

y

0

� �(15)

Here, Vg ¼ diag{1/gvl,y,1/gvN} is a diagonal matrix. It is differentfrom unweighted LS-SVM in which ~Vg ¼ diagf1=g; . . . ; gg. In orderto clearly present our method, we denote

A ¼Oþ Vg 1v

1Tv 0

" #; ~A ¼

Oþ ~Vg 1v

1Tv 0

" #; X ¼

ab

� �; Y ¼

y

0

� �.

Our motivation is to get A�1 from A�1 with much less steps thandirectly inverse A. The Sherman–Morrison–Woodbury formula [2]will be introduced here for further derivation.

4.1. Sherman– Morrison– Woodbury formula

Given an invertible matrix A and column vectors u1 and u2,assuming 1+u1

TA�1u2 6¼0, the following equation holds:

ðAþ u1uT2Þ�1¼ A�1

�A�1u1uT

2A�1

1þ uT2A�1u1

(16)

Enlightened by Sherman–Morrison–Woodbury formula (16), weattempt to use an iterative method to find the inverse of A from

A�1. Let A0�1¼ A�1, ~X ¼

~a~b

� �( ~a; ~b are the results of unweighted

LS-SVM under the optimal hyperparameters). In each iteration wecalculate the inverse of a matrix with the element on its

Page 4: A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression

ARTICLE IN PRESS

W. Wen et al. / Neurocomputing 71 (2008) 3096–3103 3099

kth column kth row different from A, that is, in the kth iteration,

Ak ¼ Ak�1 þ

0 � � � 0 � � � 0

..

. . .. ..

. . .. ..

.

0 � � �1

g1

vk� 1

� �� � � 0

..

. . .. ..

. . .. ..

.

0 � � � 0 � � � 0

26666666664

37777777775

kth row

kth col:

(17)

It is obvious when k ¼ l (l is the total number of trainingsamples), we have Al ¼ A.

Let u1(k) ¼ (0,y, 0, (1/g), 0,y0)T, u2(k) ¼ (0,y, 0,(1/vk)�1,0,y,0)T (the kth element of u1(k) is 1/g, and the kth element ofu2(k) is (1/vk)�1), then Ak ¼ Ak�1+u1(k)u2(k)

T.Therefore according to Sherman–Morrison–Woodbury formula

(16), we have

A�1k ¼ A�1

k�1 �A�1

k�1u1 kð ÞuT2ðkÞA

�1k�1

1þ uT2ðkÞA

�1k�1u1ðkÞ

. (18)

Let aij(k) is the element on the ith row jth column of Ak

�1, then

_aðkÞij ¼_aðk�1Þ

ij � _aðk�1Þik

1g

1vk� 1

� _aðk�1Þ

kj

1þ 1g _aðk�1Þkk

1vk� 1

� � (19)

Since Al ¼ A, A�1¼ Al

�1. The updated X can be calculated by

X ¼ A�1l Y (20)

which is the value of a and b for WLS-SVM.The proposed iterative updating algorithm can be summarized

as follows:

Algorithm. (The proposed fast training algorithm of WLS-SVM)

1.

Use Leave-one-out validation to train LS-SVM and obtain theoptimal hyperparameters. For the optimal hyperparameters,store the matrix A�1 in M and set iterative parameter k ¼ 1.

2.

Use Suykens’ method or the proposed method to find weightfor each sample. That is, to find vk for the kth sample (k ¼ 1,2,y,l).

3.

For the kth sample, if vko1.0. Update the elements of M

according to formula (19); Else go to step 4.

4. Let k ¼ k+1 and return to step 3 until k4l. 5.

Table 1Symbols

Symbols Brief remark Definition

SSEloo Sum squared error of leave-

one-out cross-validationSSEloo ¼

Pli¼1ðyi � y

looi Þ

2

SSE Sum squared error of

testingSSE ¼

Pmi¼1ðyi � yiÞ

2

SST Sum squared deviation of

testing samplesSST ¼

Pmi¼1ðyi � yÞ2

SSR Sum squared deviation that

can be explained by the

estimator

SSR ¼Pm

i¼1ðyi � yÞ2

SSE/SST Ratio between sum SSE=SST ¼Pm

ðy � y Þ2=Pm

ðy � yÞ2

Calculate a and b according to formula (20) and output them asthe training results of WLS-SVM.

Only when vko1, the iterative updating step 3 and step 4 haveto be carried out. In each updating procedure, it requires l2

calculations. If we let m denote the number of samples havingvko1, we totally need ml2 calculations to reach the final results ofWLS-SVM. Since both in Sukyens’ weight-setting method or inours, a large percentage of samples actually have weights equal to1 for they are not outliers. This makes m5l. Therefore, theproposed fast iterative updating algorithm costs much less timethan directly retraining WLS-SVM (which requires at least O(l3)computations). For example, if there are 10% outliers in thedataset, the fast algorithm will simply require tenth calculationsof directly retraining WLS-SVM.

squared error and sum

squared deviation of

testing samples

i¼1 i i i¼1 i

SSR/SST Ratio between

interpretable sum squared

deviation and real sum

squared deviation of

testing samples

SSR=SST ¼Pm

i¼1ðyi � yÞ2=Pm

i¼1ðyi � yÞ2

5. Experiments and results

To check the validity of the proposed algorithm, programsincluding the weight-setting procedure, the fast and standardWLS-SVM training modules are written in C++, using Microsoft’s

Visual C++ 6.0 compiler. The experiments are run on an HPpersonal computer, which utilizes a 3.06 GHz Pentium IVprocessor with a maximum of 512MB memory available. In orderto evaluate the performance of the proposed algorithms, simu-lated instances and real-word instances are used as test cases.Before presenting the experimental results, the evaluationcriterions are specified in the following subsection.

5.1. Symbols and performance evaluation criterion

Let l be the number of training samples, yi be the real-value ofsample i, yi

loo be the prediction of yi when sample i is left out of thetraining dataset. And denote m as the number of testing samples, yi

as the predicted value of yi, and y as the average value of y1,y,ym.Then we use criterions in Table 1 for algorithm evaluation.

Detailed remarks on the symbols:

SSEloo: It is also denoted as PRESS (Predicted Residual Sum ofSquares) and is usually used to evaluate the predicted ability ofan estimator [1]. In our experiments, it is used as the criterionin hyperparameter tuning stage. The smaller is SSEloo, thebetter regressing performance the LS-SVM has.SSE: It represents the fitting precision, the smaller is SSE, thefitter the estimation is. However, when outliers are also used as

testing samples, too small value of SSE probably speaks foroverfitting of the regressor.SST: It reflects the underlying variance of the testing samples,which usually involves the variance caused by noises and thatcaused by the change of input values.SSR: From the statistical perspective, SSR reflects the explana-tion ability of the regressor. The larger SSR is, the morestatistical information it catches from testing samples [23].SSE/SST and SSR/SST: Two important criterion used forevaluating the performance of regression algorithms[3,18,23]. In most cases, small SSE/SST means good agreementbetween estimations and real-values, and to obtain smallerSSE/SST usually accompanies an increase of SSR/SST. However,the extremely small value of SSE/SST is in fact not good, for itprobably means overfitting of the regressor. Therefore a goodestimator should strike balance between SSE/SST and SSR/SST.We use SSE/SST and SSR/SST as the major evaluation criterion in

our experiments.

Page 5: A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression

ARTICLE IN PRESS

W. Wen et al. / Neurocomputing 71 (2008) 3096–31033100

5.2. Simulated datasets

Firstly, three simulated instances are tested. Training data aregenerated by contaminated sinc function f, plus three kinds ofdifferent noises (Table 2) across the input space: Gaussian noiseswith zero mean, transformed w2 noises with non-zero mean and

Table 2Testing results and computation time of sinc datasets with different noises

Sinc dataset Average testing SSE Computation time (s)

Proposed Suykens’ Unweighted Retrain Fast

Gaussian 0.0867 0.1750 1.0753 5.87 0.43

K2 0.0862 0.1127 2.6491

Heter–Guassian 0.0876 0.1058 2.2242

Fig. 2. Training datasets and test dataset: top-left, top-right and bottom-left illustrate

and Heter-var noises, respectively. Bottom right is the testing dataset which are rando

samples are 600 and 200, respectively

heterogeneous variant Gaussian noises with non-zero mean(variance of noises changes as the input coordinate changes).To avoid biased comparison, for each kind of noises, 10 groupsof noisy samples are generated using Matlab toolbox, whichtotally consist of 30 training datasets. Besides, testing dataare uniformly sampled from the objective regression function f*.

the three kinds of simulated datasets with Gaussian noises, transformed w2 noises

mly sampled from the sinc function. The number of training samples and testing

Table 3Comparative evaluation criterions of sinc datasets

Sinc. instances Gaussian noises K2 noises Heter–Gaussian noises

SSR/SST SSE/SST SSR/SST SSE/SST SSR/SST SSE/SST

Proposed 0.9223 0.0040 0.9245 0.0041 0.9181 0.0040

Suykens’ 0.8856 0.0091 0.9289 0.0060 0.9089 0.0049

Unweighted 0.7591 0.0576 1.0077 0.1339 0.8313 0.0876

Page 6: A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression

ARTICLE IN PRESS

Fig. 3. Comparative results on three simulated instances: (a1 and a2) testing results on the dataset with Gaussian noises; (b1 and b2) testing results on dataset with w2

noises; and (c1 and c2) testing results on dataset with Heter–Gaussian noises. Figures on the left column illustrate comparative results using unweighted LS-SM, Suykens

WLS-SVM and the proposed WLS-SVM; figures on the right column enlarge the results of two different WLS-SVM, for more clear illustration. (a1) Datasets with Gaussian

noises, (a2) enlarged part, (b1) dataset with w2 noises, (b2) enlarged part, (c1) dataset with Heter–Gaussian noises and (c2) enlarged part.

W. Wen et al. / Neurocomputing 71 (2008) 3096–3103 3101

Page 7: A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression

ARTICLE IN PRESS

W. Wen et al. / Neurocomputing 71 (2008) 3096–31033102

The training datasets and the testing dataset are illustrated inFig. 2

f ðxÞ ¼sin x

xþ x x 2 ½�15;15�

Here, x is random fluctuation variable between �0.05 and 0.05

f nðxÞ ¼sin x

x

RBF kernel is used in the experiments.Testing SSE in Table 1 show that both Suykens’ WLS-SVM and

the proposed WLS-SVM are effective in producing robust estima-tion from noisy sinc samples. Especially, for the Heter–Gaussiannoisy dataset and transformed w2 noisy dataset, unweightedLS-SVM is seriously influenced by the noises and produces biasedestimation, while the WLS-SVMs dramatically reduce this sideeffect. For Gaussian noisy dataset, unweighted LS-SVM producesrelative satisfying estimation, yet the two WLS-SVMs still furtherimprove the results. Table 3 illustrates the average values ofother evaluation criterions of 10 tests for each noisy sincdataset: unweighted LS-SVM produces larger SSE/SST in all ofthe tests and both Suykens’ WLS-SVM and ours has very smallSSE/SST meanwhile increase SSR/SST. This indicates the statisticalinformation in the training dataset is well presented by the WLS-SVM with fairly small regression errors. Additionally, Fig. 3provides detailed information about the tests and shows theproposed WLS-SVM has slightly better precision and is mildlymore stable.

Except for the simulated sinc samples, Mackey–Glass systemplus Gaussian noises are also used as a test case. The Mack-ey–Glass system is highly chaotic even in the noiseless cases,which makes a fairly challenging regression [8]. We mix the 1385samples with about 10 percent Gaussian noises and thenrandomly split them into two groups of samples in proportionof 4:1. The large group is used as training dataset, and the small astesting dataset. Table 4 shows the comparative regressionperformance on this instance. Since the testing dataset is mixedwith noises, the testing SSE is no longer meaningful. We simplypresent other three evaluation criterions in Table 3. The resultsshow that within almost the same regression SSE, the proposedWLS-SVM has the largest value of SSR/SST, indicating that it isable to reflect more statistical information of the training dataset,while Suykens’ WLS-SVM in the second place and unweightedLS-SVM has the smallest SSR/SST. Since the difference amongSSE/SST is fairly slight, it is reasonable to believe that the proposedalgorithm is superior in catching the statistical information of thetraining dataset.

As for the computation time, both Tables 2 and 4 show that thefast training algorithm takes no more than 10 percent of the timenecessary for retraining WLS-SVM (0.43 versus 5.87 s and 2.74versus 33.02 s), indicating its efficient performance.

Table 4Comparative results and computation time of M-G system

M-G system SSR/SST SSE/SST Computation time(s)

Retrain Fast

Proposed 0.8821 0.2957 33.02 2.74

Suykens 0.7423 0.2945

Unweighted 0.7057 0.2861

5.3. Real-world datasets

For further evaluation, we test four real-world datasets. One isthe motorcycle dataset, a well-known benchmark dataset instatistics [7]. It contains 133 samples. The input values are timemeasurements in milliseconds after simulated impact and theoutput values are measurements of head acceleration. The dataare heteroscedastic and forms a challenging test case. The other isBoston Housing Dataset. It consists of 506 samples. Each samplehas 13 features which designate the quantities that influence theprice of a house in Boston suburb and an output feature which isthe house price in thousands dollars. The last two are Chwirut

dataset and Servo dataset. Chwirut dataset are the result of an NISTstudy involving ultrasonic calibration [6]. It contains 214 samples,including a few slight outliers caused by the measurement errors.The Servo dataset consists of 167 samples and covers an extremelynonlinear phenomenon-predicting the rise time of a servomechanism in terms of two continuous gain settings and twodiscrete choices of mechanical linkages [14]. For these fourdatasets, we randomly split them into two groups of samples inthe proportion of 4:1. The large group is used as the trainingdataset and the small group as testing dataset. The optimalhyperparameters are selected using fast leave one-out algorithm[4,28], and are set as g ¼ 25, s ¼ 6.6 for Motorcycle dataset,g ¼ 25, s ¼ 1 for Boston Housing dataset, g ¼ 75, s ¼ 2 for Chwirut

dataset and g ¼ 625, s ¼ 2 for Servo dataset.Table 5 shows the comparative results of the proposed

algorithms. For the Motorcycle dataset, the proposed WLS-SVMobtains the largest SSR/SST, Suykens’ WLS-SVM in the secondplace and unweighted LS-SVM produces the smallest. The value ofSSE/SST accordingly increases as SSR/SST increases. However,under close observation, we may find that the proposed WLS-SVMhas the smallest SSE/SSR, indicating a larger SSR/SST in charge ofsmaller SSE/SST. From this perspective, it is fairly reasonable tobelieve that the proposed WLS-SVM is comparatively suitable forMotorcycle. As for the Boston Housing dataset, although both theproposed WLS-SVM and Suykens’ WLS-SVM gains larger SSR/SST,yet they produce larger SSE/SST. This indicates that they obtainfractional statistical information at the cost of larger loss ofregression precision. Thus, whether it is worthwhile to use thesetwo WLS-SVMs still requires further investigation and additionalprior knowledge of the dataset. As for the computational time,results from these two datasets again report an excellentperformance of the fast iterative updating algorithm. Results onthe Chiwirut dataset are also fairly satisfying. The proposedweighting algorithm wins large SSR/SST, meanwhile keeps smallSSE/SST. The servo dataset is a particular case. It covers anextremely nonlinear phenomenon. Suykens weighted methodfails to set correct weights on the samples, thus making adrastically reduce of SSR/SST and a serious increase of SSE/SST. Theproposed weighting method detects the extremely nonlinear

Table 5Comparative results and computation time of real-world datasets

Real-world dataset Motorcycle Boston housing Chiwirut Servo

C1 C2 C1 C2 C1 C2 C1 C2

Proposed 0.8937 0.2205 0.9541 0.0670 0.9854 0.0197 0.9628 0.0177

Suykens’ 0.8143 0.2026 0.9476 0.0598 0.9706 0.0192 0.7794 0.1635

Unweighted 0.7851 0.2014 0.9271 0.0541 0.9688 0.0188 0.9628 0.0177

Time(s)

Retrain 0.059 3.688 0.219 0.122

Fast 0.007 0.128 0.015 0.008

Notes: C1 represents the criterion SSR/SST and C2 represents the criterion SSE/SST.

Page 8: A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression

ARTICLE IN PRESS

W. Wen et al. / Neurocomputing 71 (2008) 3096–3103 3103

phenomenon and sets 1.0 to all the samples therefore produce thesame SSR/SST and SSE/SST as the unweighted LS-SVM.

As for the computational time, records in Table 5 demonstratethe encouraging performance of the fast training algorithm. Itperforms much better than direct retraining the LS-SVM, savesconsiderate computations and requires less than a tenth calcula-tion time than direct retraining the LS-SVM.

6. Conclusions

A weight-setting method independent of unweighted LS-SVMis proposed for WLS-SVM. This method is based on a simple ideafrom the research field of outlier mining. It obtains comparableresults as Suykens’ WLS-SVM in most of the test cases. This studyindicates that algorithms mining prior knowledge of dataset canbe effectively used in the weight-setting stage of WLS-SVM andmay produce satisfying results. More impressively, a fast trainingalgorithm for WLS-SVM is presented. It is based on numericaltechniques and can be viewed as an iterative updating procedurefrom unweighted LS-SVM. The fast algorithm reaches final resultsof WLS-SVM within much less computational time and can beeasily carried out. The fast updating algorithm opens a new way toenhance the training procedure of recursive weighted SVM. This iswhat we are interested in the future work.

Acknowledgements

This work has been supported by the National Natural ScienceFoundation of China (10471045, 60433020), the program for NewCentury Excellent Talents in University (NCET-05-0734), NaturalScience Foundation of Guangdong Province (031360, 04020079),Excellent Young Teachers Program of Ministry of Education ofChina, Fok Ying Tong Education Foundation (91005), SocialScience Research Foundation of MOE (2005-241), Key TechnologyResearch and Development Program of Guangdong Province(2005B10101010, 2005B70101118), Key Technology Research andDevelopment Program of Tianhe District (051G041) and NaturalScience Foundation of South China University of Technology (B13-E5050190), Open Research Fund of National Mobile Communica-tions Research Laboratory (A200605), Open Research Fund of KeyLaboratory of Symbolic Computation and Knowledge Engineeringof Ministry of Education (93K-17-2006-03).

References

[1] D.M. Allen, The relationship between variable selection and prediction,Technometrics 16 (1974) 125–127.

[2] M.S. Bartlett, An inverse matrix adjustment arising in discriminant analysis,Ann. Math. Stat. 22 (1) (1951) 107–111.

[3] D.M. Bates, D.G. Watts, Nonlinear Regression Analysis and its Applications,Wiley, New York, 1988.

[4] G.C. Cawley, N.L.C. Talbot, Fast exact leave-one-out cross-validation of sparseleast squares support vector machines, Neural Networks 17 (10) (2004)1467–1475.

[5] C.C. Chuang, S.F. Su, J.T. Jeng, C.C. Hsiao, Robust support vector regressionnetworks for function approximation with outliers, IEEE Trans. NeuralNetworks 13 (6) (2002) 1322–1330.

[6] D. Chwirut, Chwirut dataset. NIST nonlinear regression dataset, 1975.Available at /www.itl.nist.gov/div898/strd/nls/data/chwirut1.shtmlS.

[7] R.L. Eubank, second ed., Nonparametric Regression and Spline Smoothing,Statistics: Textbooks and Monographs, Vol. 157, Marcel Dekker, New York, 1999.

[8] G.W. Flake, S. Lawrence, Efficient SVM Regression Training with SMO,Machine Learning, Vol. 46, Kluwer Academic Publishers, Netherlands, 2002,pp. 271–290.

[9] H.P. Huang, Y.H. Liu, Fuzzy support vector machines for pattern recognitionand data mining, Int. J. Fuzzy Systems 4 (2002) 826–835.

[10] J.Q. Jiang, C.Y. Song, C.G. Wu, M. Maurizio, Y.C. Liang, Support vector machineregression algorithm based on chunking incremental learning, in: Proceed-ings of the ICCS’06, Lecture Notes in Computer Science, Vol. 3991, Springer,Berlin, 2006, pp. 547–554.

[11] C.F. Lin, S.D. Wang, Fuzzy support vector machines, IEEE Trans. NeuralNetworks 13 (2002) 464–471.

[12] C.F. Lin, S.D. Wang, Training algorithms for fuzzy support vector machineswith noisy data, Pattern Recognition Lett. 25 (2004) 1647–1656.

[13] C.F. Lin, S.D. Wang, Fuzzy support vector machines with automatic member-ship setting, StudFuzz 17 (2005) 233–254.

[14] J.R. Quinlan, Servo dataset. UCI Machine Learning Database Repository, 1993.Available at /www.ics.uci.edu/�mlearn/databases/servoS.

[15] C. Saunders, A. Gammerman, V. Vovk, Ridge regression learning algorithm indual variables, in: Proc. ICML’ 98, 1998, pp. 2758–2765.

[16] B. Scholkopf, A.J. Smola, R.C. Williamson, P.L. Bartlett, New support vectoralgorithms, Neural Comput. 12 (4) (2000) 1207–1245.

[17] A.J. Smola, B. Scholkopf, A tutorial on support vector regression, Statist.Comput. 3 (14) (2004) 199–222.

[18] R.G. Staudte, S.J. Sheather, Robust Estimation and Testing: Wiley Series inProbability and Mathematical Statistics, Wiley, New York, 1990.

[19] J.A.K. Suykens, J. Vandewa, Least squares support vector machine classifiers,Neural Process. Lett. 9 (1999) 293–300.

[20] J.A.K. Suykens, J.D. Brabanter, L. Lukas, J. Vandewalle, Weighted least squaressupport vector machines: robustness and sparse approximation, Neurocom-puting 48 (2002) 85–105.

[21] V. Vapnik, The Nature of Statistical Learning Theory, Wiley, New York, USA, 1995.[22] V. Vapnik, S. Golowwich, A. Smola, Support vector method for function

approximation, regression estimation and signal processing, Adv. NeuralInform. Process. Systems MIT Press 9 (1997) 281–287.

[23] S. Weisberg, Applied Linear Regression, second ed., Wiley, New York, 1985.[24] W. Wen, Z.F. Hao, Z.F. Shao, X.W. Yang, M. Chen, A heuristic weight-setting

algorithm for robust weighted least squares support vector regression, in:Proceedings of ICONIP’ 06, Vol. 4232, Springer, Berlin, 2006, pp. 773–781.

[25] C.H. Wu, Travel-time prediction with support vector regression, IEEE Trans.Intelligent Transp. Systems 5 (2004) 276–281.

[26] H.Q. Yang, L.W. Chan, I. King, Support vector machine regression for volatilestock market prediction, in: Proceedings of the IDEAL’ 02, Lecture Notes inComputer Science, Vol. 2412, Springer, Berlin, 2002, pp. 391–396.

[27] J.S. Zhang, G. Gao, Reweighted robust support vector regression method, Chin.J. Comput. Sci. 28 (7) (2005) 1171–1177.

[28] Y. Zhao, K.C. Keong, Fast leave-one-out evaluation and improvement oninference for LS-SVMs, in: Proceedings of the ICPR’ 04, 2004, pp. 1051–4651.

Wen Wen received her B.S. degree in Applied Mathe-matics from South China University of Technology in2003. From 2003 to 2008, she studied in the College ofComputer Science and Engineering, South China Uni-versity of Technology. She is expected to receive thePh.D. degree in July 2008. Her current research interestsinclude support vector machine, kernel methods,evolutionary computation and image processing.

Zhifeng Hao received his B.S. degree in mathematicsfrom Zhongshan University in 1990. He studied inNanjing University from 1990 to 1995 and received thePh.D. degree in mathematics in 1995. In 2001, he hasbeen a visiting researcher in Rutgers, the StateUniversity of New Jersey. Since 1995, he has been withthe faculty of the School of Mathematical Sciences inSouth China University of Technology, where he waspromoted to Associate Professor in 1997 and FullProfessor in 2000. He is now an authorized Ph.D.supervisor in the College of Computer Science andEngineering, South China University of Technology. He

has published more than 60 academic papers on

applied mathematics, evolutionary computation, support vector machines andbioinformatics. His current research interests include support vector machines,neural networks, evolutionary computation and bioinformatics.

Xiaowei Yang obtained his Ph.D. degree in solidmechanics from the Department of Engineering Me-chanics, Jilin University in 2000, the Master degree incomputational mechanics from the Department ofMathematics, Jilin University in 1996, and the B.A.degree in theory mechanics and applications from theDepartment of Mathematics, Jilin University in 1991.From January to May in 2003, he was a research fellowin National University of Singapore. From February toAugust in 2006, he was a visiting scholar in Universityof Technology, Sydney. He is currently an AssociateProfessor at the School of Mathematical Science, South

China University of Technology. He has over 70

publications in journals and conference proceedings. His research interests aremainly in the fields of support vector machine, intelligence computation, topologyoptimization and computational mechanics.