[lecture notes in computer science] artificial intelligence and computational intelligence volume...

F.L. Wang et al. (Eds.): AICI 2010, Part I, LNAI 6319, pp. 266–272, 2010. © Springer-Verlag Berlin Heidelberg 2010

A New Smooth Support Vector Machine

Jinjin Liang1 and De Wu2

1 Department of Mathematical Sciences, Xi’an Shiyou Universtiy, Shaanxi Xi’an, China

[email protected] 2 Department of Computer Sciences, Xidian Universtiy,

Shaanxi Xi’an, China [email protected]

Abstract. A new Smooth Support Vector Machine (SSVM) is proposed and is called NSSVM for short. Different from traditional SSVM that treats perturbation formulation of SVM, NSSVM treats standard 2-norm error soft margin SVM. Different from traditional SSVM that uses the 2-norm of the Lagrangian multipliers vector to roughly substitute that of the weight of the separating hyperplane, which makes the obtained smooth model unequal to the primal program; NSSVM takes into account the connotative relation between the primal and dual program to transform the original program to a new smooth one. Numerical experiments on several UCI datasets demonstrate that NSSVM has higher precisions than existing methods.

Keywords: Smooth Support Vector Machine; 2-norm error soft margin SVM; connotative relation; primal and dual program.

1 Introduction

Support Vector Machine (SVM), recently developed by V. Vapnik and his co-workers, has been a promising method for data classification and regression which is based on statistical learning theory and dual program [1,2]. SVM is reckoned as a new machine learning method for the small-sample sets and is based on structural risk minimization principle [3, 4], which minimizes an upper bound on the generalization error. When points are separable, we figure out the hard margin classifiers that completely separate the two classes which maximize their margin; but sometimes points are non-separable, as is usually the case, so soft margin classifiers are introduced which allow erroneous for misclassified points. The commonly used methods are 1-norm and 2-norm error soft margin SVM. The former is called box constrained SVM, since the Lagrangian multipliers vector satisfies 0 Cα≤ ≤ ; and the latter is called diagonal weighted SVM, since the Lagrangian multipliers vector satisfies 0 =Cα ξ≤ .

This paper focuses on making use of smoothing techniques, which has been extensively used for solving important mathematical programming problems [5, 6] to * Supported by National Science Foundation of China under Grant No. 60574075, 60674108.

A New Smooth Support Vector Machine 267

obtain a fast and iterative algorithm. Reference [7] first applies smoothing techniques

to one specified reformulation of SVM by appending an additional bias term 2 / 2γ to

the standard 2-norm error soft margin SVM, which technique do cause changes of accuracies [8] and obtain SSVM. Based on SSVM, diverse kinds of smooth methods have been proposed [9,10,11,12]: these methods aim at designing new smooth functions, or generalizing them to the regression case; but none of them prove the equivalence between the obtained models and the original program. In fact, the 2-norm of the error is roughly used to substitute that of the weight of the hyperplane for all the above mentioned methods, which makes the obtained smooth models unequal to the primal program.

This paper investigates a new Smooth Support Vector Machine (NSSVM), which overcomes the above mentioned disadvantages. The problem treated is the standard 2-norm error soft margin SVM. We begin from the linear case to obtain the unconstrained optimization problem; then we take into account the connotative relation between the primal and dual program to generalize it to the kernel case. Upon obtaining the smooth formulations, we apply Newton algorithm to solve the optimal solution. Various numerical experiments on UCI dataset demonstrate its applicability.

2 New Linear Smooth SVM

Denote T as the training set { }1 1 2 2( , ), ( , ),..., ( , ) ( )nm m iT x y x y x y x X R= ∈ = ,

and denote L as the label set 1 2{ , ,..., }( {1, 1})m iL y y y y= ∈ − . Training standard

2-norm error soft margin SVM equals solving the following program:

1

212 2 1( , , )

min ( )

. . ( ) 1 , 1,...,

n m

lT viiw R

Ti i i

w w

s t y w x i l

γ ξξ

γ ξ

+ + =∈+

+ ≥ − =

∑ . (1)

Here w is the normal to the separating planes,γ is the offset and iξ is the error of the i-

th training sample, and the linear separating hyperplane is 0Tw x γ− = . Denote the

training set and the label set in the matrix form as m nA × and ( )m mD Diag L× = , then

training program (1) equals to (2):

1

12 2

( , , )min

. . ( )

0

n m

T Tv

w Rw w

s t D Aw e e

γ ξξ ξ

γ ξξ

+ +∈+

− ≥ −≥

. (2)

Write the error as max( ( ),0) ( ( ))e D Aw e e D Aw eξ γ γ += − − = − − . Using

this equation, we can convert (2) into its equivalent unconstrained formulation.

212

( , )min || ( ( )) ||

n m

T

w Rw w e D Aw e

γγ

+ +∈

+ − − . (3)

268 J. Liang and D. Wu

It is easy to prove that this is a convex program, but since the objective function is not differential which precludes the use of a fast Newton method. Define an entropy penalty function as (4), we have two lemmas.

1( )= ln(1 exp( ))P t t tβ β β−+ + − . (4)

Lemma 1. ( )P tβ is a strict convex function.

Lemma 2. [7] For x R∈ and22 2 2ln 2| | : ( ) ( ) ( ) ln 2x P x x ρ

β β βρ +< − ≤ + , where

( )P xβ is defined as in (8) with smoothing parameter 0β > .

Corollary 1. Entropy penalty function 1( ) ln(1 exp( ))P t t tβ β β−= + + − conver-

ges to the plus function max{0, }t t+ = , where 0β > is the smooth parameter.

Then we obtain the smooth model of the 2-norm error soft margin SVM as follows.

212 2

( , )min || ( ( )) ||

n m

T v

w Rw w P e D Aw eβξ

γ+∈

+ − − . (5)

3 New Kernel Smooth SVM

In the kernel space, suppose the nonlinear map ( )xφ is used to map the original data

to a high dimensional one. The optimization program is as follows.

1

212 2 1( , , )

min ( )

. . ( ( ) ) 1 , 1,...,

n m

lT viiw R

Ti i i

w w

s t y w x i l

γ ξξ

φ γ ξ

+ + =∈+

+ ≥ − =

∑ . (6)

Similar as the linear smooth case, we aim at deriving the corresponding smooth model in the kernel case for 2-norm error soft margin SVM. Taking into account the connotative relation between the primal program and the dual one, we transform the original program to one smooth model and prove the equivalence between the obtained model and the original one.

It is known that, at the optimal solution, w is a linear combination of training data:

1( )

l

i i iiw y xα φ

==∑ . (7)

Making use of the above relation, we have the following two equations

1 1( ) ( ) ( ) ( )

l lT Ti i i j j j i ij j ij j

y w x y y x x Q Qφ α φ φ α α= =

= = =∑ ∑ . (8)

1 1( ) ( )

l lT T Ti i i i ii i

w w y x w Q Qα φ α α α α= =

= = =∑ ∑ . (9)

So (6) can be expressed as follows:


1

12 2

( , , )

min ( , , )

. .

m m

T Tv

R

F Q

s t Q Y e

α γ ξα γ ξ α α ξ ξ

α γ ξ+ +∈

= +

+ ≥ − . (10)

Though (10) is different from (6), we will prove that for any optimal *α of (10), * *

1( )

i

l

i iiw y xα φ

==∑ is an optimal solution of (6).

Theorem 1. Denote1

( ),l

i i iiw y xα φ

==∑ suppose * * *( , , )bα ξ is the optimal

solution of (10), and then * * *( , , )w b ξ is the optimal solution of (6).

Proof: Equations (8) and (9) show that: if and only if * * *( , , )w b ξ and * * *( , , )bα ξ are respectively primal and dual optimal solutions, then * * *( , , )bα ξ and * * *( , , )w b ξ are respectively feasible for (10) and (6).

Now we prove this statement that * * *( , , )w b ξ is the optimal solution of (6) can be

inferred from the fact that * * *( , , )bα ξ is the optimal solution of (10).

Using equation (7), we have (11) for feasible ( , , )w b ξ of (6) and ( , , )bα ξ of (10).

1 12 2 2 2

* * * *1 12 2 2 2( ) ( )

T T T Tv v

T T T Tv v

Q w w

w w Q

α α ξ ξ ξ ξξ ξ α α ξ ξ

+ = +

≥ + = + . (11)

Thus, * * *( , , )w b ξ is the optimal solution to (6).

Having Theorem 1, we can obtain the optimal solution of (6) by solving (10). According to Corollary 1, we know (10) is equivalent to the following smooth formation for the kernel case:

212 2,

min || ( ) ||T vQ P e Q Yβα γα α α γ+ − − . (12)

4 NSSVM Implementation

To apply the Newton algorithm, first we have to prove the convexity of (5) and (13) and figure out the corresponding gradient vectors and Hessian matrixes, which can be directly obtained by figuring out the first and twice order derivatives to program (13). The compurgation is easy, and therefore the detailed process is omitted.

Defining a diagonal matrix ( )D Diag γ= for any vector nRγ ∈ , and introducing

several denotations ( ), exp( ), ,t e DAw De v t Q DA Y Deγ β= − − = − = = and

( ( ). ( ( ))M I Diag v Diag P tββ= + , then we get the formulas of the gradient

vector and the Hessian matrix of NSSVM in the linear space.


11( ) ( ( )) ( ) )

0

T

T

w QF x Diag P t Diag e v e

C Yβ β

−⎡ ⎤−⎡ ⎤∇ = + + ⋅⎢ ⎥⎢ ⎥ −⎣ ⎦ ⎣ ⎦

. (13)

2 201( ) 2 ( ) [ ]

0 0

T

T

I QF x Diag e v M Q Y

C Yβ

−⎡ ⎤−⎡ ⎤∇ = + + − −⎢ ⎥⎢ ⎥ −⎣ ⎦ ⎣ ⎦

. (14)

To obtain the formulas of NSSVM in the kernel space, we define

( ( , ) )Tt e DK A A Deα γ= − − and ( , )TQ DK A A= , then we obtain the formulas

of the gradient vector and the Hessian matrix, in which ,v Y and M are the same

with those in the linear space.

11( ) ( ( )) ( ) )

0

T

T

QF x Diag P t Diag e v e

C Yβ β

α −⎡ ⎤−⎡ ⎤∇ = + + ⋅⎢ ⎥⎢ ⎥ −⎣ ⎦ ⎣ ⎦

. (15)

2 201( ) 2 ( ) [ ]

0 0

T

T

Q QF x Diag e v M Q Y

C Yβ

−⎡ ⎤−⎡ ⎤∇ = + + − −⎢ ⎥⎢ ⎥ −⎣ ⎦ ⎣ ⎦

. (16)

Using the expressions of (( , ) )T TFβ α γ∇ and 2 (( , ) )T TFβ α γ∇ , we can easily

deduce Theorem 2.

Theorem 2. Programs (5) and (12) are both convex programs.

Having Theorem 2, we can apply Newton algorithm using Armijo stepsize to figure out the optimal solution.

5 Numerical Experiments

We demonstrate now the effectiveness and speed of NSSVM on several real word datasets from the UCI machine learning respiratory.

All the experiments are carried out on a PC with P4 CPU, 3.06 GHz, 1GB Memory

using MATALAB 7.01. When computing the gradient vector (( , ) )T TFβ α γ∇ and

the Hessian matrix 2 (( , ) )T TFβ α γ∇ , the limits of 1( ( )) ( ) )Diag P t Diag e vβ−⋅ +

and 2( ) ( ( ) ( ( )))Diag e v I Diag v Diag P tββ−+ ⋅ + ⋅ are used.

First, we carry out the experiments in the liner space. Three moderated sized datasets are used, the Bupa Liver, the Ionosphere and the Pima Indians. To further testify the advantages of NSSVM, we compare the ten-fold training and testing accuracies of NSSVM with RLP, SVM||1||, SVM||2|| and FSV, in terms of the number of iterations, the ten-fold training accuracies and the ten-fold testing accuracies.


Table 1. Comparisons with Various Methods

Ten-fold Training Accuracies, % Ten-fold Testing Accuracies,%

Data NSSVM RLP SVM||1|| SVM||2|| FSV

Bupa 71.60 70.19

68.98 64.34

67.83 64.03

70.57 69.86

71.21 69.81

Iono. 94.94 89.18

94.78 86.04

88.92 86.10

92.96 89.17

94.87 86.76

Pima 78.13 77.99

76.48 76.16

75.52 74.47

77.92 77.07

77.91 76.96

The results are illustrated in table 1, in which the bold type numbers indicate the best results. Obviously, NSSVM has the highest ten-fold training and testing accuracies. On the Ionosphere data, It has a training and a testing accuracy about 0.16% and 3.14% higher than RLP; It has a training and a testing accuracy about 6.02% and 3.08% higher than SVM||1||; It has a training and a testing accuracy about 1.98% and 0.01% higher than SVM||2||; It has a training and a testing accuracy about 0.07% and 2.42% higher than FSV.

In fact, SSVM has been proven to be more effective than RLP, SVM||1||, SVM||2||, SOR, FSV, SMO and SVMlight[12], so in the following, we will compare the performances of NSSVM with SSVM on Bupa Liver to further demonstrate its effectiveness using Gaussian radial basis kernel.

Table 2. Comparisons with SSVM

Methods Arm. Iter. Time Tr. Cor. Ts. Cor. 1 4.1 0.01 70.11% 68.08% Lin. 0 4.3 0.01 70.08% 67.83% 1 2 0.21 100% 60.36%

SSVM Ker.

0 2 0.20 100% 60.29% 1 4.4 0.01 70.53% 69.56% Lin. 0 4.3 0.01 70.37% 68.43% 1 3.3 0.73 100% 61.12%

NSSVM Ker.

0 3.6 0.67 100% 60.91%

In the above table, the bold type numbers indicate higher accuracies of NSSVM than SSVM.

Apparently, NSSVM has higher accuracies over SSVM both in the linear and

kernel case, which proves that removing the bias term 2 / 2γ from SSVM model do

increase the accuracy, while its the number of iterations and training time remains almost unchanged. Although Armijo stepsize can guarantee the global and quadratic convergence of Newton algorithm, it can cause only slight increases in accuracies and a little longer training time, so in practice, it can be turned off. Using kernel function, we can obtain much better results using less numbers of iterations.


6 Conclusions

This paper investigates a new smooth formulation SSVM, which is an unconstrained smooth formulation for the standard 2-norm error soft margin SVM, and proposes Newton-Armijo algorithm to commutate the optimal solution. By using the 2-norm of the Lagrangian multipliers vector to replace that of the weight of the hyperplane or using the connotative relation between the primal and dual program, NSSVM can be easily extended to the kernel space without doubt. Numerical results show that NSSVM has higher accuracies over existing methods. Future work includes finding other smooth penalty functions or searching new efficient algorithm to solve the unconstrained smooth model.

Acknowledgement

The authors give sincere thanks to the kind editors and anonymous reviewers for their valuable suggestions and comments, which have helped to improve the manuscript considerably.

References

1. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (2000) 2. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining

and Knowledge Discovery 2, 121–167 (1998) 3. Yu, H., Han, J., Chang, K.C.: PEBL: Positive-example based learning for Web page

classification using SVM. In: Proc. 8th Int. Cong. Knowledge Discovery and Data Mining, Edmonton, Canada (2002)

4. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000)

5. Chen, X., Qi, L., Sun, D.: Global and superlinear convergence of the smoothing Newton method and its application to general box constrained variational inequalities. Mathematics of Computation 67, 519–540 (1998)

6. Qi, L., Sun, D.: Smoothing Functions and Smoothing Newton Method for complementarity and Variational Ineauqlity Problems. Journal of Optimization Theory and Applications 113(1), 121–147 (2002)

7. Lee, Y.-J., Mangasarian, O.L.: SSVM: A Smooth Support Vector Machine for Classification. Computational Optimization and Applications 22(1), 5–21 (2001)

8. Hsu, C.-W., Lin, C.-J.: A simple decomposition method for support vector machines. Machine Learning 46, 291–314 (2002)

9. Yuan, Y.-b., Yan, J., Xu, C.-x.: Polynomial smooth support vector machine (PSSVM). Chinese Journal of Computers 28(1), 9–71 (2005) (in Chinese)

10. Lee, Y.J., Hsieh, W.F., Huang, C.M.: SSVR: A smooth support vector machine for insensitive regression. IEEE Transactions on Knowledge and Data Engineering 17(5), 5–22 (2005)

11. Yan-Feng, F., De-Xian, Z., Hua-Can, H.: Smooth SVM Research: A polynomial- based Approach. In: Proceedings of the Sixth Conference on Information, Communications & Signal Processing, pp. 1–5 (2007)

12. Jin-zhi, X., Jin-lian, H., Hua-qiang, Y., Tian-ming, H., Guang-ming, L.: Research on a New Class of Functions for Smoothing Support Vector Machines. Acta Electronica Sinica 35(2), 366–370 (2007)

[lecture notes in computer science] artificial intelligence and computational intelligence volume...

Documents