jordaan2002_estimation of the regularization parameter for svr

8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR

1/6

Estimation of the Regularization Parameter

for

Support Vector Regression

E.M.

Jordaan *,

G.F.

Smits t

*Department of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands

t Materials Sciences and Information Research, Dow Benelux B.V., Terneuzen, The Netherlands.

Abstract - Support Vector Machines use a

regularization parameter

C

to regulate the

trade-off between th e complexity of the model

and th e empirical risk of th e model. Most

of

the techniques available for determining the

optimal value of C are very time consum-

ing.

For

industrial applications

of

the

SVM

method, there

is

a need for

a fast

and robust

method t o estimate C. In this paper a method

based on the characteristics of the kernel, the

range

of

output values and the size

of

the

e-

insensitive zone,

is

proposed.

I

INTRODUCTION

The Support Vector Machine as a learning machine

was

first suggested by Vapnik in the early 1990’s [7].

Originally it was derived

for

classification applica-

tions, but since the mid 1990’s it has been applied to

regression and feature selection problems

as

well. The

SVM formulation in case of regression, for a given

learning or training data set xi E X

R n , y i

E R

with e is to minimize

subject to

W . xi)+

b)

i 5 E +ti,

i = 1 , . . e

2

=

1, .

.

, e ,

i w .Xi)

+

b)

5

€

+

52

, t i 2 0 i

= 1

. . . e.

Since the optimization problem in 1) is a quadratic

programming problem, it has the dua l formulation:

Maximise

e

1

raa i)(CYj j)

K Xi,Xj)

+ -ba,j

+

2 c

i ,j=1

e

e

-5- ai

y i

-5- a1

+Sa)

(2)

i= l i= l

subject to

ai Si = 0 ai 2 0,

i

2

0, for

The SVM model, in terms of the Lagrange Multi-

2 = 1 , . . . ,e.

pliers a, ),

is

defined

as

0-7803-7278-6/02/ 10.00 02002 DeEE

2192

e

f Xnew)

=

-5-

ai

Si)K xi,xnew)

+

b,

(3)

i = l

where the bias

b

is determined by using the con-

straints in 1). The input da ta vectors th at cor-

respond with positive Lagrange Multipliers, are re

ferred to as support vectors. Note tha t th e loss term

in 1) is quadratic, but

1)

can also be expressed

in terms of linear loss. For linear loss, the second

term in

1)

becomes l/eC:=l ( + i) and the

La-

grange Multipliers in (2) are bounded from above by

C. More information on Support Vector Machines

can be found in [l] nd [7].

The parameter C in 1) controls the trade-off be-

tween the complexity of the model

f w

112

and the

training error

1 / C

( +

i))

[7].C s also called

the regularization parameter since it corresponds to


2/6

the parameter y of the regularization method for solv-

ing ill-posed problems as

C

=

1.

Finding the optimal value

?or C

still remains

a

problem. Many researchers suggested that

C

should

be varied through a wide range of values and the op-

timal performance is then measured by using

a

sepa-

rate validation set

or

other techniques such as cross-

validation or boot-strapping

[l] ,

5 ] . Vapnik men-

tioned in [7] three methods for choosing the opti-

mal regularization parameter, namely, the L-Curve

method [2], the method of effective number of pa-

rameters [4] and the effective VC-dimension method

[8]. Each of these methods uses a different approach

for measuring the performance and complexity of the

model and originates from different theories. One

common problem with many of the suggested ap-

proaches is that they are not suitable for large-scale

problems. The computational effort t o determine th e

eigenvalues of large matrices o r using resampling lim-

its their use in online applications.

In particular, if one needs t o make a quick assess-

ment whether a given da ta set can be solved with the

SVM method or if

a

given kernel function is an appro-

priate choice,

a

fast estimation method is extremely

useful. Furthermore, since the

C

parameter is known

to be a rather robust parameter, determining the tru e

optimal value is often not worth the effort. In SVM

literature it is often suggested that C should be cho-

sen

s u f i c i e n t l y l ar ge .

But what value is large enough?

If

an estimation method can give a good indication

of the magnitude of C, one can

at

least start from an

informed guess.

It is known that the scale of the regularization pa-

rameter is affected by several factors. It has been

shown by Smola [6] that the optimal regularization

parameter depends on the value

E

Since

E

is used

to control the complexity of the model and depends

on the noise level in the data, the choice of the op-

timal value of C assumes some knowledge about the

underlying noise distribution as well as the inherent

complexity of the model. Often, this knowledge is not

the available. In

[l]

he au thors indicate that the reg-

ularization parameter C is also affected by the choice

of feature space. The consequence of this is very sig-

nificant, since the feature space is determined by the

specified kernel, which is in fact an operator associ-

ated with smoothness. Therefore, the choice of regu-

larization parameter can not be based on one factor

aIone, but on the combined influence. None of the

heuristics of estimation methods in literature does

that . The research was therefore aimed at deriving

an estimating rule that combines th e characteristics

of the feature space, the expected noise level, and

some other contributing factors.

The rest of the paper consists of four sections. In

section two, useful results from the L-Curve method

are discussed. In the third section a method is derived

that estimates the value of

C

from a

pr i or i

param-

eters. The performance of this method is shown in

section four.

I1

RESULTS FROM THE L-CURVE

METHOD

The L-Curve method is derived from the theory of

solving ill-posed problems [2]. It is well established

method and one of the few approaches in regulariza-

tion theory tha t takes into account both the norm of

the solution and the norm of the error

[3]

Vapnik

has shown in [7]th at th e L-Curve Method can be

ap-

plied for Support Vector Machines for regression with

a

quadratic

loss

function. The resulting terms for the

norm of the solution and the norm of the error, are

the n given by the following two functions,

and

i=l k=l

where N is the index set of the suppor t vectors.

The L-curve is the

log-log

plot of ~ y )gainst

p y).

The distinct L-shape of the curve is shown in Figure

1. The L-Curve method is a very useful graphical

0-7803-7278-6/02/~10.00 2002 IEEE

2193


3/6

of the curvatu re expression, an important relation be-

tween the derivative of p y ) and ~ y )merged. And

it is this relation we are interested in.

(a) L

(b)M&l with C=5

0.8

1’:w

4

0.6

Consider the following minimization problem

. .

.4

0.2

4 2

0

c=254

0

7

= argmin

{

lA2

blli

+

Y

Il4l;

(7)

C=5

-2

0 ‘

0 0.2

0.4

0.6

0.8 1

0.2

-10 -5

In(Emx

mm1

(d)

ModelA h

=12208W

l :m

:mhere

A

is a symmetric positive semi)definite coeffi-

cient matrix and

b

the given output da ta. Using the

0.6

SVD decomposition of A , the norms of the solution

and error can be written

as

0.8

0.6

0.6

0.4

0

e

0.2

0.4

0.2

0 ‘

0 0.2 0.4 0.6 0.6

1 (8)

0.2

0.2 0.4 0.6 0.8 1

0.2

i= l

Figure

1 :

The form of the L-Curve is shown in graph

a). Graphs b), c) and d) show models for various

values of C.

tool which is used to display the trade-off between

the complexity and the error. If to o little complexity

is used, the right ‘leg’ of th e L-Curve is dominant an d

the model typically underfits see Figure l b )) . When

the left ‘leg’ of the L-Curve is dominant, the model

uses too much complexity and s tar ts t o overfit as seen

in Figure

l d ) .

The corner point of the GCurve cor-

responds to the optimal value of the regularization

parameter for which the model has the right balance

between complexity and the error term.

Finding the corner point of the L-Curve involves

finding the minimum of the functional

N

In regularization theory, the corner point of the L-

Curve is normally found by determining the curva-

ture of the L-Curve. In

[3]

an expression for the cur-

vature of th e GCurve is derived in terms of

p 7 )

and

~ 7 )nd their derivatives. As part of the derivation

where

u i

are the singular vectors,

o i

the singular val-

ues, and

f i

the Tikhonov filter factors, that depend

on

~i

and y

as

follows,

0;

f i =

;

+y’

The derivatives of ~ y)nd p 7 ) to y, are then given

by

Note tha t in [3], these equations were derived using

y2 which resulted in having a factor 4 nstead of 2 in

each equation.) Rewriting

p ’ y )

and using the fact

that

leads to a very important relation

0-7803-7278-6/02/ 10.00 02002

lEEE

2194


4/6

I11 ESTIMATE FOR C

computed as R

I

maxlli le K xi , xi . Therefore,

The relation 13) also applies to the Support Vector

Machine formulation with quadratic

loss

when the

implicit feature space, defined by the kernel, is con-

sidered. In this section, the relation 13) combined

with 6), will be used to derive a estimate. Firs t,

consider the functional 6). In order to find the op-

timal regularization parameter y , 6) has to be min-

imized, that is to set H ’ y )

=

0. The derivative of

H ’ y ) is given by

Rewriting the relation 13), given in the previous sec-

tion, such that y stan ds alone and using 14), leads

to

Now using the fact tha t C = 117,we arrive at

V Y)

P Y)

C = - .

15)

This equation forms the basis of th e estimate ’.

Since the true solution and therefore, true error,

is unknown, we will use upper and lower bounds in

terms of the

a

prior i parameters. From the Support

Vector theory, is known that the norm of the solution

llw112< R 2 ,where R is the radius of the ball centred

at the origin in the feature space and which can be

l I t is also interesting t o note t he close resemblance between

the derivation of the expression for the curvature of the

L-

Curve,’which uses th e SVD decomposition, and the use of the

eigenvalues and eigenvectors in the method of the Effective

Number of Parameters that

was

suggested in statistics for es-

timating parameters for ridge regression [4].

*Vapnik derived in Chapte r 7 of [7] a similar relation for -y

as

in

15) as

part of the proof of a theorem. Vapnik used, however,

an entirely different approach. Th e relation

p z Af 8 ,

Af)

5

2 d f i

can be rewritten to

-yt 2 p z A f e , A f ) / 4 d 2 .

A

is an

operator in

a

Hilbert space and th e function

p2 is

metric meai

suring the distance between the true output

A f

=

F

and the

predicted output

A f t

of the optimal solution

fe.

Finally,

d

is

such that l l l l

5 d .

Now, consider the term for the norm of the error,

p . Let yic be the predicted output value

of yi

of the

SVM model. Since the SVM for Regression uses an

+insensitive loss function,

l e (

\ 2

p

=

7

v Yi 7

k ? ’ ) K z k , x i )

i=l

k=l

- e

1

=

v

Yi

i J 2

e

i = l

It is clear from 17) tha t a lower bound in terms

of a

prior i

information should involve the number of data

points, the range of the output data and the value

of

E Since no such bound exists in literature, one

was derived from a number of assumptions about the

error and experimental results.

Let us assume that the resulting model will be a

relatively good model such that the €-insensitive zone

is smaller than half the range

of

the output values

and th at there is an equal number of support vectors

above and below the +insensitive zone. Then a very

loose lower bound on 17) can be given by,

p > 1 iRange y) 6Y

e 2

From experimental observations, it was found that a

power of four gives a more accurate estimation. This

leads to the proposed estimate given by

IV EXPERIMENTAL RESULTS

In this section the estimated value of

C,

using 19)

is compared to the value

of

C determined by using

the L-Curve. Several dat a sets with varying sizes,

0-7803-7278-6/02/ 10.00 02002 IEEE

2195


5/6

noise levels and dimensions were used. Th e results for

f z1,zz)=

zlzz

+ 1 with (z1,zz) E [-I, 11 equiva-

lent t o a continuous version

of

the 2D XOR problem)

is presented in this section. Th e learning da ta con-

sisted of

a

random sampling of this function after a

noise level of N O,0.05) was added.

-10 -5 0

d)

Modewh

L-CimeC

Figure 2: Results for

a

RBF kernel

of

width 0.2. The

near) optimal value of

C

is indicated by an aster-

isk and the estimated value of

C

by

a

circle. a)

Error statistic for each iteration step in the GCurve

method. The GCurv e is shown in b) and the corner

point of the L-Curve in c). In d) and e) the per-

formances of the model using the optimal C and the

model using the estimated

C

are shown respectively.

In Figure 2 the results from the L-Curve Approach

are compared to the estimated value of C using a

RBF kernel with

a

width of 0.2 and an

E

of 0.05.

The L-Curve Approach requires the building of sev-

eral models for increasing values of

C.

The range

of values for C considered needs to be large enough,

otherwise the corner point of the L-Curve can not be

seen. Therefore, the resolution of the C-values being

used, were chosen on a logarithmic scale. Th e Figure

2 a) shows various error statistics of models

for

in-

creasing values C. The resulting GCurve is plotted

in Figure 2 b). The ar ea between the vertical dashed

lines in Figure 2 a) corresponds to the area in the

corner of the L-Curve, as shown in Figure 2 b). Th e

area around t he corner point in the GCurv e is shown

more clearly in Figure 2 c). In Figure 2 a) and Figure

2 c) the loca tion of th e optimal C-value is indicated

by the asterisk and the circle shows the location of

the estimated C-value. Finally, Figure 2 d) and Fig-

ure 2 e) show the performance of the models built

using the near) optimal C and the estimated C, re

spectively. At first glance, one might think th at a n

estimated value of C

=

340 is far from the near)

optimal value of C

=

1151 from the GCurve. How-

ever, from Figure 2 c) it is clear that

C

is

a

rather

robust parameter. Therefore, the estimation needs

only to predict a value of C close to the corner of the

L-Curve.

Figure

3:

Results for

a

polynomial kernel of degree 2.

a) Th e GCurve-determined C and th e estimated C

are plotted against the percentage support vectors of

each model. b) The Rz statistic of predictions made

by the models. c) The Root Mean Square Error

Prediction of the models.

Figure

3

shows the results of

SVM

with

a

poly-

nomial kernel for various values of

E,

ranging from

0 to 0.125. For each value €, .t he C value generated

by the L-Curve meth od and also estimated by 19).

0-7803-7278-6/02/ 10.00 02002

IEEE

2196


6/6

Figure 3 a) shows the determined value

of

C from

the L-Curve Method and the estimated value plotted

against the resulting percentage of suppor t vectors of

each model. Th e performance of models using the

optimal C and the estimated C are compared by us-

ing the Rs-Statistic and the Root Mean Square Er-

ror Prediction RMSEP) In Figure 3 d) and Figure

3 e) it is clear tha t the estimated value

of

C produces

models with error statistics that compare well with

the error statistics of

a

model using the optimal value

of c.

The CPU time of determining the near) optimal

value for C through t he L-Curve method was on av-

erage around 90 seconds. For th e estimation method,

the CPU time was less than 1 second. T he compu-

tational advantage speaks for itself. The estimated

value

of

C

can also help to speed up the L-Curve

method, since one can get a good initial guess for

a

starting point of the algorithm.

V CONCLUSIONS

A method for estimating the regularization parame-

ter C for Support Vector Regression problems is pre-

sented. Th e estimation is based on results from the

analysis of the L-Curve method. It was mentioned in

the introduction that choosing

a

value for C should

involve taking into account several factors, including

th e kernel function and th e noise level. These factors

are all present in the heuristic proposed.

Comparing the values

of

C obtained from the L-

Curve method with the values determined by the es-

timate, using several data sets, showed that the es-

timates of C-values are in close proximity to t he op-

applications in industry. In particular , if one needs

to

make a quick assessment whether a given da ta set

can be solved with the SVM method or if

a

given

kernel function is an appropriate choice, the fast and

robust estimation method is extremely useful.

In this paper, only the €-Support Vector Machine

was considered with quadratic loss, assuming tha t t he

E is known a priori. Future work includes deriving

similar estimates for the eSVM with linear loss

as

well

as

the v-Support Vector Machine

[ l ] ,

where the

expected ra tio of support vectors is used instead

of E

References

[ l ]N. Cristianini and

J.

Shawe-Taylor, An Intro-

duction to Support Vector Machines, and other

kernel-based learning methods, Cambridge Univer-

sity Press, 2000.

[2] H. W. Engl, M. Hanke, and A. Neubauer, Reg-

ularization of Inverse Problems, Kluwer Academic

Publishers, Hingham, MA, 1996.

[3] P. C. Hansen, “The L-Curve an d its use in the

numerical treatment of inverse problems” , invited

paper for

P.

Johnston Ed.), Computational In-

verse Problems in Electrocardiology, pp. 119-142,

WIT Press, Southampton, 2001.

[4] T.

J.

Hastie and R. J. Tibshirani, Generalized

Linear Models, Chapman and Hall, London, UK,

1990.

[5]

B

Scholkopf, C. J. Burges, and A . J. Smola, Ad-

vances in Kernel Methods: Support Vector Learn-

ing, MIT Press, London, 1998.

timal C. Furthermore, the difference in performance

between a model using the C-value determined by

the L-Curve and

a

model using the C estimated by

the method, is very small and often negligible.

[6] A. J. Smola, Regression Estimation with Support

Vector Learning machines, Master’s Thesis, TU

Berlin, 1996.

The computation time needed to determine a good

estimate

of

the optimal

C

is a fraction of the time

needed to determine the near) optimal value

of

C

by means of the L-Curve method. Therefore, the

proposed estimation method can be used for online

[7] V. N. Vapnik, Statistical Learning Theory, John

Wiley Sons, 1998.

8 ‘.N. Vapnik, E.

Levin,

E., and y.

“Measuring the vc Dimension

Of

a Learning

Machine”

,

Neural Com puta t ion ,

Vol. 10:5, 1994.

3The RMSEP is the relative error multiplied by the stan-

dard deviation of the predicted test data.

0-7803-7278-6/02/ 10.00

02002

EEE

2 97

jordaan2002_estimation of the regularization parameter for svr

Documents