jordaan2002_estimation of the regularization parameter for svr
TRANSCRIPT
-
8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR
1/6
Estimation of the Regularization Parameter
for
Support Vector Regression
E.M.
Jordaan *,
G.F.
Smits t
*Department of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands
t Materials Sciences and Information Research, Dow Benelux B.V., Terneuzen, The Netherlands.
Abstract - Support Vector Machines use a
regularization parameter
C
to regulate the
trade-off between th e complexity of the model
and th e empirical risk of th e model. Most
of
the techniques available for determining the
optimal value of C are very time consum-
ing.
For
industrial applications
of
the
SVM
method, there
is
a need for
a fast
and robust
method t o estimate C. In this paper a method
based on the characteristics of the kernel, the
range
of
output values and the size
of
the
e-
insensitive zone,
is
proposed.
I
INTRODUCTION
The Support Vector Machine as a learning machine
was
first suggested by Vapnik in the early 1990’s [7].
Originally it was derived
for
classification applica-
tions, but since the mid 1990’s it has been applied to
regression and feature selection problems
as
well. The
SVM formulation in case of regression, for a given
learning or training data set xi E X
R n , y i
E R
with e is to minimize
subject to
W . xi)+
b)
i 5 E +ti,
i = 1 , . . e
2
=
1, .
.
, e ,
i w .Xi)
+
b)
5
€
+
52
, t i 2 0 i
= 1
. . . e.
Since the optimization problem in 1) is a quadratic
programming problem, it has the dua l formulation:
Maximise
e
1
raa i)(CYj j)
K Xi,Xj)
+ -ba,j
+
2 c
i ,j=1
e
e
-5- ai
y i
-5- a1
+Sa)
(2)
i= l i= l
subject to
ai Si = 0 ai 2 0,
i
2
0, for
The SVM model, in terms of the Lagrange Multi-
2 = 1 , . . . ,e.
pliers a, ),
is
defined
as
0-7803-7278-6/02/ 10.00 02002 DeEE
2192
e
f Xnew)
=
-5-
ai
Si)K xi,xnew)
+
b,
(3)
i = l
where the bias
b
is determined by using the con-
straints in 1). The input da ta vectors th at cor-
respond with positive Lagrange Multipliers, are re
ferred to as support vectors. Note tha t th e loss term
in 1) is quadratic, but
1)
can also be expressed
in terms of linear loss. For linear loss, the second
term in
1)
becomes l/eC:=l ( + i) and the
La-
grange Multipliers in (2) are bounded from above by
C. More information on Support Vector Machines
can be found in [l] nd [7].
The parameter C in 1) controls the trade-off be-
tween the complexity of the model
f w
112
and the
training error
1 / C
( +
i))
[7].C s also called
the regularization parameter since it corresponds to
-
8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR
2/6
the parameter y of the regularization method for solv-
ing ill-posed problems as
C
=
1.
Finding the optimal value
?or C
still remains
a
problem. Many researchers suggested that
C
should
be varied through a wide range of values and the op-
timal performance is then measured by using
a
sepa-
rate validation set
or
other techniques such as cross-
validation or boot-strapping
[l] ,
5 ] . Vapnik men-
tioned in [7] three methods for choosing the opti-
mal regularization parameter, namely, the L-Curve
method [2], the method of effective number of pa-
rameters [4] and the effective VC-dimension method
[8]. Each of these methods uses a different approach
for measuring the performance and complexity of the
model and originates from different theories. One
common problem with many of the suggested ap-
proaches is that they are not suitable for large-scale
problems. The computational effort t o determine th e
eigenvalues of large matrices o r using resampling lim-
its their use in online applications.
In particular, if one needs t o make a quick assess-
ment whether a given da ta set can be solved with the
SVM method or if
a
given kernel function is an appro-
priate choice,
a
fast estimation method is extremely
useful. Furthermore, since the
C
parameter is known
to be a rather robust parameter, determining the tru e
optimal value is often not worth the effort. In SVM
literature it is often suggested that C should be cho-
sen
s u f i c i e n t l y l ar ge .
But what value is large enough?
If
an estimation method can give a good indication
of the magnitude of C, one can
at
least start from an
informed guess.
It is known that the scale of the regularization pa-
rameter is affected by several factors. It has been
shown by Smola [6] that the optimal regularization
parameter depends on the value
E
Since
E
is used
to control the complexity of the model and depends
on the noise level in the data, the choice of the op-
timal value of C assumes some knowledge about the
underlying noise distribution as well as the inherent
complexity of the model. Often, this knowledge is not
the available. In
[l]
he au thors indicate that the reg-
ularization parameter C is also affected by the choice
of feature space. The consequence of this is very sig-
nificant, since the feature space is determined by the
specified kernel, which is in fact an operator associ-
ated with smoothness. Therefore, the choice of regu-
larization parameter can not be based on one factor
aIone, but on the combined influence. None of the
heuristics of estimation methods in literature does
that . The research was therefore aimed at deriving
an estimating rule that combines th e characteristics
of the feature space, the expected noise level, and
some other contributing factors.
The rest of the paper consists of four sections. In
section two, useful results from the L-Curve method
are discussed. In the third section a method is derived
that estimates the value of
C
from a
pr i or i
param-
eters. The performance of this method is shown in
section four.
I1
RESULTS FROM THE L-CURVE
METHOD
The L-Curve method is derived from the theory of
solving ill-posed problems [2]. It is well established
method and one of the few approaches in regulariza-
tion theory tha t takes into account both the norm of
the solution and the norm of the error
[3]
Vapnik
has shown in [7]th at th e L-Curve Method can be
ap-
plied for Support Vector Machines for regression with
a
quadratic
loss
function. The resulting terms for the
norm of the solution and the norm of the error, are
the n given by the following two functions,
and
i=l k=l
where N is the index set of the suppor t vectors.
The L-curve is the
log-log
plot of ~ y )gainst
p y).
The distinct L-shape of the curve is shown in Figure
1. The L-Curve method is a very useful graphical
0-7803-7278-6/02/~10.00 2002 IEEE
2193
-
8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR
3/6
of the curvatu re expression, an important relation be-
tween the derivative of p y ) and ~ y )merged. And
it is this relation we are interested in.
(a) L
(b)M&l with C=5
0.8
1’:w
4
0.6
Consider the following minimization problem
. .
.4
0.2
4 2
0
c=254
0
7
= argmin
{
lA2
blli
+
Y
Il4l;
(7)
C=5
-2
0 ‘
0 0.2
0.4
0.6
0.8 1
0.2
-10 -5
In(Emx
mm1
(d)
ModelA h
=12208W
l :m
:mhere
A
is a symmetric positive semi)definite coeffi-
cient matrix and
b
the given output da ta. Using the
0.6
SVD decomposition of A , the norms of the solution
and error can be written
as
0.8
0.6
0.6
0.4
0
e
0.2
0.4
0.2
0 ‘
0 0.2 0.4 0.6 0.6
1 (8)
0.2
0.2 0.4 0.6 0.8 1
0.2
i= l
Figure
1 :
The form of the L-Curve is shown in graph
a). Graphs b), c) and d) show models for various
values of C.
tool which is used to display the trade-off between
the complexity and the error. If to o little complexity
is used, the right ‘leg’ of th e L-Curve is dominant an d
the model typically underfits see Figure l b )) . When
the left ‘leg’ of the L-Curve is dominant, the model
uses too much complexity and s tar ts t o overfit as seen
in Figure
l d ) .
The corner point of the GCurve cor-
responds to the optimal value of the regularization
parameter for which the model has the right balance
between complexity and the error term.
Finding the corner point of the L-Curve involves
finding the minimum of the functional
N
In regularization theory, the corner point of the L-
Curve is normally found by determining the curva-
ture of the L-Curve. In
[3]
an expression for the cur-
vature of th e GCurve is derived in terms of
p 7 )
and
~ 7 )nd their derivatives. As part of the derivation
where
u i
are the singular vectors,
o i
the singular val-
ues, and
f i
the Tikhonov filter factors, that depend
on
~i
and y
as
follows,
0;
f i =
;
+y’
The derivatives of ~ y)nd p 7 ) to y, are then given
by
Note tha t in [3], these equations were derived using
y2 which resulted in having a factor 4 nstead of 2 in
each equation.) Rewriting
p ’ y )
and using the fact
that
leads to a very important relation
0-7803-7278-6/02/ 10.00 02002
lEEE
2194
-
8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR
4/6
I11 ESTIMATE FOR C
computed as R
I
maxlli le K xi , xi . Therefore,
The relation 13) also applies to the Support Vector
Machine formulation with quadratic
loss
when the
implicit feature space, defined by the kernel, is con-
sidered. In this section, the relation 13) combined
with 6), will be used to derive a estimate. Firs t,
consider the functional 6). In order to find the op-
timal regularization parameter y , 6) has to be min-
imized, that is to set H ’ y )
=
0. The derivative of
H ’ y ) is given by
Rewriting the relation 13), given in the previous sec-
tion, such that y stan ds alone and using 14), leads
to
Now using the fact tha t C = 117,we arrive at
V Y)
P Y)
C = - .
15)
This equation forms the basis of th e estimate ’.
Since the true solution and therefore, true error,
is unknown, we will use upper and lower bounds in
terms of the
a
prior i parameters. From the Support
Vector theory, is known that the norm of the solution
llw112< R 2 ,where R is the radius of the ball centred
at the origin in the feature space and which can be
l I t is also interesting t o note t he close resemblance between
the derivation of the expression for the curvature of the
L-
Curve,’which uses th e SVD decomposition, and the use of the
eigenvalues and eigenvectors in the method of the Effective
Number of Parameters that
was
suggested in statistics for es-
timating parameters for ridge regression [4].
*Vapnik derived in Chapte r 7 of [7] a similar relation for -y
as
in
15) as
part of the proof of a theorem. Vapnik used, however,
an entirely different approach. Th e relation
p z Af 8 ,
Af)
5
2 d f i
can be rewritten to
-yt 2 p z A f e , A f ) / 4 d 2 .
A
is an
operator in
a
Hilbert space and th e function
p2 is
metric meai
suring the distance between the true output
A f
=
F
and the
predicted output
A f t
of the optimal solution
fe.
Finally,
d
is
such that l l l l
5 d .
Now, consider the term for the norm of the error,
p . Let yic be the predicted output value
of yi
of the
SVM model. Since the SVM for Regression uses an
+insensitive loss function,
l e (
\ 2
p
=
7
v Yi 7
k ? ’ ) K z k , x i )
i=l
k=l
- e
1
=
v
Yi
i J 2
e
i = l
It is clear from 17) tha t a lower bound in terms
of a
prior i
information should involve the number of data
points, the range of the output data and the value
of
E Since no such bound exists in literature, one
was derived from a number of assumptions about the
error and experimental results.
Let us assume that the resulting model will be a
relatively good model such that the €-insensitive zone
is smaller than half the range
of
the output values
and th at there is an equal number of support vectors
above and below the +insensitive zone. Then a very
loose lower bound on 17) can be given by,
p > 1 iRange y) 6Y
e 2
From experimental observations, it was found that a
power of four gives a more accurate estimation. This
leads to the proposed estimate given by
IV EXPERIMENTAL RESULTS
In this section the estimated value of
C,
using 19)
is compared to the value
of
C determined by using
the L-Curve. Several dat a sets with varying sizes,
0-7803-7278-6/02/ 10.00 02002 IEEE
2195
-
8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR
5/6
noise levels and dimensions were used. Th e results for
f z1,zz)=
zlzz
+ 1 with (z1,zz) E [-I, 11 equiva-
lent t o a continuous version
of
the 2D XOR problem)
is presented in this section. Th e learning da ta con-
sisted of
a
random sampling of this function after a
noise level of N O,0.05) was added.
-10 -5 0
d)
Modewh
L-CimeC
Figure 2: Results for
a
RBF kernel
of
width 0.2. The
near) optimal value of
C
is indicated by an aster-
isk and the estimated value of
C
by
a
circle. a)
Error statistic for each iteration step in the GCurve
method. The GCurv e is shown in b) and the corner
point of the L-Curve in c). In d) and e) the per-
formances of the model using the optimal C and the
model using the estimated
C
are shown respectively.
In Figure 2 the results from the L-Curve Approach
are compared to the estimated value of C using a
RBF kernel with
a
width of 0.2 and an
E
of 0.05.
The L-Curve Approach requires the building of sev-
eral models for increasing values of
C.
The range
of values for C considered needs to be large enough,
otherwise the corner point of the L-Curve can not be
seen. Therefore, the resolution of the C-values being
used, were chosen on a logarithmic scale. Th e Figure
2 a) shows various error statistics of models
for
in-
creasing values C. The resulting GCurve is plotted
in Figure 2 b). The ar ea between the vertical dashed
lines in Figure 2 a) corresponds to the area in the
corner of the L-Curve, as shown in Figure 2 b). Th e
area around t he corner point in the GCurv e is shown
more clearly in Figure 2 c). In Figure 2 a) and Figure
2 c) the loca tion of th e optimal C-value is indicated
by the asterisk and the circle shows the location of
the estimated C-value. Finally, Figure 2 d) and Fig-
ure 2 e) show the performance of the models built
using the near) optimal C and the estimated C, re
spectively. At first glance, one might think th at a n
estimated value of C
=
340 is far from the near)
optimal value of C
=
1151 from the GCurve. How-
ever, from Figure 2 c) it is clear that
C
is
a
rather
robust parameter. Therefore, the estimation needs
only to predict a value of C close to the corner of the
L-Curve.
Figure
3:
Results for
a
polynomial kernel of degree 2.
a) Th e GCurve-determined C and th e estimated C
are plotted against the percentage support vectors of
each model. b) The Rz statistic of predictions made
by the models. c) The Root Mean Square Error
Prediction of the models.
Figure
3
shows the results of
SVM
with
a
poly-
nomial kernel for various values of
E,
ranging from
0 to 0.125. For each value €, .t he C value generated
by the L-Curve meth od and also estimated by 19).
0-7803-7278-6/02/ 10.00 02002
IEEE
2196
-
8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR
6/6
Figure 3 a) shows the determined value
of
C from
the L-Curve Method and the estimated value plotted
against the resulting percentage of suppor t vectors of
each model. Th e performance of models using the
optimal C and the estimated C are compared by us-
ing the Rs-Statistic and the Root Mean Square Er-
ror Prediction RMSEP) In Figure 3 d) and Figure
3 e) it is clear tha t the estimated value
of
C produces
models with error statistics that compare well with
the error statistics of
a
model using the optimal value
of c.
The CPU time of determining the near) optimal
value for C through t he L-Curve method was on av-
erage around 90 seconds. For th e estimation method,
the CPU time was less than 1 second. T he compu-
tational advantage speaks for itself. The estimated
value
of
C
can also help to speed up the L-Curve
method, since one can get a good initial guess for
a
starting point of the algorithm.
V CONCLUSIONS
A method for estimating the regularization parame-
ter C for Support Vector Regression problems is pre-
sented. Th e estimation is based on results from the
analysis of the L-Curve method. It was mentioned in
the introduction that choosing
a
value for C should
involve taking into account several factors, including
th e kernel function and th e noise level. These factors
are all present in the heuristic proposed.
Comparing the values
of
C obtained from the L-
Curve method with the values determined by the es-
timate, using several data sets, showed that the es-
timates of C-values are in close proximity to t he op-
applications in industry. In particular , if one needs
to
make a quick assessment whether a given da ta set
can be solved with the SVM method or if
a
given
kernel function is an appropriate choice, the fast and
robust estimation method is extremely useful.
In this paper, only the €-Support Vector Machine
was considered with quadratic loss, assuming tha t t he
E is known a priori. Future work includes deriving
similar estimates for the eSVM with linear loss
as
well
as
the v-Support Vector Machine
[ l ] ,
where the
expected ra tio of support vectors is used instead
of E
References
[ l ]N. Cristianini and
J.
Shawe-Taylor, An Intro-
duction to Support Vector Machines, and other
kernel-based learning methods, Cambridge Univer-
sity Press, 2000.
[2] H. W. Engl, M. Hanke, and A. Neubauer, Reg-
ularization of Inverse Problems, Kluwer Academic
Publishers, Hingham, MA, 1996.
[3] P. C. Hansen, “The L-Curve an d its use in the
numerical treatment of inverse problems” , invited
paper for
P.
Johnston Ed.), Computational In-
verse Problems in Electrocardiology, pp. 119-142,
WIT Press, Southampton, 2001.
[4] T.
J.
Hastie and R. J. Tibshirani, Generalized
Linear Models, Chapman and Hall, London, UK,
1990.
[5]
B
Scholkopf, C. J. Burges, and A . J. Smola, Ad-
vances in Kernel Methods: Support Vector Learn-
ing, MIT Press, London, 1998.
timal C. Furthermore, the difference in performance
between a model using the C-value determined by
the L-Curve and
a
model using the C estimated by
the method, is very small and often negligible.
[6] A. J. Smola, Regression Estimation with Support
Vector Learning machines, Master’s Thesis, TU
Berlin, 1996.
The computation time needed to determine a good
estimate
of
the optimal
C
is a fraction of the time
needed to determine the near) optimal value
of
C
by means of the L-Curve method. Therefore, the
proposed estimation method can be used for online
[7] V. N. Vapnik, Statistical Learning Theory, John
Wiley Sons, 1998.
8 ‘.N. Vapnik, E.
Levin,
E., and y.
“Measuring the vc Dimension
Of
a Learning
Machine”
,
Neural Com puta t ion ,
Vol. 10:5, 1994.
3The RMSEP is the relative error multiplied by the stan-
dard deviation of the predicted test data.
0-7803-7278-6/02/ 10.00
02002
EEE
2 97