jordaan2002_estimation of the regularization parameter for svr

Upload: anonymous-psez5kgvae

Post on 01-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR

    1/6

    Estimation of the Regularization Parameter

    for

    Support Vector Regression

    E.M.

    Jordaan *,

    G.F.

    Smits t

    *Department of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands

    t Materials Sciences and Information Research, Dow Benelux B.V., Terneuzen, The Netherlands.

    Abstract - Support Vector Machines use a

    regularization parameter

    C

    to regulate the

    trade-off between th e complexity of the model

    and th e empirical risk of th e model. Most

    of

    the techniques available for determining the

    optimal value of C are very time consum-

    ing.

    For

    industrial applications

    of

    the

    SVM

    method, there

    is

    a need for

    a fast

    and robust

    method t o estimate C. In this paper a method

    based on the characteristics of the kernel, the

    range

    of

    output values and the size

    of

    the

    e-

    insensitive zone,

    is

    proposed.

    I

    INTRODUCTION

    The Support Vector Machine as a learning machine

    was

    first suggested by Vapnik in the early 1990’s [7].

    Originally it was derived

    for

    classification applica-

    tions, but since the mid 1990’s it has been applied to

    regression and feature selection problems

    as

    well. The

    SVM formulation in case of regression, for a given

    learning or training data set xi E X

    R n , y i

    E R

    with e is to minimize

    subject to

    W . xi)+

    b)

    i 5 E +ti,

    i = 1 , . . e

    2

    =

    1, .

    .

    , e ,

    i w .Xi)

    +

    b)

    5

    +

    52

    , t i 2 0 i

    = 1

    . . . e.

    Since the optimization problem in 1) is a quadratic

    programming problem, it has the dua l formulation:

    Maximise

    e

    1

    raa i)(CYj j)

    K Xi,Xj)

    + -ba,j

    +

    2 c

    i ,j=1

    e

    e

    -5- ai

    y i

    -5- a1

    +Sa)

    (2)

    i= l i= l

    subject to

    ai Si = 0 ai 2 0,

    i

    2

    0, for

    The SVM model, in terms of the Lagrange Multi-

    2 = 1 , . . . ,e.

    pliers a, ),

    is

    defined

    as

    0-7803-7278-6/02/ 10.00 02002 DeEE

    2192

    e

    f Xnew)

    =

    -5-

    ai

    Si)K xi,xnew)

    +

    b,

    (3)

    i = l

    where the bias

    b

    is determined by using the con-

    straints in 1). The input da ta vectors th at cor-

    respond with positive Lagrange Multipliers, are re

    ferred to as support vectors. Note tha t th e loss term

    in 1) is quadratic, but

    1)

    can also be expressed

    in terms of linear loss. For linear loss, the second

    term in

    1)

    becomes l/eC:=l ( + i) and the

    La-

    grange Multipliers in (2) are bounded from above by

    C. More information on Support Vector Machines

    can be found in [l] nd [7].

    The parameter C in 1) controls the trade-off be-

    tween the complexity of the model

    f w

    112

    and the

    training error

    1 / C

    ( +

    i))

    [7].C s also called

    the regularization parameter since it corresponds to

  • 8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR

    2/6

    the parameter y of the regularization method for solv-

    ing ill-posed problems as

    C

    =

    1.

    Finding the optimal value

    ?or C

    still remains

    a

    problem. Many researchers suggested that

    C

    should

    be varied through a wide range of values and the op-

    timal performance is then measured by using

    a

    sepa-

    rate validation set

    or

    other techniques such as cross-

    validation or boot-strapping

    [l] ,

    5 ] . Vapnik men-

    tioned in [7] three methods for choosing the opti-

    mal regularization parameter, namely, the L-Curve

    method [2], the method of effective number of pa-

    rameters [4] and the effective VC-dimension method

    [8]. Each of these methods uses a different approach

    for measuring the performance and complexity of the

    model and originates from different theories. One

    common problem with many of the suggested ap-

    proaches is that they are not suitable for large-scale

    problems. The computational effort t o determine th e

    eigenvalues of large matrices o r using resampling lim-

    its their use in online applications.

    In particular, if one needs t o make a quick assess-

    ment whether a given da ta set can be solved with the

    SVM method or if

    a

    given kernel function is an appro-

    priate choice,

    a

    fast estimation method is extremely

    useful. Furthermore, since the

    C

    parameter is known

    to be a rather robust parameter, determining the tru e

    optimal value is often not worth the effort. In SVM

    literature it is often suggested that C should be cho-

    sen

    s u f i c i e n t l y l ar ge .

    But what value is large enough?

    If

    an estimation method can give a good indication

    of the magnitude of C, one can

    at

    least start from an

    informed guess.

    It is known that the scale of the regularization pa-

    rameter is affected by several factors. It has been

    shown by Smola [6] that the optimal regularization

    parameter depends on the value

    E

    Since

    E

    is used

    to control the complexity of the model and depends

    on the noise level in the data, the choice of the op-

    timal value of C assumes some knowledge about the

    underlying noise distribution as well as the inherent

    complexity of the model. Often, this knowledge is not

    the available. In

    [l]

    he au thors indicate that the reg-

    ularization parameter C is also affected by the choice

    of feature space. The consequence of this is very sig-

    nificant, since the feature space is determined by the

    specified kernel, which is in fact an operator associ-

    ated with smoothness. Therefore, the choice of regu-

    larization parameter can not be based on one factor

    aIone, but on the combined influence. None of the

    heuristics of estimation methods in literature does

    that . The research was therefore aimed at deriving

    an estimating rule that combines th e characteristics

    of the feature space, the expected noise level, and

    some other contributing factors.

    The rest of the paper consists of four sections. In

    section two, useful results from the L-Curve method

    are discussed. In the third section a method is derived

    that estimates the value of

    C

    from a

    pr i or i

    param-

    eters. The performance of this method is shown in

    section four.

    I1

    RESULTS FROM THE L-CURVE

    METHOD

    The L-Curve method is derived from the theory of

    solving ill-posed problems [2]. It is well established

    method and one of the few approaches in regulariza-

    tion theory tha t takes into account both the norm of

    the solution and the norm of the error

    [3]

    Vapnik

    has shown in [7]th at th e L-Curve Method can be

    ap-

    plied for Support Vector Machines for regression with

    a

    quadratic

    loss

    function. The resulting terms for the

    norm of the solution and the norm of the error, are

    the n given by the following two functions,

    and

    i=l k=l

    where N is the index set of the suppor t vectors.

    The L-curve is the

    log-log

    plot of ~ y )gainst

    p y).

    The distinct L-shape of the curve is shown in Figure

    1. The L-Curve method is a very useful graphical

    0-7803-7278-6/02/~10.00 2002 IEEE

    2193

  • 8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR

    3/6

    of the curvatu re expression, an important relation be-

    tween the derivative of p y ) and ~ y )merged. And

    it is this relation we are interested in.

    (a) L

    (b)M&l with C=5

    0.8

    1’:w

    4

    0.6

    Consider the following minimization problem

    . .

    .4

    0.2

    4 2

    0

    c=254

    0

    7

    = argmin

    {

    lA2

    blli

    +

    Y

    Il4l;

    (7)

    C=5

    -2

    0 ‘

    0 0.2

    0.4

    0.6

    0.8 1

    0.2

    -10 -5

    In(Emx

    mm1

    (d)

    ModelA h

    =12208W

    l :m

    :mhere

    A

    is a symmetric positive semi)definite coeffi-

    cient matrix and

    b

    the given output da ta. Using the

    0.6

    SVD decomposition of A , the norms of the solution

    and error can be written

    as

    0.8

    0.6

    0.6

    0.4

    0

    e

    0.2

    0.4

    0.2

    0 ‘

    0 0.2 0.4 0.6 0.6

    1 (8)

    0.2

    0.2 0.4 0.6 0.8 1

    0.2

    i= l

    Figure

    1 :

    The form of the L-Curve is shown in graph

    a). Graphs b), c) and d) show models for various

    values of C.

    tool which is used to display the trade-off between

    the complexity and the error. If to o little complexity

    is used, the right ‘leg’ of th e L-Curve is dominant an d

    the model typically underfits see Figure l b )) . When

    the left ‘leg’ of the L-Curve is dominant, the model

    uses too much complexity and s tar ts t o overfit as seen

    in Figure

    l d ) .

    The corner point of the GCurve cor-

    responds to the optimal value of the regularization

    parameter for which the model has the right balance

    between complexity and the error term.

    Finding the corner point of the L-Curve involves

    finding the minimum of the functional

    N

    In regularization theory, the corner point of the L-

    Curve is normally found by determining the curva-

    ture of the L-Curve. In

    [3]

    an expression for the cur-

    vature of th e GCurve is derived in terms of

    p 7 )

    and

    ~ 7 )nd their derivatives. As part of the derivation

    where

    u i

    are the singular vectors,

    o i

    the singular val-

    ues, and

    f i

    the Tikhonov filter factors, that depend

    on

    ~i

    and y

    as

    follows,

    0;

    f i =

    ;

    +y’

    The derivatives of ~ y)nd p 7 ) to y, are then given

    by

    Note tha t in [3], these equations were derived using

    y2 which resulted in having a factor 4 nstead of 2 in

    each equation.) Rewriting

    p ’ y )

    and using the fact

    that

    leads to a very important relation

    0-7803-7278-6/02/ 10.00 02002

    lEEE

    2194

  • 8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR

    4/6

    I11 ESTIMATE FOR C

    computed as R

    I

    maxlli le K xi , xi . Therefore,

    The relation 13) also applies to the Support Vector

    Machine formulation with quadratic

    loss

    when the

    implicit feature space, defined by the kernel, is con-

    sidered. In this section, the relation 13) combined

    with 6), will be used to derive a estimate. Firs t,

    consider the functional 6). In order to find the op-

    timal regularization parameter y , 6) has to be min-

    imized, that is to set H ’ y )

    =

    0. The derivative of

    H ’ y ) is given by

    Rewriting the relation 13), given in the previous sec-

    tion, such that y stan ds alone and using 14), leads

    to

    Now using the fact tha t C = 117,we arrive at

    V Y)

    P Y)

    C = - .

    15)

    This equation forms the basis of th e estimate ’.

    Since the true solution and therefore, true error,

    is unknown, we will use upper and lower bounds in

    terms of the

    a

    prior i parameters. From the Support

    Vector theory, is known that the norm of the solution

    llw112< R 2 ,where R is the radius of the ball centred

    at the origin in the feature space and which can be

    l I t is also interesting t o note t he close resemblance between

    the derivation of the expression for the curvature of the

    L-

    Curve,’which uses th e SVD decomposition, and the use of the

    eigenvalues and eigenvectors in the method of the Effective

    Number of Parameters that

    was

    suggested in statistics for es-

    timating parameters for ridge regression [4].

    *Vapnik derived in Chapte r 7 of [7] a similar relation for -y

    as

    in

    15) as

    part of the proof of a theorem. Vapnik used, however,

    an entirely different approach. Th e relation

    p z Af 8 ,

    Af)

    5

    2 d f i

    can be rewritten to

    -yt 2 p z A f e , A f ) / 4 d 2 .

    A

    is an

    operator in

    a

    Hilbert space and th e function

    p2 is

    metric meai

    suring the distance between the true output

    A f

    =

    F

    and the

    predicted output

    A f t

    of the optimal solution

    fe.

    Finally,

    d

    is

    such that l l l l

    5 d .

    Now, consider the term for the norm of the error,

    p . Let yic be the predicted output value

    of yi

    of the

    SVM model. Since the SVM for Regression uses an

    +insensitive loss function,

    l e (

    \ 2

    p

    =

    7

    v Yi 7

    k ? ’ ) K z k , x i )

    i=l

    k=l

    - e

    1

    =

    v

    Yi

    i J 2

    e

    i = l

    It is clear from 17) tha t a lower bound in terms

    of a

    prior i

    information should involve the number of data

    points, the range of the output data and the value

    of

    E Since no such bound exists in literature, one

    was derived from a number of assumptions about the

    error and experimental results.

    Let us assume that the resulting model will be a

    relatively good model such that the €-insensitive zone

    is smaller than half the range

    of

    the output values

    and th at there is an equal number of support vectors

    above and below the +insensitive zone. Then a very

    loose lower bound on 17) can be given by,

    p > 1 iRange y) 6Y

    e 2

    From experimental observations, it was found that a

    power of four gives a more accurate estimation. This

    leads to the proposed estimate given by

    IV EXPERIMENTAL RESULTS

    In this section the estimated value of

    C,

    using 19)

    is compared to the value

    of

    C determined by using

    the L-Curve. Several dat a sets with varying sizes,

    0-7803-7278-6/02/ 10.00 02002 IEEE

    2195

  • 8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR

    5/6

    noise levels and dimensions were used. Th e results for

    f z1,zz)=

    zlzz

    + 1 with (z1,zz) E [-I, 11  equiva-

    lent t o a continuous version

    of

    the 2D XOR problem)

    is presented in this section. Th e learning da ta con-

    sisted of

    a

    random sampling of this function after a

    noise level of N O,0.05) was added.

    -10 -5 0

    d)

    Modewh

    L-CimeC

    Figure 2: Results for

    a

    RBF kernel

    of

    width 0.2. The

    near) optimal value of

    C

    is indicated by an aster-

    isk and the estimated value of

    C

    by

    a

    circle. a)

    Error statistic for each iteration step in the GCurve

    method. The GCurv e is shown in b) and the corner

    point of the L-Curve in c). In d) and e) the per-

    formances of the model using the optimal C and the

    model using the estimated

    C

    are shown respectively.

    In Figure 2 the results from the L-Curve Approach

    are compared to the estimated value of C using a

    RBF kernel with

    a

    width of 0.2 and an

    E

    of 0.05.

    The L-Curve Approach requires the building of sev-

    eral models for increasing values of

    C.

    The range

    of values for C considered needs to be large enough,

    otherwise the corner point of the L-Curve can not be

    seen. Therefore, the resolution of the C-values being

    used, were chosen on a logarithmic scale. Th e Figure

    2 a) shows various error statistics of models

    for

    in-

    creasing values C. The resulting GCurve is plotted

    in Figure 2 b). The ar ea between the vertical dashed

    lines in Figure 2 a) corresponds to the area in the

    corner of the L-Curve, as shown in Figure 2 b). Th e

    area around t he corner point in the GCurv e is shown

    more clearly in Figure 2 c). In Figure 2 a) and Figure

    2 c) the loca tion of th e optimal C-value is indicated

    by the asterisk and the circle shows the location of

    the estimated C-value. Finally, Figure 2 d) and Fig-

    ure 2 e) show the performance of the models built

    using the near) optimal C and the estimated C, re

    spectively. At first glance, one might think th at a n

    estimated value of C

    =

    340 is far from the near)

    optimal value of C

    =

    1151 from the GCurve. How-

    ever, from Figure 2 c) it is clear that

    C

    is

    a

    rather

    robust parameter. Therefore, the estimation needs

    only to predict a value of C close to the corner of the

    L-Curve.

    Figure

    3:

    Results for

    a

    polynomial kernel of degree 2.

    a) Th e GCurve-determined C and th e estimated C

    are plotted against the percentage support vectors of

    each model. b) The Rz statistic of predictions made

    by the models. c) The Root Mean Square Error

    Prediction of the models.

    Figure

    3

    shows the results of

    SVM

    with

    a

    poly-

    nomial kernel for various values of

    E,

    ranging from

    0 to 0.125. For each value €, .t he C value generated

    by the L-Curve meth od and also estimated by 19).

    0-7803-7278-6/02/ 10.00 02002

    IEEE

    2196

  • 8/9/2019 JORDAAN2002_Estimation of the Regularization Parameter for SVR

    6/6

    Figure 3 a) shows the determined value

    of

    C from

    the L-Curve Method and the estimated value plotted

    against the resulting percentage of suppor t vectors of

    each model. Th e performance of models using the

    optimal C and the estimated C are compared by us-

    ing the Rs-Statistic and the Root Mean Square Er-

    ror Prediction RMSEP) In Figure 3 d) and Figure

    3 e) it is clear tha t the estimated value

    of

    C produces

    models with error statistics that compare well with

    the error statistics of

    a

    model using the optimal value

    of c.

    The CPU time of determining the near) optimal

    value for C through t he L-Curve method was on av-

    erage around 90 seconds. For th e estimation method,

    the CPU time was less than 1 second. T he compu-

    tational advantage speaks for itself. The estimated

    value

    of

    C

    can also help to speed up the L-Curve

    method, since one can get a good initial guess for

    a

    starting point of the algorithm.

    V CONCLUSIONS

    A method for estimating the regularization parame-

    ter C for Support Vector Regression problems is pre-

    sented. Th e estimation is based on results from the

    analysis of the L-Curve method. It was mentioned in

    the introduction that choosing

    a

    value for C should

    involve taking into account several factors, including

    th e kernel function and th e noise level. These factors

    are all present in the heuristic proposed.

    Comparing the values

    of

    C obtained from the L-

    Curve method with the values determined by the es-

    timate, using several data sets, showed that the es-

    timates of C-values are in close proximity to t he op-

    applications in industry. In particular , if one needs

    to

    make a quick assessment whether a given da ta set

    can be solved with the SVM method or if

    a

    given

    kernel function is an appropriate choice, the fast and

    robust estimation method is extremely useful.

    In this paper, only the €-Support Vector Machine

    was considered with quadratic loss, assuming tha t t he

    E is known a priori. Future work includes deriving

    similar estimates for the eSVM with linear loss

    as

    well

    as

    the v-Support Vector Machine

    [ l ] ,

    where the

    expected ra tio of support vectors is used instead

    of E

    References

    [ l ]N. Cristianini and

    J.

    Shawe-Taylor, An Intro-

    duction to Support Vector Machines, and other

    kernel-based learning methods, Cambridge Univer-

    sity Press, 2000.

    [2] H. W. Engl, M. Hanke, and A. Neubauer, Reg-

    ularization of Inverse Problems, Kluwer Academic

    Publishers, Hingham, MA, 1996.

    [3] P. C. Hansen, “The L-Curve an d its use in the

    numerical treatment of inverse problems” , invited

    paper for

    P.

    Johnston Ed.), Computational In-

    verse Problems in Electrocardiology, pp. 119-142,

    WIT Press, Southampton, 2001.

    [4] T.

    J.

    Hastie and R. J. Tibshirani, Generalized

    Linear Models, Chapman and Hall, London, UK,

    1990.

    [5]

    B

    Scholkopf, C. J. Burges, and A . J. Smola, Ad-

    vances in Kernel Methods: Support Vector Learn-

    ing, MIT Press, London, 1998.

    timal C. Furthermore, the difference in performance

    between a model using the C-value determined by

    the L-Curve and

    a

    model using the C estimated by

    the method, is very small and often negligible.

    [6] A. J. Smola, Regression Estimation with Support

    Vector Learning machines, Master’s Thesis, TU

    Berlin, 1996.

    The computation time needed to determine a good

    estimate

    of

    the optimal

    C

    is a fraction of the time

    needed to determine the near) optimal value

    of

    C

    by means of the L-Curve method. Therefore, the

    proposed estimation method can be used for online

    [7] V. N. Vapnik, Statistical Learning Theory, John

    Wiley Sons, 1998.

    8 ‘.N. Vapnik, E.

    Levin,

    E., and y.

    “Measuring the vc Dimension

    Of

    a Learning

    Machine”

    ,

    Neural Com puta t ion ,

    Vol. 10:5, 1994.

    3The RMSEP is the relative error multiplied by the stan-

    dard deviation of the predicted test data.

    0-7803-7278-6/02/ 10.00

    02002

    EEE

    2 97