[ieee 2008 seventh international conference on machine learning and applications - san diego, ca,...

6
A supervised decision rule for multiclass problems minimizing a loss function Nisrine Jrad Edith Grall-Ma¨ es Pierre Beauseroy Universit´ e de Technologie de Troyes ICD (FRE CNRS 2848), LM2S 12 rue Marie Curie, BP 2060, 10010 Troyes cedex - France {nisrine.jrad, edith.grall, pierre.beauseroy}@utt.fr Abstract A multiclass learning method which minimizes a loss function is proposed. The loss function is defined by costs associated to the decision options which may include classes, subsets of classes if partial rejection is considered and all classes if total rejection is introduced. A formula- tion of the general problem is given, a decision rule which is based on the ν -1-SVMs trained on each class is defined and a learning method is proposed. This latter optimizes all the ν -1-SVM parameters and all the decision rule parame- ters jointly in order to minimize the loss function. To extend the search space of the ν -1-SVM parameters and keep the processing time under control, the ν -1-SVM regularization path is derived for each class and used during the learn- ing process. Experimental results on artificial data sets and some benchmark data sets are provided to assess the effec- tiveness of the approach. 1 Introduction In supervised learning of decision rules, the support vec- tor machine (SVM) is a powerful and widely used tool. It was originally designed for binary classification, where many on-going researches studied how to effectively ex- tend SVM classification methods to multiclass classifica- tion. Currently, there are two types of approaches. The first one is based on the ”decomposition-reconstruction” ap- proach which divides the multiple class problems into sev- eral binary classification problems. The generalization step is based on a voting among the binary classifiers to de- rive the winning class. The voting strategy is defined by the ”decomposition-reconstruction” method in use. Several methods were developed according to this approach and the most widely used are the one-vs-one [21, 14, 12], one-vs- all [21, 12, 2] and DAGSVM [21, 12, 16]. The second is the ”all-together” approach, it considers all data in one op- timization formulation [12, 4, 22]. Although it is compu- tationally more expensive to solve ”all-together” problems, it is considered as a promising approach since it deals with all the samples from all the classes in one optimization for- mulation and thus enables to optimize the overall solution. Moreover, it needs a lower number of support vectors and achieves higher performances but yields to a larger opti- mization problem. Recent researches proposed another approach to solve multiclass problems using Support Vector Description Do- main (SVDD), a dual formulation of the One Class SVM (1-SVM). SVDD maps data from one class into a higher dimensional space and defines a bounding region that con- tains as much of the data as possible while minimizing its volume. The corresponding margin error can be controlled by the SVDD parameters. Two approaches based on SVDD were proposed in [23, 9]. The concept behind them is to determine one SVDD for each class of the training set and tunes its parameter to improve the accuracy of each clas- sifier. The generalization step is based on maximizing a decision or membership function. The method proposed in [23] trains each SVDD separately by solving an optimiza- tion problem while the one proposed in [9] trains all the SVDD by formulating one optimization problem and new samples are classified according to a membership function. However, there are still some restrictions in these ap- proaches. Up to now, multiclass SVM which takes into ac- count classification’s costs dependant on the class have not been yet considered. Biomedical applications such as can- cer diagnosis are good examples of such cases. For these problems, taking a wrong decision on a sick patient costs more than taking a wrong decision on a healthy patient. In the case of binary problems, solutions based on SVM have been proposed [1, 3, 5]. Problems in multiclass decision may also require asymmetric costs, such as image indexing or face identification applications. This paper proposes a new method to determine a multiclass decision rule which minimizes a general loss function.This loss function can be defined according to the Bayesian risk function or it can be considered in the general framework of class selective 2008 Seventh International Conference on Machine Learning and Applications 978-0-7695-3495-4/08 $25.00 © 2008 IEEE DOI 10.1109/ICMLA.2008.44 48

Upload: pierre

Post on 09-Apr-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2008 Seventh International Conference on Machine Learning and Applications - San Diego, CA, USA (2008.12.11-2008.12.13)] 2008 Seventh International Conference on Machine Learning

A supervised decision rule for multiclass problems minimizing a loss function

Nisrine Jrad Edith Grall-MaesPierre Beauseroy

Universite de Technologie de Troyes ICD (FRE CNRS 2848), LM2S12 rue Marie Curie, BP 2060, 10010 Troyes cedex - France

{nisrine.jrad, edith.grall, pierre.beauseroy}@utt.fr

Abstract

A multiclass learning method which minimizes a lossfunction is proposed. The loss function is defined bycosts associated to the decision options which may includeclasses, subsets of classes if partial rejection is consideredand all classes if total rejection is introduced. A formula-tion of the general problem is given, a decision rule whichis based on the ν-1-SVMs trained on each class is definedand a learning method is proposed. This latter optimizes allthe ν-1-SVM parameters and all the decision rule parame-ters jointly in order to minimize the loss function. To extendthe search space of the ν-1-SVM parameters and keep theprocessing time under control, the ν-1-SVM regularizationpath is derived for each class and used during the learn-ing process. Experimental results on artificial data sets andsome benchmark data sets are provided to assess the effec-tiveness of the approach.

1 Introduction

In supervised learning of decision rules, the support vec-tor machine (SVM) is a powerful and widely used tool.It was originally designed for binary classification, wheremany on-going researches studied how to effectively ex-tend SVM classification methods to multiclass classifica-tion. Currently, there are two types of approaches. Thefirst one is based on the ”decomposition-reconstruction” ap-proach which divides the multiple class problems into sev-eral binary classification problems. The generalization stepis based on a voting among the binary classifiers to de-rive the winning class. The voting strategy is defined bythe ”decomposition-reconstruction” method in use. Severalmethods were developed according to this approach and themost widely used are the one-vs-one [21, 14, 12], one-vs-all [21, 12, 2] and DAGSVM [21, 12, 16]. The second isthe ”all-together” approach, it considers all data in one op-timization formulation [12, 4, 22]. Although it is compu-

tationally more expensive to solve ”all-together” problems,it is considered as a promising approach since it deals withall the samples from all the classes in one optimization for-mulation and thus enables to optimize the overall solution.Moreover, it needs a lower number of support vectors andachieves higher performances but yields to a larger opti-mization problem.

Recent researches proposed another approach to solvemulticlass problems using Support Vector Description Do-main (SVDD), a dual formulation of the One Class SVM(1-SVM). SVDD maps data from one class into a higherdimensional space and defines a bounding region that con-tains as much of the data as possible while minimizing itsvolume. The corresponding margin error can be controlledby the SVDD parameters. Two approaches based on SVDDwere proposed in [23, 9]. The concept behind them is todetermine one SVDD for each class of the training set andtunes its parameter to improve the accuracy of each clas-sifier. The generalization step is based on maximizing adecision or membership function. The method proposed in[23] trains each SVDD separately by solving an optimiza-tion problem while the one proposed in [9] trains all theSVDD by formulating one optimization problem and newsamples are classified according to a membership function.

However, there are still some restrictions in these ap-proaches. Up to now, multiclass SVM which takes into ac-count classification’s costs dependant on the class have notbeen yet considered. Biomedical applications such as can-cer diagnosis are good examples of such cases. For theseproblems, taking a wrong decision on a sick patient costsmore than taking a wrong decision on a healthy patient. Inthe case of binary problems, solutions based on SVM havebeen proposed [1, 3, 5]. Problems in multiclass decisionmay also require asymmetric costs, such as image indexingor face identification applications. This paper proposes anew method to determine a multiclass decision rule whichminimizes a general loss function.This loss function can bedefined according to the Bayesian risk function or it canbe considered in the general framework of class selective

2008 Seventh International Conference on Machine Learning and Applications

978-0-7695-3495-4/08 $25.00 © 2008 IEEE

DOI 10.1109/ICMLA.2008.44

48

Page 2: [IEEE 2008 Seventh International Conference on Machine Learning and Applications - San Diego, CA, USA (2008.12.11-2008.12.13)] 2008 Seventh International Conference on Machine Learning

rejection [8, 11]. Class selective rejection consists on re-jecting some patterns from one, some, or all classes in orderto ensure a higher reliability. In this case, the loss functionallows flexibility in defining different penalties on wrongdecisions, and on partially correct decisions.

The proposed method is based on the ”decomposition-reconstruction” approach coupled with the regularizationpath search of each ν-1-SVM [10, 17] and a prediction func-tion. It enables the global optimization of the decision ruleand only requires the training of one ν-1-SVM [18, 20, 19]for each class which leads to solve one relatively small op-timization problem per class. In order that the predictionfunction, used to decide the winning class, takes into con-sideration the costs asymmetry, a weighted ”distance mea-sure” is defined. The parameters of all the ν-1-SVMs areoptimized jointly in order to minimize the loss function.Taking advantage of the regularization path method, the en-tire parameters searching space is considered. Since thesearching space is widely extended, the selected decisionrule is more likely to be the optimal one.

This paper is outlined as follows. A brief description ofν-1-SVM and its regularization path is given in section 2.Section 3 presents the multiclass decision problem and de-scribes the proposed training algorithm, based on ν-1-SVMand the regularization path which are used to determine thedecision rule. In section 4, experiments on theoretical dataset and benchmark data sets are reported and the perfor-mances of the proposed method are given. Finally, a con-clusion and perspectives are presented in section 5.

2 ν-1-SVM and regularization path

This section describes the ν-1-SVM concept [18, 20, 19]and explains briefly the derivation of the entire regulariza-tion path [10, 17] which enables to get rapidly all ν-1-SVMmodels for a wide range of values of ν.

2.1 ν-One class Support Vector Machine

Considering a set of n vectors X = {x1, x2, ..., xn}drawn from an input space X , ν-1-SVM computes a de-cision function fλ

X(.) and a real number bλ in order to de-termine the region Rλ in X such that fλ

X(x) − bλ ≥ 0 ifx ∈ Rλ and fλ

X(x)− bλ < 0 otherwise. The decision func-tion fλ

X(.) is parameterized by λ = νn (with 0 ≤ ν < 1) tocontrol the number of outliers. It is designed by minimizingthe volume of Rλ under the constraint that all the vectors ofX , except the fraction ν of outliers, must lie in Rλ. In or-der to determine Rλ, the space of possible functions fλ

X(.)is reduced to a Reproducing Kernel Hilbert Space (RKHS)with kernel function K(., .). Let φ : X → H be the map-ping defined over the input space X . Let < ., . >H be adot product defined in H. The kernel K(., .) over X ×X is

defined by:

∀(xp, xq) ∈ X × X K(xp, xq) =< φ(xp), φ(xq) >H

Without loss of generality, K(., .) is supposed normalizedsuch that for any x ∈ X , K(x, x) = 1. Thus, all themapped vectors φ(xp), p = 1...n are in a subset of a hy-persphere with radius one and center O. ProvidedK(., .) isalways positive, φ(X) is a subset of the positive orthant ofthe hypersphere. A common choice of K(., .) is the Gaus-sian RBF kernel K(xp, xq) = exp[ −1

2σ2 ‖ xp − xq ‖2

X ] withσ the parameter of the Gaussian RBF kernel. ν-1-SVM con-sists of separating the training vectors in H from the cen-ter O with a hyperplane Wλ. Finding the hyperplane Wλ

is equivalent to find the decision function fλX(.) such that

fλX(x) − bλ =< wλ, φ(x) >H −bλ ≥ 0 for the (1 − ν)n

mapped training vectors while Wλ is the hyperplane withmaximum margin bλ

‖wλ‖Hwith wλ the normal vector of Wλ.

This yields fλX (.) as the solution of the following convex

quadratic optimization problem:

minwλ,bλ,ξp

n∑

p=1

ξp − λbλ +λ

2‖ wλ ‖2

H

subject to < wλ, φ(xp) >H≥ bλ − ξp

and ξp ≥ 0 ∀p = 1...n (1)

where ξp are the slack variables. This optimization prob-lem is solved by introducing lagrange multipliers αp. As aconsequence to Kuhn-Tucker conditions, wλ is given by:

wλ =1

λ

n∑

p=1

αpφ(xp)

which results in:

fλX(.) − bλ =

1

λ

n∑

p=1

αpK(xp, .) − bλ

The dual formulation of (1) is obtained by introducing La-grange multipliers as:

minα1,...,αn

1

n∑

p=1

n∑

q=1

αλpα

λqK(xp, xq)

withn∑

p=1

αλp = λ and 0 ≤ αλ

p ≤ 1 ∀p = 1...n (2)

A geometrical interpretation of the solution in the RKHSis given by figure 1. fλ

X(.) and bλ define a hyperplaneWλ orthogonal to fλ

X(.). The hyperplane Wλ separates theφ(xp)s from the sphere center, while having bλ/ ‖ wλ ‖Hmaximum which is equivalent to minimize the portion Sλ

of the hypersphere bounded by Wλ that contains the set{φ(x) s.t. x ∈ Rλ}.

49

Page 3: [IEEE 2008 Seventh International Conference on Machine Learning and Applications - San Diego, CA, USA (2008.12.11-2008.12.13)] 2008 Seventh International Conference on Machine Learning

PSfrag replacements

O

R = 1 θλ

‖wλ‖H

Non-Margin SV

Margin SV

Non-SVs

Figure 1. Training data mapped into the fea-ture space on a portion Sλ of a hypersphere.

2.2 ν-1-SVM regularization path

Regularization path was first introduced by Hastie & al[10] for a binary SVM. Later, Davy and Rakotomamonjy[17] developed the entire regularization path for a ν-1-SVM. The basic idea of the ν-1-SVM regularization pathis that the parameter vector of a ν-1-SVM is a piecewiselinear function of λ. Thus the principle of the method is tostart with large λ (ie. λ = n−ε) and decrease it towardszero, keeping track of breaks that occur as λ varies.

As λ decreases ‖ wλ ‖H increases and hence the dis-tance between the sphere center and Wλ decreases. Pointsmove from being outside (Non-Margin SVs with αλ

p = 1 infigure 1) to inside the portion Sλ (Non-SVs with αλ

p = 0).By continuity, points must linger on the hyperplane Wλ

(Margin SVs with 0 < αλp < 1) while their αλ

p s decreasefrom 1 to 0. αλ

p s are piecewise-linear in λ. Break pointsoccur when a point moves from a position to another one.Since αλ

p is piecewise-linear in λ, fλ(.) and bλ are alsopiecewise-linear in λ. Thus, after initializing the regular-ization path (computing αλ

p by solving (2) for λ = n − ε),almost all the αλ

p s are computed by solving linear systems.Only for some few integer values of λ smaller than n, αλ

p sare computed by solving (2) according to [17].

Using simple linear interpolation, this algorithm enablesto determine very rapidly the ν-1-SVM corresponding toany value of λ.

3 Multiclass SVM based on ν-1-SVMs

This section presents the decision problem and describesthe proposed method based on the ν-1-SVM to determinethe decision rule that minimizes any loss function.

3.1 Multiclass decision problem

Assuming that the multiclass decision problem dealswith N classes noted w1... wN and that any vector x be-

longs to one class, a decision rule consists in a partition Zof <d in I sets Zi corresponding to the different decisionoptions. In the simple classification scheme, the options aredefined by the N classes. In the class selective rejectionscheme, the options are defined by the classes and the sub-sets of classes (ie. assigning x to {1, 3} means that x isassigned to classes w1 and w3 with ambiguity).

The problem consists in finding the decision rule Z∗ thatminimizes a given loss function c(Z) defined by:

c(Z) =

I∑

i=1

N∑

j=1

cijPjP (Di/wj) (3)

where cij is the cost of assigning an element x to the ith

decision option when it belongs to the class wj . The valuesof cij being relative since the aim is to minimize c(Z), thevalues can be defined in the interval [0; 1] without loss ofgenerality. Pj is the a priori probability of class wj andP (Di/wj) is the probability that elements of the class wj

are assigned to the ith option.

3.2 Training approach

To design a supervised decision ruleZ,N ν-1-SVMs aretrained on data from theN classes. It was shown in [18] thatν is an upper bound on the fraction of outliers and a lowerbound on the fraction of the SVs. Thus, besides tuning thekernel parameter σ, tuning ν or equivalently λ is a crucialpoint since it enables to control the margin error. Changingλ leads to solve the optimization problem formulated in (2).In [12, 23] a smooth grid search was supplied in order tochoose the optimal values of σ and λ. The N values of λwere chosen equal to reduce the search space of parametersand the computational costs. In the proposed approach, allthe λj with j = 1...N corresponding to the N ν-1-SVMsare optimized and the entire space is explored. The opti-mal vector is the one which minimizes an estimator of c(Z)using a training data set.

3.3 Decision rule

In order to determine the decision rule, first a predic-tion function should decide the winning option. A ”distancemeasure” between x and the training class set wj , using theν-1-SVM parameterized by λj , is defined as follows:

dλj (x) =cos( wλj , φ(x))

cos(θλj )=

‖ wλj ‖Hbλj

cos( wλj , φ(x))

(4)where θλj is the angle delimited by wλj and the supportvector as shown in figure 1. cos(θλj ) is a normalizing factorwhich is used to make all the dλ

j (x) comparable.

50

Page 4: [IEEE 2008 Seventh International Conference on Machine Learning and Applications - San Diego, CA, USA (2008.12.11-2008.12.13)] 2008 Seventh International Conference on Machine Learning

Using ‖ φ(x) ‖= 1 in (4) leads to the following:

dλj (x) =< wλj , φ(x) >H

bλj=

1

λj

∑nj

p=1α

λjp K(xp, x)

bλj

(5)The ”distance measure” dλj (x) is inspired from [6].

When data is distributed in a unimodal form, the dλj (x)will be a decreasing function with respect to the distancebetween a vector x and the data mean. The probability den-sity function will be also a decreasing function with respectto the distance from the mean. Thus, dλj (x) preserves dis-tribution order relations. In such case, the use of dλj (x)should reach the same performances as the one obtained us-ing the distribution.

In the simplest case of multiclass problems where theloss function is defined as the error probability,x is assignedto the class maximizing dλj (x).

To extend the multiclass prediction process to the classselective scheme, a weighted form of the ”distance mea-sure” is proposed. The weight βj is associated to dλj whichallows to treat differently each class and helps solving prob-lems with different costs cij on the classification decisions.

Finally, in the general case where the loss function isconsidered in the class selective rejection scheme, the pre-diction process can be defined as follows; x is assigned toith option if and only if:

N∑

j=1

cijPjβjdλj (x) ≤

N∑

j=1

cljPjβjdλj (x), ∀l = 1...I, l 6= i

Thus, in contrast to previous multiclass SVM, which con-structs the maximum margin between classes and locatesthe decision hyperplane in the middle of the margin, theproposed approach resembles more to the robust Bayesianclassifier. The distribution of each class is considered inour approach and the optimal decision is slightly deviatedtoward the class with the smaller variance.

3.4 Solution of optimal parameters

The proposed decision rule depends on σ, λj and βj forj = 1...N . These parameters have to be tuned in order tominimize the loss function. Since the problem is describedby a sample set, an estimator c(Z) of c(Z) given by (3) isused:

c(Z) =I∑

i=1

N∑

j=1

cij Pj P (Di/wj) (6)

where Pj and P (Di/wj) are the empirical estimators of Pj

and P (Di/wj) respectively.The optimal rule is obtained by tuning λj , βj and σ so

that the estimated loss c(Z) is minimum. This is accom-plished by employing a global search for λj and βj and an

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−4

−2

0

2

4

6PSfrag replacements

{1}

{2}

{3}

{1,2}

{1,3} {2,3}{1,2,3}

Figure 2. Probabil-ity densities andtheoretical parti-tion in function ofx.

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−4

−2

0

2

4

6PSfrag replacements

{1}

{2}

{3}

{1,2}

{1,3} {2,3}{1,2,3}

Figure 3. Super-vised partition infunction of x.

iterative search overσ. For each fixed value of σ, ν-1-SVMsare trained using the regularization path method. Then theminimization of c(Z) is sought by solving an alternate op-timization problem over λj and βj which is easy since allν-1-SVM solutions are easily interpolated from the regular-ization path.

4 Experimental results

In this section, two experiments are reported in order toassess the performance of the proposed approach. First, a2D artificial problem with class selective rejection is stud-ied. In order to show the ability of the proposed algorithmfor solving multiclass problems with a general loss func-tion, a multiclass problem with class selective rejection isprimarily considered. Second, benchmark data sets with aloss function defined as the probability of error are used tocompare the proposed method with others.

For both experiments, the loss function was mini-mized by determining the optimal parameters βj andλj for j = 1...N for a given kernel parameterσ and by testing different values of σ in the set[2−2.5; 2−2; 2−1.5; 2−1; 2−0.5; 1; 20.5; 2; 21.5; 22]. Finally,the decision rule which minimizes the loss function is kept.

4.1 Artificial data sets

The considered problem is given by 300 patterns x ∈<2 drawn from three equiprobable Gaussian distributionsG1(m1,Σ1), G2(m2,Σ2) and G3(m3,Σ3). Their meansare given by m1 = [−1.5 0]T , m2 = [1 0.4]T andm3 = [−0.5 2.8]T . Their covariances matrices are givenby Σ1 = 0.9I, Σ2 = 0.5I and Σ3 = I where I is theidentity matrix. They are represented on figure 2.

The problem of classification is described by the follow-ing concerns:

• Seven decision options: ψ1 = {1}, ψ2 = {2}, ψ3 ={3}, ψ4 = {1, 2}, ψ5 = {1; 3}, ψ6 = {2; 3} and

51

Page 5: [IEEE 2008 Seventh International Conference on Machine Learning and Applications - San Diego, CA, USA (2008.12.11-2008.12.13)] 2008 Seventh International Conference on Machine Learning

Data set Training data Testing data Classes AttributesIris 150 0 3 4

Wine 178 0 3 13Segment 2310 0 7 19Satimage 4435 2000 6 36

Table 1. Benchmark data sets description.

ψ7 = {1; 2; 3}.• The average loss function :

c(Z) = 0.8 × {P2P (D1/w2) + P2P (D3/w2)}

+ 0.6 × {P1P (D2/w1) + P1P (D3/w1)}

+ P3P (D1/w3) + P3P (D2/w3)

+ 0.2 × {P1P (D4/w1) + P2P (D4/w2)

+ P1P (D5/w1)}

+ P3P (D5/w3) + P2P (D6/w2) + P3P (D6/w3)

+ P3P (D4/w3) + P2P (D5/w2) + P1P (D6/w1)

+ 0.3 × {P1P (D7/w1) + P2P (D7/w2)

+ P3P (D7/w3)}.

The decision rule which minimizes the decision loss func-tion c(Z) is constructed in the statistical framework accord-ing to the optimal decision rule defined in [7, 13]. The par-tition associated with the theoretical classification rule infunction of x is represented in figure 2. The optimal deci-sion cost obtained is 0.0651.

For the design of the supervised rule using the proposedapproach, a set of 300 samples were used. The value of theloss obtained with the supervised rule was computed ac-cording to the equation (6). However, instead of using anestimation of P (Di/wj), theoretical densities which corre-sponds to a validation sample set of infinite size were used.The minimal loss obtained is equal to 0.0701. The decisionboundaries obtained by the used sample set are reported onfigure 3.

The loss value corresponding to the learned rule is sim-ilar to the optimal one which assesses the quality of thelearning procedure.

4.2 Benchmark data sets

The proposed method was tested on the benchmark datasets selected from the UCI data repository [15]. To com-pare the performances of the proposed method with those ofprevious classification approaches tested on UCI repository[12, 23], the loss function is considered as the probabilityof error. These latter approaches are based on binary andsingle SVM respectively. Since [12] is a comparative studyof standard multiclass SVMs based on binary SVMs, our re-sults are compared to all existing approaches including one-vs-one, one-vs-all, DAGSVM, C&S and all-together meth-ods. Table 1 describes data sets used. All the training datawere scaled to have a mean of zero and a variance of one.

For satimage data set, a training and a validation sets areavailable. For iris, wine, and segment only training sets areavailable. The performance of the proposed method is mea-sured by evaluating its accuracy rate and it is compared toresults obtained in [12, 23].

Two methods were used in order to compute the gener-alization accuracy as in [12, 23]. For the data sets wherea training set and a validation set are available, the per-formance was measured by using 70% of the training setto determine the classification rule and the other 30% ofthe training set to test the performance of the rule. Thenthe whole training set was used to build the decision ruleusing the optimal parameters and the whole validation setwas used to compute the accuracy. For the other smallerdata sets where validation data are not available, a 5-cross-validation was used on the whole training data and the bestcross-validation rate was reported.

The obtained error rates are reported in Table 2 and com-pared with OC-k-SVM method [23] and the best multiclassmethod, based on binary SVMs, cited in [12].

For problems with large data sets (satimage and seg-ment), the proposed approach achieves competitive perfor-mance compared with results reported in [12, 23].

For problems with smaller sets (iris and wine), the re-sults obtained with the proposed algorithm outperformsthose listed in [23] and are similar to those given by [12].

5 Conclusion

In this paper, a multiclass learning algorithm with lossfunction minimization is proposed. It introduces two nov-elty. First, it is based on a ”decomposition-reconstruction”strategy but, contrarily to other methods which use thisstrategy, the proposed method performs a global optimiza-tion of the decision rule. Second it enables to solve decisionproblem which are defined by any Bayesian loss functionand can be applied to solve decision problem which takesinto account partial or total rejection options.

The ”decomposition-reconstruction” strategy is appliedby using one ν-1 SVM per class, which enables to performlearning task for each class separately. Global optimizationis obtained by using ν-1 SVM regularization path whichenables to consider all the ν-1 SVM models for each classwhile optimizing the decision rule.

The performance of the proposed method was assessedthrough experiments on a 2-D artificial data set and onbenchmark data sets. On artificial data, it was shown thatthe method is suitable for determining a supervised learn-ing rule that minimizes a loss function with asymmetriccosts dependant on the class and with partial and total re-jection options. On benchmark data sets, the method wascompared with previous works. It outperforms the existingmulticlass approaches based on SVDD and it presents sat-

52

Page 6: [IEEE 2008 Seventh International Conference on Machine Learning and Applications - San Diego, CA, USA (2008.12.11-2008.12.13)] 2008 Seventh International Conference on Machine Learning

Data set Proposed algorithm OC-k-SVM [23] Best of [12] and corresponding methodIris 94.68 90.67 97.33 One-Against-one, C&S and All-Together algorithm

Wine 96.60 54.49 99.43 One-Against-oneSegment 98.15 98.57 97.57 All-Together algorithmSatimage 90.80 90.26 92.35 C&S

Table 2. Comparison of the numerical experiment results: best accuracy rate computed using theproposed algorithm, the OC-k-SVM algorithm [23] and binary multiclass SVM classifiers [12].

isfactory results compared with the multiclass rule based onbinary SVMs.

The proposed approach has different advantages. It al-lows to design a supervised learning rule which minimizesa loss function defined according to the Bayesian risk orto more general loss functions. All the possible parame-ters domain of the decision rule is considered with greatcomputational savings due to the use of regularization path.The obtained decision rule resembles to the Bayesian ro-bust classifier in the transformed Hilbert space. Since itis based on the kernel techniques, it enables processing oflarge dimensional vectors. It can be noted that this approachis adapted to problems in which the number of classes canbe modified. When a new class has to be added, the designof the decision rule requires only to train the new class andto determine the optimal parameters that minimize the lossfunction.

For future research, multiclass problems with perfor-mance constraints has to be tackled, where these constraintsare defined by bounds on linear combinations of conditionalprobabilities of decision. The aim of the predicted rule is toverify the constraints and minimize a given loss function.A formulation of such problems was given in [7, 13]. Ourproposed approach can be interesting for designing super-vised learning rules for these problems, since it is able tominimize any given loss function.

References

[1] F. R. Bach, D. Heckerman, and E. Horvitz. On the path toan ideal roc curve : considering cost asymmetry in learningclassifiers. In AISTATS, 2005.

[2] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon,L. Jackel, Y. LeCun, U. Muller, E. Sackinger, P. Simard, andV. Vapnik. Comparison of classifier methods: a case study inhandwriting digit recognition. In International Conferenceon Pattern Recognition, pages 77–87, 1994.

[3] H. Chew, R. E. Bogner, and C. Lim. Dual ν-support vectormachine with error rate and training size biasing. In ICASSP’01, pages 1269–1272, Washington, DC, USA, 2001. IEEEComputer Society.

[4] K. Crammer and Y. Singer. On the learnability and designof output codes for multiclass problems. In ComputationalLearing Theory, pages 35–46, 2000.

[5] M. A. Davenport. The 2ν-svm : A cost-sensitive extensionof the ν-svm. Technical report, TREE 0504, Rice University,Dept. of Elec. and Comp. Engineering, October 2005.

[6] M. Davy, F. Desobry, A. Gretton, and C. Doncarli. An on-line support vector machine for abnormal events detection.Signal Process., 86(8):2009–2025, 2006.

[7] E. Grall, P. Beauseroy, and A. Bounsiar. Quality assess-ment of a supervised multilabel classification rule with per-formance constraints. In EUSIPCO’06, Italy, 2006.

[8] T. M. Ha. The optimum class-selective rejection rule. IEEETrans. Pattern Anal. Mach. Intell., 19(6):608–615, 1997.

[9] P. Hao and Y. Lin. A new multiclass support vector machinewith multisphere in the feature space. In IEA/AIE, pages756–765, 2007.

[10] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entireregularization path for the support vector machine. J. Mach.Learn. Res., 5:1391–1415, 2004.

[11] T. Horiuchi. Class selective rejection rule to minimize themaximum distance between selected classes. 31(10):1579–1588, October 1998.

[12] C. Husband and C. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on NeuralNetworks, 13:415–425, 2002.

[13] N. Jrad, E. Grall, and P. Beauseroy. Supervised learningrule selection for multiclass decision with performance con-straints. In IEEE conference ICPR, USA, 2008.

[14] U. H. G. Kressel. Pairwise classification and support vec-tor machines. Advances in kernel methods: support vectorlearning, pages 255–268, 1999.

[15] C. B. D. Newman and C. Merz. UCI repository of machinelearning databases, 1998.

[16] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margindags for multiclass classification. In S. Solla, T. Leen, andK.-R. Mueller, editors, Advances in Neural Information Pro-cessing Systems 12, pages 547–553, 2000.

[17] A. Rakotomamojy and M. Davy. One-class svm regulariza-tion path and comparison with alpha seeding. In ESANN2007, pages 221–224, Brugge, Belgium, April 2007.

[18] B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola,and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001.

[19] B. Scholkopf and A. J. Smola. Learning with Kernels: Sup-port Vector Machines, Regularization, Optimization, and Be-yond. MIT Press, Cambridge, MA, USA, 2001.

[20] D. Tax. One-class classification: concept learning in theabsence of counter-examples. PhD thesis, Technische Uni-versiteit Delft, 2001.

[21] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.

[22] J. Weston and C. Watkins. Multiclass support vector ma-chines. In ESANN, 1999.

[23] X. Yang, J. Liu, M. Zhang, and K. Niu. A new multiclasssvm algorithm based on one-class svm. In ICCS 2007, pages677–684, Beijing, China, May 27-30 2007.

53