[ieee 2010 third international workshop on advanced computational intelligence (iwaci) - suzhou,...

Third International Workshop on Advanced Computational Intelligence August 25-27,2010 - Suzhou, Jiangsu, China

Discriminant Support Vector Data Description

Zhe Wang and Daqi Gao

Abstract-Support Vector Data Description (SVDD) was designed to construct a minimum hypersphere so as to enclose all the data of the target class in the one-class classification case. In this paper, we propose a novel Discriminant Support Vector Data Description (DSVDD). The proposed DSVDD adopts the relevant metric learning instead of the original Euclidean distance metric learning in SVDD, where the relevant metric learning can consider the relationship between data. Here through incorporating both the positive and negative equivalence information, the presented DSVDD assigns large weights to the relevant features and tightens the similar data. More importantly, we introduce the discriminant knowledge prior into the proposed algorithm due to considering the negative equivalence information. The experiments show that the proposed DSVDD can bring more accurate classification performance than the conventional SVDD for all the tested data.

I. INTRODUCTION

ONE-CLASS classification is generally only one cer

tain class named the target class available, which is

different from a multi-class classification. Support Vector

Domain Description (SVDD) as one popular one-class clas

sifier was proposed by Tax and Duin [7], [8], [9]. In the

SVDD model, a hypersphere is constructed like that it can

enclose as many target objects as possible, and minimize the

chance of accepting the non-target data named the outlier

objects. It is well-known that the original SVDD model

adopts the Euclidean distance metric [7], [8], [9]. But an

important problem in those learning algorithms based on

Euclidean distance metric is the scale of the input variables.

In the Euclidean case, SVDD takes all the features of the

target class data as equivalent in training. As a result, those

irrelevant features of the data might be considered in training

and would mislead the data description of the SVDD model

into an irrelevant hypersphere. Simultaneously, the SVDD

with Euclidean distance metric does not have the ability of

considering the prior relationship among the target data.

Relevant metric learning was firstly developed as a Ma

halanobis distance metric learning [3], [5], [6]. One of its

special algorithms is Relevant Component Analysis (RCA)

[6]. RCA is an effective linear transformation for unsuper

vised learning. It constructs a Mahalanobis distance metric

through using positive equivalence relationship. The positive

Zhe Wang and Daqi Gao are with the Department of Computer Science & Engineering, East China University of Science & Technology, Shanghai, China (email: {wangzhe.gaodaqi}@ecust.edu.cn).

This work was supported by Natural Science Foundations of China under Grant No.60903091, the High-Tech Development Program of China (863) under Grant No. 2006AAlOZ315 and the Specialized Research Fund for the Doctoral Program of Higher Education under Grant No.20090074120003 for support. This work was also supported by the Open Projects Program of National Laboratory of Pattern Recognition and the Fundamental Research Funds for the Central Universities

978-1-4244-6337-4/10/$26.00 @201O IEEE 97

equivalence relationship is achieved by the covariance ma

trices of the positive equivalence data. In RCA, the positive

equivalence data are selected from the same chunklet. Each

chunklet is the set in which the data come from the same

class but without special class labels. Through the transfor

mation based on a group of chunklets, RCA can assign large

weights to relevant features and low weights to irrelevant

features [6]. Unfortunately, RCA cannot make use of negative

equivalence constraints or discriminant information. To this

end, Yeung and Chang [10] extended RCA with both positive

and negative equivalence relationship. Specially, the extended

RCA was achieved through designing the so-called within

chunklet and between-chunklet co-variance matrices. In do

ing so, both positive and negative equivalence constraints can

be used.

In this paper, we introduce the extended RCA distance

metric [10] rather than the original Euclidean distance metric

into SVDD and therefore propose a novel Discriminant

Support Vector Domain Description (DSVDD). In doing

so, the presented DSVDD can inherit the advantages of

the extended RCA. In practice, the proposed DSVDD can

reduce the input variable scale influence due to the use

of Mahalanobis distance metric from the extended RCA.

Simultaneously, the proposed DSVDD can easily incorporate

a priori discriminant knowledge due to the consideration of

both the positive and negative equivalence data from the

within-chunklet and between-chunklet co-variance matrices.

In order to validate the effectiveness of the proposed DSVDD

algorithm, we give the experimental results from both syn

thetic and real data sets. The experimental results show the

proposed DSVDD can bring more accurate description for

all the tested data than the conventional SVDD.

The rest of this paper is organized as follows. Section

II gives the structure of the proposed DSVDD. Section III

experimentally shows that the proposed method DSVDD can

bring more accurate description for all the tested target cases

than the conventional SVDD. Following that, both conclusion

and future work are given in Section IV.

II. DISCRIMINANT SUPPORT VECTOR DATA

DESCRIPTION (DSVDD)

Suppose that there is a set of one-class training samples

{Xi }�1 � ]R.n. SVDD seeks such a hypersphere that can

contain all the samples {Xi }�1 and meanwhile minimize

the volume of the hypersphere through the following opti-

mization formulation

min J (1)

subjectto (Xi - af M-1(Xi - a) ::; R2 + �i (2)

�i � 0, i = 1...N (3)

where the parameters R E lR. and a E lR.n are the radius

and the center of the optimized hypersphere respectively; the

regularization parameter C E lR. gives the tradeoff between

the volume of the hypersphere and the errors; and the �i E lR. are slack variables. Since SVDD adopts Euclidean distance

metric, the matrix M E lR.nxn is an identity one with all the

diagonal elements 1 and the others 0. It can be found that SVDD views all the features of the

samples as equivalent. In contrast, our proposed DSVDD

framework assigns large weights to the relevant features and

small weights to the irrelevant features by introducing the

relevant metric learning instead of the Euclidean metric. In

the proposed DSVDD framework, we adopt the relevant

metric learning defined in [10]. Firstly, the whole sample

set {Xi}�1 would be divided into some chunklets without

replacement. Each chunklet is made up of those data with

the positive equivalence relationship. If Xi, Xj belong to the

same chunklet, both Xi and Xj should have the same but

unknown class label. As the literature [10] does, we here

give the so-called within-chunklet matrix Ml and between

chunklet co-variance matrix M2 as following

1 D nd

Ml = N

L L(Xjd - Xd)(Xjd - xdf (4) d=lj=1

1 D D nd

M2 = N(D -1)

L L L(Xjd - Xp)(Xjd - xpf (5) d=1 p=l,p#d j=1

where D is the size of the chunklets; nd is the number

of samples in the dth chunklet; Xd is the mean of the dth

chunklet. Here, since the sample set {Xi }�1 is divided into

D chunklets without replacement, i.e., N = L:f=1 nd. The

Ml owns the positive equivalent information and the M2 owns the negative equivalent information. We define the

following matrix

(6)

Then through taking the above matrix Q instead of the

matrix M-1 in the equation (2), the objective function of

the proposed DSVDD can be obtained. For further exploring

the proposed DSVDD, the matrix Q can be decomposed into 1 1 1 1

Q = (Ml M;2f(Ml M;2) Therefore, the equation (2) with Q can be rewritten as

following

(Xi - a)TQ(xi - a) 1 1 1 1

(Xi - a)T(M22 M;2)T(Ml M;2)(Xi - a) 1 1 1 1

[Ml M;2 (Xi - a)] T[Ml M;2 (Xi - a)] < R2 + �i' i = 1...N

98

In this case, each Xi can be viewed as being linearly 1 1

transformed by M22 M; 2. Ml and M2 play a similar role

as reducing within-class scatter and increasing between-class

scatter in Fisher discriminant analysis, which is also demon

strated in [10]. Since the between-class information can be

introduced, we call the proposed method as Discriminant

Support Vector Data Description (DSVDD).

In order to optimize the parameters R, a, �i' we construct

the Lagrangian function through introducing Lagrangian

multipliers ai, 'Yi and taking the equation (2), (3), (6) into

(1), and thus get

N N

L = R2+CL�i-LadR2+�i- (Xi-afQ(Xi-a)]-L 'Yi�i i=1 i=1

(7)

where ai � 0, 'Yi � 0. Setting partial derivatives of L with

respect to R, a, �i to 0, we can get

8L =0 8R

8L =0 8a 8L =0 8�i

=}

=}

=}

N

Lai= 1 (8) i=1

N

a = Laixi (9) i=1

'Yi = C - ai (10)

Further, we take the constraints (8), (9), (10) into the La

grange function (7) and obtain the maximized criterion as

following

s.t. i,j

° ::; ai ::; C, i = 1...N 1 1

Q = M;2M2M;2 (12)

(13)

The maximization of the equation (11) can be solved through

Quadratic Programming (QP) [1].

A test sample z E lR.n is classified as the target class when

the relevant distance II z -a II Q between the sample z to the

center a of the hypersphere is smaller than or equal to the

radius R, i.e.,

II z - a II�= (z - a)TQ(z - a) ::; R2 (14)

The radius R can be calculated from the center a of the

hypersphere to the sample on the hypersphere bound. In

mathematics, the radius R is given as following

(15)

where, Xi is the sample from the set of support vectors, i.e.,

its Lagrangian multiplier ° < ai < C.

III. EXPERIMENTS

In order to validate the effectiveness of the proposed

DSVDD, we compare the DSVDD with the original SVDD

in terms of both synthetic and UCI data set [2]. Both DSVDD

and SVDD adopt the linear kernel k(Xi,Xj) = xTxj, polynomial kernel (Poly) k(Xi' Xj) = (xT Xj + l)P and radial

10r---�------�----��---. 10.---�------�------�---.

5

-10

0=[0.1500, 0.0500[ f =[0.9444, 0.8500[

(a)

5

-10

0=[0.0300, 0.0500[ f =[0.9510, 0.9700[

*

(b)

-15�--��----�------�--�

-15�----�------�----�----� -5 o

Feature 1 5 -5 0 5

Feature 1

10r---�------�----��---. 10r---�------�----��---'

5

-10

0=[0.0300,0.0500[ f =[0.9510, 0.9700[

*

(e)

-15�----�----��----�----� -5 0 5

Feature 1

5

-10

0=[0.0200, 0.0600] f =[0.9423, 0.9800]

*

(d)

-15�--�------�----��--�

-5 0 5 Feature 1

Fig. 1. The classification boundaries of the SVDD and the proposed DSVDD with D = 2,4, 50, respectively. The sub-figure (a) corresponds to SVDD with the classification result e = [0.1500,0.0500]' f = [0.9444,0.8500]; the sub-figure (b) corresponds to DSVDD with D = 2 and the classification result e = [0.0300,0.0500]' f = [0.9510,0.9700]; the sub-figure (c) corresponds to DSVDD with D = 4 and the classification result e = [0.0300,0.0500]' f =

[0.9510,0.9700]; the sub-figure (d) corresponds to DSVDD with D = 50 and the classification result e = [0.0200,0.0600], f = [0.9423,0.9800].

TABLE I

THE AVERAGE AUe VALUES AND THEIR CORRESPONDING STANDARD DE VIATIONS OF TEN INDEPENDENT RUNS FOR TAE, WATER AND SONAR.

THE LARGER THE VALUE OF THE AUe, THE BETTER THE PERFORMANCE OF THE CORRESPONDING ONE-CLASS CLASSIFIER.

Class No. SVDD DSVDD

Linear Poly RBF Linear Poly REF

TAE

1 0.61±0.17 0.60±0.17 0.69±0.20 0.73±0.14 0.67±0.16 0.83±0.11

2 0.45±0.19 0.47±O.l7 0.54±0.14 0.48±0.19 0.50 ±0.17 0.53±0.14

3 0.47±0.17 0.43±O.l7 0.55±0.15 0.62±0.15 0.51±0.19 0.93±0.1O

Total 0.5100 0.5000 0.5933 0.6134 0.5575 0.7641 WATER

1 0.52±0.29 0.63±0.34 0.88±0.11 0.74±0.19 0.74±0.24 0.91±0.13

2 0.81±0.16 0.65±0.27 0.89±0.07 0.91±0.1O 0.68±0.19 0.96±0.06

Total 0.6650 0.6400 0.8850 0.8233 0.7102 0.9357 SONAR

1 0.53±0.17 0.61±0.12 0.63±0.18 0.64±0.20 0.63±0.19 0.n±0.19

2 0.50±0.25 0.50±0.19 0.61±0.22 0.68±0.19 0.76±0.18 0.80±0.16

Total 0.5180 0.5549 0.6202 0.6588 0.6962 0.7622

99

basis kernel (RBF) k(X i,Xj) = exp(-lla-bI12ja2). All

computations were run on Pentium IV 2.IO-GHz processor

running, Windows XP Professional and MATLAB environ

ment.

First, we implement the experiments on synthetic data. In

one-class classification problem here, we adopt the vectors

e, f E ]R2 to measure the performance of the one-class

classifier, where e(1) gives the False Negative (FN) rate

(the error on the target class), e(2) gives the False Positive

(FP) rate (the error on the outlier class), f(1) gives the ratio

between the sample number of correct target predictions and

the sample number of target predictions, and f(2) gives the

ratio between the sample number of correct target predictions

and the sample number of target samples.

The synthetic data used here are made of a two

dimensional two-class data set, where the target class is

generated as a banana shaped distribution with 100 samples

and the outlier class is generated with a normal distribution

with mean I and a standard deviation sqrt( 1.5). The target

data are uniformly distributed along the bananas and are

superimposed with a normal distribution. Figure 1 gives the

classification boundaries of the SVDD and the proposed

DSVDD with the size D = 2, 4, 50 of the chunklets,

respectively. From Figure I, we can find that 1) the DSVDD

has a significant superior advantage to SVDD in terms of

FN; 2) the performance of the DSVDD is not sensitive to

the parameter D.

Further, we also report the experimental results of

the proposed DSVDD and SVDD on some real data

TAE (3 classes/I 5 1 samples/5 features) [2], WATER

(2 classesl116 samples/38 features) [2] and SONAR (2

classes/208 samples/60 features) which is available at

ftp://ftp.cs.cmu.edu/afs/cs/projectlconnectlbenchl. The size

D of the chunklets in each classification problem is set to

the size of the classes. Here, we adopt the average value

of Area Under the Receiver Operating Characteristics Curve

(AUC) as the measure criterion for the performance of

one-class classifiers [4]. It is known that a good one-class

classifier should have a small FP and a high True Positive

(TP) [7], [8], [9]. Thus, we prefer one classifier with higher

AUC to another one with lower AUe. It means that for the

specific FP threshold, the TP is higher for the first classifier

than the second classifier. Thus the larger the value of the

AUC, the better the corresponding one-class classifier. In our

experiments, the value of the AUC belongs to the range [0, 1]. Table I gives the average AUC values and their corresponding

standard deviations of the proposed DSVDD and SVDD of

ten independent runs for the data sets. The best values of

the AUC is denoted with bold. Both DSVDD and SVDD

adopt linear, polynomial and radial basis kernels. The label

of a target data class is indicated in the first column. In each

classification, we take one class as the target class and the

other classes as the outlier data. From this table, it can be

found that the proposed DSVDD has a significantly superior

classification to SVDD in all the tested cases. The results

validate that the discriminant prior knowledge is induced into

100

the proposed DSVDD framework.

IV. CONCLUSION AND FUTURE WORK

In this paper, we propose a novel Discriminant SVDD

named DSVDD. DSVDD adopts the relevant metric learning

instead of the original Euclidean distance metric learning.

In doing so, the proposed DSVDD assigns large weights

to the relevant features and tights the similar data through

incorporating both the positive and negative equivalence

prior knowledge. The experimental results validate that the

proposed DSVDD significantly improves the effectiveness

of the one-class classifier. In future, we plan to extend our

work to large scale classification cases and make a further

exploration.

REFERENCES

[1] F. Alizadeh and D. Goldfarb, "Second-order cone programming," MathelfUltical Programming, vol. 95, pp. 3-51, 2003.

[2] A. Asuncion and D. Newman, UCI machine learning repository [http://www.ics.uci.edulmlearnlmlrepository.html}. School of Information and Computer Science, University of California, Irvine, CA, 2007.

[3] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, "Learning distance functions using equivalence relations," Proceeding International Conference on Machine Learning, 2003.

[4] A.P. Bradley, "The use of the area under the ROC curve in the evaluation of machine learning algorithms," Pattern Recognition, vol. 30, no. 7, pp. 1145-1159, 1997.

[5] J. Goldberger, Roweis, G. Hinton, and R. Salakhutdinov, "Neighbourhood components analysis," Advances in Neural Information Processing Systems, 2005.

[6] N. Shental, T. Hertz, D. Weinshall, and M. Pavel, "Adjustment learning and relevant component analysis," Proceeding of the European Conference on Computer Vision, 2002.

[7] D. Tax and R.P.w. Duin, "Support vector domain description," Pattern Recognition Letters, vol. 20, no. 14, pp. 1191-1199, 1999.

[8] D. Tax and R.P.W. Duin, "Support vector data description," Machine Learning, vol. 54, pp. 45-66, 2004.

[9] D. Tax and P. Juszczak, "Kernel whitening for one-class classification," International Journal of Pattern Recognition and Artificial Intelligence, vol. 17, no. 3, pp. 333-347, 2003.

[10] D. Yeung and H. Chang, "Extending the relevant component analysis algorithm for metric learning using both positive and negative equivalence constraints," Pattern Recognition, vol. 39, no. 5, pp. 1007-1010, 2006.

[ieee 2010 third international workshop on advanced computational intelligence (iwaci) - suzhou,...

Documents