neuralnetworks anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_utkin_2015a.pdf ·...

12
Neural Networks 69 (2015) 99–110 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet A new robust model of one-class classification by interval-valued training data using the triangular kernel Lev V. Utkin a,, Anatoly I. Chekh b a Department of Control, Automation and System Analysis, Saint Petersburg State Forest Technical University, Russia b Department of Computer Science, Saint Petersburg State Electrotechnical University, Russia article info Article history: Received 17 October 2014 Received in revised form 7 February 2015 Accepted 27 May 2015 Available online 9 June 2015 Keywords: One-class classification Novelty detection Support vector machine Kernel Interval-valued data Minimax strategy Linear programming Extreme points abstract A robust one-class classification model as an extension of Campbell and Bennett’s (C–B) novelty detection model on the case of interval-valued training data is proposed in the paper. It is shown that the dual optimization problem to a linear program in the C–B model has a nice property allowing to represent it as a set of simple linear programs. It is proposed also to replace the Gaussian kernel in the obtained linear support vector machines by the well-known triangular kernel which can be regarded as an approximation of the Gaussian kernel. This replacement allows us to get a finite set of simple linear optimization problems for dealing with interval-valued data. Numerical experiments with synthetic and real data illustrate performance of the proposed model. © 2015 Elsevier Ltd. All rights reserved. 1. Introduction One of the problems of the statistical machine learning is to classify some objects into classes in accordance with their prop- erties or features. At the same time, we need often to detect abnor- mal examples or to solve a one-class classification (OCC) or novelty detection problem. A lot of papers are devoted to this important problem (Campbell, 2002; Campbell & Bennett, 2001; Cherkassky & Mulier, 2007; Manevitz & Yousef, 2001; Scholkopf, Platt, Shawe- Taylor, Smola, & Williamson, 2001; Scholkopf, Williamson, Smola, Shawe-Taylor, & Platt, 2000; Zhang & Zhou, 2013). Various reviews of the OCC can be found in the machine learning literature, for ex- ample, reviews provided by Markou and Singh (2003), Bartkowiak (2011), Khan and Madden (2010), and Hodge and Austin (2004). The OCC aims to detect anomalous or abnormal observations and separate them from the so-called normal examples (Chandola, Banerjee, & Kumar, 2007, 2009; Steinwart, Hush, & Scovel, 2005). A common way for solving the OCC problem is to model the support of the unknown data distribution directly from data, that is, to estimate a binary-valued function f that is positive in a region Corresponding author. E-mail addresses: [email protected] (L.V. Utkin), [email protected] (A.I. Chekh). where the density is high, and negative elsewhere. Sample points outside this region can be regarded as anomalous observations. Some models of the OCC are based on using the framework of the support vector machine (SVM). These models are called OCC SVMs. We mark out three main approaches for constructing the OCC SVMs. The first approach is proposed by Tax and Duin (1999, 2004). This is one of the well-known OCC models, which can be re- garded as an unsupervised learning problem. According to this ap- proach, the training of the one-class SVM consists in determining the smallest hyper-sphere containing training data. An alternative way to geometrically enclose a fraction of the training data is via a hyperplane and its relationship to the origin proposed by Scholkopf et al. (2001, 2000). Under this approach, a hyperplane is used to separate the training data from the origin with the maximal mar- gin, i.e., the objective is to separate off the region containing the data points from the surface region containing no data. It should be noted that both the approaches provide the same results when a symmetric kernel is used. The third approach which will be con- sidered in detail in this paper is the linear programming approach to the OCC proposed by Campbell and Bennett (2001). The model proposed by Campbell and Bennett uses linear programming tech- niques. It should be noted that there are other interesting novelty detection or OCC models (see for instance, Bicego & Figueiredo, 2009; Hodge & Austin, 2004; Kwok, Tsang, & Zurada, 2007; Li, 2011). Every model can be applied in various applications. http://dx.doi.org/10.1016/j.neunet.2015.05.004 0893-6080/© 2015 Elsevier Ltd. All rights reserved.

Upload: others

Post on 09-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

Neural Networks 69 (2015) 99–110

Contents lists available at ScienceDirect

Neural Networks

journal homepage: www.elsevier.com/locate/neunet

A new robust model of one-class classification by interval-valuedtraining data using the triangular kernelLev V. Utkin a,∗, Anatoly I. Chekh b

a Department of Control, Automation and System Analysis, Saint Petersburg State Forest Technical University, Russiab Department of Computer Science, Saint Petersburg State Electrotechnical University, Russia

a r t i c l e i n f o

Article history:Received 17 October 2014Received in revised form 7 February 2015Accepted 27 May 2015Available online 9 June 2015

Keywords:One-class classificationNovelty detectionSupport vector machineKernelInterval-valued dataMinimax strategyLinear programmingExtreme points

a b s t r a c t

A robust one-class classificationmodel as an extension of Campbell and Bennett’s (C–B) novelty detectionmodel on the case of interval-valued training data is proposed in the paper. It is shown that the dualoptimization problem to a linear program in the C–B model has a nice property allowing to represent itas a set of simple linear programs. It is proposed also to replace the Gaussian kernel in the obtained linearsupport vectormachines by thewell-known triangular kernelwhich can be regarded as an approximationof the Gaussian kernel. This replacement allows us to get a finite set of simple linear optimizationproblems for dealing with interval-valued data. Numerical experiments with synthetic and real dataillustrate performance of the proposed model.

© 2015 Elsevier Ltd. All rights reserved.

1. Introduction

One of the problems of the statistical machine learning is toclassify some objects into classes in accordance with their prop-erties or features. At the same time, we need often to detect abnor-mal examples or to solve a one-class classification (OCC) or noveltydetection problem. A lot of papers are devoted to this importantproblem (Campbell, 2002; Campbell & Bennett, 2001; Cherkassky&Mulier, 2007; Manevitz & Yousef, 2001; Scholkopf, Platt, Shawe-Taylor, Smola, & Williamson, 2001; Scholkopf, Williamson, Smola,Shawe-Taylor, & Platt, 2000; Zhang & Zhou, 2013). Various reviewsof the OCC can be found in the machine learning literature, for ex-ample, reviews provided by Markou and Singh (2003), Bartkowiak(2011), Khan and Madden (2010), and Hodge and Austin (2004).The OCC aims to detect anomalous or abnormal observations andseparate them from the so-called normal examples (Chandola,Banerjee, & Kumar, 2007, 2009; Steinwart, Hush, & Scovel, 2005).

A common way for solving the OCC problem is to model thesupport of the unknown data distribution directly from data, thatis, to estimate a binary-valued function f that is positive in a region

∗ Corresponding author.E-mail addresses: [email protected] (L.V. Utkin), [email protected]

(A.I. Chekh).

http://dx.doi.org/10.1016/j.neunet.2015.05.0040893-6080/© 2015 Elsevier Ltd. All rights reserved.

where the density is high, and negative elsewhere. Sample pointsoutside this region can be regarded as anomalous observations.

Some models of the OCC are based on using the framework ofthe support vector machine (SVM). These models are called OCCSVMs. We mark out three main approaches for constructing theOCC SVMs. The first approach is proposed by Tax and Duin (1999,2004). This is one of the well-known OCCmodels, which can be re-garded as an unsupervised learning problem. According to this ap-proach, the training of the one-class SVM consists in determiningthe smallest hyper-sphere containing training data. An alternativeway to geometrically enclose a fraction of the training data is via ahyperplane and its relationship to the origin proposedby Scholkopfet al. (2001, 2000). Under this approach, a hyperplane is used toseparate the training data from the origin with the maximal mar-gin, i.e., the objective is to separate off the region containing thedata points from the surface region containing no data. It shouldbe noted that both the approaches provide the same results whena symmetric kernel is used. The third approach which will be con-sidered in detail in this paper is the linear programming approachto the OCC proposed by Campbell and Bennett (2001). The modelproposed by Campbell and Bennett uses linear programming tech-niques.

It should be noted that there are other interesting noveltydetection or OCC models (see for instance, Bicego & Figueiredo,2009; Hodge & Austin, 2004; Kwok, Tsang, & Zurada, 2007; Li,2011). Every model can be applied in various applications.

Page 2: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

100 L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110

Fig. 1. Examples of interval-valued data with small and large intervals.

All these OCC models are based on using a training set consist-ing of precise or point-valued data. However, training examples inmany real applications can be obtained only in the interval form.Interval-valued data may result from imperfection of measure-ment tools or imprecision of expert information. There may alsobe some missing data when some features of an example are notobserved (Pelckmans, De Brabanter, Suykens, & De Moor, 2005).

Many methods in machine learning have been presented fordealing with interval-valued data (Ishibuchi, Tanaka, & Fukuoka,1990; Nivlet, Fournier, & Royer, 2001; Silva & Brito, 2006) dueto the importance of this condition. In some methods, interval-valued observations are replaced by precise values based on someadditional assumptions, for example, by taking middle points ofintervals (Lima Neto & de Carvalho, 2008). This approach can besuccessfully used when intervals are not large and the area pro-duced by the interval intersections is rather small (see, for exam-ple, the left picture in Fig. 1). However, if the intervals are very large(see, for example, the right picture in Fig. 1), then the replacementof intervals by point-valued data may lead to large classificationerrors.

Another part of methods use the standard interval analysisfor constructing the classification and regression models (Angulo,Anguita, Gonzalez-Abril, & Ortega, 2008; Hao, 2009). A series ofinteresting models for dealing with interval-valued and fuzzy ob-servations in classification and regression can be found inworks byCarrizosa, Gordillo, and Plastria (2007a, 2007b) and Forghani andYazdi (2014). However, these models as well as the standard in-terval analysis are restricted by considering only the linear case,i.e., when separating or regression functions are linear.

Do and Poulet (2005) proposed an interesting and very simplemethod based on the change of the Euclidean distance betweentwo data points in the Gaussian kernel function by the Hausdorffdistance between two hyper-rectangles produced by intervalsfrom sample data. The method can be used in classification andregression analyses, in OCC problems. The main condition of itsuse is the assumption of the Gaussian kernel (or the kernels basedon the distance between points) in the corresponding SVM. Inspite of its simplicity, the method has an important obstacle forits application. It is not known how to interpret the classificationresults. Moreover, by dealingwith interval-valued data, we usuallyimplicitly or explicitly select a point in every interval in accordancewith some decision strategy, which can be regarded as a‘‘typical’’ point of the interval under the accepted decision strategy.The method using the Hausdorff distance allows having manydifferent data points in intervals simultaneously, namely, pairwisedistances between three intervals may correspond to differentpoints in every interval. Another disadvantage of the approachbased on using the Hausdorff distance is lack of some justifiedstrategy of decision making by dealing with imprecise data. In

other words, it is not obvious in using the Hausdorff distance howto interpret the points which determine the distance betweenintervals from the classification point of view. The Hausdorffdistance also was used in clustering with imprecise data, forexample, Chavent (2004); Chavent, de Carvalho, Lechevallier, andVerde (2006) proposed a partitional dynamic clustering methodfor interval data based on adaptive Hausdorff distances. A city-block distance function as the distance of a special form for solvingclustering problems under interval-valued data was studied byde Souza and de Carvalho (2004). Pedrycz, Park, and Oh (2008)exploited a concept of the Hausdorff distance that determines adistance between some information granule and a numeric pattern(a point in the highly dimensional feature space) for constructingclassifiers by interval and fuzzy data. It should be noted that otherdistance measures have been successfully applied to machinelearning problems. For example, Schollmeyer and Augustin(2013) proposed another distance measure for solving regressionproblems under interval data. The authors (Schollmeyer &Augustin, 2013) argued that theirmeasuremight be better in someproblems because the Hausdorff distance does not match points oftwo sets but compares all points of the two sets to each other.

Another interesting approach to constructing a classifier underinterval-valued data was proposed by Bhadra, Nath, Ben-Tal, andBhattacharyya (2009). The authors presented a novel methodol-ogy using Bernstein bounding schemes for constructing classifierswhich are robust to interval-valued uncertainty in examples. Ac-cording to the methodology, the uncertain examples are classifiedcorrectlywith high probability. A binary linear classificationmodelunder interval data different from the models using the point-valued representation of intervals was proposed by Ghaoui, Lanck-riet, and Natsoulis (2003). The authors develop a robust classifierby minimizing the worst-case value of a given loss function overall possible choices of the data in the multi-dimensional intervals.

Following the idea of the robustmodel provided byGhaoui et al.(2003), we propose a robust model which is based on three mainideas implemented in order to construct a new OCCmodel dealingwith interval-valued training data.1. Interval-valued observations produce a set of expected classifi-

cation risk measures such that the lower and upper risk mea-sures can be determined by minimizing and by maximizing therisk measure over values of intervals.

2. There are many variants of OCC SVMs. It is proposed to uselinear programming OCC SVM by Campbell and Bennett (2001)for which constraints in its dual form do not depend onvectors of observations. This allows us to represent the dualoptimization problem as a set of simple optimization problems.

3. It is proposed to replace the Gaussian kernel by thewell-knowntriangular kernel which can be regarded as an approximationof the Gaussian kernel. This replacement allows us to get a setof linear optimization problems with variables xi restricted byintervals Ai, i = 1, . . . , n.

Page 3: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110 101

The triangular kernel has been used by Utkin, Zhuk, and Chekh(2014) as an approximation of the Gaussian kernel in the frame-work of SVMbased on Schölkopf approach tomodel the OCC. How-ever, themain difficulty in using the proposedmethod is extremelyhard computations in order to enumerate all vertices of hyper-rectangles produced by interval-valued data especiallywhen train-ing examples have a lot of interval-valued features. In the presentpaper, we also apply the triangular kernels to approximate theGaussian kernel. But the advantage of the proposed model is thatits computational complexity does not substantially depends onthe number of features. At the same time, we cannot assert that themodel is rather simple from the computation point of view whenthe number of observations is large.

It is important to point out that the proposed model eventuallydeals with some points in interval-valued data which can be calledoptimal to some extent. However, in contrast to the models whereintervals are replaced by points, the proposed model searches foroptimal precise points by applying the robust ormaximin strategy.In fact, we select a single probability distribution or a point inthe interval of expected risk values in accordance with a certaindecision strategy instead of points in intervals of training data.

The paper is organized as follows. Section 2 presents a shortintroduction into the Campbell–Bennett novelty detection modelproposed in Campbell and Bennett (2001). An approach for exten-sion of the Campbell–Bennett model on the case of interval-valueddata is provided in Section 3. Numerical experiments with syn-thetic and real data illustrating accuracy of the proposed algorithmare given in Section 4. In Section 5, concluding remarks are made.

2. Campbell and Bennett’s novelty detection linear model

Suppose we have unlabeled training data x1, . . . , xn ⊂ X,where n is the number of observations, X is some set, for instance,it is a compact subset of Rl. Let D = {xi}ni=1 be drawn i.i.d. froma distribution on X. The sample space D is finite and discrete.According to papers (Scholkopf et al., 2001, 2000), a well-knownnovelty detection or OCC model aims to construct a function fwhich takes the value+1 in a ‘‘small’’ region capturing most of thedata points and−1 elsewhere. It can be done by mapping the datainto the feature space corresponding to a kernel and by separatingthem from the origin with the maximummargin.

Let φ be a feature map X → G such that the data points aremapped into an alternative higher-dimensional feature space G. Inother words, this is a map into an inner product space G such thatthe inner product in the image of φ can be computed by evaluatingsome simple kernel K(x, y) = (φ(x), φ(y)), such as the Gaussiankernel

K(x, y) = exp−∥x− y∥2 /γ 2 .

Here γ is the kernel parameter determining the geometricalstructure of the mapped samples in the kernel space. It is pointedout by Wang, Lu, Plataniotis, and Lu (2009) that the problem ofselecting a proper parameter γ is very important in classification.When a very small γ is used (γ → 0), K(x, y) → 0 for allx = y and all mapped samples tend to be orthogonal to eachother, despite their class labels. In this case, both between-classandwithin-class variations are very large. On the other hand,whena very large γ is chosen (γ 2

→∞), K(x, y)→ 1 and all mappedsamples converge to a single point. This obviously is not desired ina classification task. Therefore, a too large or too small γ will notresult in more separable samples in G.

We consider the linear programming approach to noveltydetection proposed by Campbell and Bennett (2001). The authorsstart from the hard margin case, when any training point xj lyingoutside some predefined surface restricted the training pointsis viewed as abnormal. This surface is defined as the level set,

f (z) = 0, of some nonlinear function. In feature space, f (z) =i ϕiK(z, xi)+b, this corresponds to a hyperplane which is pulled

onto the mapped data points with the restriction that the marginalways remains positive or zero (Campbell & Bennett, 2001). Hereϕ = (ϕn, . . . , ϕn) are parameters of the function f in the featurespace or Lagrange multipliers.

A criteria for constructing the optimal function f (z)proposedbyCampbell and Bennett is to minimize the mean value of the outputof the function, i.e.,

i f (xi). This is achieved by minimizing:

W (ϕ, b) =1n

ni=1

n

j=1

ϕjK(xi, xj)+ b

,

subject ton

j=1

ϕjK(xi, xj)+ b ≥ 0, i = 1, . . . , n,

ni=1

ϕi = 1, ϕi ≥ 0. (1)

The bias b is just treated as an additional parameter in theminimization process. The added constraints on ϕ restrict the classof models to be considered. As indicated by Campbell and Bennett(2001), these constraints amount to a choice of scale for theweightvector normal to the hyperplane in feature space and hence do notimpose a restriction on the model. Also, these constraints ensurethat the problem is well-posed and that an optimal solution withϕ = 0 exists. Other constraints on the class of functions arepossible, e.g. ∥ϕ∥1 = 1 with no restriction on the sign of ϕi.

It is important to point out here that Campbell and Bennettpropose to use the mean value of the output of the function. Itfollows from the form of W (ϕ, b) that the empirical probabilitydistribution (1/n, . . . , 1/n) is assumed to get the mean valueW (ϕ, b). The multiplier 1/n is omitted because it does not changethe optimization variables ϕ and b.

To handle noise and outliers a soft margin is introduced inanalogy to the usual approach used with support vector machines(Cherkassky & Mulier, 2007; Scholkopf & Smola, 2002; Smola &Scholkopf, 2004; Vapnik, 1998). In this case, the following functionhas to be minimized:

W (ϕ, b) =1n

ni=1

n

j=1

ϕjK(xi, xj)+ b

+

1vn

ni=1

ξi, (2)

subject to (1) and

nj=1

ϕjK(xi, xj)+ b ≥ −ξi, ξi ≥ 0, i = 1, . . . , n. (3)

The parameter v ∈ [0; 1] controls the extent of margin errors(smaller v means fewer outliers are ignored: v → 0 correspondsto the hard margin limit). It is a parameter which is analogous toν for the ν-SVM standard method (Cherkassky & Mulier, 2007).Slack variables ξ = (ξ1, . . . , ξn) are used to allow points to violatemargin constraints.

We will shortly call Campbell and Bennett’s novelty detectionmodel below as the C–B model.

It should be noted that W (ϕ, b) in (2) can be viewed as theexpected riskmeasure. This is important for the next considerationof interval-valued data.

3. C–B model and interval-valued data

Let us consider how C–B model can be modified undercondition that examples from the training set are interval-valued.

Page 4: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

102 L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110

We consider classification problems where the input variables(patterns) x may be interval-valued. Suppose that we have atraining set (Ai), i = 1, . . . , n. Here Ai ⊂ Rm is the Cartesianproduct of m intervals [a(k)

i , a(k)i ], k = 1, . . . ,m, which again are

not restricted so could even include intervals (−∞,∞). In otherwords, every feature of every observation or training example isinterval-valued. We aim to construct a function f which takes thevalue+1 in a ‘‘small’’ region capturing most of the interval-valuedexamples and−1 elsewhere.

Suppose that every set of points x1, . . . , xn belonging toA1, . . . ,An, respectively, produces a decision function f (x) bymeans of C–B model, i.e., by solving the linear programmingproblem (2)–(3). All possible combinations of points x1, . . . , xnfrom sets A1, . . . ,An produce a set of decision functions. Our firstaim is to define a single function from the set of decision functioncorresponding to a certain robust decision strategy.

The robust classifier in this case is the one that maximizes theexpected risk measure. The robust strategy in this case can be alsoviewed as a minimax strategy which selects the ‘‘worst’’ combina-tion of points from intervals A1, . . . ,An providing the largest valueof the expected risk. The minimax strategy can be interpreted asan insurance against the worst case because it aims at minimizingthe expected loss in the least favorable case (Robert, 1994).

Using the pessimistic strategy, we can write the optimizationproblem as follows:

W (ϕ, b)

= supxi∈Ai,i=1,...,n

minϕ,b,ξ

n

i=1

n

j=1

ϕjK(xi, xj)+ b

+

1v

ni=1

ξi

, (4)

subject to (1) and (3).In order to solve the above optimization problem,we sketch the

following steps:

1. We fix the values x1, . . . , xn and write a set of dual forms ofproblem (4).

2. We solve every dual problembymeans of extremepointswhichdo not depend on x1, . . . , xn.

3. We reduce the optimization problems with sets of valuesx1, . . . , xn to the linear ones by introducing new kernels.

4. The set of linear problems with variables x1, . . . , xn are solvedby means of standard methods.

Below every step is represented as a subsection.

3.1. A set of dual optimization problems

Belowwe show that the initial optimization problem (4) can berepresented as a finite set of simplified programming problems.

Let us fix values x1, . . . , xn in (1), (3) and (4). Then we get a lin-ear programming problem with variables ϕ = (ϕn, . . . , ϕn), ξ =(ξ1, . . . , ξn), b for every fixed values x1, . . . , xn. It can be writtenin the matrix form as follows:

minψ

(cψ) ,

subject to Aψ ≥ q.Here ψ = (ϕ, ξ, b0, b1) is the vector of 2n + 2 non-negative

optimization variables. The unconstrained variable b is replaced bytwo non-negative variables b0 and b1. Elements of the vector c are

cj =

ni=1

K(xi, xj), j = 1, . . . , n,

v−1, j = n+ 1, . . . , 2n,n, j = 2n+ 1,−n, j = 2n+ 2.

Elements of the first and the second row of the matrix A corre-spond to constraint (1) and are

a1j =1, j = 1, . . . , n,0, j = n+ 1, . . . , 2n+ 2,

a2j =−1, j = 1, . . . , n,0, j = n+ 1, . . . , 2n+ 2.

Other n rows of thematrixA consist of thematrixK(xi, xj)

n×n,

the unitmatrix In and two elements−1, 1. The vectorqhas the firstelement equal to 1, the second element equal to −1 and n zeroselsewhere.

Let uswrite the dual formof this linear optimization problembymeans of thewell-known technique. It can bewritten in thematrixform as

maxφ

(φq) ,

subject to φA ≤ cT.Hereφ is the vector of n+2non-negative optimization variables

such that φ = (c, d, β1, . . . , βn), where (β1, . . . , βn) = β is thevector of non-negative optimization variables, c and d are also non-negative optimization variables. After substituting the elements ofA, q and c into the above dual problem,we can rewrite it as follows:

c − d→ maxc,d,β

,

subject to

c − d+n

i=1

K(xi, xj)βi ≤

ni=1

K(xi, xj), j = 1, . . . , n,

βi ≤1ν, i = 1, . . . , n,

ni=1

βi ≤ n, −

ni=1

βi ≤ −n.

Finally, we can write

c − d→ maxd,β

, (5)

subject to

c − d−n

i=1

(1− βi)K(xi, xj) ≤ 0, j = 1, . . . , n, (6)

0 ≤ βi ≤1ν, i = 1, . . . , n,

ni=1

βi = n. (7)

Let us rewrite constraints (6) as

c − d ≤n

i=1

(1− βi)K(xi, xj), j = 1, . . . , n.

Note that all above inequalities have to be satisfied. This can beachieved when the left side of every inequality is less than thesmallest right side. Hence, we can replace n constraints (6) by thefollowing constraint:

c − d ≤ minj=1,...,n

ni=1

(1− βi)K(xi, xj).

Let us replace variables β1, . . . , βn by α1, . . . , αn such that αi =

βi/n. Since there are no other restrictions for c − d in (5)–(7)except for (6), then problem (5)–(7) can be rewritten as a set ofn optimization problemsmax

α

ni=1

(1− nαi)K(xi, xj)

→ min

j=1,...,n, (8)

Page 5: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110 103

subject to

0 ≤ αi ≤1vn

, i = 1, . . . , n,n

i=1

αi = 1. (9)

In other words, we have to solve n optimization problems suchthat the final solution is determined by choosing the smallest ob-jective function (8). The main advantage of the above optimizationproblem is that the constraints (9) do not depend on x1, . . . , xn andthey are linear. Therefore, the problem solution is defined only byextreme points of a set of variables α1, . . . , αn, which also do notdepend on x1, . . . , xn. This peculiarity is very important and givesus an opportunity to replace the optimization problem by a finiteset of unconstrained optimization problems.

3.2. Extreme points of the polytope produced by constraints

It should be noted that every problem (8)–(9) by fixedx1, . . . , xn can be solved by using the extreme points of the poly-tope produced by constraints (9). Suppose that we have T ex-treme points denoted asα(1), . . . , α(T ). Hereα(l)

= (α(l)1 , . . . , α

(l)n ).

Hence, we can rewrite every problem (8)–(9) as follows:max

l=1,...,T

ni=1

(1− nα(l)i )K(xi, xj)

→ min

j=1,...,n. (10)

If we would know point-valued training data x1, . . . , xn, thenwe get nT very simple objective functions without constraints. Letus determine the extreme points and their number.

Proposition 1. Let α1, . . . , αn with n ∈ N be a set of data and Mv

be a set of (α1, . . . , αn) produced by conditions

0 ≤ αi ≤1vn

, i = 1, . . . , n,n

i=1

αi = 1.

1. If v ≥ (n − 1)n−1, then the set Mv has T = n extremepoints which are of the following form: the kth element is givenby v−1(n−1 + v − 1) and the other n − 1 elements are equal tov−1n−1.

2. If n−1 < v < (n− 1)n−1, then the set Mv has T = sns

extreme

points, where s ∈ N and it is defined by the inequality1

n− s+ 1≤

1vn≤

1n− s

.

The extreme points have the same form: n− s elements have valuev−1n−1, there is one element given by 1− (n− s)v−1n−1, and theother s− 1 elements are equal to zero.

3. If v ≤ n−1, thenMv coincideswith the unit simplexwhose verticeshave one element equal to 1 and n−1 zero elements equal to zero.

The proof of a similar proposition can be found in Utkin (2014).The above proposition provides a simple way for constructing theset of extreme points α(1), . . . , α(T ). We have now nT objectivefunctions which have to beminimized over all j = 1, . . . , n, and tobe maximized over l = 1, . . . , T for every j. Of course, the abovedoes not mean that problem (8)–(9) is reduced to a set of objectivefunctions because every problem depends on fixed x1, . . . , xn.Therefore, the next task is to consider how problem (8)–(9) can besolved by taking into account the set of values x1, . . . , xn.

3.3. The triangular kernel

It was assumed in optimization problem (10) that there is afixed set of points x1, . . . , xn from intervals A1, . . . ,An, respec-tively. Now we relax this condition and try to solve the optimiza-

tion problem. The main idea to solve the problem is to replaceGaussian kernel K(x, y) by a new kernel functionT (x, y) = max{0, 1− ∥x− y∥1 /γ 2

}.

This is the well-known triangular kernel. It can be regarded asan approximation of the Gaussian kernel. The introduced kernel isbounded by 0 and 1. Its largest value 1 takes place when x = y. Amain peculiarity of the function is that it is linear. This peculiarityallows us to represent optimization problem (10) as a set of linearprogramming problems of a special form.

It should be noted that the triangular kernel is a condition-ally positive definite kernel, but the convergence of SVMs remainsguaranteedwith this type of kernel (Scholkopf et al., 2001). Fleuretand Sahbi in Fleuret and Sahbi (2003) show that triangular ker-nels have a very interesting property: if both training and testingdata are scaled by the same factor, the response of the classificationfunction remains the same. This is the scale-invariance property ofSVMs based on the Triangular Kernel. The property has been suc-cessfully applied to some real applications (see, for example, pa-pers Ferecatu, Boujemaa, &Crucianu, 2008;Musdholifah&Hashim,2013; Sahbi, 2007).

3.4. Constructing the linear programming problem with the triangu-lar kernel

The next step is to show that the use of the triangular kernelleads to a linear programming problemwhich can be simply solvedby means of the available standard methods.

Introduce the optimization variables Gij = T (xi, xj) and H(k)ij =x(k)

i − x(k)j

. Then there holds

Gij = max

0, 1−

mk=1

H(k)ij /γ 2

.

Hence, we can write the following objective function for comput-ing optimal values of Gij,H

(k)ij , xi: max

xi,Gij,H(k)ij

ni=1

(1− nα(k)i )Gij

→ maxk=1,...,T

minj=1,...,n

.

Let us consider two main cases of elements of extreme pointsα

(k)i . Denote N = {1, . . . , n}.

Case 1. For every k, we define a set of indices Uk such that 1 −nα(k)

i < 0 for every i ∈ Uk. In this case, the maximization of theobjective function follows the minimization of Gij. Then we havesimple constraints for Gij:

Gij ≥ 0, Gij ≥ 1−m

k=1

H(k)ij /γ 2, i ∈ Uk.

Simultaneously, the minimization of Gij follows the maximiza-tion of every H(k)

ij . The main problem here is to introduce the ab-

solute valuex(k)

i − x(k)j

into the constraints. In order to realizethat we use results proposed by Beaumont (1998) represented asa lemma. We give its simplified form.

Lemma 2 (Beaumont, 1998). If [x, x] ⊂ R, x < x, and, if

u =|x| − |x|x− x

, v =x|x| − x|x|

x− x,

we have

∀x ∈ [x, x], |x| ≤ ux+ v.

Let us determine lower and upper bounds for the differencex(k)i − x(k)

j . Its lower bound is x(k)ij = a(k)

i − a(k)j . The upper bound

Page 6: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

104 L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110

can be obtained in the same way x(k)ij = a(k)

i − a(k)j . Hence, we get

the following constraints in accordance to Lemma 2:

H(k)ij ≤ h(k)

ij

x(k)i − x(k)

j

+ h

(k)ij , i ∈ Uk

where

h(k)ij =

x(k)ij

− x(k)ij

x(k)ij − x(k)

ij

, h(k)ij =

x(k)ij

x(k)ij

− x(k)ij

x(k)ij

x(k)ij − x(k)

ij

.

Case 2. For every k, we define a set of indices N\Uk such that1 − nα(k)

i ≥ 0 for every i ∈ N\Uk. In this case, the maximizationof the objective function follows the maximization of Gij. Denotethe non-zero part of the equality for Gij as w, i.e., Gij = max (0, w).Rewrite the last expression as follows:

max (0, w) = w/2+max (−w/2, w/2) = w/2+ |w/2| .

Indeed, if w is negative, then w/2+ |w/2| = 0. If w is positive,then w/2+ |w/2| = w.

We again use results proposed byBeaumont (1998) representedin Lemma 2. Let us determine lower and upper bounds for

wij = 1−m

k=1

H(k)ij /γ 2.

First, we find the bounds for the absolute value H(k)ij . Its lower

bound is

H(k)ij =

min

|x(k)

ij |, |x(k)ij |

, x(k)

ij · x(k)ij ≤ 0,

0, otherwise.

Here x(k)ij = a(k)

i − a(k)j and x(k)

ij = a(k)i − a(k)

j have been definedabove. It can be seen that the lower bound for H(k)

ij depends on therelative position of the ith and jth intervals. In particular, if a(k)

i <

a(k)j (intervals are not intersecting and the jth interval follows the

ith interval), then the closest points of intervals are a(k)i , a(k)

j andthe lower bound is |x(k)

ij |. If a(k)j < a(k)

i , then the lower boundis |x(k)

ij |. However, when the intervals are intersecting (conditionx(k)ij · x

(k)ij > 0), then there are points belonging to both intervals

such that the difference is 0. This difference is the smallest valuebecause H(k)

ij ≥ 0.The upper bound can be obtained in the same way

H(k)ij = max

|x(k)

ij |, |x(k)ij |

.

Finally, the lower and upper bounds for wij are

wij = 1−m

k=1

H(k)ij /γ 2, wij = 1−

mk=1

H(k)ij /γ 2,

respectively, and the additional constraint is

wij ≤ W ij

1−

mk=1

H(k)ij /γ 2

+W ij, i ∈ N\Uk,

where

W ij =

wij− wij

wij − wij

, W ij =wijwij

− wij

wij

wij − wij.

Simultaneously, the maximization of Gij follows the minimiza-tion of every H(k)

ij . This can be simply realized by means of con-straints:

H(k)ij ≥ x(k)

i − x(k)j , H(k)

ij ≥ x(k)j − x(k)

i , i ∈ N\Uk.

In sum, we get the following optimization problem:

O(j, k) = maxxi,Gij,H

(k)ij

i∈Uk

(1− nα(k)i )Gij

+12

i∈N\Uk

(1− nα(k)i )

wij + 1−

mk=1

H(k)ij /γ 2

, (11)

subject to

Gij ≥ 0, Gij ≥ 1−m

k=1

H(k)ij /γ 2, i ∈ Uk. (12)

H(k)ij ≤ h(k)

ij

x(k)i − x(k)

j

+ h

(k)ij , i ∈ Uk, (13)

wij ≤ W ij

1−

mk=1

H(k)ij /γ 2

+W ij, i ∈ N\Uk, (14)

H(k)ij ≥ x(k)

i − x(k)j , H(k)

ij ≥ x(k)j − x(k)

i , i ∈ N\Uk, (15)

a(k)i ≤ x(k)

i ≤ a(k)i , k = 1, . . . ,m, i = 1, . . . , n. (16)

The problem is solved for every j = 1, . . . , n and for everyk = 1, . . . , T .

3.5. An algorithm

A general algorithm can be written as follows.Step 1. E(Mv) is the set of extreme points α(1), . . . , α(T ).Step 2. We select the kth extreme point α(k) from E(Mv).Step 3. For the given j ∈ {1, . . . , n} and the selected k ∈ {1, . . . , T },we solve linear programming problem (11)–(16).Step 3. For the given j ∈ {1, . . . , n}, we select a single value k∗jsuch that objective function (11) achieves its maximum, i.e., k∗j ←argk maxO(j, k).Step 5. We select a single value j∗ such that objective function (11)achieves its minimum, i.e., j∗ ← argj minO(j, k∗j ). As a result, weget an optimal vector (x∗1, . . . , x

∗n) and an optimal extreme point

α(k).Step 6. This step can be realized in two different ways. The firstway is to get the solution ϕ of problem (4) which can be regardedas the dual one for problem (11)–(16). This step can be carriedout by means of the well-known procedures implemented, forexample, in the package ‘‘linprog’’ in R-project. The second way isjust to solve the problem (2)–(3) by substituting the optimal vector(x∗1, . . . , x

∗n) into the objective function (2) and constraints (3).

3.6. Decision strategies for testing

The next question is how to decide whether a testing observa-tion is abnormal or not when it is interval-valued.

The following three decision strategies can be used in order todecide whether the interval-valued observation is normal or ab-normal:1. Classification by using centers of intervals (CC). According to

this strategy, every hyper-rectangle is replaced by its center.This is the simplest strategy.

2. Classification by using all vertices of hyper-rectangles producedby intervals and their centers (CA). Here the vertex is a pointbelonging to one of the bounds of intervals.

3. Classification by using half of points of every hyper-rectangle(CH). According to this strategy, every interval is divided intok− 1 subintervals (k points for every feature). So, the interval-valued example is viewed as a grid.These strategies are also used in order to estimate the accuracy

of classification for models with interval-valued data

Page 7: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110 105

4. Numerical experiments

The model proposed in this paper is illustrated via several ex-amples; all computations have been performed using the pro-gramming language Java. We investigate the performance of theproposedmodel and compare itwith C–Bmodel by considering theaccuracy (ACC), which is the proportion of correctly classified caseson a sample of data and is often used to quantify the predictiveperformance of classification methods. By applying C–Bmodel, wetake centers of hyper-rectangles produced by intervals from sam-ple data.

It should be noted that all strategies replace the interval-valuedtraining example by a single point or by a set of points. Theindicator decision function of correct classification is of the form:

C(x) = I(y · f (x) ≥ 0),

where x is a testing example, f (x) is a value of the separatingfunction at point x, y ∈ {−1; 1} is the class label, I is the indicatorfunction.

It should be noted that the labels y are unknown for the classi-fier. However, in order to evaluate it, testing examples are dividedinto two classes whose labels are−1 for abnormal examples and 1for other examples. A class of a point corresponds to the class of thecorresponding interval-valued example producing this point withrespect to the accepted testing strategy.

Let us formally define the decision functions for every strategy:

1. According to the first strategy, an example is supposed to becorrectly classified if the point corresponding to its hyper-rectangle center x∗i is correctly classified, i.e.,

S(xi) = C(x∗i ).

2. According to the second strategy, an example is supposed to becorrectly classified if all vertices Vi of the corresponding hyper-rectangle as well as its center x∗i are correctly classified, i.e.,

S(xi) = I

1+ |Vi| =

x∈Vi∪x∗i

C(x)

.

3. According to the third strategy, an example is supposed to becorrectly classified if at least a half of points belonging to thepredefined grid Gi of the corresponding hyper-rectangle arecorrectly classified, i.e.,

S(xi) =

1,x∈Gi

I(C(x) = 1) ≥x∈Gi

I(C(x) = 0),

0, otherwise.

The classification accuracy measure for strategies 1–3 can becomputed as follows:

ACC =1|X |

|X |i=1

S(xi),

where X is the testing set, |X | is the number of elements in theset X , i.e., the number of testing examples, S is the binary decisionfunction for every strategy such that S(xi) = 1 if the ith exampleis correctly classified, and S(xi) = 0 otherwise.

We denote the accuracy measure of the proposed model asACC Int

S , the accuracymeasure of the standard C–Bmodelwith usingthe center points asACCCB

S , where S corresponds to one of the abovedecision strategies.

4.1. Synthetic data

First of all, we consider the performance of the proposedmodelwith synthetic data having two features x(1) and x(2). The training

set consisting of N examples from two subsets is generated inaccordance with the following rules. All experiments use thetriangular kernel with the kernel parameter γ . Different values forthe parameter γ have been tested choosing those leading to thebest results.Generation of normal observations. A set of N1 = (1 − ε)N normalobservations are generated. Here ε is the portion of abnormalobservations.Step 1. The center of an example x∗i denoted as (x∗1, x

2) is generatedwith respect to the normal probability distribution with expecta-tions (m(1)

1 ,m(1)2 ) and with standard deviations (σ

(1)1 , σ

(1)2 ).

Step 2. For every pair (x∗1, x∗

2), we generate interval-valued pair(x∗1 − ∆1, x∗1 + ∆1; x∗2 − ∆2, x∗2 + ∆2). The shifts (∆1, ∆2) aregenerated with respect to the uniform probability distributionwith a predefined largest shiftM∆.

Additionally, we introduce the portion β of one-dimensionalintervals, which means how many observations have one point-valued feature, i.e., there holds ∆i = 0.Generation of abnormal observations. A set of N2 = εN abnormalobservations are generated. Two approaches are used to do it.

1. Normal and abnormal observations are concentrated aroundtwo centers defined by different mean values m(1)

1 ,m(1)2 and

m(2)1 ,m(2)

2 , respectively. The observations are governed by thenormal probability distributions with identical variances.

2. Normal and abnormal observations are concentrated aroundone center defined bymean valuesm(1)

1 ,m(1)2 , butwith different

variances or standard deviations (σ(1)1 , σ

(1)2 ) and (rσ (2)

1 , rσ (2)2 ),

respectively. The observations are governed by the normalprobability distributions with identical mean values. Here r isa multiplier used for generating abnormal observations. A partof abnormal observations which are located close to the centeris removed.

Weuse various values of the kernel parameterγ and the param-eter ν in a predefined grid. However, we show only the values thatprovide the best classification accuracy. Approaches for generationof testing sets are similar to generation of training sets.

First, we study how the classification accuracy depends on val-ues of the largest shift M∆, i.e. on the size of intervals. By applyingthe first approach for generating the training sets, we use the fol-lowing parameters:

N = 50, ε = 0.2,

(m(1)1 ,m(1)

2 ) = (0, 0), (m(2)1 ,m(2)

2 ) = (4, 4),

(σ(1)1 , σ

(1)2 ) = (1, 1), M∆ = 0.01, . . . , 7,

β = 0.1, γ = 4, ν = 0.02.

Results of testing of the proposed model and the standard C–Bmodel for all decision strategies are shown in Table 1. It can be seenfrom the table that the proposedmodel outperforms theC–Bmodelfor large values ofM∆, i.e., for the case of large intervals of trainingdata.

It is interesting also to study how the difference of the accuracymeasures depends on M∆. The relative difference for strategy S isdefined as follows:

d =ACC Int

S − ACCCBS

(ACC IntS + ACCCB

S )/2.

Fig. 2 illustrates the dependence of relative differences betweenthe accuracy measures of the proposed model and the C–B modelon values of M∆ for the considered strategies. It can be seen fromFig. 2 that the performance of the proposedmodel is improvedwithincrease of intervals. It is interesting to note that the C–B model

Page 8: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

106 L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110

Fig. 2. Relative differences d as functions ofM∆ .

Table 1The classification accuracy of proposed and C–B models for different strategies anddifferent values of the shiftM∆ .

M∆ Strategy 1 Strategy 2 Strategy 3ACCCB

CC ACC IntCC ACCCB

CA ACC IntCA ACCCB

CH ACC IntCH

0.01 0.7823 0.7832 0.7831 0.7830 0.7834 0.78290.1 0.7877 0.7864 0.7902 0.7886 0.7878 0.78640.5 0.7779 0.7771 0.7897 0.7899 0.7766 0.77581.0 0.7855 0.7936 0.7978 0.8075 0.7833 0.79231.5 0.7842 0.8142 0.7983 0.8161 0.7805 0.81282.0 0.7828 0.8036 0.7973 0.8151 0.7781 0.79942.5 0.7762 0.7954 0.7967 0.8159 0.7676 0.79033.0 0.7869 0.8005 0.7995 0.8226 0.7730 0.78964.0 0.7822 0.7680 0.7975 0.8219 0.7506 0.72365.0 0.7808 0.8044 0.7973 0.8221 0.7184 0.71356.0 0.7861 0.8561 0.7967 0.8356 0.6698 0.71557.0 0.7902 0.8453 0.7977 0.8262 0.5984 0.6248

Table 2The classification accuracy of the proposed and C–B models for different strategiesand different values of the standard deviation of features.

σ Strategy 1 Strategy 2 Strategy 3ACCCB

CC ACC IntCC ACCCB

CA ACC IntCA ACCCB

CH ACC IntCH

0.25 0.7603 0.8542 0.8079 0.9470 0.4705 0.74150.5 0.7548 0.8597 0.8029 0.9114 0.6938 0.86380.75 0.7598 0.8578 0.8031 0.8861 0.7589 0.87901.0 0.7658 0.8419 0.8027 0.8550 0.7697 0.85871.5 0.7563 0.7875 0.7936 0.8169 0.7565 0.79152.0 0.7522 0.7651 0.7889 0.8000 0.7520 0.76612.5 0.7313 0.7309 0.7756 0.7780 0.7318 0.73173.0 0.7102 0.7136 0.7645 0.7677 0.7122 0.7158

outperforms the proposed model by M∆ = 4 for all strategies ex-cept for the second one. This behavior of the accuracy measuresmay be associated with the difference between expectations ofgenerated normal and abnormal points.

Second, we study how the classification accuracy depends onthe standard deviation σ of features. For generating the trainingsets, we use the first approach and the following parameters:

N = 50, ε = 0.2,

(m(1)1 ,m(1)

2 ) = (0, 0), (m(2)1 ,m(2)

2 ) = (8, 8),

(σ(1)1 , σ

(1)2 ) = (σ , σ ), M∆ = 2, β = 0.1,

γ = 4, ν = 0.02.

The experimental results are given in Table 2. One can see fromTable 2 that the accuracy measures of the proposed and standardC–B models converge to close values with increase of the standarddeviation σ . At the same time, the proposedmodel provides betterresults when the standard deviation is rather small.

Fig. 3 illustrates the dependence of relative differences betweenthe accuracy measures of the proposed model and the C–B model

on the values of the standard deviation for the considered strate-gies.

Third, we study how the classification accuracy depends on thelocation of normal and abnormal observations. By using the firstapproach for generating interval-valued training data (normal andabnormal observations are concentrated around two centers), weinvestigate the classifiers by the following parameters:

N = 50, ε = 0.2,

(m(1)1 ,m(1)

2 ) = (0, 0), (m(2)1 ,m(2)

2 ) = (m,m),

(σ(1)1 , σ

(1)2 ) = (0.5, 0.5), M∆ = 2, β = 0.1,

γ = 4, ν = 0.02.One can see that the mean values of abnormal observations are

taken as (m,m) in order to consider the dependence of the clas-sification accuracy measures on m. By changing the mean values(m(2)

1 ,m(2)2 ), we change the distance between normally distributed

examples and abnormal observations. Table 3 shows how theaccuracy measures depend on this distance for different decisionstrategies. It can be seen from the table that the proposed modeloutperforms the C–B model for large values ofm.

Relative differences between the accuracy measures as func-tions of the distance between expectations of generated normaland abnormal observations are depicted in Fig. 4. It is interesting tosee from Fig. 4 that the difference between accuracy measures ofthe proposed model and the C–B model increases for some strate-gies before m = 4 and then decreases after this distance value.It can be explained as follows. When m is rather small, i.e., nor-mal and abnormal observations are very close to each other, theproposed model as well as the C–B model are identical and showsimilar bad results. This can be seen from Table 3 for m = 2. Af-ter increasing m, one could expect that the C–B model with usingthe center points of intervals will outperform the proposed model.However, we can see that the width of intervals determined byM∆ = 2 is comparable with m. This leads to outperforming of theproposed model for all strategies. When m is rather large in com-parison with given M∆, the difference between models decreasesand may be very small.

Similar results can be obtained by increasing the standard devi-ation of features (σ

(1)1 , σ

(1)2 ) = (1, 1). They are shown in Table 4. It

can be seen from the table that the relative quality of the proposedmodel depends on the strategy. In particular, the second strategyprovides the best results in comparison with other strategies.

It has been pointed out in Section 1 that there are other one-class classification models, for example, the well-known modelproposed by Tax and Duin (1999, 2004) (the T–D model). There-fore, we also use this model for its comparison with the proposedmodel by interval-valued data. Table 5 illustrates the differencebetween accuracymeasures of the T–Dmodel (ACCTD

CC ) and the pro-posed model (ACC Int

CC ) for the first strategy by the standard devia-

Page 9: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110 107

Fig. 3. Relative differences d as functions of the standard deviation.

Fig. 4. Relative differences d as functions of the distance between expectations of normal and abnormal data.

Table 3The classification accuracy of the proposed and C–B models for different strategiesand different mean values of abnormal observations by (σ

(1)1 , σ

(1)2 ) = (0.5, 0.5).

m Strategy 1 Strategy 2 Strategy 3ACCCB

CC ACC IntCC ACCCB

CA ACC IntCA ACCCB

CH ACC IntCH

2.0 0.7842 0.7842 0.7982 0.8149 0.7558 0.75692.5 0.7800 0.7869 0.8146 0.7999 0.8333 0.76173.0 0.7834 0.8751 0.7988 0.8422 0.7560 0.87903.5 0.7890 0.8985 0.7989 0.8569 0.7664 0.90884.0 0.7818 0.9029 0.7987 0.8519 0.7542 0.91594.5 0.7959 0.9055 0.8011 0.8695 0.7750 0.91745.0 0.7868 0.9032 0.7997 0.8625 0.7626 0.91706.0 0.7879 0.8115 0.8000 0.8589 0.7628 0.80297.0 0.7964 0.7982 0.8025 0.8808 0.7781 0.77968.0 0.7545 0.8600 0.8025 0.9112 0.6936 0.8635

tion of features (σ(1)1 , σ

(1)2 ) = (0.5, 0.5) and by different mean

values of abnormal observations. One can observe that the T–Dmodel provides better results in comparison with the C–B modelonly for some values ofm (see Table 3 providing similar numericalresults with the C–B model for comparison). However, the charac-ter of the relationship between thismodel and the proposedmodelis the same.

The same numerical experiments can be provided by using thesecond approach for generating abnormal observations, namely,when normal and abnormal observations are concentrated aroundone center, but with different variances. We use the following

Table 4The classification accuracy of the proposed and C–B models for different strategiesand different mean values of abnormal observations by (σ

(1)1 , σ

(1)2 ) = (1, 1).

m Strategy 1 Strategy 2 Strategy 3ACCCB

CC ACC IntCC ACCCB

CA ACC IntCA ACCCB

CH ACC IntCH

2.0 0.7801 0.7651 0.7973 0.8007 0.7743 0.75552.5 0.7816 0.7862 0.7986 0.8094 0.7747 0.78003.0 0.7754 0.7750 0.7956 0.8009 0.7682 0.77013.5 0.7818 0.7936 0.7974 0.8101 0.7767 0.78994.0 0.7859 0.7931 0.7998 0.8098 0.7808 0.78904.5 0.7859 0.8034 0.7999 0.8151 0.7811 0.80145.0 0.7844 0.8153 0.7999 0.8285 0.7786 0.81486.0 0.7873 0.7840 0.8019 0.8250 0.7876 0.78977.0 0.7715 0.8198 0.8016 0.8602 0.7783 0.84018.0 0.7651 0.8408 0.8028 0.8553 0.7694 0.8585

Table 5The classification accuracy of the proposed and T–D models for the first strategyand different mean values of abnormal observations by (σ

(1)1 , σ

(1)2 ) = (0.5, 0.5).

m ACCTDCC ACC Int

CC

2.0 0.7785 0.78423.0 0.7801 0.87514.0 0.8151 0.90295.0 0.8401 0.90326.0 0.8032 0.81157.0 0.7981 0.79828.0 0.8322 0.8600

Page 10: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

108 L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110

Fig. 5. Relative differences d as functions ofM∆ for the second generation approach.

Fig. 6. Relative differences d as functions of r for the second generation approach.

Table 6The classification accuracy of the proposed and C–B models for different strategiesand different values of the shiftM∆ .

M∆ Strategy 1 Strategy 2 Strategy 3ACCCB

CC ACC IntCC ACCCB

CA ACC IntCA ACCCB

CH ACC IntCH

0.01 0.8957 0.8956 0.8942 0.8951 0.8960 0.89590.1 0.8959 0.9008 0.8874 0.8926 0.8959 0.90050.5 0.8968 0.9110 0.8592 0.8770 0.8965 0.91091.0 0.8930 0.9057 0.8358 0.8543 0.8952 0.90631.5 0.9005 0.8942 0.8289 0.8443 0.9076 0.89282.0 0.8948 0.8839 0.8238 0.8408 0.9084 0.8709

parameters for generating training data:

N = 50, ε = 0.2,

(m(1)1 ,m(1)

2 ) = (0, 0), (σ(1)1 , σ

(1)2 ) = (0.5, 0.5),

M∆ = 0.01, . . . , 7,β = 0.1, γ = 4, ν = 0.02, r = 3.

The corresponding experimental results are given in Table 6.Relative differences between the accuracy measures as functionsofM∆ are depicted in Fig. 5.

The next experiment aims to investigate how the parameter rimpacts on the accuracy measures of the models. We again usethe second approach for generating abnormal observations withparameters:

N = 50, ε = 0.2,

(m(1)1 ,m(1)

2 ) = (0, 0), (σ(1)1 , σ

(1)2 ) = (0.5, 0.5),

M∆ = 1, β = 0.1, γ = 4, ν = 0.02,r = 2, . . . , 10.

Table 7The classification accuracy of the proposed and C–B models for different strategiesand different values of r .

r Strategy 1 Strategy 2 Strategy 3ACCCB

CC ACC IntCC ACCCB

CA ACC IntCA ACCCB

CH ACC IntCH

2.0 0.8790 0.8722 0.8294 0.8370 0.8822 0.87132.5 0.8907 0.8926 0.8343 0.8488 0.8939 0.89413.0 0.9011 0.9132 0.8406 0.8595 0.9031 0.91403.5 0.8989 0.9110 0.8397 0.8580 0.8999 0.91224.0 0.8979 0.9152 0.8392 0.8607 0.8998 0.91675.0 0.9003 0.9235 0.8436 0.8668 0.9014 0.9247

10.0 0.9022 0.9272 0.8556 0.8788 0.9021 0.9272

The corresponding experimental results are given in Table 7.Relative differences between the accuracy measures as functionsof r are depicted in Fig. 6.

An example of separating functions computed by means of theproposed model (thick curve) and the C–B model with using cen-ters of hyper-rectangles (dashed curve) is shown in Fig. 7. The ab-normal observations are generated by applying the first approach,i.e., normal and abnormal observations are concentrated aroundtwo centers. Centers of hyper-rectangles corresponding to nor-mal and abnormal observations are depicted by small trianglesand diamonds, respectively. The following parameters are used forgenerating random interval-valued observations: N = 20, ε =0.2, (m(1)

1 ,m(1)2 ) = (0, 0), (m(2)

1 ,m(2)2 ) = (4, 4), (σ (1)

1 , σ(1)2 ) =

(2, 2), M∆ = 2, β = 0.1, γ = 4, ν = 0.02.Another example of similar separating functions is shown in

Fig. 8. The abnormal observations in this case are generated byapplying the second approach, i.e., normal and abnormal ob-servations are concentrated around one center, but with differ-ent variances. The following parameters are used for generating

Page 11: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110 109

5.00.0–5.0

x1

x2

5.0

0.0

–5.0

Fig. 7. An example of separating function computed by different models for thefirst generation approach.

2.0

2.0

0.0

0.0

–2.0

–2.0

x1

x2

Fig. 8. An example of separating functions computed by different models for thesecond generation approach.

interval-valued observations: N = 30, ε = 0.2, (m(1)1 ,m(1)

2 ) =

(0, 0), (σ (1)1 , σ

(1)2 ) = (0.5, 0.5), M∆ = 1, β = 0.1, γ = 4, ν =

0.02, r = 2.5.

4.2. Real data

The proposedmodel has been evaluated and investigated by thefollowing publicly available data sets: Indian Liver Patient, Iris, Ver-tebral Column, Seeds, Glass Identification. All data sets are from theUCI Machine Learning Repository (Frank & Asuncion, 2010). Thefollowing is a brief introduction about these data sets, while moredetailed information can be found from, respectively, the data re-sources.

Indian Liver Patient Data set (ILPD) contains 416 liver patientrecords and 167 non-liver patient records characterized by 10 fea-tures. Liver patients are viewed as normal data, non-liver patientsare abnormal.

Iris data set contains 3 classes (Iris Setosa, Iris Versicolour, IrisVirginica) of 50 instances each. The number of features is 4 (sepallength, sepal width, petal length, petal width). It is supposed thatdata points from the Iris Setosa class are abnormal, i.e., the numberof abnormal examples is 30.

Vertebral Column data set contains 2 classes of patients. Theclasses consist of 100 patients considered as normal and 210 pa-tients considered as abnormal. Each patient is represented in thedata set by six biomechanical attributes derived from the shapeand orientation of the pelvis and lumbar spine.

Seeds data set consists of three different varieties of wheat:Kama, Rosa and Canadian, 70 elements each. Elements are char-acterized by 7 features. For using the set in one-class classification,elements of the class Canadian are supposed to be abnormal. Otherelements are normal.

Table 8Optimal values of parameters for real data sets.

γ IDC

ILPD 16 2Vertebral 4 5Seeds 2 10Iris 2 5Glass 2 5

Table 9Accuracy measures for real data sets.

Strategy 1 Strategy 2 Strategy 3ACCCB

CC ACC IntCC ACCCB

CA ACC IntCA ACCCB

CH ACC IntCH

ILPD 0.6066 0.7220 0.6070 0.7102 0.6005 0.7124Vertebral 0.6750 0.6945 0.6750 0.7140 0.6750 0.6815Seeds 0.6397 0.8065 0.6459 0.8462 0.6460 0.8400Iris 0.5030 0.7755 0.4886 0.7491 0.4885 0.8450Glass 0.4168 0.8014 0.4068 0.8126 0.4161 0.8909

Glass Identification data set contains 6 types of glass and totally214 examples having 10 features. Elements of the first and secondclasses are viewed as normal. Other elements are abnormal.

The following algorithm is used for numerical experimentswithreal data:Step 1. All examples of the training set are divided into training andtesting subsets such that the training set contains 40 examples.Step 2. For all examples and for every feature, the intervals are ran-domly generated similarly to generation of intervals for syntheticdata. The center of the interval-valued example is the original valueof every feature in real data. The largest shift M∆ is equal to thesample standard deviation computed by using the total training setmultiplied by the introduced interval deviation coefficient (IDC).Step 3. Separating functions are computed for the standard C–Bmodel and for the proposed model on the basis of training subset.Step 4. The classification accuracy for every decision strategy isdetermined on the basis of the testing subsets in accordance withevery strategy.

It should be noted that the accuracy measures are computed asaverage values by means of the random selection of training andtesting subsets from data sets many times.

The kernel parameter γ for every data set is selected separately.The optimal values of γ and the IDC providing the largest classifi-cation accuracy are shown in Table 8. The accuracy measures areshown in Table 9.

5. Conclusion

A new OCC model dealing with interval-valued training datahas been proposed in the paper. Many experiments have shownthat themodel outperforms the standardmethods especiallywhenmainly large intervals of training data are available. The proposedmodel comes to a finite set of simple linear programming prob-lemswhose solution does not meet difficulties. Another advantageof the proposed model is that we can find ‘‘optimal’’ points of in-tervals corresponding to the robust or maximin decision strategy.

At the same time, a bottle neck of the proposed model is itscomplexity by computing the extreme points α(1), . . . , α(T ) whenv is rather large. It is obvious that the value T may be very large. Ifwe compare thismodel with the similarmodel developed by Utkinet al. (2014) which is based on enumerating vertices of polytopesproduced by the intervalsA1, . . . ,An, then its use in contrast to themodel (Utkin et al., 2014) is efficient when the number of featuresis rather large and the number of examples is small. This is animportant condition of the model usage. Indeed, one can see thatthe number of extreme points α(1), . . . , α(T ) does not depend on

Page 12: NeuralNetworks Anewrobustmodelofone ...spbftu.ru/site/upload/201511211255_Utkin_2015a.pdf · NeuralNetworks69(2015)99–110 Contents lists available atScienceDirect NeuralNetworks

110 L.V. Utkin, A.I. Chekh / Neural Networks 69 (2015) 99–110

the number of features. By returning to the model in Utkin et al.(2014), it strictly depends on numbers of features and examples.

One of the ideas allowing to come to simple linear problemsis the use of the triangular kernel instead of the Gaussian one.However, we have to point out that the obtained optimizationproblems have many constraints due to replacements of absolutevalues which take place in the triangular kernel. Another idea toavoid the absolute values is to apply the so called Epanechnikovkernel which can be regarded as a quadratic approximation. Thisidea leads to quadratically constrained linear programming prob-lems. Efficient algorithms for solving these problems are directionsfor furtherwork. Another idea is to extend the proposedmodel andthe use of triangular kernel on the binary classification problem.This is also a direction for further research.

Acknowledgments

The reported study was partially supported by RFBR, researchproject No. 15-01-01414-a. The authors would like to expresstheir appreciation to the anonymous referees whose very valuablecomments have improved the paper.

References

Angulo, C., Anguita, D., Gonzalez-Abril, L., & Ortega, J. A. (2008). Supportvector machines for interval discriminant analysis. Neurocomputing , 71(7-9),1220–1229.

Bartkowiak, A. M. (2011). Anomaly, novelty, one-class classification: A comprehen-sive introduction. International Journal of Computer Information Systems and In-dustrial Management Applications, 3, 61–71.

Beaumont, O. (1998). Solving interval linear systems with linear programmingtechniques. Linear Algebra and Its Applications, 281, 293–309.

Bhadra, S., Nath, J. S., Ben-Tal, A., & Bhattacharyya, C. (2009). Interval dataclassification under partial information: A chance-constraint approach. InT. Theeramunkong, B. Kijsirikul, N. Cercone, & T.-B. Ho (Eds.), Lecture notes incomputer science: vol. 5476. Advances in knowledge discovery and data mining(pp. 208–219). Berlin Heidelberg: Springer.

Bicego, M., & Figueiredo, M. A. T. (2009). Soft clustering using weighted one-classsupport vector machines. Pattern Recognition, 42(1), 27–32.

Campbell, C. (2002). Kernelmethods: a survey of current techniques.Neurocomput-ing , 48(1–4), 63–84.

Campbell, C., & Bennett, K. P. (2001). A linear programming approach to noveltydetection. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neuralinformation processing systems, vol. 13 (pp. 395–401). MIT Press.

Carrizosa, E., Gordillo, J., & Plastria, F. (2007a). Classification problems withimprecise data through separating hyperplanes. Technical report MOSI/33. MOSIDepartment, Vrije Universiteit Brussel, September.

Carrizosa, E., Gordillo, J., & Plastria, F. (2007b). Support vector regression for imprecisedata. Technical report MOSI/35. MOSI Department, Vrije Universiteit Brussel,October.

Chandola, V., Banerjee, A., & Kumar, V. (2007). Anomaly detection: a survey. Technicalreport TR 07-017. Minneapolis, MN, USA: University of Minnesota.

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACMComputing Surveys, 41(3), 1–58.

Chavent, M. (2004). A hausdorff distance between hyper-rectangles for clus-tering interval data. In Classification, clustering, and data mining applications(pp. 333–339). Berlin Heidelberg: Springer.

Chavent, M., de Carvalho, F. de A. T., Lechevallier, Y., & Verde, R. (2006). Newclustering methods for interval data. Computational Statistics, 21(2), 211–229.

Cherkassky, V., & Mulier, F. M. (2007). Learning from data: concepts, theory, andmethods. UK: Wiley-IEEE Press.

de Souza, R. M. C. R., & de Carvalho, F. de A. T. (2004). Clustering of interval databased on city-block distances. Pattern Recognition Letters, 25, 353–365.

Do, Thanh-Nghi, & Poulet, F. (2005). Kernel methods and visualization for intervaldata mining. In Internaional symposium on applied stochastic models and dataanalysis, Vol. 5 (pp. 345–355).

Ferecatu, M., Boujemaa, N., & Crucianu, M. (2008). Semantic interactive imageretrieval combining visual and conceptual content description.ACMMultimediaSystems Journal, 13(5–6), 309–322.

Fleuret, F., & Sahbi, H. (2003). Scale-invariance of support vector machines basedon the triangular kernel. In 3rd international workshop on statistical andcomputational theories of vision.

Forghani, Y., & Yazdi, H. S. (2014). Robust support vector machine-trained fuzzysystem. Neural Networks, 50, 154–165.

Frank, A., & Asuncion, A. (2010). UCI machine learning repository.Ghaoui, L. E., Lanckriet, G. R. G., & Natsoulis, G. (2003). Robust classification

with interval data. Technical report no. UCB/CSD-03-1279. (p. 94720). Berkeley,California: University of California.

Hao, P.-Y. (2009). Interval regression analysis using support vector networks. FuzzySets and Systems, 60, 2466–2485.

Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies.Artificial Intelligence Review, 22(2), 85–126.

Ishibuchi, H., Tanaka, H., & Fukuoka, N. (1990). Discriminant analysis of multi-dimensional interval data and its application to chemical sensing. InternationalJournal of General Systems, 16(4), 311–329.

Khan, S. S., & Madden, M. G. (2010). A survey of recent trends in one classclassification. In L. Coyle, & J. Freyne (Eds.), Lecture notes in computerscience: vol. 6206. Artificial intelligence and cognitive science (pp. 188–197).Berlin/Heidelberg: Springer.

Kwok, J. T., Tsang, I. W.-H., & Zurada, J. M. (2007). A class of single-classminimax probability machines for novelty detection. IEEE Transactions onNeural Networks, 18(3), 778–785.

Li, Y. (2011). Selecting training points for one-class support vector machines.Pattern Recognition Letters, 32(11), 1517–1522.

Lima Neto, E. A., & de Carvalho, F. A. T. (2008). Centre and range method to fitting alinear regression model on symbolic interval data. Computational Statistics andData Analysis, 52, 1500–1515.

Manevitz, L. M., & Yousef, M. (2001). One-class SVMs for document classification.Journal of Machine Learning Research, 2, 139–154.

Markou, M., & Singh, S. (2003). Novelty detection: a review—part 1: statisticalapproaches. Signal Processing , 83(12), 2481–2497.

Musdholifah, A., & Hashim, S. Z. M. (2013). Cluster analysis on high-dimensionaldata: A comparison of density-based clustering algorithms. Australian Journalof Basic and Applied Sciences, 7(2), 380–389. 7(2):380–389, 2013.

Nivlet, P., Fournier, F., & Royer, J.-J. (2001). Interval discriminant analysis: Anefficient method to integrate errors in supervised pattern recognition. InSecond international symposium on imprecise probabilities and their applications(pp. 284–292).

Pedrycz, W., Park, B. J., & Oh, S. K. (2008). The design of granular classifiers: A studyin the synergy of interval calculus and fuzzy sets in pattern recognition. PatternRecognition, 41(12), 3720–3735.

Pelckmans, K., De Brabanter, J., Suykens, J. A. K., & De Moor, B. (2005). Handlingmissing values in support vector machine classifiers. Neural Networks, 18(5–6),684–692.

Robert, C. P. (1994). The Bayesian choice. New York: Springer.Sahbi, Hichem (2007). Kernel PCA for similarity invariant shape recognition.

Neurocomputing , 70(16-18), 3034–3045.Scholkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001).

Estimating the support of a high-dimensional distribution.Neural Computation,13(7), 1443–1471.

Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: support vector machines,regularization, optimization, and beyond. Cambridge, Massachusetts: The MITPress.

Scholkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., & Platt, J. (2000).Support vector method for novelty detection. In Advances in neural informationprocessing systems (pp. 526–532).

Schollmeyer, G., &Augustin, T. (2013). On sharp identification regions for regressionunder interval data. In F. Cozman, T. Denœux, S. Destercke, & T. Seidenfeld(Eds.), Proceedings of the eighth international symposiumon imprecise probability:theories and applications (pp. 285–294). Compiègne: SIPTA.

Silva, A., & Brito, P. (2006). Linear discriminant analysis for interval data.Computational Statistics, 21, 289–308.

Smola, A. J., & Scholkopf, B. (2004). A tutorial on support vector regression. Statisticsand Computing , 14, 199–222.

Steinwart, I., Hush, D., & Scovel, C. (2005). A classification framework for anomalydetection. Journal of Machine Learning Research, 6, 211–232.

Tax, D. M. J., & Duin, R. P. W. (1999). Support vector domain description. PatternRecognition Letters, 20(11), 1191–1199.

Tax, D. M. J., & Duin, R. P. W. (2004). Support vector data description. MachineLearning , 54(1), 45–66.

Utkin, L. V. (2014). A framework for imprecise robust one-class classificationmodels. International Journal of Machine Learning and Cybernetics, 5(3),379–393.

Utkin, L. V., Zhuk, Y. A., & Chekh, A. I. (2014). A robust one-class classificationmodelwith interval-valued data based on belief functions and minimax strategy. InP. Perner (Ed.), Lecture notes in computer science: vol. 8556.Machine learning anddata mining in pattern recognition (pp. 107–118). Springer.

Vapnik, V. (1998). Statistical learning theory. New York: Wiley.Wang, J., Lu, H., Plataniotis, K. N., & Lu, J. (2009). Gaussian kernel optimization for

pattern classification. Pattern Recognition, 42(7), 1237–1247.Zhang, L., & Zhou, W.-D. (2013). 1-norm support vector novelty detection and its

sparseness. Neural Networks, 48, 125–132.