convex and concave hulls for classification with support vector machine

32
Author's Accepted Manuscript Convex and concave hulls for classification with support vector machine Asdrúbal López Chau, Xiaoou Li, Wen Yu PII: S0925-2312(13)00644-9 DOI: http://dx.doi.org/10.1016/j.neucom.2013.05.040 Reference: NEUCOM13493 To appear in: Neurocomputing Received date: 1 June 2012 Revised date: 19 March 2013 Accepted date: 27 May 2013 Cite this article as: Asdrúbal López Chau, Xiaoou Li, Wen Yu, Convex and concave hulls for classification with support vector machine, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2013.05.040 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. www.elsevier.com/locate/neucom

Upload: uamex

Post on 27-Jan-2023

1 views

Category:

Documents


0 download

TRANSCRIPT

Author's Accepted Manuscript

Convex and concave hulls for classificationwith support vector machine

Asdrúbal López Chau, Xiaoou Li, Wen Yu

PII: S0925-2312(13)00644-9DOI: http://dx.doi.org/10.1016/j.neucom.2013.05.040Reference: NEUCOM13493

To appear in: Neurocomputing

Received date: 1 June 2012Revised date: 19 March 2013Accepted date: 27 May 2013

Cite this article as: Asdrúbal López Chau, Xiaoou Li, Wen Yu, Convex andconcave hulls for classification with support vector machine, Neurocomputing,http://dx.doi.org/10.1016/j.neucom.2013.05.040

This is a PDF file of an unedited manuscript that has been accepted forpublication. As a service to our customers we are providing this early version ofthe manuscript. The manuscript will undergo copyediting, typesetting, andreview of the resulting galley proof before it is published in its final citable form.Please note that during the production process errors may be discovered whichcould affect the content, and all legal disclaimers that apply to the journalpertain.

www.elsevier.com/locate/neucom

Convex and Concave Hulls for Classification with

Support Vector Machine

Asdrubal Lopez Chau1,2, Xiaoou Li1, Wen Yu3,∗

1Departamento de Computacion, CINVESTAV-IPN

Mexico City, 07360, Mexico

2Universidad Autonoma del Estado de Mexico (UAEM)

Zumpango, Estado de Mexico, 55600, Mexico

2Departamento de Control Automatico, CINVESTAV-IPN

Mexico City, 07360, Mexico, [email protected]

∗Corresponding author

Abstract

The training of a support vector machine (SVM) has the time complexity of O(n3)

with data number n . Normal SVM algorithms are not suitable for classification of

large data sets. Convex hull can simplify SVM training, however the classification

accuracy becomes lower when there are inseparable points. This paper introduces

a novel method for SVM classification, called convex-concave hull. After grid pre-

processing, the convex hull and the concave (non-convex) hull are found by Jarvis

march method. Then the vertices of the convex-concave hull are applied for SVM

training. The proposed convex-concave hull SVM classifier has distinctive advantages

on dealing with large data sets with higher accuracy. Experimental results demonstrate

that our approach has good classification accuracy while the training is significantly

faster than the other training methods.

1 Introduction

Support Vector Machine (SVM) is a popular classification method in machine learning [30].

It minimizes the empirical classification error and maximizes the geometric margin. It offers

1

a hyperplane that represents the largest separation (or margin) between two classes [24].

However this kind of maximum-margin hyperplane may not exist because of class overlapping

or mislabeled examples. The soft margin SVM by introducing slack variables can find a

hyperplane that splits the examples as cleanly as possible. In order to find a separation

hyperplane, it is necessary to solve the Quadratic Programming Problem (QPP). This is

computational expensive, it has O(n3) time and O(n2) space complexities with n data [24].

so the standard SVM is unfeasible for large data sets [25].

Many researchers have tried to find possible methods to apply SVM classification for

large data sets. These methods can be divided into four types: a) reducing training data

sets (data selection), b) using geometric properties of SVM, c) modifying SVM classifiers,

d) decomposition, and e) the other methods. The data selection method chooses the objects

which are possible to be the support vectors (SV) [20]. These data are used for SVM training.

Generally, the number of the support vectors is much small compared with the whole data.

Clustering is another effective tool to reduce data set size, for example, hierarchical clustering

[33] and parallel clustering [26]. The sequential minimal optimization (SMO) [25] breaks the

large QP problem into a series of smallest possible QP problems. The projected conjugate

gradient (PCG) chunking scales somewhere between linear and cubic in the training set [8].

The geometric properties of SVM can also be used to reduce the training data. In sep-

arable case, the maximum-margin hyperplane is equivalent to finding the nearest neighbors

in the convex hulls of each class [2]. Neighborhood properties of the objects can be applied

to detect SV. In [18] an active learning model of sample selection is proposed. The model

is based on the measurement of neighborhood entropy. In [19] is proposed an algorithm

that selects the patterns in the overlap region around the decision boundary. The method

presented in [4] follows this trend but it uses fuzzy C-means clustering to select samples on

the boundaries of class distribution. The most popular geometric methods are the nearest

points problem (NPP) and the convex hull.

NPP was first reformulated to solve SVM classification in [18]. The data which are close

to those opposite label have high probability to be SV. The Euclidean distance between

these data and the opposite label are computed to find the shortest distance. The problem

of this method is the computation time is O(n2). Gilbert’s algorithm [12] is one of the first

algorithms for solving the minimum norm problem (MNP) in NPP. The MDM algorithm

suggested by Mitchell et al. [22] works faster than Gilbert’s algorithm. The NPP method is

also extended to solving the SMO-SVM optimization problem in the image processing [19].

Since the support vectors are usually located in local extremum or near extremum [1], using

extremes examples from the full training set can reduce training set [14].

2

The key problem is how to find the border of a data set. K-nearest neighbor (KNN)

method is a good method. The local data distribution and local geometrical property were

applied in [31][20]. Minimum enclosing ball is applied to SVM classification in [4]. Ap-

proximately optimal solutions of SVM are obtained by core vector machine (CVM) in [29].

Schlesinger-Kozinec (SK) algorithm [28] used the geometric interpretation of SVM. The

maximal margin classifier with a given precision for separable data was obtained.

Convex hull has been applied in training SVM in [1][2][21]. There are some algorithms

to compute the convex hull with finite points. Graham scan [13] finds all vertices of the

convex hull along its boundary by computing the direction of the cross product of the two

vectors. Jarvis’ march (Gift wrapping) [17] identifies the convex hull by angle comparisons

and wrapping a string around the point set. Divide and Conquer method [27] is applicable to

the three dimensional case. The incremental convex hull [23] and quick hull [9] give methods

to eliminate useless points.

In many real applications the data are not perfectly separated, and the kernel method

is not so powerful for nonlinear separation. In these cases, the closest points in the convex

hulls are no longer support vectors. In this case, the soft SVM . Although the soft margin

optimization with the slack variable, the penal parameter C effects the optimal performance

of SVMs can be applied directly. The optimization becomes a trade off between a large

margin and a small error penalty [24].

Another method is to use the reduced convex hull [21]. The intersection parts disappear

by reducing the convex hulls. The inseparable case becomes separable case. The main

disadvantage of the reduced convex hull is the convex hull has to be calculated in each

reducing step [2]. Variant SVM (v-SVM) [6] can make the intersection set be empty by the

choice of parameters. The v-SVM is similar with the reduced convex hull, the computation

process is complex.

Concave-convex procedure (CCCP) [32] separates the energy function into a convex func-

tion and a concave (non-convex) function. By using a non-convex loss function, it forms a

non-convex SVM. But some good properties of SVM, for example the maximum margin,

cannot be guaranteed [7], because the intersection parts of data sets are not satisfied convex

conditions.

In this paper we propose a new algorithm, called convex-concave hull, to search the points

that are on the outer boundaries of a set of points. By using Jarvis’ march method, we first

compute the convex hulls. Then the concave hulls are created by using the vertices of the

convex hull. In this way, the misleading points in the intersection become the vertices of the

concave hulls. The classification accuracy is increased compared with the other geometric

3

methods, such as the reduced convex hull SVM (RCHSVM) and v-SVM. The training time

is decreased considerably. The experimental results show that the accuracy obtained by our

convex-concave hull approach is very close to classic SVM methods, while the training time

is significantly shorter.

2 Convex and Concave Hull

2.1 Convex Hull

The convex hull (CH) of a set of points S is the minimum convex set that contains S.

Mathematically, CH is defined as

CH(X) :

{w =

n∑i=1

aixi, ai ≥ 0,n∑

i=1

ai = 1, xi ∈ X

}(1)

For SVM classification, a data set X has the general form as

X ={(xi, yi), xi ∈ Rd, yi = {+1,−1}

}(2)

where d is the number of features, yi is the label of object xi.

Let’s create two subsets X+ and X−, X = X+ ∪X−

X+ ={(xi, yi), xi ∈ Rd and yi = {+1}

}(3)

X− ={(xi, yi), xi ∈ Rd and yi = {−1}

}(4)

If CH(X+) ∩ CH(X−) = ∅, there is not intersection. It is called linearly separable. The

optimal separating hyperplane is determined by the closest pair of points that belong to X+

and X− [1][2]. The problem can be stated as

minxi∈X+xj∈X−

∥∑M

i=1 αixi −∑L

j=1 αjxj∥2∑Mi=1 αi = 1,

∑Lj=1 αj = 1, αi ≥ 0, αj ≥ 0

(5)

where | X+ | is the size of the set X+. Let us define | X+ |= M and | X− |= L.

In this linearly separable case, the ”‘hard”’ optimal separating hyperplane is a vector

that is normal to the line which joints the closest points, see Figure 1.

In the most case, the data are not linearly separable. The convex hulls intersect each

other, see Figure 2. (5) does not have solution. It is necessary to map the objects to a higher

dimensional space to allow mis-classification. The separating hyperplane is ”soft”. The soft

margin separating hyperplane is shown in Figure 2.

4

y=-1y=+1

w

Separating hyper plane

Figure 1: Linearly separable case

2.2 Concave Hull

In no-linearly separable case, most geometric methods have to shrink or reduce the convex

hull to avoid the overlapping. In our method, the convex hull is not changed. We detect

the objects that are close to exterior boundaries. In order to detect these objects, we use

concave hull. The border points B(X) of a set X ∈ R2 are vertices of a ”convex-concave

hull”, if they hold the following properties:

1. The vertices of a convex-concave hull B(X) are the vertices of a convex hull (CH) andthe points that are closed to the edges of CH

B(X) = VCH(X) ∪ CloseCH (6)

where VCH(X) is vertex set of the convex hull, CloseCH is the set that is closed to the edges

of the convex hull. The ”closeness” to an edge of CH(X) is different from point to point, so

there is no unique set of border points.

2. B(X) is not a convex hull. A set S ∈ R2 is said to be convex CH(X) if

x1, x2 ∈ S, then αx1 + (1− α)x2 ∈ S, for all α ∈ (0, 1) (7)

Generally, a polygon defined by (6) does not satisfy (7).

3. The border points must be part of the data set

B(X) ⊆ X (8)

It can be proved that (8) implies

max |B(X)| = |X| ∀ X (9)

5

Supportvectors

Optimal (soft margin)separatinghyper plane

Error

allowed

Misclassified

sample

Figure 2: Linearly inseparable case

where |X| refers to the number of elements in set X.

4. The minimum size of B(X) is defined by

min |B(X)| = |VCH(X)| (10)

In the following subsection we describe a method to detect the vertices of a convex-

concave hull.

2.3 Searching the vertices of a Convex-Concave Hull

The first object is to find the extreme points by computing CH(X) from given a set X =

{x ∈ R2}. For any two adjacent vertices vi and vj in CH(X), it is clear that all other points

in X must be located on one side of the line vi− vj. If an edge E is formed by two adjacent

vertices vi and vj of CH(X), they are used as reference points to detect the concave hull

B(X). The procedure is summarized as follows:

1. Compute the convex hull CH(X)

2. Search the points which are closed to each edge of CH(X)

3. Repeat Step1 and Step 2 until all local neighbors are reached.

These steps are realized by the following algorithm.

Algorithm 1 (Convex-Concave Searching Scheme)

Input

X: Set of points.

6

K: Number of candidates.

Output:

B(X):Convex-concave hull

Begin algorithm

// Set of vertices begins empty

B(X)← ∅Compute CH(X)

The vertices of CH(X) are v1,...,vn

for each pair of adjacent vertexes (vi, vj) do

θ ← compute angle defined by (vi, vj ) as cos−1

(−→vi ·−→vj

∥−→vi∥·∥−→vj∥

)//Detect points ”close” to edge (vi, vj )

Apply Algorithm 2 using (X, (vi, vj), K, θ)

B(X)← B(X) ∪ CloseCH

end for

End algorithm

We use Algorithm 2 to search the nearest points to the edges of CH(X). It works as

follows: Given a set of points X and two reference points vi, vj, the algorithm begins with

the point vi and finds its K nearest neighbors.

From each neighbor, we search the point vk such that which −−→vivk has the smallest angle

with respect to line −−→vivj

θmin = mink

[cos−1

( −−→vivk · −−→vivj∥−−→vivk∥ · ∥−−→vivj∥

)](11)

This procedure is repeated until vk = vj, see Figure 3. The closest points to the edge of

CH(X) are the nearest points to the edges of CH(X). Figure 3 shows an example where

K = 3

Algorithm 2 (Search the neareat points)

Input

X : A set of points

S: Two adjacent vertices {vi, vj} of CH(X)

K: Number of neighbors

θ : The previous angle angle between adjacent vertices

Output

7

Figure 3: Algorithm 2 searches for points that are close to an edge of CH(X)

CloseCH(X) : The set of points close to edge defined by adjacent vertices {vi, vj}Begin algorithm

currentPoints← vi;

stopPoint← vj;

CloseCH(X) ← vi;

whilecurrentPoint = vj do

candidates←get the K nearest points to currentPoint;

Compute the angle of every pi ∈ candidates w.r.t. to angle θ;

Sort candidates by increasing angle;

for each point pi ∈ candidates

Define L as a segment of line defined by pi and currentPoint;

if L do not intersect the convex-concave hull then

CloseCH(X)← CloseCH(X) ∪ pi;

exit this loop;

end if

end for

if no point was added in last loop then

CloseCH(X)← CloseCH(X) ∪ vj;

return CloseCH(X);

end if

Remove currentPoint from X;

Update θ using the two more recently added points of CloseCH(X);

ifcurrentPoint = vjthen

8

Figure 4: Algorithm 2 with different K, (a)K = 9, (b) K = 4 and (c) K = 6.

return CloseCH(X);

end if

currentPoint← pi;

end while

End algorithm

Figure 4 shows an example of the convex-concave hull B(X) generated via Algorithm 1

and Algorithm 2. It is possible to obtain a different set B(X) by changing the parameter K

in Algorithm 2. In Figure 4, (a) K = 9, (b) K = 4, (c) K = 6. The greater the value of K

the most similar to the convex hull.

Remark 1 Algorithm 2 shows how the space between two consecutive extreme points is ex-

plored. This idea is similar to Jarvis’ march [17], where a set of points are wrapped. Jarvis’

march considers all points, whereas Algorithm 2 uses only local neighbors. Algorithm 2 runs

until the second extreme point in the argument is reached.

Remark 2 The convex-concave hull has the following advantages over KNN concave hull

[34]: 1) The starting and ending points of our algorithms are pre-set before by convex hull

9

method. It ensures all vertices of convex hull are always in the vertices of convex-concave

hull. 2) The angles in [34] are entirely depended on the previously computed ones. In our

method, the extreme points are used to compute the angles, so it is easy to implement. 3)

Our algorithm does not use recursive invocations. The stop point is added to the set of border

points when there are no more points, the search time is less than [34].

2.4 Properties of Convex-Concave Hull

In this section, we propose three interesting properties of our convex-concave hull methods.

These properties can be to formulate a simple pre-processing algorithm, which helps to find

the exterior boundaries of the points X fast.

Property 1 The vertexes of convex hull VCH(X) is a subset of the vertexes of convex-concave

hull B(X)

VCH(X) ⊆ B(X)

The searching algorithm first computes CH(X), then uses two extreme points vi and vj

in CH(X) to search concave points B(X). At the last iteration, B(X) contains all vertices

of CH(X).

Remark 3 The property 1 ensures that the optimal separating hyperplane can be obtained

in linearly separable cases, if B(X) is used for SVM training.

Property 2 A superset of the convex-concave hull B(X) can be obtained by detecting border

points by partitions of X, i.e., ∪i

B(X i) ⊇ B(X) ⊇ CH(X)

where X i is the partitions of X, ∩CH(X i) = ∅ and ∪X i = X.

Let us create j partitions on the input space X ∈ R2, i.e.,

∩X i = ∅ and ∪X i = X

Suppose each partition X i contains at least three points that belong to X. It is not difficult

to see that the vertices of convex hull VCH(X) is a subset of ∪VCH(Xi). We define two partitions

as CH(X1) and CH(X2). It is clear that

VCH(X) ⊂ VCH(X1) ∪ VCH(X2)

10

Applying Algorithm 1 and Algorithm 2 on X1 and X2, two convex-concave hulls B(X1) and

B(X2) are generated. Obviously, the join of them is a superset of the convex concave hull of

X, i.e.,

B(X) ⊂ B(X1) ∪ B(X2)

Figure 5 explains this property.

Figure 5: A super set of B(X) can be obtained by applying Algorithms 1 and 2 on partitions

of X

Remark 4 All extreme points in X must be included in B(X). If the convex-concave hull

search algorithm is applied on the disjoint partitions of X, the result data set must contain

all the vertices of convex-concave hull from X.

3 SVM Classification via Convex-Concave Hull

3.1 Data Set Pre-process

The convex-concave hull searching scheme behaves good if the data set X has a uniform

distribution of elements in a region. However, the distribution of a data set is unknown in

advance. Computing the convex-concave hull B(X) directly from X is not appropriate for

training the SVM classifier. In order to avoid the density problem, we separate X into X+

and X− and then we create partitions. The basic idea here is to obtain regions where the

11

distribution is more uniform than in the original ones. The property 2 allows us to claim

that all vertices of convex hull can be included in the vertexes of the convex concave hull

VCH(X) ⊆ VB(X)

In addition, the points in intersection of convex hulls are also included in B(X). This is

helpful for linearly inseparable cases for SVM classification. Property 1 gives guaranty

that all vertexes of convex hull will be also included. Figure 6 shows the general steps to

apply our method to training sets.

Figure 6: General process of the method

The partition of input space can avoid the bottleneck problem in the border point de-

tection. The subsets Pi can be quickly created by introducing all points in a binary tree of

height h and then using the leaves as a version of the original points. Once all points have

been introduced in the binary tree, it is possible to look down from height hg < h of the tree

and take all leaves as a subset Pi.

We propose a grid method to pre-process the data set. Figure 7 and 8 show the data set

pre-process. It includes:

12

1) The partitions Pi are created using a grid G which is composed of cells cl. Each Pi

consists of some adjacent cells.

2) Repetitively apply convex-concave hull searching on Pi. This recovers border points

for each group of cells.

3) Joint the points recovered in the previous step.

The first step of the proposed method consists in a mapping of all samples of data set

X into a grid G, which quantizes the input space into a finite number of cells that form the

grid.

Figure 7: Partition of the input space using a grid

Figure 8: Binary tree represents the grid

13

We use four parameters to control the grid process: hg, ha, hb and K. Here hg is the

granularity of the grid, i.e., the height of the binary tree. ha is used to create the partitions

in higher dimensions. The binary tree is traverse down in the height of hb. K is used to

control the number of the neighbor points which are close to the edges of the convex hull.

The bigger the granularity, the finer the quantization. However, the amount of memory

required to store the grid is increased. This mapping from X into a G allows to maintain a

set of non repeated points, because they are mapped into the same cells, furthermore, this

can help to reduce the original number of points.

Now we use two examples to show how our grid method works. Figure 9 and Figure

10 show the results of pre-processing with granularity hh = 9. Figure 9 shows the result of

applying the algorithms on the whole points (hg = 0). Figure 10 the result with hg = 1. The

border point joining is realized by the union set operation.

Figure 9: Example with granularity hg = 0

3.2 SVM Classification

Basically SVM classification can be grouped into two types: linearly separable and linearly

inseparable cases. The nonlinear separation can be transformed into linear case via a kernel

mapping. In linear separable case

CH(X+) ∩ CH(X−) = ∅

where X+ and X−represents the sets of elements in X with label +1 or −1 respectively.

14

Figure 10: Example with granularity hg = 1

It has been demonstrated that if the data set is linearly separable, then the support

vectors must be the vertices of CH(X+) and CH(X−) [1]. From Property 1, our convex-

concave hull can successfully deal with this case, see Figure 11.

In the linear inseparable case, the convex hulls intersect. The convex hull based methods

do not work well for SVM, because SVs are generally located on the exterior boundaries of

data distribution, see Figure 12. On the other hand, the vertices of the convex-concave hull

are the border points and they are possible to be the support vectors [34].

Figure 11: Linealy separable case, hg = 0

After the vertices of the convex-concave hull B(X) = VCH(X) ∪ VNCH(X) have been de-

tected, they are used to train a SVM classifier. Finding an optimal hyperplane is equivalent

15

Figure 12: Linearly inseparable case, hg = 0

to solve the following quadratic programming problem (primal problem)

minw,b J (w) = 12wTw + C

n∑k=1

ξk

subject : yk[wTφ (xk) + b

]≥ 1− ξk

(12)

where ξk is slack variables to tolerate mis-classifications ξk > 0, k = 1 · · ·n, c > 0, wk is the

distance from xk to the hyperplane[wTφ (xk) + b

]= 0, φ (xk) is a nonlinear function.

The kernel which satisfies the Mercer condition [30] is K (xk, xi) = φ (xk)

T φ (xi) . (12)

is equivalent to the following quadratic programming problem which is a dual problem with

the Lagrangian multipliers αk ≥ 0,

maxα J (α) = −12

n∑k,j=1

ykyjK (xk, xj)αkαj +n∑

k=1

αk

subject :n∑

k=1

αkyk = 0, 0 ≤ αk ≤ c

(13)

It is well known most of solutions are zero, i.e., the solution vector is sparse. The non-zero

αk is called a support vector. Let V be the index set of support vectors, then the optimal

hyperplane is ∑k∈V

αkykK (xk, xj) + b = 0 (14)

The resulting classifier is

y(x) = sign

[∑k∈V

αkykK (xk, x) + b

]

16

where b is determined by Kuhn-Tucker conditions.

In our convex-concave hull SVM classification, the penal factor C can be selected very

small, because all misleading points are almost disappear by the concave algorithm. The

classification accuracy is improved. The other design parameters, such as the parameters of

the kernel, are chosen by the grid search method [24].

3.3 High-Dimension Case

Since the convex-concave hull searching algorithm of this paper must calculate angles, it

works in two-dimension space. In order to extend out method to high-dimension data set,

the dimension reduction is needed. The principal component analysis (PCA) is a common

method to reduce the dimension of data sets, but it is not suitable for large trainings sets.

The general approach for the high-dimension data is as the following:

1) Partition the input space to create a number of partitions X i. These partitions have

a more uniform distribution of objects.

2) Select two features from X i to create a set Y i ∈ R2. The Y i can be viewed as X i with

all its features fixed, at exception of two.

3) Apply convex-concave hull searching on Y i.

4) Explore fixed features for more points.

5) Join the points recovered in the previous steps.

It is well known that computing the convex hull in more than three dimensions is compu-

tational expensive. We solve this problem by searching the vertices of convex-concave hull in

several two-dimension spaces, see Figure 13. In order to extend convex-concave hull to more

than two dimensions, we create partitions on the higher dimensions and reduce temporally

the dimension of the data set in several steps. For each dimension we create one-dimensional

clusters. The center of the corresponding cluster is fixed. The final two-dimensional subsets

are formed with respect to the number of partitions of each dimension. We compute border

samples in these two-dimensional subsets and then get them back to their original dimension

with respect to the fixed center values.

The clustering algorithm is as following.

Algorithm 3 (Clustering higher dimensions)

Data

X :Data set

m:Index of a feature of X

Output

17

Figure 13: Partition in higher dimensions

C : A set of one dimensional clusters and centers of feature m.

Begin algorithm

C1 ← xm,1/* First cluster is the first arrived sample. */

x1 ← xm,1

for i = 1 to |X|

Use dk,x =

(∑ni=1

[xi(k)−xj

ximax−ximin

]2) 12

to compute dk,xi,mfrom xm,i to cluster Cj

if d(k, xm,i) ≤ L then//L is set to 0.10·range of feature m

xm,i is kept in cluster Cj

Update center using xji k+1 =

k−1kxji k +

1kxi(k)

else

Cj+1 ← xm,i

xj+1 ← xm,i

end if

end for

/* If the distance between two groups

centers is more than the required

distance L */

if∑n

r=1 (xp − xq)2 ≤ L then

end if

The two clusters (Cp and Cq) are combined

into one cluster, the center of the new cluster

may be any of the two centers.

18

end if

return C

End algorithm

The effect of this procedure in three dimensions is similar to slicing a cube. An example

of the computed vertexes of convex-concave hull is shown in Figure 14.

Figure 14: Example on a toy example with three dimensions

Algorithm 3 shows how to partition each dimension for a multi-dimensional data set.

In this way the multi-dimensional data set is divided into several two-dimensional training

data sets, then Algorithm 1 and Algorithm 2 are used to search the convex-concave hull.

The main advantage of Algorithm 3 is the training time becomes linear with the size of the

training data.

4 Performance analysis

It is not easy to analyze the complexity of normal SVM algorithm precisely. For n input

data, this operation involves multiplication of matrices with size n, which has complexity

from O(n2.3) to O(n2.83) at worst [11]. In this paper, we assume that the QP implementation

of each SVM uses O(n3) computation time and O(n2) memory space [11].

19

4.1 Memory space

In the first step of our method, all examples in training set X are mapped into grid G by a

binary tree with the height hg (granularity). Considering the following facts:

• Each example has d dimensions and the data types are double (8 bytes),

• Each internal node in the binary tree occupies about 40 bytes (two references to the

child nodes and internal variables),

So the amount of memory to manage all examples in G is

23+hg (d+ 5)− 40 bytes

where hg is the height of the binary tree, d is the number of features in data set X.

This amount of memory is computed as follows: the maximum number of leaves in a

binary tree of height hg is given by 2hg , each leaf contains a vector of d features and each

feature is stored in 8 bytes, the amount of memory used by the leaves is

2hg · 8 · d = 2hg23 · d

It is well known that the number of internal nodes in a binary tree of height hg is given

by 2hg − 1. Because each internal node in the implemented binary tree uses about 40 bytes,

the amount of memory used for the internal nodes is

40 · (2hg − 1)

The total amount of memory used by the binary tree is composed of the leaves and the

internal nodes, i.e.,

(2hg23 · d) + 40 · (2hg − 1)

This can be simplified as 2hg·23·d+40·(2hg−1) = 2hg·23d+2hg·40−40 = 2hg·23·d+2hg·23·5−40.Grouping terms we obtain finally 2hg · 23(d+ 5)− 40 bytes.

This amount of memory is used when all leaves of the binary tree exist. In this case more

than one example is usually mapped into the same leaf of the tree. Reducing the amount

of memory originally required to allocate the training set X, i.e., 8 · |X| · d bytes. With |X|the size of training set, namely n.

The memory used by Algorithm 2 can be ignored, because they use a little memory space

compared with the binary tree.

20

4.2 Computation time

In order to analyze the computational cost of the proposed method, our method is separated

into:

• Create a grid G via a binary tree. The computing time for creating G is the example

inserting in a binary tree. It is O(n log2 n) with n = |X|.

• Detect the vertexes of the convex-concave hulls in partitions Yi by looking downward

tree from height ha down to height hb.

Algorithm 1 uses the partition Yi to obtain the vertexes of a convex hull. The cost is

O(|Yi| log2 |Yi|).

To simplify the analysis we consider a uniform distribution of examples, in this case the

size of Yi can be approximated as |X|2hb

. For Algorithm 2, the worst case occurs is that all

examples of Yi are considered as vertexes, usually Yi has a few elements. The computing

time is small compared with these few samples. In general case, Algorithm 2 searches a

small subset from Yi by the K-nearest neighbors method (KNN). The computing time of the

simple KNN is O(d|X|2). For our case it becomes O(d(

|X|2

22·(hb)

)).

• The total computation time in the pre-processing steps is

O(|X| log2 |X|) +O(|X|2hb

log2|X|2hb

) +O

(d

(|X|2

22·(hb)

))(15)

• SVM training uses the QPP solver. It takes O(n3) time and O(n2) space for X. On

the other hand, the number of vertexes of convex-concave hull is greater than the

vertexes of convex hull |B(X)| >> |VCH(H)|. However |B(X)| ≪ |X|, because not all

points are on boundary. The training time of using |B(X)| is lesser than using X , i.e.,

O(|B(X)|3) << O(|X|3)

5 Experiments

We used eight data sets to compare our algorithms with other methods. Four data sets are

available in UCI web site. The others are benchmark examples.

The data sets from UCI are Four class, Skin-no-skin Breast cancer and Haberman’s

survival. The benchmark data set are modified Checkerboard (Figure 15), synthetic data

sets are Cross (Figure 16), Rotated-cross (Figure 17), and Balls aDb (Figure 18).

21

All of them are linearly inseparable. Table 1 shows a summary of the data sets used in

this paper. The synthetic data set Balls aDb has ”a” features and size ”b”×1000.All experiments were run on a computer with the following features: Core i7 2.2 GHz

processor, 8.0 GB RAM, Windows 7 operating system. The algorithms are implemented

in the Java language. The maximum amount of random access memory given to the Java

virtual machine is set to 2.0 GB. The reported results correspond to 100 runs of each

experiment. For each experiment, the training data are chosen randomly from 70% of the

data set, the rest of data are used for testing.

Figure 15: Checkerboard data set

Our algorithms are compared with SMO [25], library LIBSVM [5], and the reduced

convex hull SVM (RCHSVM) [21]. The SMO and LIBSVM are trained with the original

data set. Our convex-concave hull SVM (CCHSVM) uses the border points detected by

proposed algorithms.

The kernel used in all experiments is a radial basis function,

f (x, z) = exp

(−(x− z)T (x− z)

2γ2

)(16)

where γ was selected using the grid search method.

22

Figure 16: Cross data set

Figure 17: Rotated-cross data set

23

Figure 18: Balls data set

Table 1: Data sets used in the experiments

Data set Features Size yi = +1 yi = −1 classes

Four-class 2 862 307 555 2

Checkerboard50 2 50,000 13,000 12,000 2

Cross 2 90,000 50,000 40,000 2

Rotated-cross 2 90,000 50,000 40,000 2

Skin-no-Skin 3 245,057 50,859 194,198 2

Haberman’s Survival 3 306 225 81 2

Balls aDb a b b/2 b/2 2

Breast cancer 9 286 201 85 2

24

5.1 Experiment 1: Size of training set

In this experiment, we use the checkerboard data set [15] which is commonly used for eval-

uating SVM implementations. It is also useful to illustrate the scaling properties of our

algorithms.

We first exam how does the training data size affect the training time, and the classifi-

cation accuracy of our CCHSVM. We use 500, 1, 000, 2000, 5000, 10000, 50000 and 100000

examples to train CCHSVM and LibSVM. The comparison results of 100 runs of experi-

ments are shown in Table 2, here the columns Avg Acc and Avg training are the averages

of accuracy and training times respectively, the column stdDev is the standard deviation of

accuracy. In this experiment, the values γ = 0.9 and C = 1.0 which are chosen using the

grid search method. The parameters of the CCHSVM algorithm defined in section III are

K = 50, hg = 9, ha = 4 and hb = 8 which are also selected via a grid method.

We can see that our CCHSVM has less training time than LibSVM when the size of

training set is large, more than 5000. For small data sets, LibSVM is better than our

method. The classification accuracy of our method is slightly lesser than LibSVM. The size

of training set is important when the set contains thousands of examples. The pre-processing

reduces the size of training set.

When the data size in increased, the training time is increased with LibSVM. While our

method only increases a little. Although the classification accuracy cannot be improved

significantly when data size is very large, it does not get worse. The testing accuracy is still

acceptable.

5.2 Experiment 2: Parameters

The CCHSVM method has four parameters: hg, ha, hb and K. The hg controls the granu-

larity of the grid, i.e., the height of the binary tree. The parameter ha is used to create the

partitions in higher dimensions. The binary tree is traverse down to high hb. The parameter

K controls the number of neighbors to detect border points close to edges of convex hull.

We used the synthetic Balls3D40 data set (40, 000 examples) to analyze the effect of

parameters of algorithm. The LibSVM method was used as a reference. Table 3 shows the

results. In this experiment, γ = 1.9 and C = 2.5 which are from the grid method.

The most important effect of parameter K is to reduce the data size. K does not

contribute much on the classification accuracy and running time. The parameters hg and ha

25

Table 2: Classification results for checkerboardMethod Data set Avg training Avg Acc stdDev

size ×1000 time (ms) % %

LibSVM 0.5 16.66 85.04 0.27

CCHSVM 0.5 28.13 85.32 0.19

LibSVM 1 46.36 91.34 0.27

CCHSVM 1 61.86 91.01 0.19

LibSVM 2 108.10 98.31 0.27

CCHSVM 2 156.46 94.90 0.19

LibSVM 5 362.00 97.12 0.12

CCHSVM 5 336.80 96.35 0.15

LibSVM 10 1023.40 98.27 0.33

CCHSVM 10 792.80 97.15 0.27

LibSVM 50 15,692.93 99.28 0.07

CCHSVM 50 5,185.00 98.60 0.27

LibSVM 100 47,670.80 99.58 0.05

CCHSVM 100 13,434.81 98.32 0.15

have an impact in the training time and accuracy. The higher the value of hg, the deeper the

binary tree, and also the greater the number of cells in the grid. The value of ha affects the

number of partitions. A higher value of this parameter involves a major number of searches

for boundary points. The accuracy is degraded when ha is low, because many support vectors

are omitted.

5.3 Experiment 3: Comparison with other methods

In this experiment we compare our method with SMO, LibSVM and RCHSVM. We use four

benchmark data sets and three synthetic ones. The Balls3D100 contain 100, 000 examples

and it has three features. Both the Cross and the Rotated-cross data sets have two features

and 100, 000 examples.

In this experiment, the parameters are selected using the grid search method. The values

for the first data set are: LibSVM γ = 0.68 and C = 1.0, SMO γ = 0.68 and C = 1.0,

RCHSVM γ = 0.1 and C = 0.5, CCHSVM γ = 0.9 and C = 3.0. The parameters of out

26

Table 3: Effect of parameters

Method ha hb hg K Avg size Avg training Avg Acc stdDev

reduced time (ms) % %

LibSVM - - - - - 1,164,500.00 99.69 0.13

CCHSVM 4 7 12 50 9,277.04 23,443.00 99.83 0.24

CCHSVM 4 7 12 5 9,992.02 23,790.00 99.83 0.20

CCHSVM 4 7 8 50 9,244.01 19,945.00 99.75 0.26

CCHSVM 5 9 12 5 17,585.01 90,130.00 99.92 0.19

CCHSVM 5 9 12 500 8,460.05 20,130.00 99.71 0.22

CCHSVM 4 6 6 100 7,760.80 9,898.20 99.73 0.24

CCHSVM 4 7 8 100 7,920.10 13,260.00 99.76 0.16

CCHSVM 4 5 7 100 4,245.12 5,815.00 92.35 0.19

CCHSVM 4 5 8 100 4,287.33 8,656.60 91.97 0.20

CCHSVM 3 6 8 100 4,500.40 3,079.40 94.55 0.19

CCHSVM 3 6 8 300 4,395.40 3,296.81 92.91 0.53

algorithm in this experiment are K = 5, hg = 10 and ha = 5 and hb = 9. For Skin-NoSkin

data, they are K = 7, hg = 12 and ha = 9.

The results are shown in Table 4. RCHSVM does not scale well, LibSVM and SMO

produce the best accuracies and standard deviations in practically all cases, but the training

times are very large. Our CCHSVM method is able to train SVM very fast, the accuracy

and standard deviation are slightly degraded but yet acceptable. For small data sets, our

method is not the best one in the training time.

6 Conclusions

This paper proposes a novel method for SVM classification, CCHSVM. By using convex-

concave hull and a grid method, CCHSVM overcomes the problems of slow training of SVM

and low accuracy of many geometric properties based SVMs. The key point of our method is

we use Jarvi march method to decide the concave (non-convex) hull for the inseparable points.

experimental results demonstrate that our approach has good classification accuracy while

the training is significantly faster than other SVM classifiers. The classification accuracy is

higher than the other convex hull SVM.

27

Table 4: Comparison with other methods

Method Data set Training Acc stdDev

time (ms) % %

SMO Four-class 17.00 95.33 0.20

LibSVM Four-class 6.10 99.01 0.0

RCHSVM Four-class 27.00 98.08 0.12

CCHSVM Four-class 22.83 98.55 0.14

SMO Checkerboard50 27,740.00 98.08 0.19

LibSVM Checkerboard50 19,001.00 98.2 0.01

RCHSVM Checkerboard50 48,225.00 92.9 .24

CCHSVM Checkerboard50 3,791.01 96.82 0.68

SMO Cross 27,740.00 98.08 0.19

LibSVM Cross 19,001.00 98.2 0.01

RCHSVM Cross program crash program crash program crash

CCHSVM Cross 5,234.20 95.82 0.68

SMO Rotated-Cross 27,740.00 98.08 0.19

LibSVM Rotated-Cross 19,001.00 98.2 0.01

RCHSVM Rotated-Cross program crash program crash program crash

CCHSVM Rotated-Cross 4,986.20 95.82 0.68

SMO Skin-NoSkin 3,100,254.00 96.08 0.32

LibSVM Skin-NoSkin 1,860,014.00 96.34 0.09

RCHSVM Skin-NoSkin program crash program crash program crash

CCHSVM Skin-NoSkin 12,780.00 94.72 0.23

SMO Haberman’s survival 9.46 72.31 0.02

LibSVM Haberman’s survival 9.30 72.25 0.12

RCHSVM Haberman’s survival 12.30 72.10 0.13

CCHSVM Haberman’s survival 11.45 73.94 0.10

SMO Balls3D100 2,100,013.25 95.26 0.02

LibSVM Balls3D100 2,000,663.02 96.10 0.10

RCHSVM Balls3D100 program crash program crash program crash

CCHSVM Balls3D100 15,260.50 97.66 0.21

SMO Breast cancer 95.00 96.78 0.01

LibSVM Breast cancer 80.01 96.93 0.13

RCHSVM Breast cancer 128.25 91.35 0.23

CCHSVM Breast cancer 1,026.66 95.13 0.21

28

Experimental results demonstrate that our approach has good classification accuracy

while the training is significantly faster than other SVM classifiers. The classification accu-

racy is higher than the other geometric methods such as reduced convex hull.

The training time of CCHSVM have been significantly reduced. For small data sets and

high dimension data, the accuracy of CCHSVM is maintained slightly lower than the classical

SVM classifiers which use the whole data set. However, in the case of larger data sets, the

classification accuracy are almost the same or even better than the other SVMs, because we

do not use soft margin method to deal with the misleading points, we use concave hull to

solve this problem.

References

[1] K.P. Bennett , E.J. Bredensteiner, Geometry in Learning, Geometry at Work, C. Gorini

editors, Mathematical Association of America, 132-145, 2000

[2] K.P. Bennett , E.J. Bredensteiner, Duality and Geometry in SVM Classifiers, 17th

International Conference on Machine Learning, San Francisco, CA, 2000

[3] M.Berg, O.Cheong, M.Kreveld, M.Overmars, Computational Geometry: Algorithms and

Applications, Springer-Verlag, 2008

[4] J.Cervantes, X.Li, W.Yu, K.Li, Support vector machine classification for large data sets

via minimum enclosing ball clustering, Neurocomputing, Vol.71, 611-619, 2008

[5] C-C.Chang and C-J.Lin, LIBSVM: A library for support vector machines,

http://www.csie.ntu.edu.tw/˜cjlin/libsvm, 2001.

[6] D.J. Crisp, C.J.C. Burges, A Geometric Interpretation of υ-SVM Classifiers, NIPS,

Vol.12, 244-250, 2000

[7] R.Collobert, F.Sinz, J.Weston, L.Bottou, Trading convexity for scalability, 23rd Inter-

national Conference on Machine Learning, 201-208, Pittsburgh, USA; 2006

[8] R.Collobert, S.Bengio, SVMTorch: Support vector machines for large regression prob-

lems, Journal of Machine Learning Research, Vol.1, 143-160, 2001.

[9] W.Eddy, A New Convex Hull Algorithm for Planar Sets, ACM Transactions on Math-

ematical Software Vol.3, No.4, 398–403, 1977

29

[10] V.Franc, V.Hlavac, An iterative algorithm learning the maximal margin classifier, Pat-

tern Recognition, Vol.36, 1985 –1996, 2003

[11] B.V.Gnedenko, Y.K.Belyayev, and A.D.Solovyev, Mathematical Methods of Reliability

Theory, NewYork: Academic Press, 1969

[12] E. G. Gilbert, An iterative procedure for computing the minimum of a quadratic form

on a convex set, SIAM J. on Control and Optimization, Vol.4, No.1, 61-79, 1966

[13] R.L.Graham, An Efficient Algorithm for Dutennining the Convex Hull of a Finite PIanar

Set, Information Processing Letters, Vol.1, 132-133, 1972

[14] G.Guo, J.-S.Zhang, Reducing examples to accelerate support vector regression, Pattern

Recognition Letters, Vol.28, 2173-2183, 2007

[15] T. K. Ho and E. M. Kleinberg. Checkerboard dataset, http://www.cs.wisc.edu/, 1996.

[16] C.-W.Hsu, C.-C.Chang, C.-J.Lin, A Practical Guide to Support Vector Classification,

Bioinformatics, Volume: 1, Issue: 1, 1-16, 2010.

[17] R.A.Jarvis, On the identification of the convex hull of a finite set of points in the plane,

Information Processing Letters, Vol.2, 18-21, 1973

[18] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, A Fast Iter-

ative Nearest Point Algorithm for Support Vector Machine Classifier Design, IEEE

Transactions on Neural Networks, Vol. 11, No.12, 124-137, 2001.

[19] S.S.Keerthi, E.G.Gilbert, Convergence of a Generalized SMO Algorithm for SVM Clas-

sifier Design. Machine Learning, 46, 351–360, 2002

[20] Y.Li, Selecting training points for one-class support vector machines, Pattern Recogni-

tion Letters, Vol.32, 1517-1522, 2011.

[21] M.E.Mavroforakis, S.Theodoridis, A Geometric Approach to Support Vector Machine

(SVM) Classification, IEEE Trans. Neural Networks, vol.17, no.3, pp.671-682, 2006. S

[22] B. F. Mitchell, V. F. Dem’yanov, and V. N. Malozemov, Finding the point of a polyhe-

dron closest to the origin, Vestinik Leningrad. Gos.Univ., Vol. 13, pp. 38-45, 1971

[23] M.Kallay, The Complexity of Incremental Convex Hull Algorithms, Information Pro-

cessing Letters, Vol.19, No.4, 197-212, 1984.

30

[24] N.Cristianini, J.Shawe-Taylor , An Introduction to Support Vector Machines and Other

Kernel-based Learning Methods , Cambridge University Press, 2000.

[25] J.Platt, Fast Training of support vector machine using sequential minimal optimization,

Advances in Kernel Methods: Support Vector Machine, MIT Press, Cambridge, MA,

1998.

[26] C.Pizzuti, D.Talia, P-Auto Class: Scalable Parallel Clustering for Mining Large Data

Sets, IEEE Trans. Knowledge and Data Eng., vol.15, no.3, pp.629-641, 2003.

[27] F.P.Preparata, S.J. Hong. Convex Hulls of Finite Sets of Points in Two and Three

Dimensions, Communications of the ACM , vol. 20, no. 2, pp. 87–93, 1977.

[28] M.I. Schlesinger, V.G. Kalmykov, A.A. Suchorukov, Sravnitelnyj, Comparative anal-

ysis of algorithms synthesising linear decision rule for analysis of complex hypotheses,

Automatika, Vol.1, 3-9, 1981.

[29] I.W. Tsang, J.T. Kwok, P.M.Cheung, Core Vector Machines: Fast SVM Training on

Very Large Data Sets, Journal of Machine Learning Research, Vol. 6, 363-392, 2005

[30] V.Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995.

[31] C.Xia, W.Hsu, M.L.Lee, B.C.Ooi, BORDER: Efficient Computation of Boundary

Points, IEEE Trans. Knowledge and Data Eng., vol.18, no.3, pp.289-303, 2006.

[32] A. L. Yuille, A.Rangarajan, The Concave-Convex Procedure, Neural Computation, Vol.

15, No. 4, Pages 915-936, 2003

[33] H.Yu , J.Yang and J.Han , Classifying Large Data Sets Using SVMs with Hierarchical

Clusters, Proc. of the 9th ACM SIGKDD 2003, Washington, DC, USA, 2003.

[34] A.Moreira, M.Y.Santos, Concave Hull: A K-Nearest Neighbours Approach for the Com-

putation of the Region Occupied by a Set of Points, GRAPP (GM/R), Pages 61–68,

2007

[35] W.Yu, X.Li, On-line fuzzy modeling via clustering and support vector machines, Infor-

mation Sciences, Vol.178, 4264-4279, 2008.

31