[ieee 2011 ieee international workshop on genomic signal processing and statistics (gensips) - san...

4
STUDYING THE POSSIBILITY OF PEAKING PHENOMENON IN LINEAR SUPPORT VECTOR MACHINES WITH NON-SEPARABLE DATA Sardar Afra, and Ulisses Braga-Neto Department of Electrical Engineering, Texas A & M University, College Station, Texas 77843 [email protected], and [email protected] (Corresponding Author) ABSTRACT Typically, it is common to observe peaking phenomenon in the classification error when the feature size increases. In this paper, we study linear support vector machine clas- sifiers where the data is non-separable. A simulation based on synthetic data is implemented to study the possibility of observing peaking phenomenon. However, no peaking in the expected true error is observed. We also present the performance of three different error estimators as a func- tion of feature and sample size. Based on our study, one might conclude that when using linear support vector ma- chines, the size of feature set can increase safely. 1. INTRODUCTION Typically (but not always), it is common to observe peak- ing phenomenon while increasing feature size [1]. Actu- ally, classification error increases after a point as the num- ber of features grows. The peaking phenomenon that has been studied for the first time in 1968 by Hughes [2] de- pends on the joint feature-label distribution and classifi- cation rule [3]. In the present paper, we implement linear support vector machines as the classification rule and dis- cuss the peaking phenomenon observed in the classifica- tion error and its various common error estimators such as resubstitution, k-fold cross-validation (CV) and bootstrap methods (bootstrap zero and bootstrap 0.632). The paper is organized as follows. In section 2 linear support vec- tor machines (LSVMs) are explained. Section 3 describes three different error estimation methods. In section 4, the simulation based on the synthetic data (models parameters are estimated from a real patient data [4]) is presented. Section 5, discusses the simulation results. Finally, sec- tion 6 provides some concluding remarks. 2. LSVM CLASSIFIERS In two-group statistical pattern recognition, there is a fea- ture vector X IR p and a label Y ∈ {−1, 1}. The pair (X, Y ) has a joint probability distribution F, which is un- known in practice. Hence, one has to resort to designing classifiers from training data, which consists of a set of n independent points S n = {(X 1 ,Y 1 ),..., (X n ,Y n )}, drawn from F. A classification rule is a mapping g : {IR p ×{0, 1}} n ×IR p →{0, 1}. A classification rule maps the training data S n into the designed classifier g(S n , ·): IR p →{0, 1}. Linear discriminant classifier is defined by: g(S n ,x)= 1, a T x + a 0 > 0 0, otw Therefore, all training points are correctly classified if Y i (a T X i + a 0 ) > 0, i =1,...,n, where y i ∈ {−1, 1} (instead of the usual 0 and 1). The main idea in support vector machine is to adjust linear discrimination with mar- gins such that the margin is maximal - this is called the Maximal Margin Hyperplane (MMH) algorithm. Those points closest to the hyperplane are called the support vec- tors and determine a lot of the properties of the classifier. In this paper we consider LSVMs where the data is non- separable. 2.1. Linear Discrimination with a Margin If we want to have a margin b> 0, the constraint becomes Y i (a T X i + a 0 ) b, for i =1,...,n. The optimal clas- sification solution to this problem puts all points at a dis- tance at least b/||a|| from the hyperplane. Since a, a 0 and b can be freely scaled, without loss of generality we can set b =1, therefore we have Y i (a T X i + a 0 ) 1, for i = 1,...,n. The margin is 1/||a|| and the points that are at this exact distance from the hyperplane are called support vectors. The idea here is to maximize the margin 1/||a||. For this, it suffices to minimize 1 2 ||a|| 2 = 1 2 a T a subject to the constraints Y i (a T X i + a 0 ) 1, for i =1,...,n. The solution vector a determines the MMH. The corre- sponding optimal value a 0 will be determined later from a and the constraints. 2.2. Non-Separable Data If the data is not linearly separable, it is still possible to formulate the problem and find a solution by introduc- ing slack variables ξ i , i =1,...,n, for each of the con- straints, resulting in a new set of 2n constraints: Y i (a T X i + a 0 ) 1 ξ i and ξ i 0, i =1,...,n. Therefore, if ξ i > 0, the corresponding training point is an “outlier,” i.e., it can lie closer to the hyperplane than the margin, or even be misclassified. We introduce a penalty term C n i=1 ξ i in the functional, which then becomes: 1 2 a T a + C n i=1 ξ i . (1) 2011 IEEE International Workshop on Genomic Signal Processing and Statistics December 4-6, 2011, San Antonio, Texas, USA 978-1-4673-0490-0/11/$26.00 ©2011 IEEE 218

Upload: ulisses

Post on 22-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) - San Antonio, TX, USA (2011.12.4-2011.12.6)] 2011 IEEE International Workshop on Genomic

STUDYING THE POSSIBILITY OF PEAKING PHENOMENON IN LINEAR SUPPORTVECTOR MACHINES WITH NON-SEPARABLE DATA

Sardar Afra, and Ulisses Braga-Neto∗

Department of Electrical Engineering, Texas A & M University, College Station, Texas [email protected], and [email protected] (∗ Corresponding Author)

ABSTRACT

Typically, it is common to observe peaking phenomenon

in the classification error when the feature size increases.

In this paper, we study linear support vector machine clas-

sifiers where the data is non-separable. A simulation based

on synthetic data is implemented to study the possibility

of observing peaking phenomenon. However, no peaking

in the expected true error is observed. We also present the

performance of three different error estimators as a func-

tion of feature and sample size. Based on our study, one

might conclude that when using linear support vector ma-

chines, the size of feature set can increase safely.

1. INTRODUCTION

Typically (but not always), it is common to observe peak-

ing phenomenon while increasing feature size [1]. Actu-

ally, classification error increases after a point as the num-

ber of features grows. The peaking phenomenon that has

been studied for the first time in 1968 by Hughes [2] de-

pends on the joint feature-label distribution and classifi-

cation rule [3]. In the present paper, we implement linear

support vector machines as the classification rule and dis-

cuss the peaking phenomenon observed in the classifica-

tion error and its various common error estimators such as

resubstitution, k-fold cross-validation (CV) and bootstrap

methods (bootstrap zero and bootstrap 0.632). The paper

is organized as follows. In section 2 linear support vec-

tor machines (LSVMs) are explained. Section 3 describes

three different error estimation methods. In section 4, the

simulation based on the synthetic data (models parameters

are estimated from a real patient data [4]) is presented.

Section 5, discusses the simulation results. Finally, sec-

tion 6 provides some concluding remarks.

2. LSVM CLASSIFIERS

In two-group statistical pattern recognition, there is a fea-ture vector X ∈ IRp and a label Y ∈ {−1, 1}. The pair

(X,Y ) has a joint probability distribution F, which is un-

known in practice. Hence, one has to resort to designing

classifiers from training data, which consists of a set of

n independent points Sn = {(X1, Y1), . . . , (Xn, Yn)},

drawn from F. A classification rule is a mapping g :{IRp×{0, 1}}n×IRp → {0, 1}. A classification rule maps

the training data Sn into the designed classifier g(Sn, ·) :

IRp → {0, 1}. Linear discriminant classifier is defined by:

g(Sn, x) =

{1, aTx+ a0 > 0

0, otw

Therefore, all training points are correctly classified if

Yi(aTXi + a0) > 0, i = 1, . . . , n, where yi ∈ {−1, 1}

(instead of the usual 0 and 1). The main idea in support

vector machine is to adjust linear discrimination with mar-

gins such that the margin is maximal - this is called the

Maximal Margin Hyperplane (MMH) algorithm. Those

points closest to the hyperplane are called the support vec-tors and determine a lot of the properties of the classifier.

In this paper we consider LSVMs where the data is non-

separable.

2.1. Linear Discrimination with a Margin

If we want to have a margin b > 0, the constraint becomes

Yi(aTXi + a0) ≥ b, for i = 1, . . . , n. The optimal clas-

sification solution to this problem puts all points at a dis-

tance at least b/||a|| from the hyperplane. Since a, a0 and

b can be freely scaled, without loss of generality we can

set b = 1, therefore we have Yi(aTXi + a0) ≥ 1, for i =

1, . . . , n. The margin is 1/||a|| and the points that are at

this exact distance from the hyperplane are called supportvectors. The idea here is to maximize the margin 1/||a||.For this, it suffices to minimize 1

2 ||a||2 = 12 a

Ta subject

to the constraints Yi(aTXi + a0) ≥ 1, for i = 1, . . . , n.

The solution vector a∗ determines the MMH. The corre-

sponding optimal value a∗0 will be determined later from

a∗ and the constraints.

2.2. Non-Separable Data

If the data is not linearly separable, it is still possible to

formulate the problem and find a solution by introduc-

ing slack variables ξi, i = 1, . . . , n, for each of the con-

straints, resulting in a new set of 2n constraints:

Yi(aTXi + a0) ≥ 1− ξi and ξi ≥ 0, i = 1, . . . , n.

Therefore, if ξi > 0, the corresponding training point

is an “outlier,” i.e., it can lie closer to the hyperplane than

the margin, or even be misclassified. We introduce a penalty

term C∑n

i=1 ξi in the functional, which then becomes:

1

2aTa+ C

n∑i=1

ξi. (1)

2011 IEEE International Workshop on Genomic Signal Processing and StatisticsDecember 4-6, 2011, San Antonio, Texas, USA

978-1-4673-0490-0/11/$26.00 ©2011 IEEE 218

Page 2: [IEEE 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) - San Antonio, TX, USA (2011.12.4-2011.12.6)] 2011 IEEE International Workshop on Genomic

The constant C modulates how large the penalty for

the presence of outliers are. If C is small, the penalty is

small and a solution is more likely to incorporate outliers.

If C is large, the penalty is large and therefore a solution

is unlikely to incorporate many outliers. The method of

Lagrange Multipliers allows us to turn this into an uncon-

strained problem (unconstrained in a and a0). Consider

the primal Lagrangian functional:

LP (a, a0, ξ, λ, ρ) =12 a

Ta+ C∑n

i=1 ξi

−∑ni=1 λi

(Yi(a

TXi + a0)− 1 + ξi)−∑n

i=1 ρiξi.

where λi and ρi, for i = 1, . . . , n are the Lagrange

multipliers. It can be shown that the solution to the previ-

ous constrained problem can be found by simultaneously

minimizing Lp with respect to a and a0 and maximizing

it with respect to λ (hence, we search for a saddle point in

Lp). Since Lp is unconstrained for a and a0, a necessary

condition for optimality is that the derivatives of Lp with

respect to a and a0 be zero. This is part of the Karush-

Kuhn-Tucker (KKT) conditions for the original problem,

which yields the equations:

a =n∑

i=1

λiYiXi and

n∑i=1

λiYi = 0. (2)

Setting the derivatives of Lp with respect to ξi to zero

yields

C − λi − ρi = 0, i = 1, . . . , n. (3)

Substituting these equations back into LP leads to the

same expression for the dual Lagrangian functional as in

the separable case:

LD(λ) =n∑

i=1

λi − 1

2

n∑i=1

n∑j=1

λiλjYiYjXTi Xj ,

which must be maximized with respect to λi. The first set

of constraints come from Equation (2), by∑n

i=1 λiYi =0. One also has 0 ≤ λi ≤ C, for i = 1, . . . , n, which are

derived from Equation (3) and the non-negativity of the

Lagrange multipliers λi, ρi ≥ 0. The outliers are points

for which ξi > 0 ⇒ ρi = 0 ⇒ λi = C. The points for

which 0 < λi < C are called margin vectors. Once the

optimal λ∗ is found, the solution vector a∗ is determined

by Equation (2):

a∗ =n∑

i=1

λ∗i YiXi =

∑i∈S

λ∗i YiXi,

where S = {i |λi > 0} is the support vector index set.

The value of a∗0 can be determined from any of the active

constraints aTXi + a0 = Yi with ξi = 0, that is, the

constraints for which 0 < λi < C (the margin vectors), or

from their sum:

nma0 + (a∗)T∑i∈Sm

Xi =∑i∈Sm

Yi.

In the previous equation, Sm = {i |0 < λi < C} is

the margin vector index set, and nm = Card(Sm) is the

number of margin vectors. From this it follows that:

a∗0 = − 1

nm

∑i∈S

∑j∈Sm

λ∗i YiX

Ti Xj +

1

nm

∑i∈Sm

Yi.

The optimal discriminant is thus given by

(a∗)Tx+ a∗0 =∑i∈S

λ∗i YiX

Ti x+ a∗0 ≥ 0.

3. ERROR ESTIMATION

The true error of a designed classifier is its error rate given

the training data set:

εn[g|Sn] = P (g(Sn, x) �= y) = EF(|y−g(Sn, x)|), (4)

where the notation EF indicates that the expectation is

taken with respect to F; in fact, one can think of (x, y)in the above equation as a random test point (this inter-

pretation being useful in understanding error estimation).

The expected error rate over the data is given by

εn[g] = EFn

(εn[g|Sn]) = EFn

EF(|y − g(Sn, x)|),where Fn is the joint distribution of the training data Sn.

This is sometimes called the unconditional error of the

classification rule, for sample size n. If the underlying

feature-label distribution F were known, the true error

could be computed exactly, via (4). In practice, one is

limited to using an error estimator.

3.1. Resubstitution

The simplest and fastest way to estimate the error of a

designed classifier in the absence of test data is to compute

its error directly on the sample data itself:

ε̂resub =1

n

n∑i=1

|Yi − g(Sn, Xi)|.

This resubstitution estimator, attributed to [5], is very

fast, but is usually optimistic (i.e., low-biased) as an esti-

mator of εn[g] [6].

3.2. k-fold CV

CV removes the optimism from resubstitution by employ-

ing test points not used in classifier design [7]. In k-foldCV, the data set Sn is partitioned into k folds S(i), for

i = 1, . . . , k (for simplicity, we assume that k divides n).

Each fold is left out of the design process and used as a

test set, and the estimate is the overall proportion of error

committed on all folds:

ε̂cvk =1

n

k∑i=1

n/k∑j=1

|Y (i)j − g(Sn\S(i), X

(i)j )|,

where (X(i)j , Y

(i)j ) is a sample in the i-th fold. The pro-

cess may be repeated: several cross-validation estimates

219

Page 3: [IEEE 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) - San Antonio, TX, USA (2011.12.4-2011.12.6)] 2011 IEEE International Workshop on Genomic

are computed using different partitions of the data into

folds, and the results are averaged. CV estimators are of-

ten pessimistic, since they use smaller training sets to de-

sign the classifier. Their main drawback is their variance

[6, 8].

3.3. Bootstrap

The bootstrap error estimation technique [9, 10] is based

on the notion of an “empirical distribution” F∗, which

serves as a replacement to the original unknown distribu-

tion F. The empirical distribution puts mass 1n on each of

the n available data points. A “bootstrap sample” S∗n from

F∗ consists of n equally-likely draws with replacement

from the original data Sn. Hence, some of the samples

will appear multiple times, whereas others will not appear

at all. The actual proportion of times a data point (Xi, Yi)appears in S∗

n is P ∗i = 1

n

∑nj=1 I(X∗

j ,Y∗j )=(Xi,Yi), where

IS = 1 if the statement S is true, zero otherwise. The

basic bootstrap zero estimator [11] is written in terms of

the empirical distribution as ε̂0 = EF∗( |Y − g(S∗n, X)| :

(X,Y ) ∈ Sn \ S∗n). In practice, the expectation EF∗ has

to be approximated by a a Monte-Carlo estimate based on

independent replicates S∗bn , for b = 1, . . . , B (B between

25 and 200 being recommended [11]):

ε̂0 =

∑Bb=1

∑ni=1 |Yi − g(S∗b

n , Xi)| IP∗bi =0∑B

b=1

∑ni=1 IP∗b

i =0

.

The bootstrap zero works like cross-validation: the

classifier is designed on the bootstrap sample and tested

on the original data points that are left out. It tends to be

high-biased as an estimator of εn[g], since the amount of

samples available for designing the classifier is on average

only (1− e−1)n ≈ 0.632n. The estimator

ε̂b632 = (1− 0.632) ε̂resub + 0.632 ε̂0,

tries to correct this bias by doing a weighted average of the

bootstrap zero and resubstitution estimators. It is known

as the .632 bootstrap estimator [11], and has been per-

haps the most popular bootstrap estimator in data mining

[12]. It has low variance, but can be extremely slow to

compute. In addition, it can fail when resubstitution is too

low-biased [8].

4. SIMULATION

To generate synthetic data, we assume a multivariate Gaus-

sian distribution with parameters directly estimated from

a real data set [4]. First, we choose 500 features that have

the largest absolute mean across all samples. Then, we

calculate the mean vector and covariance matrix of the

chosen features in each class separately and use them for

generating synthetic data set from a multivariate Gaussian

distribution. We then generate a large set of sample points

of size 2000 in each class, only once, that is used as a test

set for calculating the expected true error. In addition to

the test set, a train set of variable size (as one parameter

of the simulation) 20, 40, . . . , 200 per class is generated in

each Monte Carlo iteration.

040

80120160200

040

80120

160200

0

0.05

0.1

0.15

0.2

Feature sizeSample per class

Exp

ecte

d tr

ue e

rror

Figure 1. True error as a function of sample size and fea-

ture size.

In each Monte Carlo iteration, for a fixed sample of

fixed size, we employ t-test to select best feature set of

variable size (another parameter of the simulation) 20, 40,. . . , 200. We design a LSVM classifier with C = 0.1on the training set and employ these error estimators: re-

substitution, 5-fold CV with 10 repetitions, 10-fold CV

with 10 repetitions, bootstrap zero and bootstrap 0.632with 25 repetitions. Note that in CV and bootstrap er-

ror estimators, t-test feature selection is done for every

fold and bootstrap sample and then surrogate classifiers

are designed. Also, we calculate true error of the classi-

fier based on the large test data set and selected features

from t-test. The Monte Carlo simulation is done for 1000iterations. For the linear SVM classification rule, we use

the LIBSVM package [13] with the default settings and

adjust the slack parameter to 0.1.

5. RESULTS

Expected true error is shown in Fig. 1. Strikingly, LSVM

shows no peaking up to even a feature size of 200. The

expected true error decreases while both sample size and

feature size increase. Therefore one can safely use a large

feature size although the sample size is small. Figures 2

shows the biases of error estimators in a same graph for

two different fixed sample size as a function of feature

size when the feature size increases. CV error estimators

have biases close to zero. As one know k-fold CV is a

pessimistic estimator of classification error. Also the bias

of k-fold cross validation is expected to go to zero while

the number of sample increases. For bootstrap zero, we

can see that the bias is positive, which means that the es-

timator is an pessimistic error estimator as we expected.

Bootstrap 0.632 is better than bootstrap zero in terms of

bias. In addition, while the sample size increases both the

biases go to zero. In Fig. 3 variances of all error estimator

deviations are shown in the same plot for two fixed sample

size when feature size changes.

CV is a high variance error estimator deviation com-

paring to other error estimators as observed in Fig. 3.

Bootstrap 0.632, has a small variance as expected and

the resubstitution tends to be low-variance, which also has

the smallest variance among all the other error estimators.

220

Page 4: [IEEE 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) - San Antonio, TX, USA (2011.12.4-2011.12.6)] 2011 IEEE International Workshop on Genomic

0 1 2 3 4 5 6 7 8 9 10 11−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Feature size

Bia

s

Sample size = 20 (solid) and 200 (dashed) / per class

CV5CV10Boot0Boot0.632Resub.CV5CV10Boot0Boot0.632Resub.

Figure 2. Bias of the error estimators for the sample

size of 20 and 200 per class for variable feature size of

20, 40, . . . , 200.

0 1 2 3 4 5 6 7 8 9 10 11

0

2

4

6

x 10−3

Feature size

Var

ianc

e

Sample size = 20 (solid) and 200 (dashed) / per class

CV5CV10Boot0Boot0.632Resub.CV5CV10Boot0Boot0.632Resub.

Figure 3. Variance of the error estimator deviations for the

sample size of 20 and 200 per class for variable feature

size of 20, 40, . . . , 200.

RMS of all the error estimators are depicted in Fig. 4 for

tow different sample sizes. As we can see, RMS for the

bootstrap 0.632 surpasses all others as we look for. The

resubstitution has a large RMS; however, it tends to be

very close to zero.

6. CONCLUSION

In this paper, we discussed the peaking phenomenon in

the expected true error as a function of sample and feature

size for LSVMs as the classification rule where the classes

are not linearly separable. In addition, the performances

of three error estimators were studied. Surprisingly, we

observed no peaking in the expected true error of LSVM

classifiers as the number of features increases. This might

have two reasons. The first might be because of the com-

plexity of LSVMs where the overfitting is controlled by

the parameter C. The second might be because of the

structure of feature-label distribution. We think that us-

ing more features when LSVM classifier is implemented

can be safe in terms of peaking phenomenon.

7. REFERENCES

[1] J. Hua, Z. Xiong, J. Lowey, and E.R. Suh,

E.and Dougherty, “Optimal number of features as

a function of sample size for various classification

rules,” Bioinf., vol. 21, no. 8, pp. 1509–1515.

[2] G. Hughes, “On the mean accuracy of statistical pat-

tern recognizers,” Information Theory, IEEE Trans-actions on, vol. 14, no. 1, pp. 55 – 63, 1968.

0 1 2 3 4 5 6 7 8 9 10 110

0.05

0.1

0.15

0.2

Feature size

RM

S

Sample size = 20 (solid) and 200 (dashed) / per class

CV5CV10Boot0Boot0.632Resub.CV5CV10Boot0Boot0.632Resub.

Figure 4. RMS of the error estimators for the sample

size of 20 and 200 per class for variable feature size of

20, 40, . . . , 200.

[3] C. Sima and E.R. Dougherty, “The peaking phe-

nomenon in the presence of feature-selection,” Pat-tern Recogn. Lett., vol. 29, pp. 1667–1674, 2008.

[4] X. Lin, B. Afsari, L. Marchionni, L. Cope, G. Parmi-

giani, D. Naiman, and D. Geman, “The ordering

of expression among a few genes can provide sim-

ple cancer biomarkers and signal brca1 mutations.,”

BMC Bioinf., vol. 10, pp. 256, 2009.

[5] C.A.B. Smith, “Some examples of discrimination,”

Ann. of Human Gen., vol. 13, no. 1, pp. 272–282,

1946.

[6] L. Devroye, L. Gyorfi, and G. Lugosi, A proba-bilistic theory of pattern recognition, Applications

of mathematics. Springer, 1996.

[7] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern clas-sification, Pattern Classification and Scene Analysis:

Pattern Classification. Wiley, 2001.

[8] U.M. Braga-Neto and E.R. Dougherty, “Is cross-

validation valid for small-sample microarray classi-

fication?,” Bioinf., vol. 20, pp. 374–380, 2004.

[9] B. Efron, “Bootstrap methods: Another look at the

jackknife,” The Ann. of Statistics, vol. 7, no. 1, pp.

pp. 1–26.

[10] B. Efron, The jackknife and bootstrap and other re-sampling plans, Springer series in statistics. SIAM

Monograph 38, NSF-CBMS, 1982.

[11] B. Efron, “Estimating the error rate of a prediction

rule: Improvement on cross-validation,” J. of theAmerican Stat. Assoc., vol. 78, no. 382, pp. pp. 316–

331, 1983.

[12] I. Witten and E. Frank, Data Mining, Academic

Press, CA: San Diego, 2000.

[13] C.-C. Chang and C-J Lin, LIBSVM: Introduction andbenchmarks, Department of Computer Science and

Information Engineering, National Taiwan Univer-

sity, Taipei, Taiwan, 2000.

221