a numeric comparison of variable selection algorithms for supervised learning

9
A numeric comparison of variable selection algorithms for supervised learning G. Palombo a, , I. Narsky b a University of Milan, Bicocca, Italy b California Institute of Technology, USA article info Article history: Received 12 June 2009 Received in revised form 22 September 2009 Accepted 23 September 2009 Available online 26 September 2009 Keywords: StatPatternRecognition Machine learning Data analysis Variable selection Classification abstract Datasets in modern High Energy Physics (HEP) experiments are often described by dozens or even hundreds of input variables. Reducing a full variable set to a subset that most completely represents information about data is therefore an important task in analysis of HEP data. We compare various variable selection algorithms for supervised learning using several datasets such as, for instance, imaging gamma-ray Cherenkov telescope (MAGIC) data found at the UCI repository. We use classifiers and variable selection methods implemented in the statistical package StatPatternRecognition (SPR), a free open-source C þþ package developed in the HEP community (http://sourceforge.net/projects/ statpatrec/). For each dataset, we select a powerful classifier and estimate its learning accuracy on variable subsets obtained by various selection algorithms. When possible, we also estimate the CPU time needed for the variable subset selection. The results of this analysis are compared with those published previously for these datasets using other statistical packages such as R and Weka. We show that the most accurate, yet slowest, method is a wrapper algorithm known as generalized sequential forward selection (‘‘Add N Remove R’’) implemented in SPR. & 2009 Elsevier B.V. All rights reserved. 1. Introduction A crucial task in analysis of High Energy Physics (HEP) data is separation of signal and background events [1]. In modern HEP analysis, Supervised Machine Learning (SML) techniques are often used to classify observed events as signal or background. SML is an inductive process of learning a function from a given set of events typically collected in a training set. Each event is described by a vector of input variable values (such as momenta, energy, etc.) and a class label (typically signal and background). The task is to derive a classifier that is able to predict accurately class labels for unseen events [2–5]. The measure of the classifier perfor- mance is called Figure-Of-Merit (FOM). The number of input variables can be very high and the analysis can involve a huge number of events. Variable Selection (VS) in classification addresses the problem of reducing the variable set to a subset that most completely represents informa- tion about the SML problem [2]. VS is usually based on filter, wrapper, or embedded approaches. In the filter approach, the selection is performed using only information present in the data, without considering information from the underlying learning algorithm [6]. It can be seen as a pre-processing step which involves only intrinsic characteristics of the data. In the wrapper approach, VS optimizes directly the induction algorithm perfor- mance to select the best subset of variables [7]. To do that, many different possible subsets of variables are generated and, then, evaluated on an independent test set which is not used during the search. Embedded algorithms are specific to a given algorithm too, but, in contrast to wrapper algorithms, they incorporate VS directly into the learning process [6]. They typically select the most important variables according to a ranking strategy, where the variable rank depends on the relevance of this variable in the learning process. Embedded methods are not new: CART decision trees introduced by Breiman et al. [3] in 1984 already have a built- in mechanism to estimate variable importance. Since they evaluate the variable importance during the learning process, they are faster than wrappers. Filter algorithms are usually the fastest, however, they are typically outperformed by methods which take into account the induction algorithm [6,7]. Removing variables irrelevant for classification can reduce the time needed by the learning process and gives the chance to interpret and analyze the results more easily and quickly. Also, some algorithms are not robust with respect to irrelevant or noisy variables and their performance degrades when the variable pool is polluted with poor predictors. In this paper we compare the performance of various VS methods applied to the same sets of data. Section 2 of the paper ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/nima Nuclear Instruments and Methods in Physics Research A 0168-9002/$ - see front matter & 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.nima.2009.09.059 Corresponding author. E-mail addresses: [email protected], [email protected] (G. Palombo), [email protected] (I. Narsky). Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195

Upload: g-palombo

Post on 21-Jun-2016

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: A numeric comparison of variable selection algorithms for supervised learning

ARTICLE IN PRESS

Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195

Contents lists available at ScienceDirect

Nuclear Instruments and Methods inPhysics Research A

0168-90

doi:10.1

� Corr

E-m

(G. Palo

journal homepage: www.elsevier.com/locate/nima

A numeric comparison of variable selection algorithmsfor supervised learning

G. Palombo a,�, I. Narsky b

a University of Milan, Bicocca, Italyb California Institute of Technology, USA

a r t i c l e i n f o

Article history:

Received 12 June 2009

Received in revised form

22 September 2009

Accepted 23 September 2009Available online 26 September 2009

Keywords:

StatPatternRecognition

Machine learning

Data analysis

Variable selection

Classification

02/$ - see front matter & 2009 Elsevier B.V. A

016/j.nima.2009.09.059

esponding author.

ail addresses: [email protected], g.pa

mbo), [email protected] (I. Narsky).

a b s t r a c t

Datasets in modern High Energy Physics (HEP) experiments are often described by dozens or even

hundreds of input variables. Reducing a full variable set to a subset that most completely represents

information about data is therefore an important task in analysis of HEP data. We compare various

variable selection algorithms for supervised learning using several datasets such as, for instance,

imaging gamma-ray Cherenkov telescope (MAGIC) data found at the UCI repository. We use classifiers

and variable selection methods implemented in the statistical package StatPatternRecognition (SPR), a

free open-source Cþþ package developed in the HEP community (http://sourceforge.net/projects/

statpatrec/). For each dataset, we select a powerful classifier and estimate its learning accuracy on

variable subsets obtained by various selection algorithms. When possible, we also estimate the CPU

time needed for the variable subset selection. The results of this analysis are compared with those

published previously for these datasets using other statistical packages such as R and Weka. We show

that the most accurate, yet slowest, method is a wrapper algorithm known as generalized sequential

forward selection (‘‘Add N Remove R’’) implemented in SPR.

& 2009 Elsevier B.V. All rights reserved.

1. Introduction

A crucial task in analysis of High Energy Physics (HEP) data isseparation of signal and background events [1]. In modern HEPanalysis, Supervised Machine Learning (SML) techniques are oftenused to classify observed events as signal or background. SML isan inductive process of learning a function from a given set ofevents typically collected in a training set. Each event is describedby a vector of input variable values (such as momenta, energy,etc.) and a class label (typically signal and background). The taskis to derive a classifier that is able to predict accurately class labelsfor unseen events [2–5]. The measure of the classifier perfor-mance is called Figure-Of-Merit (FOM).

The number of input variables can be very high and theanalysis can involve a huge number of events. Variable Selection(VS) in classification addresses the problem of reducing thevariable set to a subset that most completely represents informa-tion about the SML problem [2]. VS is usually based on filter,wrapper, or embedded approaches. In the filter approach, theselection is performed using only information present in the data,without considering information from the underlying learning

ll rights reserved.

[email protected]

algorithm [6]. It can be seen as a pre-processing step whichinvolves only intrinsic characteristics of the data. In the wrapperapproach, VS optimizes directly the induction algorithm perfor-mance to select the best subset of variables [7]. To do that, manydifferent possible subsets of variables are generated and, then,evaluated on an independent test set which is not used during thesearch. Embedded algorithms are specific to a given algorithm too,but, in contrast to wrapper algorithms, they incorporate VSdirectly into the learning process [6]. They typically select themost important variables according to a ranking strategy, wherethe variable rank depends on the relevance of this variable in thelearning process. Embedded methods are not new: CART decisiontrees introduced by Breiman et al. [3] in 1984 already have a built-in mechanism to estimate variable importance. Since theyevaluate the variable importance during the learning process,they are faster than wrappers. Filter algorithms are usually thefastest, however, they are typically outperformed by methodswhich take into account the induction algorithm [6,7].

Removing variables irrelevant for classification can reduce thetime needed by the learning process and gives the chance tointerpret and analyze the results more easily and quickly. Also,some algorithms are not robust with respect to irrelevant or noisyvariables and their performance degrades when the variable poolis polluted with poor predictors.

In this paper we compare the performance of various VSmethods applied to the same sets of data. Section 2 of the paper

Page 2: A numeric comparison of variable selection algorithms for supervised learning

ARTICLE IN PRESS

G. Palombo, I. Narsky / Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195188

describes briefly statistical tools used in our study. Section 3shows the performance of various VS algorithms on severaldatasets. Section 4 draws conclusions from this analysis.

2. Statistical analysis tools

For the data analysis we use statistical packages SPR (release08-00-00) [8], R (release 2.6.2) [9], and Weka (release 3.5.7) [10].We describe briefly the SPR package. For details on SPR, R, andWeka see the references above.

SPR is a Cþþ package for SML. It implements linear andquadratic discriminant analysis [11], logistic regression [2], abinary decision split, a bump hunter [12], two flavors of decisiontrees [3,4,13] (one for fast machine optimization and the other foreasy interpretation by humans), a feedforward backpropagationneural net with a logistic activation function [14], several flavorsof boosting [15] including the arc-x4 algorithm [16], bagging [17]and random forest [18], and an algorithm for combining classifierstrained on subsets of input variables [19]. The algorithms listedabove can only be used for separation of two classes, signal andbackground. The package also includes two multi-class methods[19], one of which is described later in this paper; thesealgorithms reduce a problem with an arbitrary number of classesto a set of two-class problems and then convert the solutions tothese binary problems into an overall multi-category classificationlabel.

In this study, we mainly use tree-based classifiers. Therefore,we briefly review Decision Trees (DT), Boosted Decision Trees(BDT) and Random Forest (RF).

The structure of a single DT is rather simple. At every step, thealgorithm considers all possible binary splits on each variable andsplits on the variable which improves the FOM most. It continuessplitting daughter nodes into smaller nodes recursively until astopping criterion is satisfied. A node which is not split intodaughter nodes is a leaf. A leaf is labeled as signal if there aremore signal events in this node than background events. Variousmeasures are used to search for optimal decision splits; for detailslook elsewhere [13].

DT usually offers a weak predictive power. However, ensemblemethods can significantly improve the accuracy of weak classifierssuch as DT by combining and averaging many weak classifiers.One ensemble method is called boosting. Boosting grows decisiontrees sequentially, increasing weights of misclassified events afterevery tree and growing a new tree on the reweighted trainingdata. Classification labels for test data are then computed bytaking a weighted average of all trees grown by the algorithm.

Another ensemble method is RF applied in conjunction withbagging. Many decision trees are grown on bootstrap replicas oftraining data, where each bootstrap replica is obtained bysampling N out of N training events with replacement. In addition,RF selects input variables for each decision split and thusintroduces even more randomness in the algorithm. Classificationlabels for test data are then obtained by taking a majority vote ofthe grown trees.

The SPR classifiers are implemented in a flexible frameworkutilizing the full power of object-oriented programming. Becauseall classifiers inherit from the same abstract interface, one caneasily substitute one classifier for another. This modular approachmakes the package scalable for highly complex applications. Forexample, SPR is capable of boosting or bagging an arbitrarysequence of classifiers included in the package.

Besides multivariate classifiers, SPR implements tools forcomputation of data moments including correlations betweenvariables, bootstrap analysis of data and others. SPR is distributedunder General Public License (GPL) off Sourceforge [20]. Support is

provided for two versions of the package: a standalone versionthat uses ASCII text for input and output of data, and a ROOT-dependent version [21]. The user can choose between the twoversions during installation by setting an appropriate parameterof the configuration script. No graphical tools are offered for theASCII version of the package; however, one can go through the fullanalysis chain using ASCII output from SPR executables, as long asone can tolerate digesting information in the form of text tablesinstead of plots. The ROOT version of the package allows the userto call SPR routines from an interactive ROOT session and plotoutput of various SPR methods using supplied ROOT scripts forgraphical data analysis. Algorithms for VS implemented in SPR arediscussed below.

One of VS algorithms implemented in SPR is a quick filteralgorithm which ranks variables by their correlation with the classlabel (we will call it ‘‘Correlations’’ in this work). The larger thecorrelation, the more important is this variable for classification.

Another method for estimation of variable importance is the‘‘Permutations’’ algorithm implemented in SPR. The idea behindthis method is similar to that proposed by Breiman [18] for out-of-bag estimates. A trained classifier is applied to events not includedin the data used for training and a performance measure such as,e.g., quadratic loss

D¼1

N

XN

i ¼ 1

ðyi � f ðxiÞÞ2

ð1Þ

is recorded, where yi is the true class of event i, f ðxiÞ is theresponse of the induced classifier event i, and N is the number ofevents. Then, this classifier is applied to these events with classlabels randomly permuted across each variable in turn and thechange in the performance measure due to this permutation foreach variable is estimated. Random permutation effectively voidsthe original effect of the permuted variable on the response of theclassifier. When the set with one variable permuted, along with allthe other non-permuted variables, is used to estimate classifierperformance, the classifier performance decreases significantly ifthe permuted variable was a strongly relevant variable. The SPRimplementation lets the user specify the number of permutationsacross each variable, producing more accurate estimates as thisnumber increases. Unlike Breiman’s method for the out-of-bagestimate, the classifier is applied to independent test data;furthermore, an ensemble of weak learners such as, e.g., RF isapplied to the test events as a whole, without evaluating the out-of-bag performance for each tree. The ‘‘Permutations’’ method isversatile as it can be used with any classifier, not just ensemblemembers trained on bootstrap replicas.

For DT and their ensembles, one can estimate variableimportance by summing changes in the optimized FOM such as,e.g., the Gini index due to splits imposed on each variable [3].Specifically, each split is made on that variable which improvesthe FOM the most. Then, sum of FOM improvements over all thesplits on each variable is calculated. The greater is the overallimprovement in the FOM, the more important is that variable. Werefer to this popular embedded algorithm as ‘‘FOM Importance’’.

Another algorithm for VS implemented in SPR sorts inputvariables by their interaction with the rest of the variables(‘‘Interactions’’). Interaction between two variable subsets S1 andS2 is defined as

r¼ Siðf ðS1 iÞ � f ðS1ÞÞðf ðS2iÞ � f ðS2ÞÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiSiðf ðS1 iÞ � f ðS1ÞÞ

2q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Siðf ðS2 iÞ � f ðS2ÞÞ2

q ð2Þ

where f ðS1Þ and f ðS2Þ are classifier responses at a given pointintegrated over all the variables not included in subsets S1 and S2,respectively. f ðS1Þ and f ðS2Þ are the means of f ðS1Þ and f ðS2Þ. Thealgorithm sorts variables by their interaction in the descendant

Page 3: A numeric comparison of variable selection algorithms for supervised learning

ARTICLE IN PRESS

G. Palombo, I. Narsky / Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195 189

order. In a stepwise manner, it chooses the variable interactingmost with all other variables except those included in the optimalsubset at earlier steps. The order in which the variables areselected gives the variable importance rank. This algorithm needsto compute K�ðK � 1Þ=2 interactions, where K is the dimension-ality of the dataset. For datasets with many variables and a hugenumber of events, this could be very time consuming. To reducethe CPU time needed by the algorithm, the SPR implementationallows the user to choose the number of points, picked randomlyfrom the dataset, used to calculate the interactions.

A well-known wrapper VS method is Forward StepwiseAddition (FSA) [22]. This method, given the selected FOM, addsone variable at a time, trains the classifier, computes the FOM, andchooses the variable with the greatest improvement in the FOM.This addition continues as long as it is possible to improve theFOM. A disadvantage of FSA, as implemented in Weka and R, isthat an added variable cannot be removed. For instance, if avariable is part of the best subset with M variables but is not partof the best subset with Mþ1 variables, FSA is not able to find theoptimal subset with Mþ1 variables.

SPR overcomes the disadvantage of FSA by implementing amore flexible generalization of the forward stepwise additionmethod, ‘‘Add n Rem r’’. At each step the subset with the smallestloss on the test set is selected by adding n and removing r

variables, where the values of ‘‘n’’ and ‘‘r’’ are chosen by the user.

3. Tests of variable selection methods

We evaluate several methods implemented in the statisticalpackages SPR, R, and Weka. The datasets used in this analysis aretaken from the machine learning repository of the University ofCalifornia at Irvine (UCI) [23]. We include one astrophysics datasetand four datasets related to medical research. We could have aswell performed this study using HEP data. However, HEP data areproprietary and rarely shared among HEP experiments. Moreover,well-documented applications of SML algorithms to HEP dataare hard to find. In contrast, datasets at the UCI repository arepublicly available and have been thoroughly studied from the SMLperspective. The methodology described in this study can beapplied to HEP data in exactly the same way.

If a dataset has more than 1000 events, it is split into a trainingset and a test set in the proportion 2:1. A classifier is optimizedusing the training set and the performance is evaluated usingReceiver Operating Characteristic (ROC) curve on the test set,where ROC is a graphical plot of true signal rate (y-axis) vs. falsesignal rate (x-axis). The larger the area under the ROC, the betterthe classifier performance. For datasets with o1000 events,10-fold cross-validation is used to train the classifier and estimateits accuracy. That is, if we take k as the number of cross-validationfolds, the dataset is split into k folds. Then k� 1 subsets are usedfor training and the remaining one is used for testing. This is donek times, with each of the k subsets used exactly once as a test set.At the end, the FOM is obtained by averaging FOMs from all thefolds.

To check how sensitive are results obtained by the samealgorithm to the number of cross-validation folds, we also run3-fold cross-validation. For any dataset, results from 3- and10-fold cross-validation are consistent within statistical errors.This is why we show only results obtained by 10-fold cross-validation. The error on the classification accuracy estimateobtained by cross-validation is computed using

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

N � 1

XN

i ¼ 1

ðai � aÞ2

vuut ð3Þ

where ai is the classification accuracy for the test subset i, a is theaverage accuracy and N is the number of folds. This estimate isbiased [24], because it does not account for correlations betweenthe splits.

For the ‘‘Interactions’’ method, we choose 500 points forintegration for datasets with more than 500 events and all pointsfor datasets with o500 events.

For each dataset, we find the best classifier among thoseavailable in SPR by comparing their predictive power. If severalclassifiers show comparable performance, we choose the fastest.

We briefly describe here some of the VS methods used forcomparison in this study.

The RuleFit [25] classifier is constructed as a linear combina-tion of rules. Every rule represents a node of a decision tree and isdetermined by the sequence of splits from the root of the tree tothis node. To maximize classifier performance, different weightsare given to different rules. RuleFit assigns a high importance rankto a variable if this variable frequently appears in most importantrules. Rule importance is estimated by the change in the classifierperformance when that rule is removed from the predictivemodel.

FOS-MOD is a filter VS method. It selects variables in astepwise manner by estimating how efficient each variable is forrepresenting all other variables. Dependencies among variablesare estimated by computing an average correlation between eachvariable and all others. At each step, subset predictive power isestimated by using a linear combination of selected variables.Then, error reduction given by the last variable added iscalculated. Search ends when sum of error reduction of allvariables selected reaches a certain threshold [26]. The rationale isto remove all variables that contain redundant information.

ReliefF is a filter VS method. The idea behind ReliefF is simple:if nearest neighbors from the same class share a similar value for agiven variable, this variable is relevant; otherwise it is not. ReliefFselects events at random and finds their nearest neighbors fromthe same class and from other classes. Then it ranks variablesaccording to how well their value can discriminate among nearevents [27].

FCBF stands for ‘‘fast correlation-based filter’’. It operates intwo steps. First, it sorts variables by their correlation with theclass label in the descending order and keeps only variableswith correlation values above a certain threshold. Then FCBFgoes through the list obtained in the first step. It accepts everyvariable whose correlation with the class label exceeds itscorrelation with any variable accepted so far and rejects thevariable otherwise [28].

3.1. Magic gamma-ray telescope

This simulated dataset involves a binary classification taskwith 10 variables and 19020 events (12 332 signal events and6688 background events). Performance of several classifiers onthese data has been studied in Ref. [29].

We use a RF classifier with 200 trees, at least 1 event per leaf,and with each split chosen among three variables randomlyselected at each node. To find important variables, we run ‘‘Add 2Rem 1’’, adding two and removing one variable every time. Thebest subset with three variables is obtained by adding variables‘‘Size’’ and ‘‘Width’’ and removing ‘‘Length’’, a variable that wasadded at an earlier step of the forward selection. At all the othersteps, the removed variable is one of two added at the same step.Quadratic classification loss versus number of variables in themodel is plotted in Fig. 1.

Fig. 2 shows the entire ROC curves corresponding toconfigurations with 6 and all 10 variables (left) and a blow-up

Page 4: A numeric comparison of variable selection algorithms for supervised learning

ARTICLE IN PRESS

G. Palombo, I. Narsky / Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195190

of region below 0.01 and 0.4 of background and signal acceptance,respectively (right).

We conclude that the optimal predictive power is obtainedusing only four variables and use 6 to be conservative. The twoROC curves, for 6 and 10 variables, are in statistical agreement.

We then test 5 other VS methods implemented in SPR: ‘‘Add 1Rem 0’’, Interactions, Permutations, FOM Importance, and Corre-lations. We take the best 6 variables from each method. In thiscase ‘‘Add 1 Rem 0’’ selects the same 6 variables as ‘‘Add 2 Rem 1’’,while the other methods select different subsets with 6 variables.The ROC curves for best 6 variables obtained by ‘‘Add 2 Rem 1’’, bythe Interactions method, and by the FOM Importance method (the

0.00

0.10

0.20

0.30

1 2 3 4 5 6 7 8 9 10

Qua

drat

ic L

oss

Subset Dimensionality

Add 2 Rem 1

Fig. 1. Magic telescope dataset: quadratic classification loss as a function of the

best subset dimensionality. The best 6 variables selected are: alpha, length, size,

width, dist, and conc.

0

0.2

0.4

0.6

0.8

1

0.001 0.01 0.1

Sign

al A

ccep

tanc

e

Background Acceptance

ROC - All vs 6 Variables

10 Vars6 Vars

Fig. 2. Magic telescope dataset: ROC distributions for the random forest classifier with 1

(left). Blow-up of the ROC curve near the origin (right).

0

0.2

0.4

0.6

0.8

1

0.001 0.01 0.1

Sign

al A

ccep

tanc

e

Background Acceptance

ROC - Best 6 Variables

Add-RemInteractionsFOM Import.

Fig. 3. Magic telescope dataset: ROC curves for best 6 out of 10 variables for ‘‘Add 2 Re

between the curves for Interactions and ‘‘Add 2 Rem 1’’. A logarithmic axis is used to m

(right).

two methods with the worst results) are shown in Fig. 3. In thisanalysis, all methods except FOM Importance give good results.

Bock et al. [29] define LOACC as average SA for BA ¼ 0:01, 0.02,and 0.05, HIACC as the average SA for BA ¼ 0:1 and 0.2, andQ ¼ SA=

ffiffiffiffiffiffiBA

pat SA ¼ 0:5; where SA is the signal acceptance and BA is

the background acceptance. In Table 1, the results obtained by anSPR implementation of RF are compared with those from Ref. [29]obtained by Breiman’s FORTRAN code [30].

3.2. Cardiac arrhythmia

The task of this analysis is to classify a patient into one of the16 classes of cardiac arrhythmia. We use a slightly modifiedversion of the dataset [31]. After removing 17 variables whosevalues are zero for each event and removing the 14th variablewhich has a high incidence of missing values, the final dataset has261 variables, 12 classes (four classes have no events in theoriginal dataset), and 420 events. We consider the accuracy of10-fold cross-validation as a measure of the predictive power ofthe learning algorithm. We use the Allwein–Schapire–Singeralgorithm [32], which reduces a multi-class problem to a seriesof binary problems. The binary classifier is a DT with at least 15events per node. The Allwein–Schapire–Singer algorithm uses anindicator matrix of size C � L (C rows and L columns), where C isthe number of classes and L is the number of binary classifiers.Elements of the indicator matrix can take only three values: 1, �1and 0. In each column, 1 represents signal, �1 representsbackground, and 0 means that the corresponding class is notconsidered for this binary classification problem. The indicatormatrix can be built in a number of ways. In one approach, for

1 0

0.1

0.2

0.3

0.4

0 0.005 0.01

Sign

al A

ccep

tanc

e

Background Acceptance

ROC - All vs 6 Variables10 Vars6 Vars

0 and 6 variables. A logarithmic axis is used to magnify the low-acceptance region

1 0

0.1

0.2

0.3

0.4

0 0.005 0.01

Sign

al A

ccep

tanc

e

Background Acceptance

ROC - Best 6 VariablesAdd-Rem

InteractionsFOM Import.

m 1’’, Interactions, and FOM Importance. The ROC curves for all other methods lie

agnify the low-acceptance region (left). Blow-up of the ROC curve near the origin

Page 5: A numeric comparison of variable selection algorithms for supervised learning

ARTICLE IN PRESS

0.15

0.25

0.35

0.45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cla

ssifi

catio

n Er

ror

Subset Dimensionality

Add 2 Rem 1

Fig. 4. Arrhythmia dataset: classification error versus best subset dimensionality.

The best 11 variables selected are: AVFA.Q, AVRW.PDiph, DIIIA.JJ, DIIIW.Defl,

HeartRate, V1A.Rapex, V3A.Q, V3A.QRSA, V3A.T, V6A.T, and V6W.R.

Table 3

G. Palombo, I. Narsky / Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195 191

instance, each class is compared to all others (One-vs-All). Thismeans that the indicator matrix has C binary classifiers. For eachbinary classifier, one element is equal to 1 and all others are equalto �1. In another approach, every binary classifier uses a pair ofclasses and the indicator matrix has C�ðC � 1Þ=2 columns. Eachcolumn has one 1, one �1 and all other elements are zeros. Both,All-vs-One and One-vs-One approaches, are implemented in SPR.SPR also lets the user specify an arbitrary indicator matrix.

To classify a test event, we compute quadratic loss for each rowof the indicator matrix with weighted contributions from theconstructed binary classifiers:

Dc ¼

PLl ¼ 1 IðIcla0ÞwlðIcl � f ðxlÞÞ

2

PLl ¼ 1 wl

: ð4Þ

Each of the L classifiers is used to compute response f ðxlÞ for agiven event, with l¼ 1;2; . . . ; L. Above wl is the relative weight foreach classifier, Icl is an element of the indicator matrix (c-th rowand l-th column), and IðIcla0Þ is an indicator function equal to 1 ifthe indicator expression is true and 0 otherwise. The loss isevaluated for each row of the indicator matrix and an event isclassified into class c if Dc is the smallest loss value across all rows.

In this analysis we test all three different approaches: One-vs-All, One-vs-One and user-configured matrix. We obtain the bestaccuracy by building our own matrix. With the One-vs-Allapproach, the results are 2–3% worse than those obtained withthe user-configured matrix. With the One-vs-One approach, theresults are about 10% worse. The user-configured matrix shown inTable 2 first separates class ‘‘Normal’’ (the most numerous class)from the classes that are most likely to be misclassified as‘‘Normal’’ and then separates the remaining classes amongthemselves.

The 10-fold cross-validation accuracy for the model with all261 variables is 75.95%. We then run the ‘‘Add 2 Rem 1’’ method tosearch for the optimal subset of variables. Classification error is

Table 1Magic telescope dataset: summary of results in this analysis and in Ref. [29].

FOM FORTRAN SPR

LOACC 0.452 0:45670:006

HIACC 0.852 0:83770:010

Q 2.8 2:9170:18

Table 2Indicator matrix used for the arrhythmia dataset.

Row/

col

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 1 0 �1 1 �1 �1 �1 �1 �1 �1 �1 �1 �1 0

2 �1 1 1 0 �1 �1 �1 �1 �1 �1 �1 �1 0 0

3 0 0 0 0 1 �1 �1 �1 �1 �1 �1 �1 �1 0

4 �1 0 0 0 �1 1 �1 �1 �1 �1 �1 �1 �1 1

5 �1 0 0 0 �1 �1 1 �1 �1 �1 �1 �1 �1 0

6 0 0 0 0 �1 �1 �1 1 �1 �1 �1 �1 �1 0

7 0 0 0 0 �1 �1 �1 �1 1 �1 �1 �1 �1 �1

8 0 0 0 0 �1 �1 �1 �1 1 �1 �1 �1 �1 0

9 0 0 0 0 �1 �1 �1 �1 �1 1 �1 �1 �1 0

10 0 0 0 �1 �1 �1 �1 �1 �1 �1 1 �1 �1 0

14 �1 0 0 0 �1 �1 �1 �1 �1 �1 �1 1 0 0

16 �1 �1 0 0 �1 �1 �1 �1 �1 �1 �1 �1 1 0

W 1 7 2 5 1 1 1 1 1 1 1 1 1 5

Rows are classes and columns are binary classifiers. Matrix elements can take only

three values: �1, 0 and 1, where 1 means that the corresponding class is treated as

signal, �1 as background and 0 means that this class does not participate in this

binary problem. The last row represents weights given to binary classifiers.

plotted against the number of variables included in the bestsubset in Fig. 4.

Table 3 displays the order in which the variables are added andremoved and the corresponding classification loss at each step.Often, at a given step, two new variables are added and onevariable, which was part of the best subset with lowerdimensionality, is removed. That is, for a dataset with manyvariables the advantage of a generalized forward addictionselection, which takes into account the combined effect ofseveral variables, is more evident.

Arrhythmia dataset: part of the output of ‘‘Add 2 Rem 1’’ showing which variables

are added and removed at every step.

Variable AddðþÞ=Remð�Þ Loss

HeartRate þ 0.37381

V1W.Rapex þ 0.314286

V1W.Rapex � 0.378571

V1A.Rapex þ 0.316667

V3A.Q þ 0.278571

V3A.Q � 0.314286

AVRA.T þ 0.27619

V3A.R þ 0.242857

V3A.R � 0.297619

V3W.R þ 0.245238

V1A.QRSA þ 0.235714

V1A.QRSA � 0.269048

V3A.T þ 0.235714

DIIW.Q þ 0.22619

V3A.T � 0.242857

V3W.S þ 0.213889

DIW.TRag þ 0.216667

V3W.S � 0.233333

AVRW.TDiph þ 0.216667

V3W.S þ 0.202186

DIW.TRag � 0.221429

DIW.PDiph þ 0.206612

V1A.Q þ 0.195055

AVRW.TDiph � 0.228571

DIIA.P þ 0.207182

V3A.QRSTA þ 0.217033

V1A.Q � 0.233333

DIIIW.Defl þ 0.211111

V2W.R þ 0.209945

DIIIW.Defl � 0.219048

V4A.P þ 0.211538

V4W.R þ 0.211905

V4W.R � 0.230952

Page 6: A numeric comparison of variable selection algorithms for supervised learning

ARTICLE IN PRESS

Table 4Arrhythmia dataset: accuracy and p-value obtained using different VS methods.

Variable set Accuracy (%) p-Value

Best 11 Add 2 Rem 1 80.95

Best 11 Add 1 Rem 0 76.19 0.49 (-)

Best 3 Add 2 Rem 1 71.90 0.04 (-)

Best 4 Add 2 Rem 1 75.24 0.22 (-)

All 261 SPR 75.95 0.10 (-)

Best 25 ReliefF 72.14 0.20 (-)

Best 25 CFS-SF 73.33 0.54 (-)

Best 5 FCBF 69.05 0.01 (-)

Full Dataset � 68.80

Best 25 ReliefF � 68.80

Best 25 CFS-SF � 69.02

Best 5 FCBF � 71.47

Full Dataset 95-NN y 56:9277:70

Best 96 95-NN y 56:9277:70

Full DataSet 7-NN y 65:3877:20

Best 96 5-NN y 63:6574:39

First two sectors from the top: best 11 variables selected by ‘‘Add 2 Rem 1’’

compared with different models (see text for details). (-) Indicates that the best 11

variables model is better than the model it is compared with. p-Value smaller or

equal to 0.05 indicates that the model with 11 variables is statistically, at 0.05

level, better. These results � are from Ref. [28]. These results y are from Ref. [26].

Table 5WDBC dataset: classification accuracy for different variable sets.

Variables set Accuracy (%)

Full Dataset Random Forest 96:2573:09

Full Dataset RuleFit 95.78

Best 3 Add 2 Rem 1 96:0772:77

Best 3 Add 1 Rem 0 94:8974:29

Best 3 Correlations 93:1072:65

Best 3 Permutations 96:2172:63

Best 3 Interactions 93:2173:65

Best 3 RuleFit 93.85

Best 3 FOM Importance 91:2573:20

Full Dataset y 97:9471:67 (5-NN)

13 Variables y 97:0471:65 (7-NN)

Rulefit method implemented in R does not compute statistical errors. These results

y are from Ref. [26].

0.00

0.04

0.08

0.12

0.16

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Cla

ssifi

catio

n Er

ror

Subset Dimensionality

Add 2 Rem 1

Fig. 5. WDBC dataset: classification error versus best subset dimensionality. The

best three variables selected are: WorstArea, WorstSmoothness, and WorstCon-

cavity.

G. Palombo, I. Narsky / Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195192

We use the estimates of the classification error obtained by theAdd 2 Rem 1 algorithm and shown in Table 3 to choose the bestvariable subset. Therefore, we cannot use these numbers asunbiased estimates of the classification error. To obtain unbiasedestimates, we retrain the classifier on the 3-, 4- and 11-variablesubsets using new random seeds for the 10-fold cross-validation.The best accuracy is obtained using 11 variables. The accuracyobtained with four variables is not statistically different at 0.05level from the model with all 261 variables.

We also analyze the performance of VS algorithms FCBF,ReliefF, and CFS-SF [28]. As a classifier, we use a C4.5 decision tree[33] with at least 15 events per node; same as in our previousanalysis with SPR. All these algorithms are used in the Wekaenvironment [10].

In the top section of Table 4, the accuracy for the best 11variables from ‘‘Add 2 Rem 1’’ is compared with that for the best11 variables from ‘‘Add 1 Rem 0’’ and the best three and fourvariables from ‘‘Add 2 Rem 1’’; as well as with the accuracy of thefull model. In the 2nd section from the top, results of the VSmethods implemented in Weka are compared with those for the11-variables subset from ‘‘Add 2 Rem 1’’. In the 3rd section fromthe top, we give published results [28] obtained by applying thealgorithms from Weka to the same dataset. Our results usingReliefF, CFS-SF and FCBF are in good agreement with the publishedresults. In the bottom section of the table, we compare our resultswith those obtained by the FOS-MOD method with the k-nearest-neighbor (k-NN) algorithm [26].

We use a 10-fold cross-validation paired t test [34] to comparethe accuracies obtained by various methods. This test is widelyused in machine learning applications and has the advantage ofgiving results independent of the specific splitting used toseparate training and test sets. To perform this test, it is necessaryto know not only the average accuracy of 10-fold cross-validation,but also the accuracy of each of 10 splits. Since the publishedresults report only the average accuracy, we can calculate the p-values only for the results obtained by SPR and Weka.

Results of these tests are shown in Table 4. ‘‘Add 2 Rem 1’’ andthe multi-class algorithm implemented in SPR give the bestaccuracy.

3.3. WDBC

Wisconsin Diagnostic Breast Cancer (WDBC) dataset has 569events, and 30 variables, and involves a binary classification task.The goal is to predict whether a tumor is benign or malignant. Weuse a random forest classifier with 50 trees, at least two eventsper leaf and 23 variables randomly chosen for each DT. Theaccuracy is evaluated using 10-fold cross-validation.

VS is carried out using ‘‘Add 2 Rem 1’’, ‘‘Add 1 Rem 0’’,Correlations, Interactions, and FOM Importance methods imple-mented in SPR; as well as the RuleFit method implemented in R[9].

According to ‘‘Add 2 Rem 1’’, three variables are enough for thisproblem (see Fig. 5). At each of the first three steps, the variableremoved is one of the two added at the same step.

Accuracies obtained by various methods with three bestvariables are compared in Table 5, including FOS-MOD results[26]. For each method, the accuracy is consistent within statisticalerror with that obtained using all variables. ‘‘Add 1 Rem 0’’ and‘‘Add 2 Rem 1’’ select different subsets, with ‘‘Add 1 Rem 0’’ givinga slightly worse accuracy.

‘‘Add 2 Rem 1’’ with three variables outperforms the FOS-MODmethod with 13 variables.

Page 7: A numeric comparison of variable selection algorithms for supervised learning

ARTICLE IN PRESS

Add 2 Rem 1

G. Palombo, I. Narsky / Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195 193

3.4. WBC

Wisconsin Breast Cancer (WBC) dataset is similar to the WDBCdataset. It has 699 events (458 benign and 241 malignantsamples) and nine variables. We use an AdaBoost classifier with900 binary splits, 100 per variable. We compare VS methods ‘‘Add2 Rem 1’’, ‘‘Add 1 Rem 0’’, Correlations, and Interactions from SPR,RuleFit from R, and FOS-MOD. As shown in Fig. 6, ‘‘Add 2 Rem 1’’selects three variables without losing predictive power. At each ofthe first three steps, ‘‘Add 2 Rem 1’’ removes one of the twovariables added at the same step. The FOS-MOD method needs toselect at least four variables if no significant loss of predictivepower is required.

We then take the best three variables from ‘‘Add 2 Rem 1’’,‘‘Add 1 Rem 0’’ (same variables as ‘‘Add 2 Rem 1’’), Correlations,Interactions, and RuleFit. For each method, the accuracy is notstatistically different from the one obtained using all onevariables. In this dataset, which is particularly simple for a binaryclassification task, all methods give similar results, even whenselecting different variables, as shown in Table 6.

0.20

0.30

0.40

sific

atio

n Er

ror

3.5. Colic Horse data

Colic Horse dataset has 368 events, and 22 variables. The taskis to predict whether the lesion is surgical or non-surgical. Thisdataset has a high number of missing values which are replacedby respective median values observed in this dataset. We use a DTwith at least 15 events per leaf.

0.00

0.04

0.08

0.12

0 1 2 3 4 5 6 7 8 9

Cla

ssifi

catio

n Er

ror

Subset Dimensionality

Add 2 Rem 1

Fig. 6. WBC dataset: classification error versus best subset dimensionality. The

best three variables are: Uniformity of Cell Shape, Normal Nucleoli, and Bland

Chromatin.

Table 6WBC dataset: accuracy with different variable sets.

Variables set Accuracy (%)

Full Dataset Boosted Decision Split 96:2472:59

Full Dataset RuleFit 95.82

Best 3 Add 2 Rem 1 95:7771:71

Best 3 Add 1 Rem 0 95:7771:71

Best 3 Correlations 95:2872:54

Best 3 Permutations 95:7272:98

Best 3 Interactions 94:8573:03

Best 3 RuleFit 95.61

Full Dataset y 98:1672:03 (5-NN)

Four variables y 97:4272:16 (15-NN)

Rulefit method implemented in R does not compute statistical errors. These results

y are from Ref. [26].

The output of ‘‘Add 2 Rem 1’’ (Fig. 7) shows that two variablesare enough to obtain an accuracy comparable to the one obtainedwith all 22 variables. ‘‘Add 2 Rem 1’’, at each of the first two steps,removes one of the two variables added at the same step.

We then evaluate the performance of VS methods by measur-ing the classification accuracy for two most important variablesobtained by each VS method. In this comparison, the minimumnumber of events per leaf of the SPR tree is taken equal to 1. InTable 7 we show results obtained by SPR algorithms (top sectionof the table) together with published results obtained usingvarious VS methods with classifiers ID3 [7] (2nd section of thetable from the top), and C4.5 [35,36] (3rd and bottom section ofthe table). Again, we observe the superior performance of the‘‘Add 2 Rem 1’’ method which selects fewer variables than anyother published result with a similar accuracy. The ‘‘Correlations’’method gives a high classification accuracy too, while ‘‘FOMImportance’’ obtains the worst result among SPR methods.

0.00

0.10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Cla

s

Subset Dimensionality

Fig. 7. Colic dataset: classification error versus best subset dimensionality. The

two selected variables are: Surgery and Abdominal Distension.

Table 7Colic dataset: method, accuracy, and number of selected variables.

Method Accuracy (%) No of sel. var.

All 85:8575:25

Add 2 Rem 1 85:8575:25 2

Add 1 Rem 0 85:8575:25 2

Correlations 82:0575:44 2

Permutations 78:5977:33 2

Interactions 77:4374:87 2

FOM Importance 61:3674:21 2

ID3 81:5272:0 17.4

ID3 HC-FSS 83:1571:1 2.8

ID3 BFS-FSS Forward 82:0771:5 3.4

ID3 BFS-FSS Backward 82:6171:7 7.2

C4.5 GA-Wrapper 82.4 13

C4.5 ReliefF-GA-Wrapper 83.8 10

C4.5 ReliefF 85.3 20

C4.5 All 85.3 22

C4.5-RelFss 85.9 4

C4:5� RelFss0 84.5 12

Classification accuracy obtained by VS methods implemented in SPR is compared

with published results obtained using various VS methods. ‘‘Add 2 Rem 1’’ and

‘‘Add 1 Rem 0’’ select the same variables in this case. Number of selected variables

with decimal point is average for the 10-fold cross-validation.

Page 8: A numeric comparison of variable selection algorithms for supervised learning

ARTICLE IN PRESS

Table 8CPU elapsed real time in seconds (s) to select the best variable subset.

Dataset Events Classes Sel. variables/ Tot. variable Program Method Time (s)

Magic 19 020 2 6/10 SPR Add 2 Rem 1 14 283

Add 1 Rem 0 5840

Correlations 2.53

Permutations 178

Interactions 21 658

FOM Importance 125

Arrhythmia 420 12 11/261 SPR Add 2 Rem 1 2660

Add 1 Rem 0 1150

5/261 Weka FCBF 2.11

25/261 ReliefF 9.01

25/261 CFS-SF 2.65

WDBC 569 2 3/30 SPR Add 2 Rem 1 472

Add 1 Rem 0 269

Correlations 0.250

Permutations 0.637

Interactions 440

FOM Importance 1.02

R Rulefit 2.81

WBC 699 2 3/9 SPR Add 2 Rem 1 12.7

Add 1 Rem 0 4.90

Correlations 0.078

Permutations 2.04

Interactions 1400

R Rulefit 2.24

Colic 368 2 2/22 SPR Add 2 Rem 1 2.78

Add 1 Rem 0 1.89

Correlations 0.142

Permutations 0.171

Interactions 31

FOM Importance 0.071

G. Palombo, I. Narsky / Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195194

3.6. CPU time

In Table 8 we show CPU time consumed by each algorithm. TheSPR implementation of the ‘‘Add n Remove r’’ algorithm canmonitor step by step addition and removal of variables, as well ascurrent classification loss. This feature lets the user select the bestsubset before the algorithm finishes processing all variables. Thisproved particularly useful for the arrhythmia dataset withhundreds of variables. For the MAGIC analysis, we use a CPU-expensive classifier, RF with 200 trees and one event per leaf, inconjunction with a big dataset. All these experiments areconducted on an Intel Dual-Core Xeon 2.8 GHz with 1 GB RAM.

4. Summary

We have evaluated performance of 5 VS methods implementedin the statistical package StatPatternRecognition using datasetshosted at the UCI repository. We have compared their perfor-mance with that of other methods described in the machinelearning literature and available from other software packages. Inevery experiment, the wrapper method ‘‘Add 2 Rem 1’’ gives thebest classification accuracy for variable sets of the same size; thiseffect is more significant for datasets with many variables.However, the wrapper approach tends to be slower than the filterand embedded methods, especially if it applies a CPU-intensiveclassifier to a large dataset. In general, the filter and embeddedmethods are often able to achieve good results; but theirreliability is not guaranteed for every dataset.

References

[1] K. Cranmer, Statistics for the LHC: progress, challenges, and future, in:Proceedings of PhyStat-LHC Workshop, Geneva, 27–29 June 2007, pp. 47–60/http://phystat-lhc.web.cern.ch/phystat-lhc/2008-001.pdfS; B.P. Roe, et al.,Nuclear Instruments and Methods A 543 (2005) 577.

[2] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning, in:Springer Series in Statistics, 2001;A. Webb, Statistical Pattern Recognition, Wiley, New York, 2002.

[3] L. Breiman, J.H. Friedman, R. Olshen, C. Stone, Classification and RegressionTrees, Wadsworth and Brooks, 1984.

[4] J.R. Quinlan, Machine Learning 1 (1986) 81.[5] J.H. Friedman, Journal of Classification 23 (2006) 175.[6] I. Guyon, et al., Journal of Machine Learning Research 3 (2003) 1157.[7] R. Kohavi, G.H. John, Artificial Intelligence Journal 97 (1997) 273 (Special issue

on Relevance).[8] /http://sourceforge.net/projects/statpatrecS.[9] /http://www.r-project.orgS.

[10] /http://www.cs.waikato.ac.nz/ml/wekaS.[11] R.A. Fisher, Annals of Eugenics 7 (1936) 179.[12] J. Friedman, N. Fisher, Statistics and Computing 9 (1999) 123.[13] I. Narsky, StatPatternRecognition: A Cþþ Package for Statistical Analysis of

High Energy Physics Data, axXiv:physics/0507143v1, 20 July 2005.[14] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall,

Englewood, Cliffs, NJ, 1999.[15] Y. Freund, R.E. Schapire, Journal of Computer and System Sciences 55 (1997)

119;J. Friedman, T. Hastie, R. Tibshirani, Annals of Statistics 28 (2) (2000) 337.

[16] L. Breiman, The Annals of Statistics 26 (1998) 801.[17] L. Breiman, Machine Learning 26 (1996) 123;

L. Lam, C.Y. Suen, IEEE Transactions on Systems, Man, and Cybernetics 27(1997) 533.

[18] L. Breiman, Machine Learning 45 (2001) 5.[19] See README in the package distribution [8].[20] /http://sf.netS.[21] /http://root.cern.chS.[22] R.B. Bendel, A.A. Afifi, Journal of the American Statistical Association 72

(1977) 46.[23] /http://www.ics.uci.edu/�mlearn/MLRepository.htmlS; A. Asuncion, D.J.

Newman, UCI Machine Learning Repository, University of California, Schoolof Information and Computer Science, Irvine, CA, 2007.

[24] Y. Bengio, Y. Grandvalet, Journal of Machine Learning Research 5 (2004) 1089.[25] J.H. Friedman, B.E. Popescu, Predictive learning via rule ensembles, Technical

Report, Stanford University, 2005.[26] H-L. Wei, S.A. Billings, IEEE Transactions on Pattern Analysis and Machine

Intelligence 29 (1) (2007) 162.[27] M. Robnik-Sikonja, I. Kononenko, Machine Learning 53 (2003) 23.[28] L. Yu, H. Liu, Journal of Machine Learning Research 5 (2004) 1205.[29] R.K. Bock, et al., Methods for multidimensional event classification: a case

study using images from Cherenkov gamma-ray telescope A 516 (2004) 511.[30] L. Breiman, FORTRAN Program Random Forests, Version 3.1, Available at:

/http://oz.berkeley.edu/users/breimanS.

Page 9: A numeric comparison of variable selection algorithms for supervised learning

ARTICLE IN PRESS

G. Palombo, I. Narsky / Nuclear Instruments and Methods in Physics Research A 612 (2009) 187–195 195

[31] S. Perkins, et al., Journal of Machine Learning Research 3 (2003) 1333.[32] E. Allwein, R.E. Schapire, Y. Singer, Journal of Machine Learning Research 1

(2000) 113.[33] J.R. Quinlan, C4.5: Programs for Machine Learnings, Morgan Kaufmann, Los

Altos, CA, 1993.

[34] T.G. Dietterich, Neural Computation 10 (1998) 1895.[35] L.-X. Zhang, et al., A novel hybrid feature selection algorithm: using ReliefF

estimation for GA-Wrapper search, in: Proceedings of the Second Interna-tional Conference on Machine Learning and Cybernetics, Xi’an, 2003.

[36] D.A. Bell, H. Wang, Machine Learning 41 (2000) 175.