fast- a roc-based feature selection metric for small samples and imbalanced data classification...

8/6/2019 FAST- A ROC-Based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems(2008)

http://slidepdf.com/reader/full/fast-a-roc-based-feature-selection-metric-for-small-samples-and-imbalanced 1/9

FAST: A ROC-based Feature Selection Metric for SmallSamples and Imbalanced Data Classification Problems

Xue-wen ChenDepartment of Electrical Engineering and Computer

ScienceThe University of KansasLawrence, KS 66045, USA

[email protected]

Michael WasikowskiDepartment of Electrical Engineering and Computer

ScienceThe University of KansasLawrence, KS 66045, USA

[email protected]

ABSTRACT

The class imbalance problem is encountered in a large number of

practical applications of machine learning and data mining, for

example, information retrieval and filtering, and the detection of

credit card fraud. It has been widely realized that this imbalanceraises issues that are either nonexistent or less severe compared to

balanced class cases and often results in a classifier’s suboptimal

performance. This is even more true when the imbalanced data

are also high dimensional. In such cases, feature selectionmethods are critical to achieve optimal performance. In this paper,we propose a new feature selection method, Feature Assessment

by Sliding Thresholds (FAST), which is based on the area under a

ROC curve generated by moving the decision boundary of a

single feature classifier with thresholds placed using an even-bin

distribution. FAST is compared to two commonly-used feature

selection methods, correlation coefficient and RELevance InEstimating Features (RELIEF), for imbalanced data classification.

The experimental results obtained on text mining, mass

spectrometry, and microarray data sets showed that the proposed

method outperformed both RELIEF and correlation methods on

skewed data sets and was comparable on balanced data sets; whensmall number of features is preferred, the classification

performance of the proposed method was significantly improved

compared to correlation and RELIEF-based methods.

Categories and Subject Descriptors

I.5.2 [Pattern Recognition]: Design Methodology – feature

evaluation and selection.

General Terms

Algorithms.

Keywords

Feature selection, imbalanced data classification, ROC.

1. INTRODUCTIONOne of the greatest challenges in machine learning and data

mining research is the class imbalance problem presented in real-

world applications. The class imbalance problem refers to the

issues that occur when a dataset is dominated by a class or classes

that have significantly more samples that the other classes of thedataset. Imbalanced classes are seen in a variety of domains and

many have major economic, commercial, and environmental

concerns. Some examples include text classification, risk

management, web categorization, medical diagnosis/monitoring,

biological data analysis, credit card fraud detection, oil spillidentification from satellite images.

While the majority of learning methods are designed for well- balanced training data, data imbalance presents a unique

challenging problem to classifier design when the

misclassification costs for the two classes are different (i.e., cost-

sensitive classification) and accordingly, the overall classification

rate is not appropriate to evaluate the performance. The classimbalance problem could hinder the performance of standard

machine learning methods. For example, it is highly possible to

achieve the high classification accuracy by simply classifying all

samples as the class with majority samples. The practicalapplications of cost-sensitive classification arise frequently, for example, in medical diagnosis [1], in agricultural product

inspection [2], in industrial production processes [3], and in

automatic target detection [4]. Analyzing the imbalanced data

thus requires new methods than those used in the past.

The majority of current research in the class-imbalance problem

can be grouped into two categories: sampling techniques and

algorithmic methods, as discussed in two workshops at the AAAIconference [5] and the ICML conference [6], and later in the sixth

issue of SIGKDD Exploration (see, for example, a review by

Weiss [7]). The sampling methods involve leveling the class

samples so that they are no longer imbalanced. Typically, this is

done by under-sampling the larger class [8-9] or by over-sampling

the smaller one [10-11] or by combination of these techniques[12]. Algorithmic methods include adjusting the costs associated

with misclassification so as to improve performance [13-15],

shifting the bias of a classifier to favor the rare class [16-17],

creating Adaboost-like boosting schemes [18-19], and learning

from one class [20].

The class imbalance problem is even more severe when thedimensionality is high. For example, in microarray-based cancer

classification, the number of features is typically tens of

thousands [21]; in text classification, the number of features in a

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

KDD’08, August 24–27, 2008, Las Vegas, Nevada, USA.

Copyright 2008 ACM 978-1-60558-193-4/08/08…$5.00.

124



bag of words is often more than an order of magnitude compared

to the number of training documents [22]. Both sampling

techniques and algorithmic methods may not work well for high

dimensional class imbalance problems. Indeed, van der Putten and

van Someren analyzed the COIL challenge 2000 datasets andconcluded that to overcome overfitting problems, feature

selection is even more important than classification algorithms

[23]. A similar observation was made by Forman in highlyimbalanced data classification problems [22]. As pointed out by

Forman, “no degree of clever induction can make up for a lack of

predictive signal in the input space” [22]. This holds even for theSVM which is engineered to work with hyper-dimensional

datasets. Forman [22] found that the performance of the SVM

could be improved by the judicious use of feature selection

metrics. It is thus critical to develop effective feature selection

methods for imbalanced data classification, especially if the dataare also high dimensional.

While feature selection has been extensively studied [24-30], its

importance to class imbalance problems in particular was recently

realized and attracted increasing attention from machine learning

and data mining community. Mladenic and Grobelnik examined

the performance of different feature selection metrics inclassifying text mining data from the Yahoo hierarchy [31]. After applying one of nine different filters, they tested the classification

power of the selected features using naïve Bayes classifiers. Their

results showed that the best metrics choose common features and

consider the domain and learning machine’s inherent

characteristics. Forman found improved results with the use of multiple different metrics, but the best performing results were

those selected by metrics that focused primarily on the results of

the minority class [22]. Zheng, Wu, and Srihari empirically tested

different ratios of features indicating membership in a class versus

features indicating lack of membership in a class [32]. This

approach resulted in better accuracy compared to using one-sidedmetrics that solely score features indicating membership in a class

and two-sided metrics that simultaneously score features

indicating membership and lack of membership.

One common problem with standard evaluation statistics used in

previous studies, like information gain and odds ratios, is that

they are dependent on the choice of the true positive (TP), false

positive (FP), false negative (FN), and true negative (TN). These parameters are set based on a preset threshold. Consider

imbalanced data classification with two different feature sets. The

first feature set may yield higher TP, but lower TN, than the

second feature set. By varying the decision threshold, the second

feature set may produce higher TP and lower TN than the firstfeature set. Thus, one single threshold cannot tell us which feature

set is better. This is an artifact of using a parametric statistic to

evaluate a classifier's predictive power [33]. If we vary the

classifier's decision threshold, we can find these statistics for eachthreshold and see how they vary based on where the threshold is

placed. A receiver operating characteristic, or ROC curve, is onesuch non-parametric measure of a classifier's power that compares

the true positive rate with the false positive rate. While the ROC

curve has been extensively used for evaluating classification

performance in class imbalance problems, it has not been directly

applied for feature selection. In this paper, we construct a newfeature selection metric based on an ROC curve generated on

optimal simple linear discriminants and select those features with

the highest area under the curve as the most relevant. Unlike other

feature selection metrics which depend on one particular decision

boundary, our metric evaluates features in terms of their

performance on multiple decision hyperplanes and is more

appropriate to class imbalance problems.

The rest of our paper is organized as follows. Section 2 provides

a brief discussion about two commonly-used filter methods:correlation coefficient (CC), and RELevance In Estimating

Features (RELIEF). In section 3, we follow with a description of

the proposed new method, Feature Assessment by Sliding

Thresholds (FAST). In section 4, we present the results

comparing the performance of the linear support vector machines(SVM) and 1-nearest neighbor (1-NN) classifiers using features

selected by each metric. These results are measured on two

microarray, two mass spectrometry, and one text mining datasets.

Finally, we give our concluding remarks in section 5.

2. FEATURE SEELCTION METHODSIn this section, we briefly review two commonly-used feature

selection methods, CC and RELIEF.

2.1 Correlation CoefficientThe correlation coefficient is a statistical test that measures the

strength and quality of the relationship between two variables.

Correlation coefficients can range from -1 to 1. The absolute

value of the coefficient gives the strength of the relationship;

absolute values closer to 1 indicate a stronger relationship. Thesign of the coefficient gives the direction of the relationship: a

positive sign indicates then the two variables increase or decrease

with each other and a negative sign shows that one variable

increases as the other decreases.

In machine learning problems, the correlation coefficient is used

to evaluate how accurately a feature predicts the target

independent of the context of other features. The features arethen ranked based on the correlation score [25]. For problems

where the covariance cov( i X , Y ) between a feature ( i X ) and the

target (Y ) and the variances of the feature (var( i X )) and target

(var(Y )) are known, the correlation can be directly calculated:

)var()var(

),cov()(

Y X

Y X i R

i

i

⋅= (1)

Equation 1 can only be used when the true values for the

covariance and variances are known. When these values are

unknown, an estimate of the correlation can be made using

Pearson's product-moment correlation coefficient over a sampleof the population ( xi, y). This formula only requires finding the

mean of each feature and the target to calculate:

∑∑∑

==

=

−⋅−

−−=m

k ik

m

k iik

m

k k iik

y y x x

y y x xi R

1

2

1

2,

1 ,

)()(

))(()( (2)

where m is the number of data points.

Correlation coefficients can be used for both regressors and

classifiers. When the machine is a regressor, the range of valuesof the target may be any ratio scale. When the learning machine

is a classifier, we restrict the range of values for the target to 1± .

125



We then use the coefficient of determination, or 2)(i R , to enforce

a ranking of the features according to the goodness of linear fit

between individual features and the target [25].

When using the correlation coefficient as a feature selection

metric, we must remember that the correlation only finds linear relationships between a feature and the target. Thus, a feature and

the target may be perfectly related in a non-linear manner, but thecorrelation could be equal to 0. We may lift this restriction by

using simple non-linear preprocessing techniques on the feature

before calculating the correlation coefficients to establish agoodness of non-linear relationship fit between a feature and the

target [25].

Another issue with using correlation coefficients comes from how

we rank features. If features are solely ranked on their value, with

features having a positive score getting picked first or vice versa,

then we risk not choosing the features that have the strongest

relationship with the target. Conversely, if features are chosen based on their absolute value, Zheng, Wu, and Srihari argue that

we may not select a ratio of positive to negative features that

gives the best results based on the imbalance in the data [32].

Finding this optimal ratio takes empirical testing, but it can resultin extremely strong results.

2.2 RELIEFRELIEF is a feature selection metric based on the nearest

neighbor rule designed by Kira and Rendell [34]. It evaluates a

feature based on how well its values differentiate themselves from

nearby points. When RELIEF selects any specific instance, it

searches for two nearest neighbors: one from the same class (the

nearest hit), and one from the other class (the nearest miss). Wethen calculate the relevance of each attribute A by the rule:

W ( A) = P (different value of A | nearest miss)

– P (different value of A | nearest hit) (3)

This is justified by the thinking that instances of different classes

should have vastly different values, while instances of the same

class should have very similar values. Because the true

probabilities cannot be calculated, we must estimate the

difference in equation 3. This is done by calculating the distance

between random instances and their nearest hits and misses. For discrete variables, the distance is 0 if the same and 1 if different;

for continuous variables, we use the standard Euclidean distance.

We may select any number of instances up to the number in the

set, and more selections indicate a better approximation [35].

Algorithm 1 details the pseudo-code for implementing RELIEF.

Algorithm 1 (RELIEF):

Set all W ( A) = 0

FOR i =1 to m

Select instance R randomly

Find nearest hit H and nearest miss M

FOR A=1 to number of features

W ( A) = W ( A) - dist ( A, R, H )/m

W ( A) = W ( A) + dist ( A, R, M )/m

The original version of RELIEF suffered from several problems.

First, this method searches only for one nearest hit and one

nearest miss. Noisy data could make this approximation

inaccurate. Second, if there are instances which have missing

values for features, the algorithm will crash because it cannot

calculate the distance between those instances. Kononenko

created multiple extensions of RELIEF to address these issues[35]. RELIEF-A allowed the algorithm to check multiple nearest

hits and misses. RELIEF-B, C, and D gave the method different

ways to address missing values. Finally, RELIEF-E and F found anearest miss from each different class instead of just one and used

this to better estimate the separability of an instance from all other

classes. These extensions added to RELIEF's adaptability todifferent types of problems.

3. METHOD DESCRIPTION: FASTIn this section, we propose to assess features based on the area

under a ROC curve, which is determined by training a simple

linear classifier on each feature and sliding the decision boundary

for optimal classification. The new metric is called FAST(Feature Assessment by Sliding Thresholds).

Most single feature classifiers set the decision boundary at the

mid-point between the mean of the two classes [25]. This may not

be the best choice for the decision boundary. By sliding the

decision boundary, we can increase the number of true positives

we find at the expense of classifying more false positives.

Alternately, we could slide the threshold to decrease the number of true positives found in order to avoid misclassifying negatives.

Thus, no single choice for the decision boundary may be ideal for

quantifying the separation between two classes.

We can avoid this problem by classifying the samples on multiple

thresholds and gathering statistics about the performance at each

boundary. If we calculate the true positive rate and false positiverate at each threshold, we can build an ROC curve and calculate

the area under the curve. Because the area under the ROC curve

is a strong predictor of performance, especially for imbalanced

data classification problems, we can use this score as our feature

ranking: we choose those features with the highest areas under the

curve because they have the best predictive power for the dataset.

By using a ROC curve as the means to rank features, we haveintroduced another problem: deciding where to place the

thresholds. If there are a large number of samples clustered

together in one region, we would like to place more thresholds

between these points to find how separated the two classes are in

this cluster. Likewise, if there is a region where samples aresparse and spread out, we want to avoid placing multiple

thresholds between these points so as to avoid placing redundant

thresholds between two points. One possible solution is to use a

histogram to determine where to place the thresholds. A

histogram fixes the bin width and varies the number of points in

each bin. This method does not accomplish the goals detailedabove. It may be the case that a particular histogram has multiple

neighboring bins that have very few points. We would prefer that

these bins be joined together so that the points would be placed

into the same bin. Likewise, a histogram may also have a bin that

has a significant proportion of the points. We would rather havethis bin be split into multiple different bins so that we could better

differentiate inside this cluster of points.

We use a modified histogram, or an even-bin distribution, to

correct both of these problems. Instead of fixing the bin width

126



and varying the number of points in each bin, we fix the number

of points to fall in each bin and vary the bin width. This even-bin

distribution accomplishes both of the above goals: areas in the

feature space that have fewer samples will be covered by wider

bins, and areas that have many samples will be covered bynarrower bins. We then take the mean of each sample in each bin

as our threshold and classify each sample according to this

threshold. Algorithm 2 details the pseudo-codes for implementingFAST.

Algorithm 2 (FAST):

K : number of bins

N : number of samples in dataset

M : number of features in dataset

Split = 0 to N with a step size N/K

FOR i = 1 to M

X is a vector of samples’ values for feature i

Sort X

FOR j = 1 to K

Bottom = round(Split(j))+1top = round(Split(j+1))

MU = mean( X (bottom to top))

Classify X using MU as threshold

tpr (i, j) = tp/# positive

fpr (i, j) = fp/# negative

Calculate area under ROC by tpr , fpr

One potential issue with this implementation is how it compares

to the standard ROC algorithm of using each possible threshold as

the standard is simpler but requires more computations. Weconducted a pilot study using the CNS dataset to measure the

difference between the FAST algorithm and this standard. Our

findings showed that with a parameter of K=10, 99% of the FASTscores were within plus-minus 0.02 of the exact AUC score, and

50% were within plus-minus 0.005. Additionally, the FAST

algorithm was nearly ten times as fast. Thus, we concluded thatthe approximation scores were sufficient.

Note that the FAST method is a two-sided metric. The scores

generated by the FAST method may range between 0.5 and 1. If

a feature is irrelevant to classification, its score will be close to .5.

If a feature is highly indicative of membership in the positive or

negative class or both, it will have a score closer to 1. Thus, this

method has the potential to select both positive and negativefeatures for use in classification.

4. EXPERIMENTAL RESULTS

4.1 Data SetsWe tested the effectiveness of correlation coefficient, RELIEF,and FAST features on five different data sets. Two of the data

sets are microarray sets, two are mass spectrometry sets, and one

is a bag-of-words set. Each of the microarray and mass

spectrometry data sets has a small number of samples, a large

number of features, and a significant imbalance between the twoclasses. The bag-of-words data set also has a small number of

samples with a large number of features, but we artificially

controlled the class skew to show differences in performance on

highly imbalanced classes versus balanced classes. The

microarray sets were not preprocessed. The mass spectrometry

sets were minimally preprocessed by subtracting the baseline,

reducing the amount of noise, trimming the range of inspected

mass/charge ratios, and normalizing. The bag-of-words set wasconstructed using RAINBOW [36] to extract the word counts

from text documents. These data sets are summarized in Table 1.

Because the largest data set has 320 samples, we used 10-fold

cross-validation to evaluate the trained models. Each fold had a

class ratio equal to the ratio of the full set. The results for each

fold are combined with each other to obtain test results for the

entire data set. To stabilize the results, we repeated the cross-validation 20 times and averaged over each trial.

Table 1. Data set descriptions

CNS

Central Nervous System Embryonal Tumor Data[37]. This data set contains 90 samples: 60 have

medulloblastomas and 30 have other types of

tumors or no cancer. There are 7129 genes in this

data set.

LYMPH

Lymphoma Data [38]. This data set contains 77

samples: 58 are diffuse large B-cell lymphomas,and 19 are folicular lymphomas. There are 7129

genes in this data set.

OVARY

Ovarian Cancer Data [39]. This data set contains

66 samples: 50 are benign tumors, and 16 aremalignant tumors. There are 6000 mass/charge

ratios in this data set.

PROST

Prostate Cancer Data [40]. This data set contains

89 samples: 63 have no evidence of cancer, and

26 have prostate cancer. There are 6000

mass/charge ratios in this data set.

NIPS

NIPS Bag-of-Words Data [41]. This data set

contains 320 documents: 160 cover neurobiology

topics, and 160 cover various applications topics.There are 13649 words in this data set. The setwas rebalanced for five separate class ratios: 1:1,

1:2, 1:4, 1:8, and 1:16. The neurobiology class

was the class shrunk to account for these

imbalances.

4.2 Evaluation StatisticsThe standard accuracy and error statistics quantify the strength of

a classifier over the overall data set. However, these statistics do

not take into account the class distribution. Forman argued that

this is because a trivial majority classifier can give good results on

a very imbalanced distribution [22]. It is more important to

classify samples in the minority class at the potential expense of misclassifying majority samples. However, the converse is true as

well: a trivial minority classifier will give great results for the

minority class, but such a classifier would have too many false

alarms to be usable. An ideal classifier would perform well on

both the minority and the majority class.

The balanced error rate (BER) statistic looks at the performance

of a classifier on both classes. It is defined as the average of theerror rates of two classes as shown in equation 4. If the classes are

balanced, the BER is equal to the global error rate. It is commonly

127



used for evaluating imbalanced data classification [42]. We used

this statistic to evaluate trained classifiers on test data.

⎟ ⎠

⎞⎜⎝

⎛

++

+=

TN FN

FN

TP FP

FP BER

2

1(4)

4.3 ResultsWe evaluated the performance of FAST-selected features bycomparing them with features chosen by correlation coefficientsand RELIEF. Many researchers have used standard learningalgorithms that maximize accuracy to evaluate imbalanceddatasets. Zheng [32] used the Naive Bayes classifier and logisticregression methods, and Forman [22] used the linear SVM andnoted its superiority over decision trees, Naive Bayes, and logisticregression. The object of study in these papers, and in our research, was the performance of the feature selection metrics andnot the induction algorithms. Thus, we chose to evaluate themetrics using the performance of the linear SVM and 1-NNclassifiers. These classifiers were chosen based on their differingclassification philosophies. The 1-NN method is a lazy algorithmthat defers computation until classification. In contrast, the SVM

computes a maximum separating hyperplane before classification.

The classification results are summarized in Figs. 1-10, wheredashed lines with square markers indicate classifiers usingRELIEF-selected features (with one nearest hit and miss), dashedlines with star markers indicate classifiers using correlation-selected features, and dashed lines with diamond markers indicateclassifiers using FAST-selected features (with 10 bins). The solid

black line indicates the baseline performance where all thefeatures are used for classification.

Figures 1 and 2 show the BER versus the number of featuresselected using an 1-NN classifier and a linear SVM for CNS data,respectively. FAST features significantly outperformed RELIEFand correlation features when using the 1-NN classifier. When

using the SVM classifier, FAST features performed the best for less than 40 features; and for more than 40 features, there waslittle difference between feature sets. For all the cases, using asmall set of features outperforms the baseline with all the originalfeatures. Similar results can be obtained for other datasets. For example, Figures 3 and 4 show the results for LYMPH data withan 1-NN and a linear SVM, respectively. Due to page limits, weare not able to show the results for all the four datasets. Instead,we include the average results here. Figures 5 and 6 show theBER scores averaged over the four datasets with an 1-NNclassifier and a SVM, respectively. For comparison, the baseline

performance of the classifier using all features is also included.

Another evaluation statistic commonly used on imbalanceddatasets is the area under the ROC (AUC). This statistic is similar

in nature to the BER in that it weights errors differently on thetwo classes. In this study, it lines up well with the design

philosophy of FAST. FAST selects features that maximize theAUC, so it is reasonable to believe that a learning method usingFAST-selected features would also maximize the AUC. We alsoused this statistic to evaluate trained classifiers on test data.Figures 7 and 8 show the AUC scores averaged over the four datasets with an 1-NN classifier and a SVM, respectively. Notsurprisingly, FAST outperforms CC and RELIEF.

Figure 1. BER for CNS using an 1-NN classifier

Figure 2. BER for CNS using a SVM classifier

Figure 3. BER for LYMPH using 1-NN classifiers

128



Figure 4. BER for LYMPH using a SVM classifier

Figure 5. BER averaged over CNS, LYMPH, OVARY, and

PROST using an 1-NN classifier

Figure 6. BER averaged over CNS, LYMPH, OVARY, and

PROST using a SVM classifier

Figure 7. AUC averaged over CNS, LYMPH, OVARY, and

PROST using an 1-NN classifier

Figure 8. AUC averaged over CNS, LYMPH, OVARY, and

PROST using a SVM classifier

Figure 9. AUC for CNS using a SVM classifier

129



Figure 10. AUC for PROST using a SVM classifier

Figure 11. Training data distribution of CNS with the two

best RELIEF-selected features

The average results in Figures 6 and 8 agree with the belief thatSVM's are robust for high-dimensional data. Up to 100 RELIEF-selected features did not improve the BER or the AUC of theSVM. Additionally, up to 100 correlation-selected features didnot improve the BER. On the other hand, the SVM using morethan 30 FAST-selected features did see a significant improvementon both BER and AUC. Thus, our results agree with the generalfinding that SVM's are resistant to feature selection, but alsoagree with the findings presented by Forman [22] that SVM's can

benefit from prudent feature selection. Specific examples of thisimprovement in our datasets can be seen in Figures 2 and 4 usingFAST on the BER scores for the CNS and LYMPH datasets,respectively, and in Figures 9 and 10 using FAST on the AUCscores for the CNS and PROST datasets, respectively.

The results for the 1-NN classifiers, seen in Figures 5 and 7, areeven more striking. Both RELIEF and correlation-selectedfeatures improved on the baseline performance of the classifier significantly for a minimum of 45 features selected. FAST-selected features saw a significant jump in performance over thatseen using RELIEF and correlation-selected features; the 1-NNclassifiers using only 15 FAST-selected features beat the baseline.


best correlation-selected features


best FAST-selected features

Why would FAST features outperform correlation and RELIEFfeatures by such a significant margin for both 1-NN and SVMclassifiers? We visualized the features selected by the correlation,RELIEF, and FAST methods to answer this question. We showthe training data of the CNS dataset with the two bestfeatures. Figures 11-13 show the data using the best two RELIEFfeatures, the best two correlation features, and the best two FASTfeatures respectively. FAST features appear to separate the twoclasses and group them into smaller clusters better thancorrelation and RELIEF features. This may explain why FASTfeatures perform better using both the SVM and 1-NN classifiers;SVM's try to maximize the distance between two classes, and 1-

NN classifiers give the best results when similar samples areclustered close together.

Finally, we show the effects of different class ratios on the performance of each feature selection metric. Figures 14 and 15show the BER versus class ratios for the NIPS dataset with theSVM and 1-NN classifiers, respectively. Not surprisingly, as theclass ratio increases, the BER tends to increase accordingly. For

both the 1-NN and SVM classifiers, correlation and FASTfeatures performed comparably well for datasets up to a 1:8 classratio. For the 1:16 ratios, FAST features performed significantly

better than correlation features. RELIEF features did not performwell on this dataset for any of the class ratios.

130



We conclude that FAST features perform better than RELIEF andcorrelation features; this boost in performance is especially largewhen the selected feature set is small and when the classes areextremely imbalanced. Because using less features helpsclassifiers avoid overfitting the data when the sample space issmall, we believe that the FAST metric is of interest for use inlearning patterns of real world datasets, especially those that have

imbalanced classes and high dimensionality.

Figure 14. BER for NIPS using SVM classifiers

Figure 15. BER for NIPS using 1-NN classifiers

5. CONCLUSIONClassification problems involving a small sample space and largefeature space are especially prone to overfitting. Feature selectionmethods are often used to increase the generalization potential of a classifier. However, when the dataset to be learned isimbalanced, the most-used metrics tend to select less relevantfeatures. In this paper, we proposed and tested a feature selectionmetric, FAST, that evaluates the relevance of features using thearea under the ROC curve by sliding decision line in one-dimensional feature space. We compared the FAST metric withcommonly-used RELIEF and correlation coefficient scores ontwo mass spectrometry and two microarray datasets that havesmall sample sizes and imbalanced distributions. FAST features

performed considerably better than RELIEF and correlationfeatures; the increase in performance was magnified for smaller feature counts, and this makes FAST a practical candidate for feature selection.

One interesting finding from this research was that correlationfeatures tended to outperform RELIEF features for classimbalance and small sample problems, especially when the SVMclassifier was used. This may have occurred because thecorrelation coefficient takes a global view as to whether a featureaccurately predicts the target; in contrast, RELIEF, especiallywhen the number of nearest hits and misses selected is small, has

a local view of a feature's relevancy to predicting the target. If there are small clusters of points that are near each other but far away from the main cluster of points, these points can act as eachothers' nearest hits while being a great distance from the nearestmisses. Thus, features that have this quality could be scoredrather high when they are, in fact, highly irrelevant toclassification. There is strong evidence for this claim in Fig. 11.There are multiple small clusters of points, some from themajority class and some from the minority class, that are close toeach other but a significant distance away from the nearestmiss. This would greatly affect the score of these two featuresand make them appear more relevant. Figures 5-8 clearly point tothis deficiency as the performance of both SVM's and 1-NNclassifiers using RELIEF features is only marginally better (or worse) than chance and significantly behind classifiers usingcorrelation or FAST features.

Our future work will investigate the use of other metrics for feature evaluation. For example, researchers have recently arguedthat precision-recall curves are preferable when dealing withhighly skewed datasets [43]. Whether or not the precision-recallcurves are also appropriate to small sample and imbalanced data

problems remains to be examined.

6. ACKNOWLEDGMENTSThis work is supported by the US National Science FoundationAward IIS-0644366. We would also like to the reviewers for their valuable comments.

7. REFERENCES[1] Nunez, M. 1991. The use of background knowledge in

decision tree induction. Machine Learning, 6, 231-250.

[2] Casasent, D. and Chen, X.-W. 2003. New training strategiesfor RBF neural networks for X-ray agricultural productinspection. Pattern Recognition, 36(2), 535-547.

[3] Verdenius, F. 1991. A method for inductive cost optimization.Proceedings of the Fifth European Working Session onLearning, EWSL-91, 179-191. New York: Springer-Verla.

[4] Casasent, D. and Chen, X.-W. 2004. Feature reduction andmorphological processing for hyperspectral image data.Applied Optics, 43 (2), 1-10.

[5] Japkowicz, N. editor 2000. Proceedings of the AAAI’2000Workshop on Learning from Imbalanced Data Sets. AAAITech Report WS-00-05.

[6] Chawla, N., Japkowicz, N., and Kolcz, A. editors 2003.Proceedings of the ICML’2003 Workshop on Learning fromImbalanced Data Sets.

[7] Weiss, G. 2004. Mining with rarity: A unifying framework.SIGKDD Explorations 6(1), 7-19.

[8] Kubat, M. and Matwin, S. 1997. Addressing the curse of imbalanced data set: One sided sampling. In Proc. of the

131



Fourteenth International Conference on Machine Learning,179-186.

[9] Chen, X., Gerlach, B., and Casasent, D. 2005. Pruning supportvectors for imbalanced data classification. In Proc. of International Joint Conference on Neural Networks, 1883-88.

[10] Kubat, M. and Matwin, S. 1997. Learning when negative

examples abound. In Proceedings of the Ninth EuropeanConference on Machine Learning ECML97, 146-153.

[11] Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P. 2002.SMOTE: Synthetic Minority Over-sampling Technique.Journal of Artificial Intelligence Research 16, 321-357.

[12] Estabrooks, A., Jo, T., and Japkowicz, N. 2004. A multipleresampling method for learning from imbalanced data sets.Computational Intelligence, 20(1), 18-36.

[13] Domingos, P. 1999. MetaCost: a general method for makingclassifiers cost-sensitive. Proc. of the Fifth ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining, 155-164.

[14] Elkan, C. 2001. The foundations of cost-sensitive learning.

Proc. of the Seventeenth International Joint Conference onArtificial Intelligence, 973-978.

[15] Fawcett, T., Provost, F. 1997. Adaptive fraud detection. DataMining and Knowledge Discovery, 1(3), 291-316.

[16] Huang, K., Yang, H., King, I., Lyu, M., 2004. Learningclassifiers from imbalanced data based on biased minimax

probability machine. Proc. of the 2004 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition,2(27), II-558 - II-563.

[17] Ting, K. 1994. The problem of small disjuncts: its remedy ondecision trees. Proc. of the Tenth Canadian Conference onArtificial Intelligence, 91-97.

[18] Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. 2003.

SMOTEBoost: Improving prediction of the minority class in boosting. Principles of Knowledge Discovery in Databases,LNAI 2838, 107-119.

[19] Sun, Y., Kamel, M., Wang, Y. 2006. Boosting for learningmultiple classes with imbalanced class distribution. SixthInternational Conference on Data Mining, 592-602.

[20] Raskutti, A. and Kowalczyk, A. 2004. Extreme rebalancing for svms: a svm study. SIGKDD Explorations, 6(1), 60-69.

[21] Xiong, H and Chen, X. 2006. Kernel-based distance metriclearning for microarray data classification. BMCBioinformatics, 7, 299.

[22] Forman, G. 2003. An extensive empirical study of featureselection metrics for test classification. Journal of Machine

Learning Research, 3, 1289-1305.[23] Van der Putten, P. and van Someren, M. 2004. A bias-variance

analysis of a real world learning problem: the coil challenge2000. Machine Learning, 57(1-2), 177-195.

[24] Guyon, I., Weston J., Barnhill, S., and Vapnik, V. 2002. Geneselection for cancer classification using support vector machines. Machine Learning, 46(1-3), 389-422.

[25] Guyon, I., and Elisseeff, A. 2003. An introduction to variableand feature selection. JMRL special Issue on variable andFeature Selection 3, 1157-1182.

[26] Weston, J. et al. 2000. Feature selection for support vector machines. In Advances in Neural Information ProcessingSystems.

[27] Chen, X. and Jeong, J. 2007. Minimum reference set basedfeature selection for small sample classifications. Proc. of the24th International Conference on Machine Learning, 153-160.

[28] Chen, X. 2003. An improved branch and bound algorithm for feature selection. Pattern Recognition Letter, 24, 1925-1933.

[29] Yu, L. and Liu, H. 2004. Efficient feature selection via analysisof relevance and redundancy. Journal of Machine LearningResearch, 5, 1205-1224.

[30] Pudil, P., Novovicova, J., and Kittler, J., 1994. Floating searchmethods in feature selection. Pattern Recognition Letters, 15,1119 – 1125.

[31] Mladenic, D. and Grobelnik, M. 1999. Feature selection for unbalanced class distribution and naïve Bayes. In Proc. of the

16th International Conference on Machine Learning, 258-267.[32] Zheng, Z., Wu, X., and Srihari, R. 2004. Feature selection for

text categorization on imbalanced data. SIGKDD Explorations6(1), 80-89.

[33] Lund, O., Nielsen, C., Lundegaard, C., and Brunak, S. 2005.Immunological Bioinformatics, 99-101. The MIT Press.

[34] Kira, K. and Rendell, L. 1992. The feature selection problem:Traditional methods and new algorithms. In Proc. of the 9thInternational Conference on Machine Learning, 249-256.

[35] Kononenko, I. 1994. Estimating attributes: Analysis andextension of RELIEF. In Proc. of the 7th European Conferenceon Machine Learning, 171-182.

[36] McCallum, A. 1996. Bow: A toolkit for statistical language

modeling, text retrieval, classification and clustering.http://www.cs.cmu.edu/~mccallum/bow.

[37] Pomeroy, S. et al. 2002. Prediction of central nervous systemembryonal tumour outcome based on gene expression. Nature,415, 436–442.

[38] Shipp, M. et al. 2002. Diffuse large b-cell lymphoma outcome prediction by gene expression profiling and supervisedmachine learning. Nature Medicine, 8, 68–74.

[39] Petricoin, E. et al. 2002. Use of proteomic patterns in serum toidentify ovarian cancer. The Lancet, 359, 572–577.

[40] Petricoin, E. et al. 2002. Serum proteomic patterns for detection of prostate cancer. Journal of the National Cancer Institute, 94, 1576–1578.

[41] Roweis, S. 2008. http://www.cs.toronto.edu/ roweis.

[42] MPS, 2006. Performance prediction challenge – evaluation.http://www.modelselect.inf.ethz.ch/evaluation.php.

[43] Davis, J. and Goadrich, M. 2006. The relationship between precision-recall and ROC curves. In Proc. of the 23rd International Conference on Machine Learning, 30-38.

132

fast- a roc-based feature selection metric for small samples and imbalanced data classification...

Documents