diff criteria

Upload: nabeel-asim

Post on 05-Jul-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/16/2019 Diff Criteria

    1/13

    Feature Selection Based on Class-DependentDensities for High-Dimensional Binary Data

    Kashif Javed, Haroon A. Babri, and Mehreen Saeed

    Abstract—Data and knowledge management systems employ feature selection algorithms for removing irrelevant, redundant, and

    noisy information from the data. There are two well-known approaches to feature selection, feature ranking (FR) and feature subset

    selection (FSS). In this paper, we propose a new FR algorithm, termed as class-dependent density-based feature elimination (CDFE),

    for binary data sets. Our theoretical analysis shows that CDFE computes the weights, used for feature ranking, more efficiently as

    compared to the mutual information measure. Effectively, rankings obtained from both the two criteria approximate each other. CDFE

    uses a filtrapper approach to select a final subset. For data sets having hundreds of thousands of features, feature selection with FR

    algorithms is simple and computationally efficient but redundant information may not be removed. On the other hand, FSS algorithms

    analyze the data for redundancies but may become computationally impractical on high-dimensional data sets. We address these

    problems by combining FR and FSS methods in the form of a two-stage feature selection algorithm. When introduced as a

    preprocessing step to the FSS algorithms, CDFE not only presents them with a feature subset, good in terms of classification, but also

    relieves them from heavy computations. Two FSS algorithms are employed in the second stage to test the two-stage feature selection

    idea. We carry out experiments with two different classifiers (naive Bayes’ and kernel ridge regression) on three different real-life data

    sets (NOVA, HIVA, and GINA) of the “Agnostic Learning versus Prior Knowledge” challenge. As a stand-alone method, CDFE shows

    up to about 92 percent reduction in the feature set size. When combined with the FSS algorithms in two-stages, CDFE significantly

    improves their classification accuracy and exhibits up to 97 percent reduction in the feature set size. We also compared CDFE against

    the winning entries of the challenge and found that it outperforms the best results on NOVA and HIVA while obtaining a third position in

    case of GINA.

    Index Terms—Feature ranking, binary data, feature subset selection, two-stage feature selection, classification.

    Ç

    1 INTRODUCTION

    T

    HE  advancements in data and knowledge managementsystems have made data collection easier and faster. Raw

    data are collected by researchers and scientists working indiverse application domains such as engineering (robotics),pattern recognition (face, speech), internet applications(anomaly detection), and medical applications (diagnosis).These data sets may consist of thousands of observations orinstances where each instance may be represented by tens orhundreds of thousands of variables, also known as features.The number of instances and the number of variablesdetermine the size and the dimension of a data set. Data setssuch as NOVA [1], a text classification data set, consisting of 16,969 features and 19,466 instances and DOROTHEA [2], adata set used for drug discovery, consisting of 100,000 fea-

    tures and 1,950 instances are not too uncommon these days.Intuitively, having more features implies more discrimina-tive power in classification [3]. However, this is not alwaystrue in practical experience, because not all the featurespresent in high-dimensional datasetshelpin classprediction.

    Many features might be irrelevant and possibly detrimentalto classification. Also, redundancy among the features is not

    uncommon [4], [5]. The presence of irrelevant and redundantfeatures not only slows down the learning algorithm but alsoconfuses it by causing it to overfit the training data [4]. Inother words, eliminating irrelevant and redundant featuresmakes the classifier’s design simple, improves its predictionperformance and its computational efficiency [6], [7].

    High-dimensional data sets are inherently sparse andhence, can be transformed to lower dimensions withoutlosing too much information about the classes [8]. Thisphenomenon known as the empty space phenomenon [9] isresponsible for the well-known issue of “curse of dimen-sionality,” a term first coined by Bellman [10] in 1961 to

    describe the problems faced in the analysis of high-dimensional data. He proved that to effectively estimatethe multivariate density functions up to a given degree of accuracy, an increase in data dimensions leads to anexponential growth in the number of data samples. Whilestudying the small sample size effects on classifier design, aphenomenon related to the curse of dimensionality wasobserved by Ruadys and Jain [11] and was termed as the“peaking phenomenon.” They found that for a givensample size, the accuracy of a classifier first increases withthe increase in the number of features, approaches theoptimal value, and then starts decreasing.

    Problems faced by learning algorithms with high-dimen-sional data sets have been intensively worked on byresearchers. Algorithms that have been developed can becategorized into two broad groups. The algorithms that

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 465

    .   K. Javed and H.A. Babri are with the Department of Electrical Engineering,University of Engineering and Technology, Lahore 54890, Pakistan.E-mail: {kashif.javed, babri}@uet.edu.pk.

    .   M. Saeed is with the Department of Computer Science, NationalUniversity of Computer and Emerging Sciences, Block-B, Faisal Town,Lahore, Pakistan. E-mail: [email protected].

     Manuscript received 24 Oct. 2009; revised 11 May 2010; accepted 14 Aug.2010; published online 21 Dec. 2010.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2009-10-0734.Digital Object Identifier no. 10.1109/TKDE.2010.263.

    1041-4347/12/$31.00    2012 IEEE Published by the IEEE Computer Society

  • 8/16/2019 Diff Criteria

    2/13

    select, from theoriginalfeature set, a subsetof features, whichare highly effective in discriminating classes, are categorizedas feature selection (FS) methods. Relief [12] is a popular FSmethod that filters out irrelevant features using the nearestneighbor approach.Another well-knownmethod is recursivefeature elimination support vector machine (RFE-SVM) [13]which selects useful features while training the SVMclassifier.TheFSalgorithmsarefurtherdiscussedinSection2.On theotherhand, algorithms that createa newset of featuresfrom the original features, through the application of sometransformationor combination of original features aretermedas feature extraction (FE) methods. Among them, principalcomponent analysis (PCA) and linear discriminant analysis(LDA), are the two well-known linear algorithms which arewidely used because of their simplicity and effectiveness [3].Nonlinear FE algorithms include isomap [14] and locallylinear embedding (LLE) [15]. For a comparative review of FEmethods, interested readers are referred to [16]. In this paper,our focus is on the problem of supervised feature selectionand we propose a solution that is suitable for binary data sets.

    Theremainder of the paper is organized into four sections.Section 2 describes the theory related to feature selection andpresents a literature survey of the existing methods. InSection 3, we propose a new feature ranking (FR) algorithm,termed as class-dependent density-based feature elimination(CDFE). Section 4 discusses howto combine CDFE with otherfeature selection algorithms in two stages. Experimentalresults on three real-life data sets are discussed in Section 5.The conclusions are drawn in Section 6.

    2 FEATURE  SELECTION

    This section describes the theory related to the featureselection problem and surveys the various methodspresented in the literature for its solution. Suppose we aregiven a labeled data set fxt; C tgN t¼1 consisting of  N  instancesand M   features such that  xt 2 RM  and C t denotes the classvariable of instance   t. There can be  L   number of classes.Each vector x  is, thus, an M -dimensional vector of features;hence,   xt ¼ fF t1; F 

    t2; . . . ; F 

    tM g. We use   F   to denote the set

    comprising all features of a data set whereas  G  denotes afeature subset. The feature selection problem is to find asubset  G  of  m   features from the set  F  having  M   featureswith the smallest classification error [17] or at least withouta significant degradation in the performance [6].

    A straightforward solution to the feature selectionproblem is to explore all   M m

      possible subsets of size  m.

    However, this kind of search is computationally expensivefor even moderate values of   M   and  m. Therefore, alter-native search strategies have to be designed.

    Generally speaking, the feature selection process mayconsist of four basic steps, namely, subset generation, subsetevaluation, stopping criterion, and result validation [6]. Inthe subset generation step, the feature space is searchedaccording to a search strategy for a candidate subset, whichis evaluated later. The search can begin with either anempty set, which is then successively built up (forward

    selection) or it starts with the entire feature set and thenfeatures are successively eliminated (backward selection).Different search strategies have been devised such ascomplete search, sequential search, and random search

    [18]. The newly generated subset is evaluated either withthe help of the classifier performance or some criterion thatdoes not involve the classifier feedback. These two steps arerepeated until a stopping criterion is met.

    Two well-known classes of feature selection algorithmsarefeature ranking andfeature subset selection(FSS) [7], [19].Feature ranking methods typically assign weights to features by assessing each feature individually according to somecriterion such as the degree of relevance to the class variable.Correlation, information theoretic, and probabilistic-basedranking criteria are discussed in [20]. Features arethen sortedaccording to their weights in descending order. A fixednumber of the top ranked features can comprise the optimalsubset or alternatively, a threshold value provided by theuser can be set on the ranking criterion to retain/discardfeatures. Thus, FR methods don’t indulge themselves inexplicit search of the smallest optimal set. They are highlyattractive for microarray analysis and text-categorizationdomains because of their computational efficiency andsimplicity [7]. Kira and Rendell’s “Relief” algorithm [12]estimates the relevance of a feature using the values of thefeatures of its nearest neighbors. Hall [21] proposes a rankingcriterion that evaluates and ranks subsets of features ratherthan assessing features individually. In [22], a comparison of four feature ranking methods is given. Presenting the mostrelevant features to a classifier may not produce an optimalresult as it may contain redundant features. In other words,the selected  m  best features may not result in the highestclassification accuracy, which can be achieved with the bestm   features. Yu and Liu [23] suggest analyzing the subsetobtained by the feature ranking methods for featureredundancy in a separate stage.

    Unlike feature ranking methods, feature subset selectionmethods select subsets of features, which together havegood predictive power. Guyon and Elisseeff presenttheoretical examples to illustrate superiority of the FSSmethods over the FR methods in [7]. Feature subsetselection methods are divided into three broad categories:filter, wrapper, and embedded methods [7], [19], [20]. Afilter acts as a preprocessing step to a learning algorithmand assesses feature subsets without the algorithm’sinvolvement. Fleuret [24] proposes a filtering criterion based on conditional mutual information (MI) for binarydata sets. A feature   F i   among the unselected features is

    selected if its mutual information,   I ðC ;F ijF kÞ  conditionedon every feature   F k   already in the selected subset of features, is the largest. This conditional mutual informationmaximization (CMIM) criterion discards features similar tothe already selected ones as they do not carry additionalinformation about the class. In [25], Peng et al. propose theminimal-redundancy-maximal-relevance (mRMR) criterion,which adds a feature to the final subset if it maximizes thedifference between its mutual information with the classand the sum of its mutual information with each of theindividual features already selected. Qu et al. [26] suggest anew redundancy measure and a feature subset merit

    measure based on mutual information concepts to quantifythe relevance and redundancy among features. Theproposed filter first finds a subset of highly relevantfeatures. Among these features, the most relevant feature

    466 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

  • 8/16/2019 Diff Criteria

    3/13

    having the least redundancy with the already selectedfeatures is added to the final subset.

    Wrapper methods, motivated by Kohavi and John [17]use the performance of a predetermined learning algorithmfor searching an optimal feature subset. They suggest usingn-fold cross validation for evaluating feature subsets andfind that the best first search strategy outperforms the hill-climbing technique for forward selection. In practice,wrappers are considered to be computationally moreexpensive as compared to filters.

    In theembedded approach, the feature selection process isintegrated into the training process of a given classifier. Anexample is the recursive feature elimination (RFE) algorithm[13] in which the support vector machines (SVM) is used asthe classifier. Features are assigned weights that areestimated by the SVM classifier after being trained with thedata set. During each iteration, feature(s), which decreasemargin of the class separation the least, are eliminated.

    Another class of feature selection algorithms uses theconcepts of Bayesian networks [27]. The Markov blanket(MB) of a target variable is a minimal set of variablesconditioned on, which all other variables are probabilisti-cally independent of the target. The optimal feature subsetfor classification is the Markov blanket of the class variable.One way of identifying the Markov blanket is throughlearning the Bayesian network [28]. Another way is todiscover the Markov blanket directly from the data [4]. TheMarkov blanket filtering (MBF) algorithm of Koller andSahami [4] calculates pairwise correlations between all thefeatures and assumes the K  highest correlated features of afeature to comprise its Markov blanket. During eachiteration, expected cross entropy is used to estimate the

    MB of a feature and a feature whose MB is bestapproximated, is eliminated. For large values of  K , MBFruns into computational and data fragmentation problems.To address these problems, in [29], we propose a Bernoullimixture model-based Markov blanket filtering (BMM-MBF)algorithm for binary data sets that estimates the expectedcross entropy measure via Bernoulli mixture models ratherthan from the training data set.

    3 CLASS-DEPENDENT  DENSITY-BASED FEATUREELIMINATION

    In this section, we propose a new feature ranking algorithm,termed as, class-dependent density-based feature elimina-tion, for binary data sets. Binary data sets are found in awide variety of applications including document classifica-tion [30], binary image recognition [31], drug discovery [32],databases [33], and agriculture [34]. It may be possible to binarize nonbinary data sets in many cases (e.g., binariza-tion of the GINA data set; see Section 5). CDFE uses ameasure termed as, diff-criterion, to estimate the relevanceof features. The diff-criterion is a probabilistic measure andassigns weights to features by determining their densityvalue in each class. Mathematically, we show that the

    computational cost of estimating the weights by diff-criterion is less as compared to the cost for calculatingweights by mutual information. Feature rankings obtained by the two criteria are similar to each other. Instead of using

    a user-provided threshold value, CDFE determines the finalsubset with the help of a classifier.

    Guyon et al. [35], proposed the Zfilter method to rank thefeatures of a sparse-integer data set. The filter counts thenonzero values of a feature irrespective of the class variableand assigns the sum of the count as its weight. Featureshaving a weight less than a given threshold value are thenremoved to obtain the final subset. In earlier work [36], [37],

    we proposed a similar density-based elimination strategyfor high-dimensional binary data sets using the max-criterion (see definition 3.2). In the following, we suggesta new and more effective density-based ranking criterion(diff-criterion) and present a formal analysis of its working.The discussion that follows is for binary features and two-class classification problems unless stated otherwise.

    Definition 3.1. The density of a binary feature for a given class isthe fraction of the number of instances whose value is 1.

    The density of the   ith feature,   F i, in the   lth class,   C lhaving N C l   instances, is calculated as

    d i1l  ¼

    PN C lt¼1 F 

    ti

    N C l

    ¼ pðF i  ¼  1jC lÞ

    8i; 1   i    M; 8l; 1   l    L:

    ð1Þ

    Remark 3.1.   0   d i1l    1   can be intuitively acquired. Theextremums are the cases when a feature’s value remainsidentical over all the instances in a given class.

    Definition 3.2.   The max-criterion [36] calculates the densityvalue of a feature in each class and then, scores it with the

    maximum density value over all the classes.

    The weight of the  ith feature,  F i  using max-criterion, is

    W ðF iÞmax ¼  argmaxl

    d i1l:   ð2Þ

    Remark 3.2.   0   W ðF iÞmax   1   can be intuitively acquired.Irrelevant features will be assigned a value, W ðF iÞmax ¼  0.

    Definition 3.3. The diff-criterion calculates the density value of a feature in each class and then, scores it with difference of thedensity values over the two classes  C 0  and C 1.

    The weight of the  ith feature,  F i  using diff-criterion, is

    W ðF iÞdiff  ¼ jd i11  d 

    i10j

    ¼ j pðF i  ¼  1jC 1Þ  pðF i  ¼  1jC 0Þj:ð3Þ

    Remark 3.3.   0   W ðF iÞdiff   1  can be intuitively acquired.A feature having   W ðF iÞdiff  ¼ 0   is irrelevant whereasW ðF iÞdiff  ¼ 1  means F i  is the most relevant feature of adata set. Features with lower weights are, thus, lessrelevant as compared to features with higher values of W diff .

    The class-dependent density-based feature eliminationstrategy ranks the features using diff-criterion given by (3)and sorts them according to decreasing relevance. In featureselection algorithms such as [5], [23], [26], where relevance

    JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 467

  • 8/16/2019 Diff Criteria

    4/13

    and redundancy are analyzed separately in two steps, apreliminary subset of relevant features is chosen in the firststep using a threshold value provided by the user. A highthreshold value provided by the user may result in a verysmall subset of highly relevant features, whereas, with a lowvalue the subset may consist of too many features includinghighly relevant features along with less relevant features. Inthe former case, a lot of information about the class may belost, whereas, the subset, in the latter case, will still contain alot of information irrelevant to the class, thus, requiring acomputationally expensivesecondstage for selecting the bestfeatures. Feature ranking algorithms such as “Relief” [12] andothers also suffer from the same problem with a user-provided threshold value. To address this problem, CDFEuses a filtrapper approach [20] and defines nested sets of features S 1   S 2   S N T  in search of the optimal subset.Here, N T  denotes the number of threshold levels. A sequenceof increasing   W diff   values can be used as thresholds toprogressively eliminate more and more features of decreas-ing relevance in the nested subsets. Each feature subset, thus,

    generated is evaluated with a classifier and the final subset ischosen according to the application requirement. Either thesubset smallest in size having the same accuracy as attained by the entire feature set, is selected or the one with bestclassification accuracy is chosen.

    The W ðF iÞdiff  value of the  ith feature,  F i, is determined by counting the number of 1s over the instances of the twoclasses. Therefore, the time complexity of diff-criterion isOðNM Þ, where   N   is the number of instances and   M   isnumber of features in the training set. Consequently, thetime complexity of CDFE is  OðNMN T V Þ, where   V   is thecomputing time of the classifier used.

    3.1 Rationale of the Diff-Criterion Measure

    In the remainder of this section, theoretical justification fordiff-criterion is provided. We determine mutual informa-tion in terms of diff-criterion and show that diff-criterion iscomputationally more efficient.

    Definition 3.4. Mutual information is a measure of the amountof information that one variable contains about anothervariable [38].

    It is calculated by finding the relative entropy orKullback-Leibler distance between the joint distribution

     pðC; F iÞ of two random variables C  and F i and their productdistribution   pðC Þ pðF iÞ   [38]. Being consistent with ournotation used in (2) and (3), the weight of the   ith feature,F i  using mutual information, is

    W ðF iÞmi  ¼ DKLð pðC; F iÞk pðC Þ pðF iÞÞ

    ¼ X

    X

    F i

     pðC; F iÞ log2 pðC Þ pðF iÞ

     pðC; F iÞ  :

      ð4Þ

    Remark 3.4.  Because of the properties of Kullback-Leiblerdivergence, W ðF iÞmi    0  with equality, if and only if,  C and   F i   are independent. Larger   W mi   value means a

    feature is more important.

    Writing mutual information given in (4) in terms of class-conditional probabilities

    W ðF iÞmi  ¼ X

    X

    F i

     pðF ijC Þ pðC Þ log2 pðF iÞ

     pðF ijC Þ:   ð5Þ

    S i n c e   pðF i ¼  f Þ ¼  pðF i  ¼  f jC  ¼  0Þ pðC  ¼  0Þ þ pðF i  ¼

    f jC  ¼  1Þ pðC  ¼  1Þ   and   pðC  ¼  0Þ þ pðC  ¼  1Þ ¼ 1, and using

    the notation  pðC  ¼  cÞ ¼ P c  for prior probability of the class

    variable and   d ifc  ¼  pðF i  ¼ f jC  ¼  cÞ   for class-conditional

    probabilities of the   ith feature, where   f; c 2 f0; 1g, in (5),

    we get

    W ðF iÞmi  ¼

    P 0ðd i00Þ log2

    ðP 0ðd i00  d 

    i01Þ þ d 

    i01Þ

    d i00

    P 0ðd i10Þ log2

    ðP 0ðd i10  d 

    i11Þ þ d 

    i11Þ

    d i10

    ð1 P 0Þðd i01Þ log2

    ðP 0ðd i00  d 

    i01Þ þ d 

    i01Þ

    d i01

    ð1 P 0Þðd i11Þ log2

    ðP 0ðd i10  d 

    i11Þ þ d 

    i11Þ

    d i11:

    ð6Þ

    Putting d i00 þ d i10 ¼  1 and d 

    i01 þ d 

    i11 ¼  1  in (6), suppressing

    the index i and rearranging the terms, we get

    W ðF iÞmi  ¼

    ðP 0Þ log2

    ðd 10Þd 10ð1 d 10Þ

    ð1d 10Þ

    þ ð1 P 0Þ log2

    ðd 11Þd 11ð1 d 11Þ

    ð1d 11Þ

    log2

    ðd 11  P 0ðd 11  d 10ÞÞðd 11P 0ðd 11d 10ÞÞ

    ð1 ðd 11  P 0ðd 11  d 10ÞÞÞð1ðd 11P 0ðd 11d 10ÞÞÞÞ

    :

    ð7Þ

    Equation (7) indicates that mutual information between a

    feature and the class variable depends on P 0, d 11, d 10, and thediff-criterion measure, ðd 11  d 10Þ. The first term in (7) ranges

    over [P 0  0] with minimum at   d 10 ¼  0:5   and maxima at

    d 10 ¼  0; 1. Similarly, the second term lies in the range

    [ð1 P 0Þ 0] with a minimum value at d 11 ¼  0:5 and maxima

    at  d 11 ¼  0; 1. The third term lies over the range [0 1] and

    depends on   P 0   and   ðd 11  d 10Þ. The most significant con-

    tribution to mutual information comes from the third term

    which contains the diff-criterion measure. Fig. 1 shows the

    relationship between mutual information and the diff-

    criterion for a balanced data set,   P 0 ¼  0:5, a partially

    unbalanced data set with P 0 ¼  0:25 and an unbalanced data

    set having   P 0  ¼  0:035. Mutual information increases as a

    function of the diff-criterion. The change in mutual informa-

    tion due to different values of  d 11 and d 10 but the same value

    of ðd 11  d 10Þ is relatively small as evident from the standard

    deviation bars on the three plots in Fig. 1. It is also observed

    that the maximum value of mutual information obtained

    with diff-criterion, ðd 11  d 10Þ ¼  1  decreases as P 0 decreases.

    Remark 3.5.   Mutual information of a feature,   F i, whose

    value remains the same over the two classes is 0.

    Proof.   In this case, diff-criterion,   ðd 11  d 10Þ  becomes 0 or

    W ðF iÞdiff 

     ¼ 0. Putting this value in (7) and using

    limy!0;1 log2ððyyÞð1 yÞð1yÞÞ ¼ 0, we get, W ðF iÞmi  ¼  0.   tu

    Theorem 3.1. Mutual information is upper bounded by entropy

    of the class variable.

    468 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

  • 8/16/2019 Diff Criteria

    5/13

    Proof.   The most relevant feature has   ðd 11  d 10Þ ¼ 1   orW ðF iÞdiff  ¼ 1. Putting this value in (7) and usinglimy!0;1 log2ððy

    yÞð1 yÞð1yÞÞ ¼ 0, we get

    W ðF iÞmi

    ¼ ðP 0Þ log2 P 0  ðP 1Þ log2 P 1

    ¼  pðC  ¼  0Þ log2ð pðC  ¼  0ÞÞ  pðC  ¼  1Þ log2ð pðC  ¼  1ÞÞ

    ¼ EntropyðC Þ:

    ut

    Remark 3.6.   The range of diff-criterion measure is [0 1]whereas mutual information lies within [0 Entropy(C)].

    In other words, features with higher  W diff  will reduce theuncertainty of theclass variable more as compared to featureswithlower W diff  anda feature whoseW diff  is1 willcontain allthe information required to predict the class variable.

    Remark 3.7.   Diff-criterion is a computationally less ex-pensive measure as compared to mutual information.

    If we assume that it takes t1 units of time to calculate thedensity term,   pðF i  ¼  1jC Þ, a subtraction operation isperformed in   t2   and an absolute operation takes   t3   unitsof time, then the computational cost of  W ðF iÞdiff   given in(3) is   2 t1 þ t2 þ t3. Further, if we assume that  pðC Þ  and pðF iÞ take t4, log2   takes t5, a division operation takes  t6 anda multiplication takes   t7   units of time, then the computa-tional cost of  W ðF iÞmi   given in (5) is   4 t1 þ 4 t2 þ 4 t4 þ 4 t5 þ 4 t6 þ 8 t7. Comparing the two computa-

    tional costs and keeping in mind that logarithm, multi-plication, and division are expensive operations, we findthat diff-criterion is a relatively less expensive measure.

    4 TWO-STAGE FEATURE  SELECTION ALGORITHMS

    Feature ranking algorithms while selecting a final subsetignore redundancies among the features. Without anysearch strategy, they choose features that are highlyrelevant to the class variable. Due to their simplicity andcomputational efficiency, they are highly popular inapplication domains involving high-dimensional data. Onthe other hand, feature subset selection algorithms take the

    redundancies among features into consideration whileselecting features but are computationally expensive withdata having a very large number of features. In this section,we suggest combining an FR algorithm using a filtrapper

    approach [20] with an FSS algorithm to overcome theselimitations. The idea of designing dimensionality reductionalgorithms with more than one stage is not new [25], [39].However, this kind of combination of FR and FSSalgorithms for high-dimensional binary data has not yet been explored. The first stage of the two-stage algorithm is

     based on a computationally cheap FR measure and selects apreliminary subset with best classification accuracy. Apotentially large number of irrelevant and redundantfeatures are discarded in this phase. This makes the job of an FSS algorithm relatively easy. In the second stage, a“higher performance” computationally more expensive FSSalgorithm is employed to select the most useful featuresfrom the reduced feature set produced in the first stage.

    4.1 First Stage: Selection of the Preliminary FeatureSubset

    To evaluate its effectiveness as a preprocessor to FSSalgorithms, CDFE is employed in the first stage of our two-

    stage algorithm. In this capacity, it provides them with areduced initial feature subset having good classificationaccuracy as compared to the entire feature set. Besides theirrelevant features, a large number of redundant featuresare eliminated by CDFE during this stage. The subset thus,generated is not only easier to manipulate by the FSSalgorithm in the second stage but also improves itsclassification performance.

    4.2 Second Stage: Selection of the Final FeatureSubset

    In this paper, we have tested two FSS algorithms in thesecond stage: Koller and Sahami’s Markov blanket filtering

    algorithm [4], which is an approximation to the theoreticallyoptimal feature selection criterion, and our Bernoullimixture model-based Markov blanket filtering algorithm[29], which makes MBF computationally more efficient. Thetwo algorithms are briefly described here.

    4.2.1 Koller and Sahami’s Markov Blanket Filtering 

    Algorithm [4] 

    Koller and Sahami show that a feature   F i   can be safelyeliminated from a set without an increase in the divergencefrom the true class distribution, if its Markov blanket,  M,can be identified. Practically, it is not possible to exactly

    pinpoint the true MB of   F i; hence, heuristics have to beapplied. MBF is a backward elimination algorithm and isoutlined in Table 1. For each feature  F i, a candidate set  Miconsisting of those   K   features, which have the highest

    JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 469

    Fig. 1. Relationship between diff-criterion and mutual information for balanced (left), partially unbalanced (middle), and highly unbalanced (right)data.

  • 8/16/2019 Diff Criteria

    6/13

    correlation with F i, is selected. The value of  K  should be aslarge as possible to subsume all the information  F i  containsabout the class and other features. Then, MBF estimateshow close  Mi   is to being the MB of  F i  using the followingexpected cross entropy measure:

     GðF ijMiÞ ¼X

    f Mi ;f i

    P ðMi  ¼  f Mi ; F i  ¼  f iÞ

    DKLðP ðC jM ¼  f M; F i ¼  f iÞjjP ðC jM ¼  f M ÞÞ:

    ð8Þ

    The feature F i  having the smallest value of   GðF ijMiÞ  isomitted. The output of this algorithm can also be a list of features sorted according to relevance to the class variable.Its time complexity is OðrMKN 2K LÞ where r  is the numberof features to eliminate and  L  is the total number of classes.

    4.2.2 Bernoulli Mixture Model-Based Markov Blanket 

    Filtering Algorithm [29] 

    Larger values of  K   in the MBF algorithm demand heavy

    computations for calculating the expected cross entropymeasure given in (8) from the training data. This issue isaddressed by the BMM-MBF algorithm for binary data sets by estimating the cross entropy measure from the Bernoullimixture model instead of the training set. A Bernoullimixture model can be seen as a tool for partitioning an  M -dimensional hypercube, identifying regions of high datadensity on the corners of the hypercube. BMM-MBF is thesame as the MBF algorithm given in Table 1, except thatStep 2b is replaced by the steps given in Table 2.

    BMM-MBF, first determines the Bernoulli mixtures, (Q1and Q0), from the training data for the positive and negative

    (C 1   and   C 0) classes, respectively. The   q th mixture is

    specified by the prior   q    and the probability vector

    pq  2 ½0; 1M ,   1   q   ðQ ¼  Q1 þ Q0Þ. These two parameters

    can be determined by the expectation maximization (EM)

    algorithm [40]. Then, BMM-MBF thresholds the values of 

    probability vector to see, which corner of the hypercube is

    represented by this mixture. A probability value greater

    than 0.5 is taken as a 1 and 0 otherwise. This converts  pq 

    into a feature vector x

     whose probability of occurrence can be estimated as

     pðxjq Þ ¼ q   mini

     pxiqi ð1  pqiÞ1xi ;   ð9Þ

    where   pqi  2 ½0; 1,   1  i    M , denotes the probability of 

    success of the   ith feature in the   q th mixture. The feature

    vector having the highest probability of occurrence accord-

    ing to (9) in the mixture density, is termed as “main vector”

    and is denoted by  v

    v ¼  argmaxx2X 

     pðxjq Þ;

    where X  represents the set of all binary vectors in  f0; 1gM .Once the main vectors are estimated, the steps given in

    Table 2 are then followed. Here, the   kth mixtures for the

    positive and negative classes are denoted by   q 1k   and   q 0k ,

    respectively.The BMM-MBF algorithm has a time complexity of 

    OðrMKQ2LÞ. The cross entropy measure in MBF is

    computed from  N   K  sized data and we need to look at

    2K  combinations of values. On the other hand, BMM-MBF

    computes the cross entropy measure from   Q K   sized

    data, where we are only looking at  Q  main vectors in each

    Bernoulli mixture, resulting in a dramatic reduction in time.

    470 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

    TABLE 1MBF Algorithm [4]

    TABLE 2BMM-MBF Algorithm [29]

  • 8/16/2019 Diff Criteria

    7/13

    5 EXPERIMENTAL RESULTS

    This section first measures the effectiveness of our class-dependent density-based feature elimination algorithm usedas a stand-alone method. Then, we evaluate CDFE as apreprocessor to the FSS algorithms, described in Section 4, asa part of our two-stage algorithm for selecting features fromhigh-dimensional binary data. Experiments are carried outon three different real-life benchmark data sets using twodifferent classifiers. The three data sets are NOVA, GINA,and HIVA, which were collected from the text-mining,

    handwriting, and medicine domains, respectively, and wereintroduced in agnostic learning track of the “AgnosticLearning versus Prior Knowledge” challenge organized bythe International Joint Conference on Neural Networks in2007 [1]. The data sets are summarized in Table 3.

    Designed for the text classification task, NOVA classifiesemails into two classes: politics and religion. The data are asparse binary representation of a vocabulary of 16,969 wordsand hence consists of 16,969 features. The positive class is28.5 percent of the total instances. Thus, NOVA is a partiallyunbalanced data set.

    HIVA is used for predicting the compounds that are

    active against the AIDS HIV infection. The data arerepresented as 1,617 sparse binary features and 3.5 percentof the class variable comprise the positive class. HIVA is,thus, an unbalanced data set.

    The GINA data set is used for the handwritten digitrecognition task, which consists of separating the two-digiteven numbers from the two-digit odd numbers. With sparsecontinuous input variables, it is designed such that only theunit digit provides the information about the classes. TheGINA features are integers quantized to 256 grayscalelevels. We converted these 256 gray levels into 2 bysubstituting 1 for the values greater than 0. This is

    equivalent to converting a grayscale image to a binaryimage. Data sets with GINA-like feature values can be binari zed wit h this str ategy which does not affect

    the sparsity of the data. The positive class is 49.2 percentof the total instances. In other words, GINA is balanced between the positive and negative classes.

    The class labels of the test sets of these data sets are notpublicly available but one can make an online submission toknow the prediction accuracy on the test set. In ourexperiments, the training and validation sets are combinedto train the Naive Bayes’ and kernel ridge regression(kridge) classifiers. The software implementation given inChallenge Learning Object Package (CLOP) [35] for both theclassifiers, was used. The classification performance is

    evaluated by the balanced error rate (BER) over fivefoldcross validation. BER is the average of the error rates of thepositive and negative classes [20]. Given two classes, if   tndenotes the number of negative instances that are correctlylabeled by the classifier and f  p  refers to number of negativeinstances that are incorrectly labeled, then false positive rateis defined as,   fpr ¼  f  p=ðtn þ f  pÞ. Similarly, we can definefalse negative rate as   fnr ¼  f n=ðt p þ f nÞ, where   f n   is thenumber of positive instances that are incorrectly labeled bythe classifier and   t p   denotes the number of positiveinstances that are correctly labeled. Thus, BER, is given by

    BER ¼  0:5 ðfpr þ fnrÞ:

    For data sets that are unbalanced in cardinality, BER gives a better picture of the error than the simple error rate [20].

    5.1 Class-Dependent Density-Based FeatureElimination as a Stand-Alone Feature SelectionAlgorithm

    Experiments described in this section test the performanceof CDFE as a stand-alone feature selection algorithm.Features are scored using the diff-criterion and are thensorted in descending order according to their weights. Fig. 2shows the weights (sorted in descending order) assigned tothe features by the max-criterion, diff-criterion, and mutualinformation measures using (2), (3), and (4), respectively.Although, each measure assigns a different value to a

    JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 471

    TABLE 3Summary of the Data Sets [1]

    The number of classes for each data set is 2. The train, valid and test columns show the total number of instances in the corresponding data sets.

    Fig. 2. Comparison of weights assigned to the features for NOVA (left), HIVA (middle), and GINA (right).

  • 8/16/2019 Diff Criteria

    8/13

    feature, we are actually interested in looking at theirpatterns. The curve of diff-criterion for the three data setslies in middle of the curves of the other two measures, whilethe curve of max-criterion lies at top of the three curves. It isevident from these patterns that diff-criterion behaves in amore similar fashion to mutual information as compared to

    max-criterion. For NOVA, the  W diff  values lie in the range[0 0.231] with most of the features having value close to zeroas shown by its diff-criterion pattern. Thus, most NOVAfeatures have poor discriminating power. The  W diff  valuesof HIVA ranges over [0 0.272]. Compared to NOVA, a largerfraction of HIVA features have good class separationcapability as seen in Fig. 2. In case of GINA, the  W diff values lie within the range [0 0.471] with a fairly largefraction of features have good discriminating power.

    To find the final subset, the space of   M   features issearched with a filtrapper like approach [20]. We definenested sets of features progressively eliminating more and

    more features of decreasing relevance with the help of asequence of increasing threshold values on   W diff . For agiven threshold value, we discard a number of features andretain the remaining features. The usefulness of everyfeature subset, thus, generated is tested using the classifica-tion accuracy of a classifier.

    In Fig. 3, we look at the effectiveness of a feature rankingmethod. The CDFE algorithm is compared against a baseline method that generates nested feature subsets butselects features randomly from the data set that is notranked. From the plots, we observe that ranking the NOVA,HIVA, and GINA features significantly improves theclassification accuracy. Besides this, a feature subset of 

    smaller size attains the BER value that is obtained with theset containing all the features.

    Next, we compare CDFE against three feature selectionalgorithms: the mutual information-based ranking method,Koller and Sahami’s MBF algorithm, and BMM-MBF algo-rithm using kridge classifier. The MI-based ranking methodassigns weights to features according to the mutual informa-tion measure given in (4) and sorts them in the order of 

    decreasing weights. The plots are given in Fig. 4, and theresults are tabulated in Table 4. Among these algorithms,CDFE is the least expensive. Both MBF and BMM-MBFapplied on the entire NOVA feature set become computa-tionally infeasible, as each algorithm involves calculating thecorrelation matrix of size M   M . For data sets with a largenumber of features, suchcalculations render thesealgorithmsimpractical. Due to this reason, we could not compare theperformance of CDFE against that of theMBF andBMM-MBFalgorithms for NOVA. However, when compared with theMI-based ranking method, CDFE gives better results asshown in Fig. 4. CDFE reduces the original dimensionality to

    a set of 3,135 features (81.53 percent reduction) having aclassification accuracy as good as attained with the entirefeature set. On the other hand, the MI-based ranking methodselects a subset of 4,950 features(70.83percent reduction). Forthe HIVA data set, CDFE results in higher classificationaccuracy as compared to the other three feature selectionalgorithms. It generates a subset with about 8.6 percent of theoriginal features. Here, CDFE’s performance is close to MBFand outperforms the MI-based ranking method and theBMM-MBF algorithm. In case of GINA, the classificationaccuracy patterns of CDFE, MBF, and MI-based rankingmethod are similar. CDFE generates a subset whose

    dimensionality is 33 percent of original feature set and comesthird in terms of feature reduction.

    472 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

    0 2000 4000 6000 8000 10000 12000 14000 16000 180000.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45NOVA − Performance of CDFE and Baseline method without feature ranking using kridge

    size of feature subsets

          B      E      R

    CDFE

    Baseline method without feature ranking

    BER − 16969 =0.070175

    0 200 400 600 800 1000 1200 1400 1600 18000.23

    0.24

    0.25

    0.26

    0.27

    0.28

    0.29

    0.3

    0.31

    0.32HIVA − Performance of CDFE and Baseline method without feature ranking using kridge

    size of feature subsets

          B      E      R

    CDFE

    Baseline method without feature ranking

    BER − 1617 =0.26778

    100 200 300 400 500 600 700 800 900 10000.13

    0.14

    0.15

    0.16

    0.17

    0.18

    0.19

    0.2

    0.21

    0.22

    size of feature subsets

          B      E      R

    GINA − Performance of CDFE and Baseline method without feature ranking using kridge

    CDFE

    Baseline method without feature ranking

    BER − 970 =0.14044

    Fig. 3. Comparison of CDFE against a baseline method of selecting random features without feature ranking for NOVA (left), HIVA (middle), andGINA (right) using kridge classifier.

    Fig. 4. Comparison of the CDFE algorithm against MI-based ranking, MBF and BMM-MBF algorithms for NOVA (left), HIVA (middle) and GINA(right) using kridge classifier.

  • 8/16/2019 Diff Criteria

    9/13

    5.2 Two-Stage Feature Selection Algorithms

    This section measures the performance of the two-stagealgorithm with CDFE used as a preprocessor to an FSSalgorithm (MBF or BMM-MBF) in the second stage. For thispurpose, we compare the performance of two stages used inunison against only the second stage feature selectionalgorithm.

    5.2.1 Stage-1: Class-Dependent Density-Based Feature Elimination 

    When CDFE is used as a preprocessor, we choose thefeature subset resulting in minimum BER value for the nextstage. Table 5 summarizes the minimum BER results for thethree data sets obtained from Fig. 4. The NOVA plotindicates that a BER of 6.38 percent is obtained when weeliminate features using a threshold of  W diff  

  • 8/16/2019 Diff Criteria

    10/13

    features selected by the two-stage algorithm result in aclassification accuracy as good as the one achieved by all thefeatures. From the HIVA plot, a shift in the optimum BERpoint of MBF toward the left is evident when it is combinedwith CDFE. MBF alone results in an optimum BER of 26.8 percent with 185 features while it attains an optimumBER of 26.4 percent with 140 features in two stages. The

    smallest subset which attains a BER value equal to thatattained with all the HIVA features, consists of 96 featureswhen MBF is used as a stand-alone method. It consists of 76 features when MBF is combined with CDFE. In case of GINA, MBF alone performs the classification task with128 features with an accuracy equal to that obtained withthe entire feature set. The size of this subset is reduced to32 features when CDFE and MBF are combined in two stages.

    Fig. 6 investigates performance of the two-stage algorithmagainst that of MBF using kridge classifier. When applied onNOVA, the smallest subset selected by the two-stagealgorithm which attains a BER value equal to that obtainedwith all the features, consists of 1,792 features. For HIVA,

    MBF selects a subset of 64 features while CDFE and MBF intwo stages select 58 features to perform the classification taskwithout anydegradation in theaccuracy that is obtained withall the features. The GINA results indicate that 165 features

    selectedbyMBFresultinaBERthatisobtainedwiththeentirefeature set. On the other hand, subset selected by the two-stage algorithm consists of 150 features.

    5.2.3 Results of the Two-Stage Algorithm with the 

    BMM-MBF Algorithm in Second Stage 

    In this section, the performance of two-stage algorithm is

    discussed when CDFE is combined with the BMM-MBFalgorithm. We experimented with the BMM-MBF algorithmusing different values of  K  and found that unlike Koller andSahami’s MBF algorithm, it remains computationally effi-cient even if it has to search for the Markov blanket of afeature using values of K  as large as 40. For each data set, weuse the optimal value of  K . Like MBF, we evaluated theperformance of our two-stage algorithm against the classi-fication accuracy of the entire feature set obtained by naiveBayes’ and kridge classifiers for the NOVA data set. ForHIVA and GINA, the two-stage algorithm was comparedagainst the performance of the BMM-MBF algorithm. Theempirical results are shown in Figs. 7 and 8 and are

    summarized in Table 7. We find that the performance of the BMM-MBF algorithm, both in terms of feature reductionand classification accuracy, is significantly improved withthe addition of the CDFE algorithm as a first stage.

    474 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

    0 1000 2000 3000 4000 50000.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    size of feature subsets

          B      E      R

    NOVA − Performance of kridge

    CDFE + MBF−k=2

    BER−16969 =0.070175

    0 200 400 600 800 1000 1200 1400 1600 18000.23

    0.24

    0.25

    0.26

    0.27

    0.28

    0.29

    0.3

    0.31

    0.32

    0.33

    size of feature subsets

          B      E      R

    HIVA − Performance of kridge

    CDFE + MBF−k = 2

    MBF−k = 2

    BER − 1617 =0.26778

    0 200 400 600 800 10000.13

    0.14

    0.15

    0.16

    0.17

    0.18

    0.19

    0.2

    0.21

    0.22

    size of feature subsets

          B      E      R

    GINA − Performance of kridge

    CDFE + MBF−k = 1

    MBF−k = 1

    BER − 970 =0.14044

    Fig. 6. Comparison of the two-stage (CDFE þ MBF) algorithm against the MBF algorithm for NOVA (left), HIVA (middle), and GINA (right) usingkridge classifier.

    TABLE 6Comparison of the Two-Stage (CDFE þ MBF) Algorithm against the MBF Algorithm

    F  is the entire feature set,  G  is the selected feature subset, and  BER  is the balanced error rate.

    Fig. 7. Comparison of the two-stage (CDFE þ BMM MBF) algorithm against the BMM-MBF algorithm for NOVA (left), HIVA (middle), and GINA(right) using naive Bayes’ classifier.

  • 8/16/2019 Diff Criteria

    11/13

    Fig. 7 compares the performance of the two-stage algo-rithmandthatoftheBMM-MBFalgorithmusingnaiveBayes’classifier. Our two-stage algorithm, for the NOVA data set,leads to an optimum BER value of 2 percent with 2,048 fea-tures while it selects a subset of 605 features with classifica-tion accuracy as good as obtained by all the features. The

    HIVAplot indicatesthatclassificationaccuracyof BMM-MBFis improved with the introduction of the CDFE stage in such amanner that almost 8 percent of the original features result inan accuracy obtained with all the features. In case of GINA,we find that BMM-MBF alone performs theclassification taskwith 279 features with an accuracy equal to that attained withall the features. The addition of CDFE to BMM-MBF reducesthis subset to 165 features.

    InFig.8,resultsofthekridgeclassifierwhenappliedonthethree data sets are shown. Dimensionality of the NOVAsubsetselectedfromthefirststageisreducedfurtherto780byBMM-MBF without compromising the classification accu-racythatisobtainedwithallthefeatures.FromtheHIVAplot,

    we find that the smallest subset selected by BMM-MBF toperform the classification task with a BER value equal to thatattained by allthe features,consists of 817features. Thesize of this subset is reduced to 140 when CDFE and BMM-MBF arecombined in two stages. When theexperiment wasrun on theGINA data set, BMM-MBF selected 550 features while thetwo-stage algorithm selected 279 features.

    5.3 Comparison of CDFE Performance against theTop 3 Winning Entries of the Agnostic LearningTrack [1]

    The organizers of the agnostic learning track of “AgnosticLearning versus Prior Knowledge” challenge evaluated allthe entrants on the basis of the BER on the test sets. We testedCDFE in both the capacities, as a stand-alone method and as apart of thetwo-stage algorithm (i.e., a preprocessor to MBForBMM-MBF) with kridge classifier and the classification

    method given in [36] for NOVA, HIVA, and GINA. Table 8gives a comparison of CDFE’s performance against the top 3winning entries of theagnostic learning track ofthechallenge.

    In case of NOVA, both our methods, the CDFE stand-alone algorithm and the two-stage algorithm outperformthe top 3 results. We also find that CDFE performs better in

    two stages (CDFE þ MBF) as compared to the CDFE stand-alone case. For the HIVA data set, we observe that the BERvalue obtained by CDFE with the kridge classifier outper-forms the top 3 results. When combined with MBF in twostages, CDFE results in a performance that is comparable tothe three winning BER results. Results obtained on GINAindicate that the two-stage (CDFE þ MBF) algorithm beatsthe second and third winning entries. As a stand-alonemethod, CDFE obtains the third position in the ranking of the top 3 entries of the challenge.

    Feature selection algorithms may behave differently ondata sets from different application domains. The mainfactors that affect the performance include the number of 

    features and samples and the balance of the classes of thetraining data [20]. NOVA, HIVA, and GINA belong todifferent application domains. The ratio of features tosamples is 0.103, 2.378, and 3.251 and the positive class is28.5, 3.5, and 49.2 percent of the total samples, respectively.Results in Table 8 indicate that CDFE, which is currentlylimited to the domain of two-class classification with binary-valued features, performs consistently better as compared tothe other feature selection algorithms used in the challenge.

    6 CONCLUSIONS

    This paper is devoted to feature selection in high-dimen-sional binary data sets. We proposed a ranking criterion,called diff-criterion, to estimate the relevance of featuresusing their density values over the classes. We showed that itis equivalent to the mutual information measure but is

    JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 475

    TABLE 7Comparison of the Two-Stage (CDFE þ BMM MBF) Algorithm against the BMM-MBF Algorithm

    Fig. 8. Comparison of the two-stage (CDFE þ BMM MBF) algorithm against the BMM-MBF algorithm for NOVA (left), HIVA (middle), and GINA(right) using kridge classifier.

  • 8/16/2019 Diff Criteria

    12/13

    computationally more efficient. Based on the diff-criterion,we proposed a supervised feature selection algorithmtermed as class-dependent density-based feature elimina-tion, to select a subset of useful binary features. CDFE uses aclassifier instead of a user-provided threshold value to selectthe final subset. Our experiments on three real-life data setsdemonstrate that CDFE, in spite of its simplicity andcomputational efficiency, either outperforms other well-known feature selection algorithms or is comparable to themin terms of classification and feature selection performance.

    We also found that CDFE can be effectively used as apreprocessing step for other feature selection algorithms fordetermining compact subsets of features without compro-mising the accuracy on a classification task. It, thus,provides them with a substantially smaller feature subsethaving better class separability. Feature selection algo-rithms, such as MBF and BMM-MBF, involving squarematrices of size equal to the number of features, becomecomputationally intractable for high-dimensional data sets.It was shown empirically that CDFE adequately relievesthem from this problem and significantly improves theirclassification and feature selection performance.

    Furthermore, we analyzed CDFE’s performance bycomparing it against the winning entries of the agnostic

    learning track of “Agnostic Learning versus Prior Knowl-edge” challenge. Results indicate that CDFE outperformsthe best entries obtained on NOVA and HIVA data sets andattains the third position on the GINA data set.

    ACKNOWLEDGMENTS

    Kashif Javed was supported by a doctoral fellowship at theUniversity of Engineering and Technology, Lahore. Theauthors would like to thank the anonymous reviewers fortheir helpful comments.

    REFERENCES[1] I. Guyon, A. Saffari, G. Dror, and G. Cawley, “Agnostic Learning

    vs. Prior Knowledge Challenge,”   Proc. Int’l Joint Conf. NeuralNetworks (IJCNN),   http://www.agnostic.inf.ethz.ch, 2007.

    [2]   “Feature Selection Challenge by Neural Information ProcessingSystems Conference (NIPS),” http://www.nipsfsc.ecs.soton.ac.uk, 2003.

    [3]   R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, seconded. Wiley, 2001.

    [4]   D. Koller and M. Sahami, “Toward Optimal Feature Selection,”Proc. 13th Int’l Conf. Machine Learning,  pp. 284-292, 1996.

    [5]   L. Yu and H. Liu, “Efficient Feature Selection via Analysis of Relevance and Redundancy,” J. Machine Learning Research,  vol. 5,pp. 1205-1224, 2004.

    [6]   M. Dash and H. Liu, “Feature Selection for Classification,”Intelligent Data Analysis, Elsevier Science B.V.,   vol. 1, no. 3,pp. 131-156, 1997.

    [7]   I. Guyon and A. Elisseeff, “An Introduction to Variable andFeature Selection,”   J. Machine Learning Research,  vol. 3, pp. 1157-1182, 2003.

    [8]   L. Jimenez and D. Landgrebe, “Supervised Classification in HighDimensional Space: Geometrical, Statistical and AsymptoticalProperties of Multivariate Data,”   IEEE Trans. Systems, Man andCybernetics—Part C: Applications and Rev., vol. 28, no. 1, pp. 39-54,Feb. 1998.

    [9]   D. Scott and J. Thompson, “Probability Density Estimation inHigher Dimensions,”   Proc. 15th Symp. Interface, Elsevier SciencePublishers, pp. 173-179, 1983.

    [10]   R. Bellman,  Adaptive Control Processes: A Guided Tour.   PrincetonUniv. Press, 1961.

    [11]   S. Ruadys and A. Jain, “Small Sample Size Effects in StatisticalPattern Recognition: Recommendations for Practitioners,”   IEEETrans. Pattern Analysis and Machine Intelligence,   vol. 13, no. 3,pp. 252-264, Mar. 1991.

    [12]   K. Kira and L.A. Rendell, “A Practical Approach to FeatureSelection,”  Proc. Ninth Int’l Conf. Machine Learning,   pp. 249-256,1992.

    [13]   I. Guyon, J. Watson, S. Barnhill, and V. Vapnik, “Gene Selectionfor Cancer Classification Using Support Vector Machines,”

     Machine Learning,  vol. 46, pp. 389-422, 2002.[14]   J.B. Tenenbaum, V. de Silva, and J.C. Langford, “A Global

    Geometric Framework for Nonlinear Dimensionality Reduction,”Science, vol. 290, pp. 2319-2323, 2000.

    [15]   L.K. Saul and S.T. Roweis, “Think Globally, Fit Locally:Unsupervised Learning of Low Dimensional Manifolds,”

     J. Machine Learning Research, vol. 4, pp. 119-155, 2003.[16]   L. van der Maaten, E. Postma, and H. van den Herik,

    “Dimensionality Reduction: A Comparative Review,” TechnicalReport TiCC-TR 2009-005, Tilburg Univ., 2009.

    [17]   R. Kohavi and G. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence,  vol. 97, pp. 273-324, Dec. 1997.[18]   H. Liu and L. Yu, “Toward Integrating Feature Selection

    Algorithms for Classification and Clustering,” IEEE Trans. Knowl-edge and Data Eng.,  vol. 17, no. 4, pp. 491-502, Apr. 2005.

    476 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

    TABLE 8Comparison of CDFE Performance against Top 3 Winning Entries of the Agnostic Learning Track [1]

    BMM is Bernoulli mixture model, PCA is principal component analysis, PSO is particle swarm optimization, and SVM is support vector machine.

  • 8/16/2019 Diff Criteria

    13/13

    [19]   A.L. Blum and P. Langley, “Selection of Relevant Features andExamples in Machine Learning,”   Artificial Intelligence, ElsevierB.V., vol. 97, pp. 245-271, 1997.

    [20]   I. Guyon, S. Gunn, M. Nikravesh, and L.A. Zadeh,   FeatureExtraction Foundations and Applications.   Springer, 2006.

    [21]   M. Hall, “Correlation-Based Feature Selection for Discrete andNumeric Class Machine Learning,”  Proc. 17th Int’l Conf. MachineLearning, 2000.

    [22]   R. Ruiz and J.S. Aguilar-Ruiz, “Analysis of Feature Rankings forClassification,”   Proc. Int’l Symp. Intelligent Data Analysis (IDA),

    pp. 362-372, 2005.[23]   L. Yu and H. Liu, “Feature Selection for High-Dimensional Data:

    A Fast Correlation-Based Filter Solution,”   Proc. 20th Int’l Conf. Machine Learning,  2003.

    [24]   F. Fleuret, “Fast Binary Feature Selection with Conditional MutualInformation,”   J. Machine Learning Research,  vol. 5, pp. 1531-1555,2004.

    [25]   H. Peng, F. Long, and C. Ding, “Feature Selection Based on MutualInformation: Criteria of Max-Dependency, Max-Relevance, andMin-Redundancy,”   IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005.

    [26]   G. Qu, S. Hariri, and M. Yousaf, “A New Dependency andCorrelation Analysis for Features,” IEEE Trans. Knowledge and DataEng., vol. 17, no. 9, pp. 1199-1207, Sept. 2005.

    [27]   J. Pearl,   Probabilistic Reasoning in Intelligent Systems.   Morgan

    Kaufmann, 1988.[28]   A. Freno, “Selecting Features by Learning Markov Blankets,” Proc.11th Int’l Conf., KES 2007 and XVII Italian Workshop Neural NetworksConf. Knowledge-Based Intelligent Information and Eng. Systems: PartI (KES/WIRN),  pp. 69-76, 2007.

    [29]   M. Saeed, “Bernoulli Mixture Models for Markov Blanket Filteringand Classification,” J. Machine Learning Research, vol. 3, pp. 77-91,2008.

    [30]   A. Juan and E. Vidal, “On the Use of Bernoulli Mixture Models forText Classification,”   Pattern Recognition,   vol. 35, pp. 2705-2710,2002.

    [31]   A. Juan and E. Vidal, “Bernoulli Mixture Models for BinaryImages,” Proc. 17th Int’l Conf. Pattern Recognition, (ICPR ’04), 2004.

    [32]   “Annual KDD Cup 2001,” http://www.sigkdd.org/kddcup/,2001.

    [33]   R. Agrawal and R. Srikant, “Fast Algorithms for Mining

    Association Rules,”   Proc. 20th Int’l Conf. Very Large Databases(VLDB ’94),  1994.

    [34]   J. Wilbur, J. Ghosh, C. Nakatsu, S. Brouder, and R. Doerge,“Variable Selection in High-Dimensional Multivariate Binary Datawith Application to the Analysis of Microbial Community DNAFingerprints,” Biometrics,  vol. 58, pp. 378-386, 2002.

    [35]   I. Guyon et al., “CLOP,” http://ymer.org/research/files/clop/clop.zip, 2011.

    [36]   M. Saeed, “Hybrid Learning Using Mixture Models and ArtificialNeural Networks,” Hands-on Pattern Recognition Challenges in DataRepresentation, Model Selection, and Performance Prediction,   http://www.clopinet.com/ChallengeBook.html, Microtome, 2008.

    [37]   M. Saeed and H. Babri, “Classifiers Based on Bernoulli MixtureModels for Text Mining and Handwriting Recognition,”   Proc.IEEE Int’l Joint Conf. Neural Networks, 2008.

    [38]   T.M. Cover and J.A. Thomas, Elements of Information Theory.  JohnWiley and Sons, 1991.

    [39]   L. Jimenez and D.A. Landgrebe, “Projection Pursuit in HighDimensional Data Reduction: Initial Conditions, Feature Selectionand the Assumption of Normality,”  Proc. IEEE Int’l Conf. Systems,

     Man and Cybernetics, 1995.[40]   C.M. Bishop,   Pattern Recognition and Machine Learning.  Springer,

    2006.[41]   R.W. Lutz, “Doubleboost,” Fact Sheet http://clopinet.com/

    isabelle/Projects/agnostic/, 2007.[42]   V. Nikulin, “Classification with Random Sets, Boosting and

    Distance-Based Clustering,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.

    [43]   V. Franc, “Modified Multi-Class SVM Formulation; Efficient LOOComputation,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.

    [44]   H.J. Escalante, “Particle Swarm Optimization for NeuralNetworks,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.

    [45]   J. Reunanen, “Cross-Indexing,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.

    [46]   I.C. ASML team, “Feature Selection with Redundancy Eliminationþ   Gradient Boosted Trees,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.

    Kashif Javed   received the BSc and MScdegrees in electrical engineering in 1999 and2004, respectively, from the University ofEngineering and Technology (UET), Lahore,Pakistan, where he is currently working towardthe PhD degree. He joined the Department of

    Electrical Engineering at UET in 1999, where heis currently an assistant professor. His researchinterests include machine learning, patternrecognition, and ad hoc network security.

    Haroon A. Babri   received the BSc degree inelectrical engineering from the University ofEngineering and Technology (UET), Lahore,Pakistan, in 1981, and the MS and PhD degreesin electrical engineering from the University ofPennsylvania in 1991 and 1992, respectively. Hewas with the Nanyang Technological University,Singapore, from 1992 to 1998, with the KuwaitUniversity from 1998 to 2000, and with the

    Lahore University of Management Sciences(LUMS) from 2000 to 2004. He is currently a professor of electricalengineering at UET. He has written two book chapters and has morethan 60 publications in machine learning, pattern recognition, neuralnetworks, and software reverse engineering.

    Mehreen Saeed  received the doctorate degreefrom the Department of Engineering Mathe-matics, University of Bristol, United Kingdom,in 1999. She is currently working as an assistantprofessor in the Department of ComputerScience, FAST National University of Computerand Emerging Sciences, Lahore Campus, Paki-stan. Her main areas of interest include artificialintelligence, machine learning and statisticalpattern recognition.

    .   For more information on this or any other computing topic,please visit our Digital Library at   www.computer.org/publications/dlib.

    JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 477