diff criteria

Download Diff Criteria

Post on 05-Jul-2018




0 download

Embed Size (px)


  • 8/16/2019 Diff Criteria


    Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data

    Kashif Javed, Haroon A. Babri, and Mehreen Saeed

    Abstract—Data and knowledge management systems employ feature selection algorithms for removing irrelevant, redundant, and

    noisy information from the data. There are two well-known approaches to feature selection, feature ranking (FR) and feature subset

    selection (FSS). In this paper, we propose a new FR algorithm, termed as class-dependent density-based feature elimination (CDFE),

    for binary data sets. Our theoretical analysis shows that CDFE computes the weights, used for feature ranking, more efficiently as

    compared to the mutual information measure. Effectively, rankings obtained from both the two criteria approximate each other. CDFE

    uses a filtrapper approach to select a final subset. For data sets having hundreds of thousands of features, feature selection with FR

    algorithms is simple and computationally efficient but redundant information may not be removed. On the other hand, FSS algorithms

    analyze the data for redundancies but may become computationally impractical on high-dimensional data sets. We address these

    problems by combining FR and FSS methods in the form of a two-stage feature selection algorithm. When introduced as a

    preprocessing step to the FSS algorithms, CDFE not only presents them with a feature subset, good in terms of classification, but also

    relieves them from heavy computations. Two FSS algorithms are employed in the second stage to test the two-stage feature selection

    idea. We carry out experiments with two different classifiers (naive Bayes’ and kernel ridge regression) on three different real-life data

    sets (NOVA, HIVA, and GINA) of the “Agnostic Learning versus Prior Knowledge” challenge. As a stand-alone method, CDFE shows

    up to about 92 percent reduction in the feature set size. When combined with the FSS algorithms in two-stages, CDFE significantly

    improves their classification accuracy and exhibits up to 97 percent reduction in the feature set size. We also compared CDFE against

    the winning entries of the challenge and found that it outperforms the best results on NOVA and HIVA while obtaining a third position in

    case of GINA.

    Index Terms—Feature ranking, binary data, feature subset selection, two-stage feature selection, classification.




    HE  advancements in data and knowledge management systems have made data collection easier and faster. Raw

    data are collected by researchers and scientists working in diverse application domains such as engineering (robotics), pattern recognition (face, speech), internet applications (anomaly detection), and medical applications (diagnosis). These data sets may consist of thousands of observations or instances where each instance may be represented by tens or hundreds of thousands of variables, also known as features. The number of instances and the number of variables determine the size and the dimension of a data set. Data sets such as NOVA [1], a text classification data set, consisting of  16,969 features and 19,466 instances and DOROTHEA [2], a data set used for drug discovery, consisting of 100,000 fea-

    tures and 1,950 instances are not too uncommon these days. Intuitively, having more features implies more discrimina- tive power in classification [3]. However, this is not always true in practical experience, because not all the features present in high-dimensional datasetshelpin classprediction.

    Many features might be irrelevant and possibly detrimental to classification. Also, redundancy among the features is not

    uncommon [4], [5]. The presence of irrelevant and redundant features not only slows down the learning algorithm but also confuses it by causing it to overfit the training data [4]. In other words, eliminating irrelevant and redundant features makes the classifier’s design simple, improves its prediction performance and its computational efficiency [6], [7].

    High-dimensional data sets are inherently sparse and hence, can be transformed to lower dimensions without losing too much information about the classes [8]. This phenomenon known as the empty space phenomenon [9] is responsible for the well-known issue of “curse of dimen- sionality,” a term first coined by Bellman [10] in 1961 to

    describe the problems faced in the analysis of high- dimensional data. He proved that to effectively estimate the multivariate density functions up to a given degree of  accuracy, an increase in data dimensions leads to an exponential growth in the number of data samples. While studying the small sample size effects on classifier design, a phenomenon related to the curse of dimensionality was observed by Ruadys and Jain [11] and was termed as the “peaking phenomenon.” They found that for a given sample size, the accuracy of a classifier first increases with the increase in the number of features, approaches the optimal value, and then starts decreasing.

    Problems faced by learning algorithms with high-dimen- sional data sets have been intensively worked on by researchers. Algorithms that have been developed can be categorized into two broad groups. The algorithms that


    .   K. Javed and H.A. Babri are with the Department of Electrical Engineering, University of Engineering and Technology, Lahore 54890, Pakistan. E-mail: {kashif.javed, babri}@uet.edu.pk.

    .   M. Saeed is with the Department of Computer Science, National University of Computer and Emerging Sciences, Block-B, Faisal Town, Lahore, Pakistan. E-mail: mehreen.saeed@nu.edu.pk.

     Manuscript received 24 Oct. 2009; revised 11 May 2010; accepted 14 Aug.2010; published online 21 Dec. 2010. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2009-10-0734. Digital Object Identifier no. 10.1109/TKDE.2010.263.

    1041-4347/12/$31.00    2012 IEEE Published by the IEEE Computer Society

  • 8/16/2019 Diff Criteria


    select, from theoriginalfeature set, a subsetof features, which are highly effective in discriminating classes, are categorized as feature selection (FS) methods. Relief [12] is a popular FS method that filters out irrelevant features using the nearest neighbor approach.Another well-knownmethod is recursive feature elimination support vector machine (RFE-SVM) [13] which selects useful features while training the SVM classifier.TheFSalgorithmsarefurtherdiscussedinSection2. On theotherhand, algorithms that createa newset of features from the original features, through the application of some transformationor combination of original features aretermed as feature extraction (FE) methods. Among them, principal component analysis (PCA) and linear discriminant analysis (LDA), are the two well-known linear algorithms which are widely used because of their simplicity and effectiveness [3]. Nonlinear FE algorithms include isomap [14] and locally linear embedding (LLE) [15]. For a comparative review of FE methods, interested readers are referred to [16]. In this paper, our focus is on the problem of supervised feature selection and we propose a solution that is suitable for binary data sets.

    Theremainder of the paper is organized into four sections. Section 2 describes the theory related to feature selection and presents a literature survey of the existing methods. In Section 3, we propose a new feature ranking (FR) algorithm, termed as class-dependent density-based feature elimination (CDFE). Section 4 discusses howto combine CDFE with other feature selection algorithms in two stages. Experimental results on three real-life data sets are discussed in Section 5. The conclusions are drawn in Section 6.


    This section describes the theory related to the featureselection problem and surveys the various methods presented in the literature for its solution. Suppose we are given a labeled data set fxt; C tgN t¼1 consisting of  N  instances and M   features such that  xt 2 RM  and C t denotes the class variable of instance   t. There can be  L   number of classes. Each vector x  is, thus, an M -dimensional vector of features; hence,   xt ¼ fF t1; F 

    t 2; . . . ; F 

    t M g. We use   F   to denote the set

    comprising all features of a data set whereas  G  denotes a feature subset. The feature selection problem is to find a subset  G  of  m   features from the set  F  having  M   features with the smallest classification error [17] or at least without a significant degradation in the performance [6].

    A straightforward solution to the feature selection problem is to explore all   M m

        possible subsets of size  m.

    However, this kind of search is computationally expensive for even moderate values of   M   and  m. Therefore, alter- native search strategies have to be designed.

    Generally speaking, the feature selection process may consist of four basic steps, namely, subset generation, subset evaluation, stopping criterion, and result validation [6]. In the subset generation step, the feature space is searched according to a search strategy for a candidate subset, which is evaluated later. The search can begin with either an empty set, which is then successively built up (forward

    selection) or it starts with the entire feature set and then features are successively eliminated (backward selection). Different search strategi