[ieee 2012 ieee symposium on computational intelligence in bioinformatics and computational biology...

6
Hybrid Feature Selection Method for Biomedical Datasets Saúl Solorio-Fernández, José Fco. Martínez- Trinidad, Jesús Ariel Carrasco-Ochoa Computer Science Department National Institute for Astrophysics, Optics and Electronics Santa Maria Tonantzintla, Puebla, Mexico e-mail:{sausolofer,fmartine,ariel}@inaoep.mx Yan-Qing Zhang Department of Computer Science Georgia State University Atlanta, GA, USA e-mail: [email protected] Abstract— Currently classifying high-dimensional data is a very challenging problem. High dimensional feature spaces affect both accuracy and efficiency of supervised learning methods. To address this issue, we present a fast and efficient feature selection algorithm to facilitate classifying high-dimensional datasets as those appearing in Bioinformatics problems. Our method employs a Laplacian score ranking to reduce the search space, combined with a simple wrapper strategy to find a good feature subset of uncorrelated features, giving as result a hybrid feature selection method which is useful for high dimensional spaces. Some experiments have been carried out on gene microarray datasets to demonstrate the effectiveness and robustness of the proposed method.. Feature selection; supervised classification; high-dimensional spaces. I. INTRODUCTION Feature selection is a process of identifying a small subset of highly accurate features out of a large set of candidate features, most of which might be strongly irrelevant and redundant [1, 2]. Feature selection plays a fundamental role in pattern recognition, data mining, information retrieval, machine learning, and bioinformatics, among other areas, for a variety of reasons [2]. Currently, new technologies produce large datasets characterized by a large number of features; this is a reason why feature selection has become very important in several scientific disciplines. One of these disciplines is Bioinformatics, where feature selection (also called gene selection) plays a critical role for the purposes of data classification using microarrays. Here the main goal is to identify a few features/genes from thousands of genes to diagnose diseases. In this context, the number of samples is around one hundred, while the number of features is in the order of thousands or even tens of thousands. Moreover, under this circumstance, performance of most supervised classification algorithms suffer as the number of features becomes excessively large [3,4]. There are three general approaches for feature selection. The first one is concerning feature selection regardless of the classifier. This approach is known as filtering technique. It aims to compute the importance of each feature and then select the top rank. Some commonly used ranking metrics are Information Gain, Signal-to-Noise, Fisher Criterion, and T- Statistics, among others. This approach is simple, fast and easily scales to very high dimensional data. However, it has some drawbacks as most of the proposed filter techniques are univariate. This means that each feature is considered and treated separately, ignoring correlation between features. However, gene expression datasets contain highly correlated features. As a result, applying filtering techniques on gene expression datasets leads to low classification performance. The second approach is concerning feature selection taking into account the classifier used in classification stage. This approach is known as wrapper selection, which aims to select a subset of features that is useful to build a good classifier or predictor. Sequential forward selection, backward selection, and genetic algorithms [5], etc. are strategies in the wrapper approach to choose the suitable feature subsets for classification. The advantage of this technique is the ability to take into account the correlation between features and the interaction with the classifier. However, this technique has some drawbacks as it is prone to a high risk of over fitting, and it requires very intensive computation. The last, makes this approach unfeasible for feature selection in high-dimensional data. Another approach is to combine the advantages of the previous approaches, it results in hybrid feature selection methods [6,7]. These methods mainly focus on combining filter and wrapper methods to achieve the best possible performance with a particular classifier or learning algorithm, but keeping a reasonable time performance similar to the one reached by filter methods. In this work, we focus on this approach and aim to develop a feature selection method which can effectively solve the feature selection problem in high-dimensional data. Feature selection for high-dimensional data is considered one of the current challenges in statistical machine learning and data mining fields [8], and it has been studied broadly in the last years. Several methods have been proposed by researchers following the above approaches, see [5,9-20] to name a few. Recently the hybrid approach has demonstrated its ability to solve feature selection in high-dimensional data, as an example of recent works following this approach we can mention the work proposed in [21] where a selector to identify those features that are the most useful in describing essential differences among the classes is proposed. The authors present 978-1-4673-1191-5/12/$31.00 ©2012 IEEE 150

Upload: haquynh

Post on 11-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - San Diego, CA (2012.05.9-2012.05.12)] 2012 IEEE Symposium on Computational

Hybrid Feature Selection Method for Biomedical Datasets

Saúl Solorio-Fernández, José Fco. Martínez-Trinidad, Jesús Ariel Carrasco-Ochoa

Computer Science Department National Institute for Astrophysics, Optics and Electronics

Santa Maria Tonantzintla, Puebla, Mexico e-mail:{sausolofer,fmartine,ariel}@inaoep.mx

Yan-Qing Zhang Department of Computer Science

Georgia State University Atlanta, GA, USA

e-mail: [email protected]

Abstract— Currently classifying high-dimensional data is a very challenging problem. High dimensional feature spaces affect both accuracy and efficiency of supervised learning methods. To address this issue, we present a fast and efficient feature selection algorithm to facilitate classifying high-dimensional datasets as those appearing in Bioinformatics problems. Our method employs a Laplacian score ranking to reduce the search space, combined with a simple wrapper strategy to find a good feature subset of uncorrelated features, giving as result a hybrid feature selection method which is useful for high dimensional spaces. Some experiments have been carried out on gene microarray datasets to demonstrate the effectiveness and robustness of the proposed method..

Feature selection; supervised classification; high-dimensional spaces.

I. INTRODUCTION Feature selection is a process of identifying a small subset

of highly accurate features out of a large set of candidate features, most of which might be strongly irrelevant and redundant [1, 2]. Feature selection plays a fundamental role in pattern recognition, data mining, information retrieval, machine learning, and bioinformatics, among other areas, for a variety of reasons [2].

Currently, new technologies produce large datasets characterized by a large number of features; this is a reason why feature selection has become very important in several scientific disciplines. One of these disciplines is Bioinformatics, where feature selection (also called gene selection) plays a critical role for the purposes of data classification using microarrays. Here the main goal is to identify a few features/genes from thousands of genes to diagnose diseases. In this context, the number of samples is around one hundred, while the number of features is in the order of thousands or even tens of thousands. Moreover, under this circumstance, performance of most supervised classification algorithms suffer as the number of features becomes excessively large [3,4].

There are three general approaches for feature selection. The first one is concerning feature selection regardless of the classifier. This approach is known as filtering technique. It aims to compute the importance of each feature and then select the top rank. Some commonly used ranking metrics are

Information Gain, Signal-to-Noise, Fisher Criterion, and T-Statistics, among others. This approach is simple, fast and easily scales to very high dimensional data. However, it has some drawbacks as most of the proposed filter techniques are univariate. This means that each feature is considered and treated separately, ignoring correlation between features. However, gene expression datasets contain highly correlated features. As a result, applying filtering techniques on gene expression datasets leads to low classification performance.

The second approach is concerning feature selection taking into account the classifier used in classification stage. This approach is known as wrapper selection, which aims to select a subset of features that is useful to build a good classifier or predictor. Sequential forward selection, backward selection, and genetic algorithms [5], etc. are strategies in the wrapper approach to choose the suitable feature subsets for classification. The advantage of this technique is the ability to take into account the correlation between features and the interaction with the classifier. However, this technique has some drawbacks as it is prone to a high risk of over fitting, and it requires very intensive computation. The last, makes this approach unfeasible for feature selection in high-dimensional data.

Another approach is to combine the advantages of the previous approaches, it results in hybrid feature selection methods [6,7]. These methods mainly focus on combining filter and wrapper methods to achieve the best possible performance with a particular classifier or learning algorithm, but keeping a reasonable time performance similar to the one reached by filter methods. In this work, we focus on this approach and aim to develop a feature selection method which can effectively solve the feature selection problem in high-dimensional data.

Feature selection for high-dimensional data is considered one of the current challenges in statistical machine learning and data mining fields [8], and it has been studied broadly in the last years. Several methods have been proposed by researchers following the above approaches, see [5,9-20] to name a few. Recently the hybrid approach has demonstrated its ability to solve feature selection in high-dimensional data, as an example of recent works following this approach we can mention the work proposed in [21] where a selector to identify those features that are the most useful in describing essential differences among the classes is proposed. The authors present

978-1-4673-1191-5/12/$31.00 ©2012 IEEE

150

Page 2: [IEEE 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - San Diego, CA (2012.05.9-2012.05.12)] 2012 IEEE Symposium on Computational

a way to represent essential discriminating characteristics together with the sparsity as an optimization objective, then they use Markov random field optimization techniques to solve the formulated objective functions. The proposed method was applied on synthetic data as well as standard real-world data sets, and the experimental results show the effectiveness of the proposed method. However, this method requires to specify the number of features to select and the value for a parameter in the optimization process, but specifying values for these parameters is not trivial. Another method was proposed in [22], in this work the authors transform the feature selection problem which is an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally. The proposed method is based on well-established machine learning and numerical analysis techniques, without making any assumptions about the underlying data distribution. The proposed method is capable of processing many thousands of features while maintaining a very high accuracy. The experiments show that the proposed method has a comparable time consuming to Relief [23] and Relief-F[24] algorithms. No matter the authors assure parameter tuning is easy, from a user's point of view it is not clear what values should be fixed for kernel width σ, regularization parameter λ, and stop criterion θ. It makes this method hard to use. Another recent work is the one presented in [25] where the authors propose a novel feature extraction method that treats high-dimensional data, especially microarray data, and construct an improved fuzzy Bayesian classifier which is similar to the Naive Bayesian classifier. The proposed feature selection algorithm is comprised of two steps. In Step 1, all features are ranked using a scoring scheme based on t and F statistics. Then, the features with high scores are retained and used jointly with the improved fuzzy Bayesian classifier. Following a similar idea, [26] proposes a general framework of sample weighting to improve the stability of feature selection methods under sample variations. The framework first weights each sample in a given training set according to its influence to the estimation of feature relevance, then the framework provides the weighted training set to a feature selection algorithm. The authors also introduce extended versions of SVM-RFE and Relief-F such that they can work on a weighted training set produced by the proposed sample weighting algorithm. This framework produces good results mainly improving the stability of representative feature selection methods such as SVM-RFE and Relief-F, without sacrificing their classification performance.

Based on the results reported in [27] where preliminary results were presented using small datasets. In this paper, we propose a modification of the method and show how combining Laplacian Score and a simple wrapper strategy results in a hybrid feature selection method which is useful for feature selection in high dimensional data, particularly we show its application in Bioinformatics datasets. The proposed method unlike previous feature selection methods does not transform the original space of features, does not require the user specify values for parameters and the selection produced by our method can be used by any classifier. The formulation of the proposed method is based on the idea that local structure of the data space is more important than the global structure, this idea is handled using the Laplacian Score and jointly with

a simple wrapper strategy it allows to find a good feature subset of uncorrelated features. The method performs well in high dimensional feature spaces. Some experiments conducted on real-world datasets demonstrate that the method is capable of processing many thousands of features within minutes on a personal computer, yet maintaining a very high accuracy.

The rest of this paper is organized as follows, in section II, we describe the proposed method which consists of two stages. Experiments will be described and results will be presented and discussed in section III. Finally, section IV will conclude this paper and enunciate further research on this topic.

II. PROPOSED HYBRID FEATURE SELECTION ALGORITHM The filter stage of the proposed method is based on spectral

graph theory and it was inspired by previous methods for feature selection in unsupervised classification [29-32,37] these methods follow the idea that local structure of the data space is more important than the global structure. This idea is based on the observation that, two data points are probably related if they are close to each other. Therefore the importance of features is evaluated according to their agreement with the Laplacian matrix of a graph of similarities of the data. Thus, for each feature, its Laplacian score is computed to reflect its locality preserving power.

Formally, the Laplacian score is defined as follows, given a dataset consisting of m vectors x1,...,xm, we can construct a matrix of similarities Wm×m that represents the similarity or adjacency matrix between each xi and xj data points. Depending on the type of graph we want to use, such as k-nearest neighbor or fully connected graph [28], the matrix W can be interpreted as a weighted graph whose nodes are the instances and each edge in the graph is a connection between nodes xi and xj with weight wij. The Laplacian matrix, denoted in this paper as L is defined as

L= D−W (1)

D is a diagonal matrix such that dii=∑j=1,..,mwij. Given a graph G, the Laplacian matriz L is a linear operator on a vector f∈ℜm [28,29] where

fTLf=1/2 D−W (2)

quantifies how much the vector f locally varies in G [30]. This fact motivates the use of L to measure on a vector of values of a feature, the consistency of this feature regarding the structure of the graph G. A feature is consistent with the structure of a graph if it takes similar values for instances that are near each other in the graph, and dissimilar values for instances that are far from each other. Thus a consistent feature would be relevant to separate the classes in a supervised problem [28,29]. The Laplacian Score [31], proposed by X. He et al. [32], assesses the significance of individual features taking into account the local preserving power and its consistency with the structure of the similarity graph.

If we denote fr=(fr1,fr2,...,frm)T with r=1,...,n as the r-th feature and its values for the m instances. Then the Laplacian Score for fri is calculated as

151

Page 3: [IEEE 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - San Diego, CA (2012.05.9-2012.05.12)] 2012 IEEE Symposium on Computational

(3)

Where ⁄ , L is the Laplacian matrix of the graph G, D is the degree diagonal matrix, and fr represents the deviation from the mean of all observations of the f vector.

For the Laplacian Score, the local structure of the data space is more important than the global structure [32]. In order to model the local structure, this method constructs a k-nearest neighbor graph, where k is the degree of neighborhood for each instance in the graph (see [32] for details). This value must be specified a priori by the user and it models local neighborhood relations between instances. According to the Laplacian Score, a "good" feature should have a small value for Lr [32]. Thus, the features are arranged in a list according to their relevance. Those features that are at the top of the list are those with smaller values for Lr; these features will be considered as the most important.

Applying the Laplacian Score, we aim to identify those features that are consistent with the structure of the data, sorting them according to their relevance, in order to narrow the search space of possible subsets of features (from 2n-1 to only n subsets, as we will see later) and starting the second step with a good approximation. In the second step (wrapper stage), the idea is to evaluate the features considered as a subset rather than individually; in this step, the proposed method evaluates n feature subsets: the first set only contains the top ranked feature, the second set contains the two top-ranked features, the third set contains the three top-ranked features, and so on (see algorithm in Fig. 1). In [27] we proposed to evaluate as many subsets as the number of features in the dataset, however, due to in Bioinformatics the number of features is very large we propose to evaluate only s subsets being s<<n, as we will show in our experiments this value for our algorithm is fixed to 1000. In this way, we only evaluate subsets with at most 1000 features. In our method, to evaluate the accuracy of the subsets of features, we use the target classifier. The accuracy of the classification model generated by each feature subset is estimated using two fold cross-validation. The pseudocode of our hybrid method, named Laplacian Score Feature Selection Method (LS-FS), is described in the figure 1. In the pseudocode of figure 1, Classifier can be any supervised classifier.

III. EXPERIMENTS AND RESULTS In order to compare the effectiveness of our method first of

all we compare the time of the filter stage of our algorithm against Information Gain(IG) and Relief-F, two of the most popular filter algorithms, then we show experimentally that subsets of features in the top rank of the filter stage produce better accuracies to those accuracies produced by subsets of features in the bottom rank. Finally, we compare the accuracy reached by the subset selected by our algorithm against the accuracies reached by the subsets obtained by applying the same wrapper stage of our algorithm but using IG and Relief-F rankings.

A. Microarray Datasets In our experiments, four public microarray datasets

available at: http://www.upo.es/eps/aguilar/datasets.html were used. The first one is the reduced lymphoma data [33], consisting of 45 samples with 4,026 genes each. This dataset contains samples of two types of lymphoma.

The second dataset is leukemia (training set) [34], consisting of 38 samples with 7129 genes. This dataset contains samples of two types of leukemia, lymphoblastic leukemia and myeloid leukemia.

The third dataset is Global Cancer MAP (training set) [35], consisting of 144 samples and 16,064 genes. This dataset contains samples of fourteen types of cancer.

The last dataset is the dataset C of embryonal tumors of the central nervous system [36]. In this dataset there are 60 samples with 7130 genes. This dataset contains samples of two classes of tumors, medulloblastomas and no medulloblastomas (other types of tumors).

B. Experiment Settings In the filter stage of our hybrid method, for computing the

Laplacian score, a k-nearest neighbor graph is built, where k is the degree of neighborhood for each instance in the graph (see [32] for details). In our experiment we used k=33% of the number of instances of the dataset since according to Liu Rongyan [37] this value is close to the "optimum".

The feature sets selected by each algorithm were tested with three classifiers implemented in Weka 3.6.4 [38]: a decision tree (J48), a probabilistic classifier (Naive Bayes) and an instance-based classifier (KNN). These three classifiers were chosen because they represent three quite different approaches for supervised classification. In order to evaluate the accuracy

Input: Dataset X with m instances and n features Output: Sbest, being Sbest the best feature subset 1: Begin 2: Start with ACCBest = -∞ and S= . 3: Calculate Laplacian Score for each feature fr with r=1 to n. 4: Order in descendent form the fr values and assigning to indRank the indexes associated to these values. 5: for i=1 to 1000 do 6: S ← S indRank[i] 7: Run Classifier with 2-fold cross validation on XS and to assign the ACC obtained to tempACC if tempACC >ACCBest ACCBest←tempACC Sbest← S end if 8: end for 9: return Sbest 10: end Fig. 1. Pseudocode of the Hybrid Laplacian Score Feature Selection method (LS-FS).

152

Page 4: [IEEE 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - San Diego, CA (2012.05.9-2012.05.12)] 2012 IEEE Symposium on Computational

0

5

10

15

20

Time in seconds

with and without feature selection the percentage of correct classification (Accuracy), averaged over 2-fold cross validation was computed. In order to do a fair comparison for IG and Relief-F selectors we used the same wrapper stage used in our proposed algorithm. The run-times reported in this paper were obtained using a computer with an Intel Core i7 at 3.4GHz, with 8GB RAM.

C. Results The first experiment was addressed to evaluate the filter

stage of our hybrid feature selection method. One important aspect to take into account in this stage is that it must be fast since we are interested in facing problems with high dimensional datasets. In the figure 2, we report the time of the filter stage of our method, in this figure we also show the time needed by IG and Relief-F two of the most popular filter methods. Note that our method is clearly faster than IG and Relief-F.

Another important aspect in the filter stage of a hybrid method is that the order generated for the features must be useful for the wrapper stage. Therefore, in figure 3,we report the accuracy reached by the subsets evaluated in the wrapper stage of our method using the 1000 top rank features, we also report the accuracies reached by the subsets evaluated in the wrapper stage when the last 1000 features in the bottom rank were used. Note that we only report the accuracies obtained using KNN classifier (using 3 neighbors), however we obtained similar results with J48 and Naive Bayes classifiers. Note also that the subsets containing features in the top rank reach clearly better accuracies than subsets from the bottom rank.

Finally, in table 1, we report the accuracy reached by the subset of features selected by our method on the test data of each dataset, in this table we also report the accuracies obtained by IG and Relief-F methods using the same wrapper strategy of our method. In these results we can appreciate that our method, in most of the cases, gets good accuracies no matter the classifier used, but if we use IG and Relief-F in the filter stage the subsets selected do not have consistency in terms of the accuracy when a different classifier is used.

D. Disccusion From the results presented in figure 2, it is evident that the

filter stage of our method is clearly faster than IG and Relief-F methods, this is a desirable characteristic in a feature selection method for facing problems with high dimensional datasets as those appearing in Bioinformatics. Another important point to highlight is that the ranking generated in the filter stage finds out important features (genes) which are really useful for the wrapper stage, since it allows to the wrapper stage to narrow down the search space from the 2n-1 possible subsets of features to only at most n, being n the number of features. In this way, the wrapper stage only considers a reduced number of feature subsets to find a good uncorrelated set of features (genes). As figure 3 shows, using the top ranked features our method finds subsets with good accuracy employing less than one hundred features (genes). Based on the results presented in table 1, using the Laplacian score ranking in the wrapper stage, our proposed method produces subsets of features which reach a good accuracy no matter the classifier we use.

Fig. 2. Average run-time of feature selection methods (in seconds) with KNN, J48 and Naive Bayes classifiers.

IV. CONCLUSIONS AND FUTURE WORK To solve the feature selection problem in high-dimensional

data, in this paper we introduce a new hybrid feature selection method. Our method combines the advantages of a filter strategy based on the Laplacian Score jointly with a simple wrapper strategy. It results in a fast hybrid feature selector that can effectively solve feature selection problems in high dimensional datasets.

To efficiently show the effectiveness of our method, we tested it on several public microarray datasets. The results have shown that our method, in the filter stage, is clearly faster than IG and Relief-F two of the most popular filter methods, this characteristic together with the simple wrapper strategy makes our method capable of processing many thousands of features within minutes on a personal computer, having a good performance in terms of accuracy for any classifier, in contrast to other methods which are specifically designed for a particular classifier.

Our results suggest that the selected values for the parameter k in the filter stage (fixed as 33% as suggested in [37]) and the amount of features (fixed as 1000) in the top rank used to build the subsets in the wrapper stage, allow to get a good compromise between time and accuracy. In this way, the user does not need to specify values which could make hard or confuse the use of our feature selector. Of course, a comparison against other hybrid feature selection methods is mandatory and it is part of the future work of this research. Another interesting point is to perform a deeper study about wrapper strategies that allow our method improve its performance in terms of accuracy, without affecting its performance in terms of time. There are a lot of works to study under this direction which comprises our future work.

REFERENCES [1] M. Dash, H. Liu, "Feature Selection for Classification," Intelligent Data

Analysis, 1, pp. 1131–1156, 1997. [2] I. Guyon, A. Elissee,"An Introduction to Variable and Feature

Selection," JMLR, 3, pp. 1157–1182, 2003.

153

Page 5: [IEEE 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - San Diego, CA (2012.05.9-2012.05.12)] 2012 IEEE Symposium on Computational

[3] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, “Feature selection for SVMs,” in Proc. 13th Adv. Neu. Info. Proc. Sys., pp. 668–674, 2001.

[4] O. Chapelle, V. Vapnik, O. Bousquet, and S.Mukherjee, “Choosing multiple parameters for support vector machines,” Mach. Learn., vol. 46, no. 1, pp. 131-159, 2002.

[5] Y Y. Saeys, I. Inza and P. Larranaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23, no. 19, pp. 2507–2517, 2007.

[6] Das, S,“Filters, wrappers and a boosting-basedhybrid for feature selection,” Proceedings of the Eighteenth International Conference on Machine Learning , pp. 74-81, 2001.

[7] Xing, E., Jordan, M., & Karp, R“Featureselection for high-dimensional genomic microarraydata,” Proceedings of the Eighteenth InternationalConference on Machine Learning , pp. 601-608, 2001.

[8] J. Lafferty and L. Wasserman, “Challenges in statistical machine learning,” Statistica Sinica, vol. 16, pp. 307–322, 2006.

[9] Y D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression Data: A Survey,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1370–1386, Nov. 2004.

[10] E. Bair and R. Tibshirani, “Machine Learning Methods Applied to DNA Microarray Data Can Improve the Diagnosis of Cancer,” SIGKDD Explorations, vol. 5, no. 2, pp. 48–55, 2003.

[11] F. Model, P. Adorjan, A. Olek, and C. Piepenbrock, “Feature Selection for DNA Methylation Based Cancer Classification,” Bioinformatics, vol. 17, no. 1, pp. 157–164, 2001.

[12] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, pp. 389–422, 2002.

[13] Yuchun Tang, Yan-Qing Zhang, and Zhen Huang, “Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis,” IEEE/ACM Trans. on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 365–381, July-Sep. 2007.

[14] Yu, L. & Liu, H.,“Feature Selection for High-Dimensional Data : A Fast Correlation-Based Filter Solution,” T. Fawcett & N. Mishra, eds. Machine Learning, 20(2), pp.856, 2003.

[15] Y. Wu and A. Zhang, "Feature Selection for Classifying High-Dimensional Numerical Data", in Proc. CVPR (2), pp.251-258, 2004.

[16] J. Biesiada and W. Duch, "Feature Selection for High-Dimensional Data - A Pearson Redundancy Based Filter," presented at Computer Recognition Systems 2, pp.242-249, 2008.

[17] Augusto Destrero & Sofia Mosci & Christine Mol & Alessandro Verri & Francesca Odone,. "Feature selection for high-dimensional data," Computational Management Science, Springer, vol. 6(1), pp. 25-40, February 2009.

[18] Z. J. Ding and Y. Q. Zhang, “Additive Noise Analysis on Microarray Data via SVM Classification,” IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, Montreal, Canada, pp. 1-7,2-5May 2010.

[19] A. A. Yahya, A. Osman, A. R. Ramli and A. Balola,“Feature Selection for High Dimensional Data: An Evolutionary Filter Approach,” Journal of Computer Science, Volume 7, Issue 5, pp. 800-820, 2011.

[20] Su-Fen Chen,“Redundant Feature Selection Based on Hybrid GA and BPSO,” 2011 IEEE 3rd International Conference on Communication Software and Networks (ICCSN), pp. 414 – 418,27-29 May 2011.

[21] Qiang Cheng, Hongbo Zhou, and Jie Cheng, “The Fisher-Markov Selector: Fast Selecting Maximally Separable Feature Subset for Multiclass Classification with Applications to High-Dimensional Data,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 33, no. 6, pp. 1217-1233, June 2011.

[22] Y. Sun, S. Todorovic, and S. Goodison, “Local Learning Based Feature Selection for High Dimensional Data Analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1610-1626, September 2010.

[23] K. Kira and L. A. Rendell, “A practical approach to feature selection,” in Proc. 9th Int. Conf. Mach. Learn., pp. 249–256, 1992.

[24] I. Kononenko, “Estimating attributes: analysis and extensions of RELIEF,” in Proc. Eur. Conf. Mach. Learn., pp. 171–182, 1994.

[25] Xianchang Wang, Lishi Zhang, Junfu Du,“A Novel Approach to Select Important Genes from Microarray Data,”in Proc. of Control and Decision Conference (CCDC), 2011 Chinese, pp.3489 - 3492, 2011.

[26] Lei Yu, Yue Han, and Michael E Berens,“Stable Gene Selection from Microarray Data via Sample Weighting,”IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol 9, no 1, pp. 262-272, 2012.

[27] Saúl Solorio-Fernández, J. Ariel Carrasco-Ochoa, and José Fco. Martínez-Trinidad,“Hybrid Feature Selection Method for Supervised Classification Based on Laplacian Score Ranking,”in Proc. of the 2nd Mexican Conference on Pattern Recognition (MCPR2010) Puebla, Mexico. Published in Lecture Notes in Computer Science series, 6256, Springer-Verlag, pp. 260-269, 2010.

[28] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing 17(4), 395–416, 2007.

[29] Z. Zhao, H. Liu, “Spectral feature selection for supervised and unsupervised learning,” in Proceedings of the 24th International Conference on Machine learning, New York, NY, USA: ACM, pp.1151–1157, 2007.

[30] D.G. García, R.S. Rodríguez, “Spectral clustering and feature selection for microarray data,” in Proceedings of the Fourth International Conference on Machine Learning and Applications, pp. 425–428, 2009.

[31] Niijima, S., Okuno, Y.,“Laplacian linear discriminant analysis approach to unsupervised feature selection,” IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(4), 605–614,2009.

[32] X. He, D. Cai, P. Niyogi, “Laplacian Score for feature selection,” in Advances in Neural Information Processing Systems,vol. 18, Y. Weiss, B. Schölkopf,and J. Platt, eds. MIT Press, Cambridge, pp. 507–514, 2006.

[33] Ash A. Alizadeh, Michael B. Eisen, R. Eric Davis, Chi Ma, Izidore S. Lossos, Andreas Rosenwald, Jennifer C. Boldrick, Hajeer Sabet, Truc Tran, Xin Yu, John I. Powell, Liming Yang, Gerald E. Marti, Troy Moore, James Hudson Jr, Lisheng Lu, David B. Lewis, Robert Tibshirani, Gavin Sherlock, Wing C. Chan, Timothy C. Greiner, Dennis D. Weisenburger, James O. Armitage, Roger Warnke, Ronald Levy, Wyndham Wilson, Michael R. Grever, John C. Byrd, David Botstein, Patrick O. Brown & Louis M. Staudt, “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 3, pp. 503–511, Feb. 2000.

[34] T.R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, E. S. Lander,“Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” SCIENCE, vol. 286, pp. 531-537, 15 October 1999.

[35] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander and T.R. Golub,“Multiclass cancer diagnosis using tumor gene expression signatures,” PNAS, VOL 98, nº 26, pp. 15149-15154, December 18, 2001.

[36] Scott L. Pomeroy, Pablo Tamayo, Michelle Gaasenbeek, Lisa M. Sturla, Michael Angelo, Margaret E. McLaughlin, John Y. H. Kim, Liliana C. Goumnerova, Peter M. Black, Ching Lau, Jeffrey C. Allen, David Zagzag, James M. Olson, Tom Curran, Cynthia Wetmore, Jaclyn A. Biegel, Tomaso Poggio, Shayan Mukherjee, Ryan Rifkin, Andrea Califano, Gustavo Stolovitzky, David N. Louis, Jill P. Mesirov, Eric S. Lander & Todd R. Golub,“Prediction of Central Nervous System Embryonal Tumour Outcome based on Gene Expression,” NATURE, VOL 415, pp. 436-442, 24 January 2002.

[37] Liu, R., Yang, N., Ding, X., Ma, L.,“An unsupervised feature selection algorithm:Laplacian Score combined with distance-based entropy measure”,in: Workshop on Intelligent Information Technology Applications, vol. 3, pp. 65–68, 2009.

[38] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, "The WEKA Data Mining Software: An Update"; SIGKDD Explorations, Volume 11, Issue 1, 2009.

154

Page 6: [IEEE 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) - San Diego, CA (2012.05.9-2012.05.12)] 2012 IEEE Symposium on Computational

0 200 400 600 800 100055

60

65

70

75

80

85

90

95

100

Number of features

AC

C

0 200 400 600 800 100055

60

65

70

75

80

Number of features

AC

C

0 200 400 600 800 100025

30

35

40

45

50

55

Number of features

AC

C

0 200 400 600 800 10002

4

6

8

10

12

14

16

Number of features

AC

C

0 200 400 600 800 100050

55

60

65

70

75

Number of features

AC

C

0 200 400 600 800 100030

35

40

45

50

55

60

65

Number of features

AC

C

0 200 400 600 800 100070

75

80

85

90

95

100

Number of featuresA

CC

0 200 400 600 800 100025

30

35

40

45

50

55

60

65

Number of features

AC

C

a)Leukemia

b)Global Cancer MAP

c)Embryonal Tumors

d)Lymphoma

Fig. 3. Accuracy reached by the subsets evaluated in the wrapper stage of our method using the 1000 top rank, and the last 1000 features in the bottom rank with KNN classifier using 3 neighbors.

TABLE I ACCURACY REACHED BY THE SUBSET OF FEATURES SELECTED BY FEATURE SELECTION METHODS ON THE TEST DATASET.

Datasets KNN (using 3 neighbors) J48 Naive Bayes

LS-FS IG Relief-F LS-FS IG Relief-F LS-FS IG Relief-F

Leukemia-test 91.176 82.353 73.529 91.176 91.176 91.176 97.059 100.000 94.118

GCM-test 41.304 45.652 41.304 34.783 43.478 41.304 58.696 50.000 36.957

EmbrionalTum-test 100.000 100.000 76.923 100.000 100.000 100.000 90.000 92.308 84.615

Lymphoma-test 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000

Average 83.120 82.001 72.939 81.490 83.664 83.120 86.439 85.577 78.922

155