feature selection using tabu search method
Post on 02-Jul-2016
Embed Size (px)
This work was supported by the National Natural ScienceFoundation of China (69675007) and BeijingMunicipal NaturalScience Foundation (4972008).*Corresponding author. Tel.:#86-10 68907155.E-mail address: email@example.com (H. Zhang).
Pattern Recognition 35 (2002) 701}711
Feature selection using tabu search method
Hongbin Zhang*, Guangyu Sun
Computer Institute of Beijing Polytechnic University, West San Huan North Road 56, 69 Beijing 100044, People's Republic of China
Received 9 February 1999; received in revised form 6 July 2000; accepted 5 December 2000
Selecting an optimal subset from original large feature set in the design of pattern classi"er is an important and di$cultproblem. In this paper, we use tabu search to solve this feature selection problem and compare it with classic algorithms,such as sequential methods, branch and boundmethod, etc., and most other suboptimal methods proposed recently, suchas genetic algorithm and sequential forward (backward) #oating search methods. Based on the results of experiments,tabu search is shown to be a promising tool for feature selection in respect of the quality of obtained feature subset andcomputation e$ciency. The e!ects of parameters in tabu search are also analyzed by experiments. 2001 PatternRecognition Society. Published by Elsevier Science Ltd. All rights reserved.
Keywords: Feature selection; Tabu search; Pattern classi"er; Search methods; Curse of dimensionality
The problem of feature selection is important in de-signing pattern classi"er, whose goal is to acquire ane$cient subset so as to reduce the dimension of thefeature set. When the number of initial features becomestoo large, not only the cost of collecting features in-creases, but the performance of designed classi"er willnot be guaranteed in the case of small sample size. Fromthe standpoint of Bayesian decision rules there are no`bada features, i.e., the performance of a Bayes classi"ercannot be improved by eliminating a feature. But inpractice, the assumptions in designing Bayes classi"erare usually not satis"ed. As a consequence, for a givenamount of samples, reducing the number of features mayresult in improving a non-ideal classi"er's performance.In applications, there are two forms of feature selec-
tion, each form addresses a speci"c objective and leads toa distinct type of optimization. The "rst form is to "nd anoptimal subset having a prede"ned number of features
and yield the lowest error rate of a classi"er. This isa non-constrained combinatorial optimal problem. An-other form of feature selection is aimed at seeking thesmallest subset of features for which error rate is belowa given threshold. This is a constrained combinatorialoptimal problem, in which the error rate serves as a con-straint and the smallest subset of features is the primarysearch criterion.Let F" f1 , f2 ,2, f be the initial set of features with
cardinality n. S is a subset of F. S represents subset size,i.e., the number of inclusive features in the subset. Letm denote the required subset size in the 1st-form offeature selection problem and t the tolerable threshold inthe 2nd-form of feature selection problem. Let error(S) bethe relevant error rate or some other measures of perfor-mance such as class separability (e.g., Mahalanobis dis-tance), when S is used to design the classi"er. The twoforms of feature selection problem can be represented asfollows:
1st-form of feature selection:min J(S)"error(S)s.t. SLF, S"m, m(n.
2nd-form of feature selection:min J(S)"Ss.t. S-F, error(S)(t.
0031-3203/01/$22.00 2001 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.PII: S 0 0 3 1 - 3 2 0 3 ( 0 1 ) 0 0 0 4 6 - 2
Theoretically, both forms of feature selection areNP-hard problems. So the optimal solution cannot beguaranteed to be acquired except for doing exhaustivesearch in the solution space . But exhaustive search isfeasible only for small n. For a bigger n, the explosivecomputational cost makes the exhaustive search imprac-ticable. The branch and bound search scheme (BB) intro-duced by Narendra and Fukunaga is more e$cient bymeans of pruning a lot of `infeasiblea solutions . Onthe assumption that the objective function is monotonic,this algorithm is potentially capable of examining allfeasible solutions so that it is an optimal method. Unfor-tunately, monotonic condition is seldom satis"ed. More-over, even if 99.9% computational cost saving is done byBB, the computational complexity remains exponential.Thus, BB or its improved version*relaxed BB (by intro-ducing the concept of approximate monotonicity)*are still unavailable for a large n.The need of trade-o! between the optimality and e$-
ciency of algorithms for NP-problems was recognizedearly, and the main stream of feature selection researchwas thus directed toward suboptimal but e$cient androbust methods. The preliminary works on feature selec-tion was started in the early 60s. The approaches offeature selection at that time were based on probabilisticmeasures of class separability and on entropies. In somemethods the independence of features was assumed andthe features were selected on the basis of their individualmerits. Such methods ignore the interactions among fea-tures. As a result, the selected subsets are not satisfactory.As was pointed out by Cover, the best two independentfeatures do not have to be the two best . The sequen-tial forward selection method (SFS), sequential backwardselection method (SBS), and their generalized versionsGSFS, GSBS invented later belong to greedy algorithmsin essence. These algorithms begin with a feature subsetand sequentially add or remove features until some ter-mination criterion is met. But these methods su!er fromthe so-called `nesting e!ecta. It means that the featuresdiscarded cannot be re-selected, and the features selectedcannot be removed later. Since these algorithms do notexamine all possible feature subsets, they are not guaran-teed to produce the optimal result. The plus l take awayr (PTA) methods was proposed to prevent the `nestingae!ect. But there is no theoretical guidance to determinethe appropriate value of l and r. Max}min (MM) methodproposed by Backer and Shipper is a computationallye$cient method . It only evaluates the individual andpairwise merits of features. The experiments made bymany researchers show that this method gives the poor-est results [6,7]. This con"rms that it is not possible toselect a feature subset in a high-dimensional space basedon only two-dimensional information measures withouta substantial information loss.In recent years, there have been some developments in
the problem of feature selection. Siedlecki and Sklansky
introduced the use of genetic algorithm (GA) for thisproblem and obtained good results [8,9]. But the prema-ture phenomena in GA seem to be di$cult to avoidand it needs to carefully select and test the parametersin GA. Pudil et al. improved the sequential methodsby introducing sequential forward #oating selection(SFFS) and sequential backward #oating selection(SBFS) . These two methods can be understoodas plus 1-minus x and minus 1-plus x, where x is dynam-ically changed according to the backtrack e!ect. SFFSand SBFS avoid the problem of prede"ning l and rin PTA, and have a mechanics to control the depth ofbacktrack. According to the experimental results ofJain and Zongker , SFFS and SBFS achieve resultscomparable to the optimal algorithm (BB) but arefaster than BB.To conclude, despite some progress has been obtained,
the available feature selection techniques for large featureset are not yet completely satisfactory. They are eithercomputationally feasible but far from optimal, or theyare optimal or almost optimal but cannot cope with thecomputational complexity of feature selection problemsof realistic size. Recently, there has been a resurgence ofinterest in applying feature selection methods due to theapplication needs. In the application of informationfusion of multiple sensors' data, integration of multiplemodels and data mining, the number of features is usu-ally quite large. In some cases, it may be over 100 . It isnecessary to research more powerful methods for featureselection, which should give very good results and shouldbe computationally more e$cient. In Ref. , Sklanskystated that when the initial set contains 20 or morefeatures and the selected subset contains 10 or fewerfeatures, the problem of selecting a best or near-bestsubset can be quite di$cult because we must search alarge*often astronomically large*space of subsets ofcandidate features, and determine a "gure of merit for anoptimum or near-optimum classi"er operating on eachsubset in this space. He referred to such problems aslarge-scale feature selection. Nowadays along with theupgrading of computer performance, the feature selectioncontaining about 20 features has become easier, and thefeature selection problem including 20}49 features haschanged to so-called `medium-sizeda problem. But for aclassi"er based on personal computer, the feature selec-tion including a few dozens or over 100 of features is stilla di$cult and challenging problem.It has been argued that since feature selection is typi-
cally done in an o!-line manner, the execution time ofa particular algorithm is of much less important than itsultimate classi"cation. While this is generally true forfeature sets of appropriate size. However, if the number offeatures reaches a few dozen or even more, the executiontime becomes extremely important as it may be impracti-cal to run some algorithms even once on such large datasets.
702 H. Zhang, G. Sun / Pattern Recognition 35 (2002) 701}711
In a recent paper , Kudo and Sklansky carried outa comparative study of algorithms for large-s