[ieee 2009 international conference on artificial intelligence and computational intelligence -...
Post on 14-Dec-2016
Embed Size (px)
Feature selection with discrete binary differential evolution
Xingshi He Department of Mathematics Xian polytechnic University
Xian 710048, P.R.China e-mail: email@example.com
Qingqing Zhang Department of Mathematics Xian polytechnic University
Xian 710048, P.R.China e-mail: firstname.lastname@example.org
Na Sun Department of Mathematics Xian polytechnic University
Xian 710048, P.R.China e-mail: email@example.com
Yan Dong Department of Mathematics Xian polytechnic University
Xian 710048, P.R.China e-mail: firstname.lastname@example.org
AbstractThe processing of data from the database using data mining algorithms need more special methods. In fact, some redundancy and irrelevant attributes reduce the performance of data mining, so the problem of feature subset selection becomes important in data mining domain. This paper presentes a new algorithm which is called discrete binary differential evolution (BDE) algorithm to select the best feature subsets. The relativity of attributes is evaluated based on the idea of mutual information. Experiments using the new feature selection method as a preprocessing step for SVM, C&R Tree and RBF network are done.We find that the method is very effective to improve the correct classification rate on some datasets and the BDE algorithm is useful for feature subset selection.
Keywords-differential evolution; data mining; feature
selection; mutual information
I INTRODUCTION The success of data mining on a given task is affected by
many factors. The quality of the data is one of these factors. If information is irrelevant or redundant, or the data is noisy and unreliable, knowledge discovery is more difficult.Feature subset selection is a process that identify and remove the irrelevant and redundant information as much as possible. It is necessary to select a small number of highly predictive features in order to avoid over-fitting the training data. Regardless of a learner attempting to select features itself or ignoring the issue, feature selection prior to learning can be beneficial. Reducing the dimension of the data reduces the size of the hypothesis space and allows algorithms to operate faster and more effectively. In some cases,the accuracy on future classification can be improved; in others, the result is a more compact, easily interpreted representation of the target concept. Algorithms performing feature selection as a preprocessing step prior to learning can generally be placed into one of two broad categories. One approach, referred to as the wrapper ,is a method to select useful features depending on special problems and learning algorithms. This approach has been proved useful but very
slow to execute. For this reason, wrappers do not scale well to large datasets containing many features. Another approach, called the filter , operates independently of any learning algorithms undesirable features are filtered out of the data before induction commences. Filters have been proved to be much faster than wrappers and hence can be applied to enlarge data sets containing many features. Their general natures allow them to be used with any learners, unlike the wrapper, which must be rerun when switching from one learning algorithm to another. This paper presents a new approach to feature selection, called BDE(discrete binary differential evolutionbased feature selection). The approach uses a population-based heuristics to evaluate the worth of features. The algorithm is simple, fast to execute by applying suitable correlation measures.
The rest of this paper is organized as follows.The second section describes the BDE algorithm. Section 3 presents experimental results of using BDE as a pre-processor for learning algorithms. The last section summaries and discusses future work.
II BDE: DISCRETE BINARY DIFFERENTIAL EVOLUTION BASED FEATURE SELECTION
A. Feature evaluation The purpose of feature selection is to decide which of the
initial (possibly large) number of features is included in the final subset and which is ignore. If there are n possible features initially, then there are 2n possible subsets. So we must use heuristic methods to search the feature subset space in reasonable time. In this paper, we use BDE algorithm to get the best subset. The key problem is to define a rule for evaluating the worth of a subset of features. This worth takes into account the usefulness of individual features with the purpose of predicting the class label along with the level of inter-correlation among them. So we use a new fitness function (according to Equation 1) to select feature .
2009 International Conference on Artificial Intelligence and Computational Intelligence
978-0-7695-3816-7/09 $26.00 2009 IEEEDOI 10.1109/AICI.2009.438
( 1) *cf
k rf s
k k k r=
Where ( )f s is the worth of a feature subset S containing k features as fitness function for BDE, cfr is the average feature-class correlation, 'ffr 'ffr is the average feature-feature inter-correlation and cfr , 'ffr are individually indicated how much predictive with a group of features and how much redundancy among them. The heuristic handles irrelevant features as they will be poor predictors of the class, redundant attributes are discriminated against as they will be highly correlated with one or more of the other features.
In order to apply equation 1 to estimate the merit of a feature subset, it is necessary to compute the correlation (dependence) between attributes. For discrete class problems, we use information to estimate the degree of association between features. If X and Y are discrete random variables, equations 2 and 3 give the entropy of Y before and after observing X.
2( ) ( ) log ( )y Y
H y p y p y
= (2) 2( | ) ( ) ( | ) log ( | )
x X y YH y x p x p y x p y x
= (3) The amount by which the entropy of Y decreases reflects the additional information about Y provided by X and is called the information gain . Information gain is given by equation 4
( , ) ( ) ( | )gain x y H y H y x= (4) Here, we use information gain to indicate the degree of
correlation between x and y. if ( , )gain x y is very big, then the correlation of x and y is very high. So we can
commutate cfr and 'ffr in equation 1 using equations 5 and 6 below.
1 ( , )cf ii S
r gain x yk
= (5) '
1 ( , )( ) i jff i S j S
r gain x xk n k
(6) Where S indicate the attribute set which do not belong to
set S, n is the total number of all attributes and y is the class attribute.
B. Searching the Feature Subset Space using BDE 1) Differential Evolution (DE) algorithm
DE algorithm is a population-based heuristic search procedure which was first introduced by Rainer Storn and Kenneth Price in 1997. Start with NP individuals as solution vectors randomly, use mutation, crossing and selection operation through numerical encoding, and then get the best individual as the problems answer . By experimentation, we recently noticed that DE has
exceptional performance compared to other evolution algorithms in numerical optimization problems. Surprisingly, DE requires hardly any parameters tuning and works very reliably with excellent overall results over a wide set of benchmark and real-world problems.
Firstly, we get NP individuals randomly, the individuals have the form:
, 1, , 2, , , ,[ , , ]i G i G i G D i Gx x x x= " 1, 2,i NP= " Where G is the generation number and D is the problems dimension.In each generation, for individual ix , we use operation and get the vector iv called donor vector according to formula 7.
1, 2, 3,( )i r G r G r Gv x F x x= + (7) Where the mutation factor F is a constant from [0, 2] and the three vectors 1,r Gx , 2,r Gx and 3,r Gx are selected randomly such that the indices i, r1 and r2are distinct from 1, 2,,NP
Secndly, we use crossing operation and get the trail vector iu through formula 8.
, , 1 ,, , 1
, , 1 ,
j i G j i randj i G
j i G j i rand
v if rand CR or j Iu
x f rand CR or j I+
1,2, , ; 1,2,i NP j D= =" " (8) Where , ~ (0,1)j irand U , randI is a random integer from [1, 2, D], CR is a number predefined between [0, 1] by user which called crossing factor.
Lastly, we use selection operation and get the target
vector , 1i Gx + is compared with the trail vector , 1i Gu + and
the one with the lowest fitness function value is admitted to the next generation through formula 9. Mutation, crossing and selection continue until some stopping criterion is reached.
, 1 , 1 ,, 1
f( ) ( )
i G i G i Gi G
u if u f xx
2) Feature selection use BDE algorithm In DE algorithm, it adopts to numerical encode for
continuous space. In this paper, we transform numerical encoding to binary encoding depending on some probability using below formula 10 and we call this method Binary Differential Evolution (BDE).So we can use the method to deal with discrete optimization problems.
1 () exp( | |)0
if rand xd
= (10) Where ()rand is a random number between 0 and 1
So each individual encode format as follows:
1 0 1 . 0
Where 1 indicates the feature is selected and all selected features group a best feature subset. The pseudo-code is as follows: Begin