# [IEEE 2009 International Conference on Artificial Intelligence and Computational Intelligence - Shanghai, China (2009.11.7-2009.11.8)] 2009 International Conference on Artificial Intelligence and Computational Intelligence - Feature Selection with Discrete Binary Differential Evolution

Post on 14-Dec-2016

215 views

TRANSCRIPT

Feature selection with discrete binary differential evolution

Xingshi He Department of Mathematics Xian polytechnic University

Xian 710048, P.R.China e-mail: xingshi_he@163.com

Qingqing Zhang Department of Mathematics Xian polytechnic University

Xian 710048, P.R.China e-mail: suiyue2959@163.com

Na Sun Department of Mathematics Xian polytechnic University

Xian 710048, P.R.China e-mail: sunn827@sina.com

Yan Dong Department of Mathematics Xian polytechnic University

Xian 710048, P.R.China e-mail: dongyan840214@126.com

AbstractThe processing of data from the database using data mining algorithms need more special methods. In fact, some redundancy and irrelevant attributes reduce the performance of data mining, so the problem of feature subset selection becomes important in data mining domain. This paper presentes a new algorithm which is called discrete binary differential evolution (BDE) algorithm to select the best feature subsets. The relativity of attributes is evaluated based on the idea of mutual information. Experiments using the new feature selection method as a preprocessing step for SVM, C&R Tree and RBF network are done.We find that the method is very effective to improve the correct classification rate on some datasets and the BDE algorithm is useful for feature subset selection.

Keywords-differential evolution; data mining; feature

selection; mutual information

I INTRODUCTION The success of data mining on a given task is affected by

many factors. The quality of the data is one of these factors. If information is irrelevant or redundant, or the data is noisy and unreliable, knowledge discovery is more difficult.Feature subset selection is a process that identify and remove the irrelevant and redundant information as much as possible. It is necessary to select a small number of highly predictive features in order to avoid over-fitting the training data. Regardless of a learner attempting to select features itself or ignoring the issue, feature selection prior to learning can be beneficial. Reducing the dimension of the data reduces the size of the hypothesis space and allows algorithms to operate faster and more effectively. In some cases,the accuracy on future classification can be improved; in others, the result is a more compact, easily interpreted representation of the target concept. Algorithms performing feature selection as a preprocessing step prior to learning can generally be placed into one of two broad categories. One approach, referred to as the wrapper [1],is a method to select useful features depending on special problems and learning algorithms. This approach has been proved useful but very

slow to execute. For this reason, wrappers do not scale well to large datasets containing many features. Another approach, called the filter [1], operates independently of any learning algorithms undesirable features are filtered out of the data before induction commences. Filters have been proved to be much faster than wrappers and hence can be applied to enlarge data sets containing many features. Their general natures allow them to be used with any learners, unlike the wrapper, which must be rerun when switching from one learning algorithm to another. This paper presents a new approach to feature selection, called BDE(discrete binary differential evolutionbased feature selection). The approach uses a population-based heuristics to evaluate the worth of features. The algorithm is simple, fast to execute by applying suitable correlation measures.

The rest of this paper is organized as follows.The second section describes the BDE algorithm. Section 3 presents experimental results of using BDE as a pre-processor for learning algorithms. The last section summaries and discusses future work.

II BDE: DISCRETE BINARY DIFFERENTIAL EVOLUTION BASED FEATURE SELECTION

A. Feature evaluation The purpose of feature selection is to decide which of the

initial (possibly large) number of features is included in the final subset and which is ignore. If there are n possible features initially, then there are 2n possible subsets. So we must use heuristic methods to search the feature subset space in reasonable time. In this paper, we use BDE algorithm to get the best subset. The key problem is to define a rule for evaluating the worth of a subset of features. This worth takes into account the usefulness of individual features with the purpose of predicting the class label along with the level of inter-correlation among them. So we use a new fitness function (according to Equation 1) to select feature [2].

2009 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-3816-7/09 $26.00 2009 IEEEDOI 10.1109/AICI.2009.438

327

'

*( )

( 1) *cf

ff

k rf s

k k k r=

+ (1)

Where ( )f s is the worth of a feature subset S containing k features as fitness function for BDE, cfr is the average feature-class correlation, 'ffr 'ffr is the average feature-feature inter-correlation and cfr , 'ffr are individually indicated how much predictive with a group of features and how much redundancy among them. The heuristic handles irrelevant features as they will be poor predictors of the class, redundant attributes are discriminated against as they will be highly correlated with one or more of the other features.

In order to apply equation 1 to estimate the merit of a feature subset, it is necessary to compute the correlation (dependence) between attributes. For discrete class problems, we use information to estimate the degree of association between features. If X and Y are discrete random variables, equations 2 and 3 give the entropy of Y before and after observing X.

2( ) ( ) log ( )y Y

H y p y p y

= (2) 2( | ) ( ) ( | ) log ( | )

x X y YH y x p x p y x p y x

= (3) The amount by which the entropy of Y decreases reflects the additional information about Y provided by X and is called the information gain [3]. Information gain is given by equation 4

( , ) ( ) ( | )gain x y H y H y x= (4) Here, we use information gain to indicate the degree of

correlation between x and y. if ( , )gain x y is very big, then the correlation of x and y is very high. So we can

commutate cfr and 'ffr in equation 1 using equations 5 and 6 below.

1 ( , )cf ii S

r gain x yk

= (5) '

1 ( , )( ) i jff i S j S

r gain x xk n k

=

(6) Where S indicate the attribute set which do not belong to

set S, n is the total number of all attributes and y is the class attribute.

B. Searching the Feature Subset Space using BDE 1) Differential Evolution (DE) algorithm

DE algorithm is a population-based heuristic search procedure which was first introduced by Rainer Storn and Kenneth Price in 1997[4]. Start with NP individuals as solution vectors randomly, use mutation, crossing and selection operation through numerical encoding, and then get the best individual as the problems answer [5]. By experimentation, we recently noticed that DE has

exceptional performance compared to other evolution algorithms in numerical optimization problems. Surprisingly, DE requires hardly any parameters tuning and works very reliably with excellent overall results over a wide set of benchmark and real-world problems.

Firstly, we get NP individuals randomly, the individuals have the form:

, 1, , 2, , , ,[ , , ]i G i G i G D i Gx x x x= " 1, 2,i NP= " Where G is the generation number and D is the problems dimension.In each generation, for individual ix , we use operation and get the vector iv called donor vector according to formula 7.

1, 2, 3,( )i r G r G r Gv x F x x= + (7) Where the mutation factor F is a constant from [0, 2] and the three vectors 1,r Gx , 2,r Gx and 3,r Gx are selected randomly such that the indices i, r1 and r2are distinct from 1, 2,,NP

Secndly, we use crossing operation and get the trail vector iu through formula 8.

, , 1 ,, , 1

, , 1 ,

j i G j i randj i G

j i G j i rand

v if rand CR or j Iu

x f rand CR or j I+

++

== >

1,2, , ; 1,2,i NP j D= =" " (8) Where , ~ (0,1)j irand U , randI is a random integer from [1, 2, D], CR is a number predefined between [0, 1] by user which called crossing factor.

Lastly, we use selection operation and get the target

vector , 1i Gx + is compared with the trail vector , 1i Gu + and

the one with the lowest fitness function value is admitted to the next generation through formula 9. Mutation, crossing and selection continue until some stopping criterion is reached.

, 1 , 1 ,, 1

,

f( ) ( )

otherwise

1, 2,...,

i G i G i Gi G

i G

u if u f xx

x

i NP

+ +

+

=

=

(9)

2) Feature selection use BDE algorithm In DE algorithm, it adopts to numerical encode for

continuous space. In this paper, we transform numerical encoding to binary encoding depending on some probability using below formula 10 and we call this method Binary Differential Evolution (BDE).So we can use the method to deal with discrete optimization problems.

1 () exp( | |)0

ii

if rand xd

otherwise

= (10) Where ()rand is a random number between 0 and 1

So each individual encode format as follows:

1 0 1 . 0

328

Where 1 indicates the feature is selected and all selected features group a best feature subset. The pseudo-code is as follows: Begin

G=1; Initialize the NP individuals ,i GX randomly;

For G=1 to Gmax do For i =1 to NP do Mutation step: for each individual ,i GX , get the donor

vector , 1i GV + according to formula 7; Crossing step: get the trail vector , 1i GU + according to

formula 8 by vector ,i GX and , 1i GV + ; Discrete step: get temp vector ,i GX and , 1i GU +

according to formula 10; Get the feature subset and compute ,i XJ , ,i UJ

according to formula 1 by vector ,i GX , , 1i GU + ; Selection step: get the target vector , 1i GX + according

to formula 9; End for

G=G+1; End for

End

III EXPERIMENTS In order to evaluate the effectiveness of BDE as a global

feature selector for common machine learning algorithms, experiments were performed using six standard datasets from the UCI collection [6]. The datasets and their characteristics are listed in Table 1. The parameters of BDE algorithm were set as follows: mutation factor F=0.3, crossing factor CR=0.7and Max iterations Gmax=10*dim (dim is dimension of dataset). Three data mining algorithms representing three diverse approaches to learning were used in the experiments:

Support Vector Machine (SVM), C&R Tree and RBF network. All the experiments were conducted 10 runs. In each experiment, each data set was randomly divided into two parts: 60% as training sets and the rest as test sets .The correct classification rate averaged over 10 runs for the training sets and the test sets. The results were listed in table 2, 3,4.

ACKNOWLEDGMENT This work is supported by Mechanism Design Theory

and Application Research through Shaanxi Province Department of Education Research Project(08JK285)and Xian Polytechnic University Postgraduate Innovation Foundation(chx090721).

REFERENCES [1] John G. H; Kohavi.R; and Peger. P: Irrelevant features and

the subset selection problem. In Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, 1994.

[2] Mark A. Hall: Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. Proceedings of the Seventeenth International Conference on Machine Learning, 2000, Pages: 359 366.

[3] Quinlan J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

[4] Rainer.Storn; Kenneth.Price: Differential Evolution: A simple and efficient adaptive scheme for global optimization over continuous spaces. Global Optimization, 11, 1997, Pages: 341-359.

[5] Vesterstrom.J; Thomsen.R: A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems. Congress on Evolutionary Computation, 2004, Pages: 1980-1987.

[6] Blake.C; Keogh.E; Merz C.J: UCI Repository of Machine Learning Databases (1998). www.ics.uci.edu/mlearn/MLRepository.html.

Table 1: The structure of datasets used in the experiments

datasets number of attributes Instances classes

Vote 16 435 2 Zoo 16 101 7

Flare 10 1066 2 Breast 9 683 2 Lung 56 32 2 Exactly 13 1000 2

Table2: the best feature subsets

datasets Number of the best feature subset Vote 10 Zoo 9 Flare 7 Breast 5 lung 35

exactly 9

329

Table 3: the correct rate of classification algorithms in train datasets

datasets C&R Tree SVM RBF network Bef Aft Bef Aft Bef Aft Vote 0.9732 0.9349 1 0.9923 0.9732 0.9272 Zoo 1 1 1 1 1 1 Flare 0.8406 0.8219 0.8812 0.825 0.8219 0.8215 Breast 0.9756 0.9707 1 1 0.978 0.98 Lung 0.95 0.95 1 1 0.95 0.95 Exactly 0.765 0.74 1 0.8767 0.7017 0.705

Table 4: the correct rate of classification algorithms in test datasets

datasets C&R Tree SVM RBF network Bef Aft Bef Aft Bef Aft Vote 0.8483 0.8736 0.5172 0.5172 0.8368 0.8909 Zoo 0.85 0.85 0.8 0.85 0.9 0.875 Flare 0.8333 0.8474 0.5609 0.5609 0.8774 0.878 Breast 0.9194 0.9304 0.6024 0.62 0.9487 0.9487 Lung 0.5333 0.6333 0.4 0.42 0.5233 0.4867 Exactly 0.7 0.685 0.4617 0.4617 0.675 0.67

330

Recommended