feature selection for high-dimensional data: a fast correlation-based filter solution
DESCRIPTION
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. Presented by Jingting Zeng 11/26/2007. Outline. Introduction to Feature Selection Feature Selection Models Fast Correlation-Based Filter ( FCBF) Algorithm Experiment Discussion Reference. - PowerPoint PPT PresentationTRANSCRIPT
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution
Presented by Jingting Zeng11/26/2007
Outline
Introduction to Feature Selection Feature Selection Models Fast Correlation-Based Filter (FCBF)
Algorithm Experiment Discussion Reference
Introduction of Feature Selection Definition
A process that chooses an optimal subset of features according to an objective function
Objectives To reduce dimensionality and remove noise To improve mining performance
Speed of learning Predictive accuracy Simplicity and comprehensibility of mined results
An Example for Optimal Subset
Data set (whole set)Five Boolean featuresC = F1 F2∨F3= ┐F2 ,F5= ┐F4Optimal subset:
{F1, F2}or{F1, F3}
Models of Feature Selection
Filter model Separating feature selection from classifier learning Relying on general characteristics of data
(information, distance, dependence, consistency) No bias toward any learning algorithm, fast
Wrapper model Relying on a predetermined classification algorithm Using predictive accuracy as goodness measure High accuracy, computationally expensive
Filter Model
Wrapper Model
Two Aspects for Feature Selection How to decide whether a feature is
relevant to the class or not How to decide whether such a relevant
feature is redundant or not compared to other features
Linear Correlation Coefficient
For a pair of variables (x,y):
However, it may not be able to capture the non-linear correlations
Information Measures Entropy of variable X
Entropy of X after observing Y
Information Gain
Symmetrical Uncertainty
Fast Correlation-Based Filter (FCBF) Algorithm How to decide whether a feature is
relevant to the class C or notFind a subset , such that
How to decide whether such a relevant feature is redundantUse the correlation of features and class as a
reference
'S
,', 1 ,i i cf S i N SU
if
Definitions
Predominant CorrelationThe correlation between a feature and the
class C is predominant
Redundant peer (RP) If there is , is a RP of
Use to denote the set of RP for
, , ,',i c j j i i ciff SU and f S SU SU
if
, ,j i i cSU SU jf
iSp iS
if
if
iSp
iSp
i
C
Three Heuristics
If , treat as a predominant feature, remove all features in and skip identifying redundant peers for them
If , process all the features in at first. If non of them becomes predominant, follow the first heuristic
The feature with the largest value is always a predominant feature and can be a starting point to remove other features.
iSp
iSp
ifiSp
iSp
,i cSU
iSp
iSp
i
C
FCBF Algorithm
Time Complexity:
O(N)
FCBF Algorithm (cont.)
Time complexity:
O(NlogN)
Experiments
FCBF are compared to ReliefF, CorrSF and ConsSF
Summary of the 10 data sets
Results
Results (cont.)
Pros and Cons
AdvantageVery fastSelect fewer features with higher accuracy
DisadvantageCannot detect some features
4 features generated by 4 Gaussian functions and adding 4 additional redundant features, FCBF selected only 3 features
Discussion
FCBF compares only individual features with each other
Try to use PCA to capture a group of features. Based on the result, then the FCBF is used.
Reference
L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003
Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005.
www.cse.msu.edu/~ptan/SDM07/Yu-Ye-Liu.pdf www1.cs.columbia.edu/~jebara/6772/proj/Keith.ppt
Thank you!
Q and A