feature selection for high-dimensional data: a fast correlation-based filter solution

24
Feature Selection for High- Dimensional Data: A Fast Correlation-Based Filter Solution Presented by Jingting Zeng 11/26/2007

Upload: meghan

Post on 11-Jan-2016

44 views

Category:

Documents


1 download

DESCRIPTION

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. Presented by Jingting Zeng 11/26/2007. Outline. Introduction to Feature Selection Feature Selection Models Fast Correlation-Based Filter ( FCBF) Algorithm Experiment Discussion Reference. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Presented by Jingting Zeng11/26/2007

Page 2: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Outline

Introduction to Feature Selection Feature Selection Models Fast Correlation-Based Filter (FCBF)

Algorithm Experiment Discussion Reference

Page 3: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Introduction of Feature Selection Definition

A process that chooses an optimal subset of features according to an objective function

Objectives To reduce dimensionality and remove noise To improve mining performance

Speed of learning Predictive accuracy Simplicity and comprehensibility of mined results

Page 4: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

An Example for Optimal Subset

Data set (whole set)Five Boolean featuresC = F1 F2∨F3= ┐F2 ,F5= ┐F4Optimal subset:

{F1, F2}or{F1, F3}

Page 5: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Models of Feature Selection

Filter model Separating feature selection from classifier learning Relying on general characteristics of data

(information, distance, dependence, consistency) No bias toward any learning algorithm, fast

Wrapper model Relying on a predetermined classification algorithm Using predictive accuracy as goodness measure High accuracy, computationally expensive

Page 6: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Filter Model

Page 7: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Wrapper Model

Page 8: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Two Aspects for Feature Selection How to decide whether a feature is

relevant to the class or not How to decide whether such a relevant

feature is redundant or not compared to other features

Page 9: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Linear Correlation Coefficient

For a pair of variables (x,y):

However, it may not be able to capture the non-linear correlations

Page 10: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Information Measures Entropy of variable X

Entropy of X after observing Y

Information Gain

Symmetrical Uncertainty

Page 11: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Fast Correlation-Based Filter (FCBF) Algorithm How to decide whether a feature is

relevant to the class C or notFind a subset , such that

How to decide whether such a relevant feature is redundantUse the correlation of features and class as a

reference

'S

,', 1 ,i i cf S i N SU

if

Page 12: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Definitions

Predominant CorrelationThe correlation between a feature and the

class C is predominant

Redundant peer (RP) If there is , is a RP of

Use to denote the set of RP for

, , ,',i c j j i i ciff SU and f S SU SU

if

, ,j i i cSU SU jf

iSp iS

if

if

Page 13: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

iSp

iSp

i

C

Page 14: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Three Heuristics

If , treat as a predominant feature, remove all features in and skip identifying redundant peers for them

If , process all the features in at first. If non of them becomes predominant, follow the first heuristic

The feature with the largest value is always a predominant feature and can be a starting point to remove other features.

iSp

iSp

ifiSp

iSp

,i cSU

Page 15: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

iSp

iSp

i

C

Page 16: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

FCBF Algorithm

Time Complexity:

O(N)

Page 17: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

FCBF Algorithm (cont.)

Time complexity:

O(NlogN)

Page 18: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Experiments

FCBF are compared to ReliefF, CorrSF and ConsSF

Summary of the 10 data sets

Page 19: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Results

Page 20: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Results (cont.)

Page 21: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Pros and Cons

AdvantageVery fastSelect fewer features with higher accuracy

DisadvantageCannot detect some features

4 features generated by 4 Gaussian functions and adding 4 additional redundant features, FCBF selected only 3 features

Page 22: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Discussion

FCBF compares only individual features with each other

Try to use PCA to capture a group of features. Based on the result, then the FCBF is used.

Page 23: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Reference

L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003

Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005.

www.cse.msu.edu/~ptan/SDM07/Yu-Ye-Liu.pdf www1.cs.columbia.edu/~jebara/6772/proj/Keith.ppt

Page 24: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Thank you!

Q and A