feature selection

Feature selection

Using slides by Gideon Dror, Alon Kaufman and Roy

Learning to ClassifyLearning of binary classification• Given: a set of m examples (xi,yi) i = 1,2…m sampled

from some distribution D, where xiRn and yi{-1,+1}• Find: a function f f: Rn -> {-1,+1} which classifies ‘well’

examples xj sampled from D.

Examples:– microarray data: separate malignant from healthy tissues – text categorization: spam detection– Face detection: discriminating human faces from not faces.

Learning algorithms: decision trees, nearest neighbors,bayesian networks, neural networks, Support VectorMachines …

Advantages of dimensionality reduction

– May Improve performance of classification algorithm by removing irrelevant features

– Defying the curse of dimensionality - improved generalization

– Classification algorithm may not scale up to the size of the full feature set either in space or time

– Allows us to better understand the domain– Cheaper to collect and store data based on

reduced feature set

Two approaches for dimensionality reduction

– Feature construction– Feature selection (This talk)

Methods of Feature construction

• Linear methods– Principal component analysis (PCA)– ICA– Fisher linear discriminant– ….

• Non-linear methods– Non linear component analysis (NLCA) – Kernel PCA– Local linear embedding (LLE)– ….

Feature selection

• Given examples (xi,yi) where xiRn, select a minimal subset of features which maximizes the performance (accuracy,….).

• Exhaustive search is computationally prohibitive, except for a small number of dimensions.

• There are 2n-1 possible combinations.

• Basically it is an optimization problem, where the classification error is the function to be minimized.

Feature selection methods

Filter methods

Wrapper methods

Embedded methods

Feature selection

classifier

Feature selection

classifier

classifier

Filtering

– Order all features according to strength of association with the target yi

– Various measures of association may be used: • Pearson correlation R(Xi) = cov(Xi,Y)/XiY

2 (discrete variables Xi)• Fisher Criterion Scoring F(Xi) = |+

Xi- -Xi| / (+

Xi2+ -

Xi2)

• Golub criterion F(Xi) = |+Xi- -

Xi| / |+Xi+ -

Xi|• Mutual information

I(Xi,Y) =p(Xi,Y) log(p(Xi,Y)/p(Xni)p(Y)

• …

– Choose the first k features and feed them to the classifier

Wrappers

Use the classifier as a black box, to search in the space of feature subsets, the subset which maximizes classification accuracy.

Search is exponentially hard.

A common example of heuristic search is hill climbing: keep adding features one at a time until no further improvement can be achieved (“forward selection”)

Alternatively we can start with the full set of predictors and keep removing features one at a time until no further improvement can be achieved (“backward selection”)

Embedded methods: Recursive Feature Elimination - RFE

0. Set V = n (total number of features)1. build linear Support Vector Machine

classifiers using V features2. compute weight vector w = iyixi of

optimal hyperplane. Omit V/2 features with lowest |wi|.

3. repeat steps 1 and 2 until one feature is left4. choose the feature subset that gives the

best performance (using cross-validation)

(Has strong theoretical justification)

Margin Based Feature SelectionTheory and Algorithms

Ran Gilad-Bachrach, Amir Navot and Naftali Tishby

• Feature selection based on the quality of margin they induce

• Idea: use of large margin principle for feature selection

• Supervise classification problem

• “study-case” predictor: 1-NN

Margins

• Margins measure the classifier confidence• Sample-margin – distance between the instance and the

decision boundary (SVM)• Hypothesis-margin – given an instance the distance

between the hypothesis and the closet hypothesis that assigns an alternative label.

• In the 1-NN case (Crammer et al 2002):– Previous results: the hypothesis margin lower bounds the

sample margin–

• Motivation: choose the features that induce large margins

1( ) ( ( ) ( ) )

2p x x nearmiss x x nearhit x

Margins

• Given a weight vector of the features:

The evaluation function is defined for any weight vector w over the features:

1( ) ( ( ) ( ) )

2wp w wx x nearmiss x x nearhit x

2 2i iw i

z w z

\( ) ( )wS x

x S

e w x

nearhit(x)

nearmiss(x)

x

(Crammer et al. 2002, Bachrach et al. 2004)

Margins For 1-NN

= ½( ||x-nearmiss(x)|| - ||x-nearhit(x)|| )

Iterative Search Based Algorithm(Simba)

• For a set S with m samples and N features:1. W=(1,1,1…..1)2. For t=1:T (number iterations)

a. Pick a random instance x from Sb. Calculate nearmiss(x) and nearhit(x) considering wc. For i=1:N

d. w=w+

3.

(TNm) / (Nm2)

2 2( ( ) ) ( ( ) )1

2 ( ) ( )i i i i

i ii iw w

x nearmiss x x nearhit xw

x nearmiss x x nearhit x

2 2/w w w

wi=wi+(xi-nearmiss(x)i) 2-(xi-nearhit(x)i)2

Application: Face Images

• AR face database• 1456 images females and males• 5100 features

• Train 1000 faces test: 456

Faces – Average Results

Unsupervised feature selection

• Background: Motivation and Methods

• Our Solution– SVD-Entropy and the CE criterion

– Three Feature Selection Methods

• Results

R. Varshavsky, A. Gottlieb, M. Linial, D. Horn. ISMB 2006

Background: Motivation

• Gene Expression, Sequence Similarities

• ‘Curse of dimensionality’, Dimension Reduction, Compression– Thousands – Tens of Thousands Genes in an

array– Number of proteins in databases > million

• Noise

The Data: An Example

• Gene Expression Experiments sample

s

Genes/ features

Background: Methods

• Extraction Vs Selection

• Most methods are supervised (i.e., have an objective function)

• Unsupervised– Variance– Projection on the first PC (e.g., ‘gene-shaving’)– Statistical significant overabundance (Ben-Dor et

al., 2001)

SVD in genes expression

Our Solution: SVD-Entropy

• The Normalized relative Values (Wall et al., 2003)*

• SVD-Entropy (Alter et al., 2000)

2 2j j k

k

V = s / s

* S2j are the eigen values of the [nXn] XX’ matrix

N

j jj=1

1E = - V log(V )

log(N)

SVD-Entropy (Example)

0

0.1

0.2

0.3

0.4

0.5

1 2 3 4 5Component #

No

rmal

ized

Val

ue

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5Component #

No

rmal

ized

Val

ue

A comparison of two eigenvalue distributions; the left has high entropy (0.87) and the right one has low entropy (0.14)

CE – Contribution to the Entropy

• The Contribution of the i-th feature to the overall Entropy is determined according to a leave-out-out measurement

CEi=E(X[nXm]) – E(X[nX(m-1)])

Golub AML/ALL data

CEs suggest 3 groups of features

• CEi>c high contribution meaningful (?)

• CEi=c average contribution neutral

• CEi<c low contribution uniformity

Three Feature Selection Methods

• Simple Ranking (SR)

• Forward selection (FS)1. Aggregate the highest CE at a time

2. Select and remove the highest CE at a time

• Backward Elimination (BE)

Fauquet virus problem

61 viruses. 18 features (amino-acid compositions of coat proteins of the viruses). Four known classes.

Ranking of the different methods

Test: classification results

Results - Example (Golub et al. 1999)

samples

Genes/ features

• Leukemia– 72 patients (samples)

– 7129 genes

– 4 groups• Two major types ALL & AML

– T & B Cells in ALL– With/without treatment in AML

Results (Cont’)

Results (Cont’)

54

35

8

3

11

38

FS2

FS1

SR

0.2

0.3

0.4

0.5

0.6

0.7

0.8

5 20 40 60 80 100

120

140

160

180

200

220

240

260

280

300

Number of features selected

Ja

cc

ard

FS1

SR

All Features

Variance

Random

Overlap of features among methods

Results (Cont’)

Clustering Assessment

11

11 01 10

nJaccard =

n + n + n 11

11 01

nPurity =

n + nSpecificity

11

11 10

nEfficiency =

n + nSensitivity

• n11 – number of pairs that are classified together, both in the ‘real’ classification and by the algorithm• n10 – number of pairs that are classified together in the ‘real’ classification, but not by the algorithm• n01 – number of pairs that are classified together by the algorithm, but not in the ‘real’ classification

n11n10 n01

Real Algorithm1 2 3 4

feature selection

Documents