feature selection

38
Feature selection Using slides by Gideon Dror, Alon Kaufman and Roy

Upload: novia

Post on 14-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Feature selection. Using slides by Gideon Dror, Alon Kaufman and Roy. Learning to Classify. Learning of binary classification Given: a set of m examples ( x i ,y i ) i = 1,2…m sampled from some distribution D, where x i  R n and y i {-1,+1} - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Feature selection

Feature selection

Using slides by Gideon Dror, Alon Kaufman and Roy

Page 2: Feature selection

Learning to ClassifyLearning of binary classification• Given: a set of m examples (xi,yi) i = 1,2…m sampled

from some distribution D, where xiRn and yi{-1,+1}• Find: a function f f: Rn -> {-1,+1} which classifies ‘well’

examples xj sampled from D.

Examples:– microarray data: separate malignant from healthy tissues – text categorization: spam detection– Face detection: discriminating human faces from not faces.

Learning algorithms: decision trees, nearest neighbors,bayesian networks, neural networks, Support VectorMachines …

Page 3: Feature selection

Advantages of dimensionality reduction

– May Improve performance of classification algorithm by removing irrelevant features

– Defying the curse of dimensionality - improved generalization

– Classification algorithm may not scale up to the size of the full feature set either in space or time

– Allows us to better understand the domain– Cheaper to collect and store data based on

reduced feature set

Page 4: Feature selection

Two approaches for dimensionality reduction

– Feature construction– Feature selection (This talk)

Page 5: Feature selection

Methods of Feature construction

• Linear methods– Principal component analysis (PCA)– ICA– Fisher linear discriminant– ….

• Non-linear methods– Non linear component analysis (NLCA) – Kernel PCA– Local linear embedding (LLE)– ….

Page 6: Feature selection

Feature selection

• Given examples (xi,yi) where xiRn, select a minimal subset of features which maximizes the performance (accuracy,….).

• Exhaustive search is computationally prohibitive, except for a small number of dimensions.

• There are 2n-1 possible combinations.

• Basically it is an optimization problem, where the classification error is the function to be minimized.

Page 7: Feature selection

Feature selection methods

Filter methods

Wrapper methods

Embedded methods

Feature selection

classifier

Feature selection

classifier

classifier

Page 8: Feature selection

Filtering

– Order all features according to strength of association with the target yi

– Various measures of association may be used: • Pearson correlation R(Xi) = cov(Xi,Y)/XiY

2 (discrete variables Xi)• Fisher Criterion Scoring F(Xi) = |+

Xi- -Xi| / (+

Xi2+ -

Xi2)

• Golub criterion F(Xi) = |+Xi- -

Xi| / |+Xi+ -

Xi|• Mutual information

I(Xi,Y) =p(Xi,Y) log(p(Xi,Y)/p(Xni)p(Y)

• …

– Choose the first k features and feed them to the classifier

Page 9: Feature selection

Wrappers

Use the classifier as a black box, to search in the space of feature subsets, the subset which maximizes classification accuracy.

Search is exponentially hard.

A common example of heuristic search is hill climbing: keep adding features one at a time until no further improvement can be achieved (“forward selection”)

Alternatively we can start with the full set of predictors and keep removing features one at a time until no further improvement can be achieved (“backward selection”)

Page 10: Feature selection

Embedded methods: Recursive Feature Elimination - RFE

0. Set V = n (total number of features)1. build linear Support Vector Machine

classifiers using V features2. compute weight vector w = iyixi of

optimal hyperplane. Omit V/2 features with lowest |wi|.

3. repeat steps 1 and 2 until one feature is left4. choose the feature subset that gives the

best performance (using cross-validation)

(Has strong theoretical justification)

Page 11: Feature selection

Margin Based Feature SelectionTheory and Algorithms

Ran Gilad-Bachrach, Amir Navot and Naftali Tishby

• Feature selection based on the quality of margin they induce

• Idea: use of large margin principle for feature selection

• Supervise classification problem

• “study-case” predictor: 1-NN

Page 12: Feature selection

Margins

• Margins measure the classifier confidence• Sample-margin – distance between the instance and the

decision boundary (SVM)• Hypothesis-margin – given an instance the distance

between the hypothesis and the closet hypothesis that assigns an alternative label.

• In the 1-NN case (Crammer et al 2002):– Previous results: the hypothesis margin lower bounds the

sample margin–

• Motivation: choose the features that induce large margins

1( ) ( ( ) ( ) )

2p x x nearmiss x x nearhit x

Page 13: Feature selection

Margins

• Given a weight vector of the features:

The evaluation function is defined for any weight vector w over the features:

1( ) ( ( ) ( ) )

2wp w wx x nearmiss x x nearhit x

2 2i iw i

z w z

\( ) ( )wS x

x S

e w x

Page 14: Feature selection

nearhit(x)

nearmiss(x)

x

(Crammer et al. 2002, Bachrach et al. 2004)

Margins For 1-NN

= ½( ||x-nearmiss(x)|| - ||x-nearhit(x)|| )

Page 15: Feature selection

Iterative Search Based Algorithm(Simba)

• For a set S with m samples and N features:1. W=(1,1,1…..1)2. For t=1:T (number iterations)

a. Pick a random instance x from Sb. Calculate nearmiss(x) and nearhit(x) considering wc. For i=1:N

d. w=w+

3.

(TNm) / (Nm2)

2 2( ( ) ) ( ( ) )1

2 ( ) ( )i i i i

i ii iw w

x nearmiss x x nearhit xw

x nearmiss x x nearhit x

2 2/w w w

wi=wi+(xi-nearmiss(x)i) 2-(xi-nearhit(x)i)2

Page 16: Feature selection

Iterative Search Based Algorithm(Simba)

• For a set S with m samples and N features:1. W=(1,1,1…..1)2. For t=1:T (number iterations)

a. Pick a random instance x from Sb. Calculate nearmiss(x) and nearhit(x) considering wc. For i=1:N

d. w=w+

3.

(TNm) / (Nm2)

2 2( ( ) ) ( ( ) )1

2 ( ) ( )i i i i

i ii iw w

x nearmiss x x nearhit xw

x nearmiss x x nearhit x

2 2/w w w

wi=wi+(xi-nearmiss(x)i) 2-(xi-nearhit(x)i)2

Page 17: Feature selection

Application: Face Images

• AR face database• 1456 images females and males• 5100 features

• Train 1000 faces test: 456

Page 18: Feature selection

Faces – Average Results

Page 19: Feature selection

Unsupervised feature selection

• Background: Motivation and Methods

• Our Solution– SVD-Entropy and the CE criterion

– Three Feature Selection Methods

• Results

R. Varshavsky, A. Gottlieb, M. Linial, D. Horn. ISMB 2006

Page 20: Feature selection

Background: Motivation

• Gene Expression, Sequence Similarities

• ‘Curse of dimensionality’, Dimension Reduction, Compression– Thousands – Tens of Thousands Genes in an

array– Number of proteins in databases > million

• Noise

Page 21: Feature selection

The Data: An Example

• Gene Expression Experiments sample

s

Genes/ features

Page 22: Feature selection

Background: Methods

• Extraction Vs Selection

• Most methods are supervised (i.e., have an objective function)

• Unsupervised– Variance– Projection on the first PC (e.g., ‘gene-shaving’)– Statistical significant overabundance (Ben-Dor et

al., 2001)

Page 23: Feature selection

SVD in genes expression

Page 24: Feature selection

Our Solution: SVD-Entropy

• The Normalized relative Values (Wall et al., 2003)*

• SVD-Entropy (Alter et al., 2000)

2 2j j k

k

V = s / s

* S2j are the eigen values of the [nXn] XX’ matrix

N

j jj=1

1E = - V log(V )

log(N)

Page 25: Feature selection

SVD-Entropy (Example)

0

0.1

0.2

0.3

0.4

0.5

1 2 3 4 5Component #

No

rmal

ized

Val

ue

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5Component #

No

rmal

ized

Val

ue

A comparison of two eigenvalue distributions; the left has high entropy (0.87) and the right one has low entropy (0.14)

Page 26: Feature selection

CE – Contribution to the Entropy

• The Contribution of the i-th feature to the overall Entropy is determined according to a leave-out-out measurement

CEi=E(X[nXm]) – E(X[nX(m-1)])

Page 27: Feature selection

Golub AML/ALL data

Page 28: Feature selection

CEs suggest 3 groups of features

• CEi>c high contribution meaningful (?)

• CEi=c average contribution neutral

• CEi<c low contribution uniformity

Page 29: Feature selection

Three Feature Selection Methods

• Simple Ranking (SR)

• Forward selection (FS)1. Aggregate the highest CE at a time

2. Select and remove the highest CE at a time

• Backward Elimination (BE)

Page 30: Feature selection

Fauquet virus problem

61 viruses. 18 features (amino-acid compositions of coat proteins of the viruses). Four known classes.

Page 31: Feature selection

Ranking of the different methods

Page 32: Feature selection

Test: classification results

Page 33: Feature selection

Results - Example (Golub et al. 1999)

samples

Genes/ features

• Leukemia– 72 patients (samples)

– 7129 genes

– 4 groups• Two major types ALL & AML

– T & B Cells in ALL– With/without treatment in AML

Page 34: Feature selection

Results (Cont’)

Page 35: Feature selection

Results (Cont’)

54

35

8

3

11

38

FS2

FS1

SR

0.2

0.3

0.4

0.5

0.6

0.7

0.8

5 20 40 60 80 100

120

140

160

180

200

220

240

260

280

300

Number of features selected

Ja

cc

ard

FS1

SR

All Features

Variance

Random

Page 36: Feature selection

Overlap of features among methods

Page 37: Feature selection

Results (Cont’)

Page 38: Feature selection

Clustering Assessment

11

11 01 10

nJaccard =

n + n + n 11

11 01

nPurity =

n + nSpecificity

11

11 10

nEfficiency =

n + nSensitivity

• n11 – number of pairs that are classified together, both in the ‘real’ classification and by the algorithm• n10 – number of pairs that are classified together in the ‘real’ classification, but not by the algorithm• n01 – number of pairs that are classified together by the algorithm, but not in the ‘real’ classification

n11n10 n01

Real Algorithm1 2 3 4