feature selection
DESCRIPTION
Feature selection. Using slides by Gideon Dror, Alon Kaufman and Roy. Learning to Classify. Learning of binary classification Given: a set of m examples ( x i ,y i ) i = 1,2…m sampled from some distribution D, where x i R n and y i {-1,+1} - PowerPoint PPT PresentationTRANSCRIPT
Feature selection
Using slides by Gideon Dror, Alon Kaufman and Roy
Learning to ClassifyLearning of binary classification• Given: a set of m examples (xi,yi) i = 1,2…m sampled
from some distribution D, where xiRn and yi{-1,+1}• Find: a function f f: Rn -> {-1,+1} which classifies ‘well’
examples xj sampled from D.
Examples:– microarray data: separate malignant from healthy tissues – text categorization: spam detection– Face detection: discriminating human faces from not faces.
Learning algorithms: decision trees, nearest neighbors,bayesian networks, neural networks, Support VectorMachines …
Advantages of dimensionality reduction
– May Improve performance of classification algorithm by removing irrelevant features
– Defying the curse of dimensionality - improved generalization
– Classification algorithm may not scale up to the size of the full feature set either in space or time
– Allows us to better understand the domain– Cheaper to collect and store data based on
reduced feature set
Two approaches for dimensionality reduction
– Feature construction– Feature selection (This talk)
Methods of Feature construction
• Linear methods– Principal component analysis (PCA)– ICA– Fisher linear discriminant– ….
• Non-linear methods– Non linear component analysis (NLCA) – Kernel PCA– Local linear embedding (LLE)– ….
Feature selection
• Given examples (xi,yi) where xiRn, select a minimal subset of features which maximizes the performance (accuracy,….).
• Exhaustive search is computationally prohibitive, except for a small number of dimensions.
• There are 2n-1 possible combinations.
• Basically it is an optimization problem, where the classification error is the function to be minimized.
Feature selection methods
Filter methods
Wrapper methods
Embedded methods
Feature selection
classifier
Feature selection
classifier
classifier
Filtering
– Order all features according to strength of association with the target yi
– Various measures of association may be used: • Pearson correlation R(Xi) = cov(Xi,Y)/XiY
2 (discrete variables Xi)• Fisher Criterion Scoring F(Xi) = |+
Xi- -Xi| / (+
Xi2+ -
Xi2)
• Golub criterion F(Xi) = |+Xi- -
Xi| / |+Xi+ -
Xi|• Mutual information
I(Xi,Y) =p(Xi,Y) log(p(Xi,Y)/p(Xni)p(Y)
• …
– Choose the first k features and feed them to the classifier
Wrappers
Use the classifier as a black box, to search in the space of feature subsets, the subset which maximizes classification accuracy.
Search is exponentially hard.
A common example of heuristic search is hill climbing: keep adding features one at a time until no further improvement can be achieved (“forward selection”)
Alternatively we can start with the full set of predictors and keep removing features one at a time until no further improvement can be achieved (“backward selection”)
Embedded methods: Recursive Feature Elimination - RFE
0. Set V = n (total number of features)1. build linear Support Vector Machine
classifiers using V features2. compute weight vector w = iyixi of
optimal hyperplane. Omit V/2 features with lowest |wi|.
3. repeat steps 1 and 2 until one feature is left4. choose the feature subset that gives the
best performance (using cross-validation)
(Has strong theoretical justification)
Margin Based Feature SelectionTheory and Algorithms
Ran Gilad-Bachrach, Amir Navot and Naftali Tishby
• Feature selection based on the quality of margin they induce
• Idea: use of large margin principle for feature selection
• Supervise classification problem
• “study-case” predictor: 1-NN
Margins
• Margins measure the classifier confidence• Sample-margin – distance between the instance and the
decision boundary (SVM)• Hypothesis-margin – given an instance the distance
between the hypothesis and the closet hypothesis that assigns an alternative label.
• In the 1-NN case (Crammer et al 2002):– Previous results: the hypothesis margin lower bounds the
sample margin–
• Motivation: choose the features that induce large margins
1( ) ( ( ) ( ) )
2p x x nearmiss x x nearhit x
Margins
• Given a weight vector of the features:
The evaluation function is defined for any weight vector w over the features:
1( ) ( ( ) ( ) )
2wp w wx x nearmiss x x nearhit x
2 2i iw i
z w z
\( ) ( )wS x
x S
e w x
nearhit(x)
nearmiss(x)
x
(Crammer et al. 2002, Bachrach et al. 2004)
Margins For 1-NN
= ½( ||x-nearmiss(x)|| - ||x-nearhit(x)|| )
Iterative Search Based Algorithm(Simba)
• For a set S with m samples and N features:1. W=(1,1,1…..1)2. For t=1:T (number iterations)
a. Pick a random instance x from Sb. Calculate nearmiss(x) and nearhit(x) considering wc. For i=1:N
d. w=w+
3.
(TNm) / (Nm2)
2 2( ( ) ) ( ( ) )1
2 ( ) ( )i i i i
i ii iw w
x nearmiss x x nearhit xw
x nearmiss x x nearhit x
2 2/w w w
wi=wi+(xi-nearmiss(x)i) 2-(xi-nearhit(x)i)2
Iterative Search Based Algorithm(Simba)
• For a set S with m samples and N features:1. W=(1,1,1…..1)2. For t=1:T (number iterations)
a. Pick a random instance x from Sb. Calculate nearmiss(x) and nearhit(x) considering wc. For i=1:N
d. w=w+
3.
(TNm) / (Nm2)
2 2( ( ) ) ( ( ) )1
2 ( ) ( )i i i i
i ii iw w
x nearmiss x x nearhit xw
x nearmiss x x nearhit x
2 2/w w w
wi=wi+(xi-nearmiss(x)i) 2-(xi-nearhit(x)i)2
Application: Face Images
• AR face database• 1456 images females and males• 5100 features
• Train 1000 faces test: 456
Faces – Average Results
Unsupervised feature selection
• Background: Motivation and Methods
• Our Solution– SVD-Entropy and the CE criterion
– Three Feature Selection Methods
• Results
R. Varshavsky, A. Gottlieb, M. Linial, D. Horn. ISMB 2006
Background: Motivation
• Gene Expression, Sequence Similarities
• ‘Curse of dimensionality’, Dimension Reduction, Compression– Thousands – Tens of Thousands Genes in an
array– Number of proteins in databases > million
• Noise
The Data: An Example
• Gene Expression Experiments sample
s
Genes/ features
Background: Methods
• Extraction Vs Selection
• Most methods are supervised (i.e., have an objective function)
• Unsupervised– Variance– Projection on the first PC (e.g., ‘gene-shaving’)– Statistical significant overabundance (Ben-Dor et
al., 2001)
SVD in genes expression
Our Solution: SVD-Entropy
• The Normalized relative Values (Wall et al., 2003)*
• SVD-Entropy (Alter et al., 2000)
2 2j j k
k
V = s / s
* S2j are the eigen values of the [nXn] XX’ matrix
N
j jj=1
1E = - V log(V )
log(N)
SVD-Entropy (Example)
0
0.1
0.2
0.3
0.4
0.5
1 2 3 4 5Component #
No
rmal
ized
Val
ue
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5Component #
No
rmal
ized
Val
ue
A comparison of two eigenvalue distributions; the left has high entropy (0.87) and the right one has low entropy (0.14)
CE – Contribution to the Entropy
• The Contribution of the i-th feature to the overall Entropy is determined according to a leave-out-out measurement
CEi=E(X[nXm]) – E(X[nX(m-1)])
Golub AML/ALL data
CEs suggest 3 groups of features
• CEi>c high contribution meaningful (?)
• CEi=c average contribution neutral
• CEi<c low contribution uniformity
Three Feature Selection Methods
• Simple Ranking (SR)
• Forward selection (FS)1. Aggregate the highest CE at a time
2. Select and remove the highest CE at a time
• Backward Elimination (BE)
Fauquet virus problem
61 viruses. 18 features (amino-acid compositions of coat proteins of the viruses). Four known classes.
Ranking of the different methods
Test: classification results
Results - Example (Golub et al. 1999)
samples
Genes/ features
• Leukemia– 72 patients (samples)
– 7129 genes
– 4 groups• Two major types ALL & AML
– T & B Cells in ALL– With/without treatment in AML
Results (Cont’)
Results (Cont’)
54
35
8
3
11
38
FS2
FS1
SR
0.2
0.3
0.4
0.5
0.6
0.7
0.8
5 20 40 60 80 100
120
140
160
180
200
220
240
260
280
300
Number of features selected
Ja
cc
ard
FS1
SR
All Features
Variance
Random
Overlap of features among methods
Results (Cont’)
Clustering Assessment
11
11 01 10
nJaccard =
n + n + n 11
11 01
nPurity =
n + nSpecificity
11
11 10
nEfficiency =
n + nSensitivity
• n11 – number of pairs that are classified together, both in the ‘real’ classification and by the algorithm• n10 – number of pairs that are classified together in the ‘real’ classification, but not by the algorithm• n01 – number of pairs that are classified together by the algorithm, but not in the ‘real’ classification
n11n10 n01
Real Algorithm1 2 3 4