feature extraction vs. feature selection search strategy and
TRANSCRIPT
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1
L11: sequential feature selection
β’ Feature extraction vs. feature selection
β’ Search strategy and objective functions
β’ Objective functions β Filters
β Wrappers
β’ Sequential search strategies β Sequential forward selection
β Sequential backward selection
β Plus-l minus-r selection
β Bidirectional search
β Floating search
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2
Feature extraction vs. feature selection β’ As discussed in L9, there are two general approaches to dim. reduction
β Feature extraction: Transform the existing features into a lower dimensional space
β Feature selection: Select a subset of the existing features without a transformation
π₯1π₯2
π₯π
β
π₯π1 π₯π2
π₯ππ
π₯1π₯2
π₯π
β
π¦1π¦2
π¦π
= π
π₯1π₯2
π₯π
β’ Feature extraction was covered in L9-10 β We derived the βoptimalβ linear features for two objective functions
β’ Signal representation: PCA
β’ Signal classification: LDA
β’ Feature selection, also called feature subset selection (FSS) in the literature, will be the subject of the last two lectures β Although FSS can be thought of as a special case of feature extraction (think of a
sparse projection matrix with a few ones), in practice it is a quite different problem
β FSS looks at the issue of dimensionality reduction from a different perspective
β FSS has a unique set of methodologies
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3
Feature subset selection β’ Definition
β Given a feature set π = {π₯π | π = 1β¦π}, find a subset ππ, with π < π, that maximizes an objective function π½(π), ideally π πππππππ‘
ππ = π₯π1, π₯π2, β¦ , π₯ππ = arg maπ₯π,ππ
π½ π₯π| π = 1. . π
β’ Why feature subset selection? β Why not use the more general feature extraction methods, and simply
project a high-dimensional feature vector onto a low-dimensional space?
β’ Feature subset selection is necessary in a number of situations β Features may be expensive to obtain
β’ You evaluate a large number of features (sensors) in the test bed and select only a few for the final implementation
β You may want to extract meaningful rules from your classifier β’ When you project, the measurement units of your features (length, weight,
etc.) are lost
β Features may not be numeric, a typical situation in machine learning
β’ In addition, fewer features means fewer model parameters β Improved the generalization capabilities β Reduced complexity and run-time
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4
Search strategy and objective function β’ FSS requires
β A search strategy to select candidate subsets
β An objective function to evaluate these candidates
β’ Search strategy
β Exhaustive evaluation of feature subsets involves ππ
combinations for a fixed value of π, and 2π combinations if π must be optimized as well β’ This number of combinations is unfeasible, even for
moderate values of π and π, so a search procedure must be used in practice
β’ For example, exhaustive evaluation of 10 out of 20 features involves 184,756 feature subsets; exhaustive evaluation of 10 out of 100 involves more than 1013 feature subsets [Devijver and Kittler, 1982]
β A search strategy is therefore needed to direct the FSS process as it explores the space of all possible combination of features
β’ Objective function β The objective function evaluates candidate subsets and
returns a measure of their βgoodnessβ, a feedback signal used by the search strategy to select new candidates
Feature Subset Selection
Information
content Feature
subset
PR
algorithm
Objective
function
Search
Training data
Final feature subset
Complete feature set
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5
Objective function
β’ Objective functions are divided in two groups β Filters: evaluate subsets by their information content, e.g., interclass
distance, statistical dependence or information-theoretic measures
β Wrappers: use a classifier to evaluate subsets by their predictive accuracy (on test data) by statistical resampling or cross-validation
F ilte r F S S
In fo rm a tio n
c o n te n t
F e a tu re
s u b s e t
M L
a lg o r ith m
O b je c t iv e
fu n c t io n
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
W ra p p e r F S S
P re d ic t iv e
a c c u ra c y
F e a tu re
s u b s e t
P R
a lg o r ith m
P R
a lg o r ith m
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
F ilte r F S S
In fo rm a tio n
c o n te n t
F e a tu re
s u b s e t
M L
a lg o r ith m
O b je c t iv e
fu n c t io n
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
F ilte r F S S
In fo rm a tio n
c o n te n t
F e a tu re
s u b s e t
M L
a lg o r ith m
O b je c t iv e
fu n c t io n
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
W ra p p e r F S S
P re d ic t iv e
a c c u ra c y
F e a tu re
s u b s e t
P R
a lg o r ith m
P R
a lg o r ith m
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
W ra p p e r F S S
P re d ic t iv e
a c c u ra c y
F e a tu re
s u b s e t
P R
a lg o r ith m
P R
a lg o r ith m
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
F ilte r F S S
In fo rm a tio n
c o n te n t
F e a tu re
s u b s e t
M L
a lg o r ith m
O b je c t iv e
fu n c t io n
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
W ra p p e r F S S
P re d ic t iv e
a c c u ra c y
F e a tu re
s u b s e t
P R
a lg o r ith m
P R
a lg o r ith m
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
F ilte r F S S
In fo rm a tio n
c o n te n t
F e a tu re
s u b s e t
M L
a lg o r ith m
O b je c t iv e
fu n c t io n
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
F ilte r F S S
In fo rm a tio n
c o n te n t
F e a tu re
s u b s e t
M L
a lg o r ith m
O b je c t iv e
fu n c t io n
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
W ra p p e r F S S
P re d ic t iv e
a c c u ra c y
F e a tu re
s u b s e t
P R
a lg o r ith m
P R
a lg o r ith m
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
W ra p p e r F S S
P re d ic t iv e
a c c u ra c y
F e a tu re
s u b s e t
P R
a lg o r ith m
P R
a lg o r ith m
S e a rc h
T ra in in g d a ta
F in a l fe a tu re s u b s e t
C o m p le te fe a tu re s e t
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6
Filter types
β’ Distance or separability measures β These methods measure class separability using metrics such as
β’ Distance between classes: Euclidean, Mahalanobis, etc.
β’ Determinant of ππβ1ππ΅ (LDA eigenvalues)
β’ Correlation and information-theoretic measures β These methods are based on the rationale that good feature subsets
contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other
β Linear relation measures
β’ Linear relationship between variables can be measured using the correlation coefficient
π½ ππ = πππ ππ=1
πππ ππ=π+1
ππ=1
β’ Where πππ is the correlation coefficient between feature π and the class label and πππ is the correlation coefficient between features π and π
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7
β Non-linear relation measures
β’ Correlation is only capable of measuring linear dependence
β’ A more powerful measure is the mutual information πΌ(ππ; πΆ)
π½ ππ = πΌ ππ; πΆ = π» πΆ β π» πΆ|ππ = π ππ, ππ ππππ ππ, πππ ππ π ππ
ππ₯ππ
πΆ
π=1
β’ The mutual information between the feature vector and the class label πΌ(ππ; πΆ) measures the amount by which the uncertainty in the class π»(πΆ) is decreased by knowledge of the feature vector π»(πΆ|ππ), where π»(Β·) is the entropy function
β’ Note that mutual information requires the computation of the multivariate densities π(ππ) and π ππ, ππ , which is ill-posed for high-dimensional spaces
β’ In practice [Battiti, 1994], mutual information is replaced by a heuristic such as
π½ ππ = πΌ π₯ππ; πΆ
π
π=1
β π½ πΌ π₯ππ; π₯ππ
π
π=π+1
π
π=1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8
Filters vs. wrappers β’ Filters
β Fast execution (+): Filters generally involve a non-iterative computation on the dataset, which can execute much faster than a classifier training session
β Generality (+): Since filters evaluate the intrinsic properties of the data, rather than their interactions with a particular classifier, their results exhibit more generality: the solution will be βgoodβ for a larger family of classifiers
β Tendency to select large subsets (-): Since the filter objective functions are generally monotonic, the filter tends to select the full feature set as the optimal solution. This forces the user to select an arbitrary cutoff on the number of features to be selected
β’ Wrappers β Accuracy (+): wrappers generally achieve better recognition rates than filters since
they are tuned to the specific interactions between the classifier and the dataset
β Ability to generalize (+): wrappers have a mechanism to avoid overfitting, since they typically use cross-validation measures of predictive accuracy
β Slow execution (-): since the wrapper must train a classifier for each feature subset (or several classifiers if cross-validation is used), the method can become unfeasible for computationally intensive methods
β Lack of generality (-): the solution lacks generality since it is tied to the bias of the classifier used in the evaluation function. The βoptimalβ feature subset will be specific to the classifier under consideration
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9
Search strategies β’ Exponential algorithms (Lecture 12)
β Evaluate a number of subsets that grows exponentially with the dimensionality of the search space β’ Exhaustive Search (already discussed)
β’ Branch and Bound
β’ Approximate Monotonicity with Branch and Bound
β’ Beam Search
β’ Sequential algorithms (Lecture 11) β Add or remove features sequentially, but have a tendency to become trapped in
local minima β’ Sequential Forward Selection
β’ Sequential Backward Selection
β’ Plus-l Minus-r Selection
β’ Bidirectional Search
β’ Sequential Floating Selection
β’ Randomized algorithms (Lecture 12) β Incorporate randomness into their search procedure to escape local minima
β’ Random Generation plus Sequential Selection
β’ Simulated Annealing
β’ Genetic Algorithms
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10
NaΓ―ve sequential feature selection β’ One may be tempted to evaluate each individual
feature separately and select the best M features β Unfortunately, this strategy RARELY works since it does not
account for feature dependence
β’ Example β The figures show a 4D problem with 5 classes
β Any reasonable objective function will rank features according to this sequence: π½(π₯1) > π½(π₯2) β π½(π₯3) > π½(π₯4) β’ π₯1 is the best feature: it separates π1, π2, π3 and π4, π5
β’ π₯2 and π₯3 are equivalent, and separate classes in three groups
β’ π₯4 is the worst feature: it can only separate π4 from π5
β The optimal feature subset turns out to be {π₯1, π₯4}, because π₯4 provides the only information that π₯1 needs: discrimination between classes π4 and π5
β However, if we were to choose features according to the individual scores π½(π₯π), we would certainly pick π₯1 and either π₯2 or π₯3, leaving classes π4 and π5 non separable β’ This naΓ―ve strategy fails because it does not consider features
with complementary information
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11
Sequential forward selection (SFS) β’ SFS is the simplest greedy search algorithm
β Starting from the empty set, sequentially add the feature π₯+ that maximizes π½(ππ + π₯
+) when combined with the features ππ that have already been selected
β’ Notes β SFS performs best when the optimal subset is small
β’ When the search is near the empty set, a large number of states can be potentially evaluated
β’ Towards the full set, the region examined by SFS is narrower since most features have already been selected
β The search space is drawn like an ellipse to emphasize the fact that there are fewer states towards the full or empty sets β’ The main disadvantage of SFS is that it is unable
to remove features that become obsolete after the addition of other features
1. Start with the empty set π0 = {β } 2. Select the next best feature π₯+ = arg max
π₯βππ
π½ ππ + π₯
3. Update ππ+1 = ππ + π₯+; π = π + 1
4. Go to 2
Empty feature set
Full feature set
0 0 0 0
1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
1 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1
1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1
1 1 1 1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12
Example
β’ Run SFS to completion for the following objective function π½ π = β2π₯1π₯2 + 3π₯1 + 5π₯2 β 2π₯1π₯2π₯3 + 7π₯3 + 4π₯4 β 2π₯1π₯2π₯3π₯4
β’ where π₯π are indicator variables, which indicate whether the ππ‘β feature has been selected π₯π = 1 or not π₯π = 0
β’ Solution
J(x1)=3 J(x2)=5 J(x3)=7 J(x4)=4
J(x3x1)=10 J(x3x2)=12 J(x3x4)=11
J(x3x2x1)=11 J(x3x2x4)=16
J(x3x2x4x1)=13
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13
Sequential backward selection (SBS)
β’ SBS works in the opposite direction of SFS β Starting from the full set, sequentially remove the feature π₯β that
least reduces the value of the objective function π½(π β π₯β)
β’ Removing a feature may actually increase the objective function π½(ππ β π₯
β) > π½(ππ); such functions are said to be non-monotonic (more on this when we cover Branch and Bound)
β’ Notes β SBS works best when the optimal feature subset
is large, since SBS spends most of its time visiting large subsets
β The main limitation of SBS is its inability to reevaluate the usefulness of a feature after it has been discarded
1. Start with the full set π0 = π 2. Remove the worst feature π₯β = arg max
π₯βππ
π½ ππ β π₯
3. Update ππ+1 = ππ β π₯β; π = π + 1
4. Go to 2
Empty feature set
Full feature set
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14
Plus-L minus-R selection (LRS)
β’ A generalization of SFS and SBS β If L>R, LRS starts from the empty set
and repeatedly adds L features and removes R features
β If L<R, LRS starts from the full set and repeatedly removes R features followed by L additions
β’ Notes β LRS attempts to compensate for the
weaknesses of SFS and SBS with some backtracking capabilities
β Its main limitation is the lack of a theory to help predict the optimal values of L and R
1. If L>R then π0 = β else π0 = π; go to step 3
2. Repeat L times π₯+ = arg max
π₯βππ
π½ ππ + π₯
ππ+1 = ππ + π₯+; π = π + 1
3. Repeat R times π₯β = arg max
π₯βππ
π½ ππ β π₯
ππ+1 = ππ β π₯β; π = π + 1
4. Go to 2
Empty feature set
Full feature set
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15
Bidirectional Search (BDS)
β’ BDS is a parallel implementation of SFS and SBS β SFS is performed from the empty set
β SBS is performed from the full set
β To guarantee that SFS and SBS converge to the same solution
β’ Features already selected by SFS are not removed by SBS
β’ Features already removed by SBS are not selected by SFS
Empty feature set
Full feature set
1. Start SFS with ππΉ = β 2. Start SBS with ππ΅ = π 3. Select the best feature
π₯+ = arg maxπ₯βππΉπ
π₯βFπ΅π
π½ ππΉπ + π₯
ππΉπ+1 = ππΉπ + π₯+
4. Remove the worst feature
π₯β = arg maxπ₯βππ΅ππ₯βππΉπ+1
π½ ππ΅π β π₯
ππ΅π+1 = ππ΅π β π₯β; π = π + 1
5. Go to 2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16
Sequential floating selection (SFFS and SFBS)
β’ An extension to LRS with flexible backtracking capabilities β Rather than fixing the values of L and R, these floating methods allow
those values to be determined from the data:
β The dimensionality of the subset during the search can be thought to be βfloatingβ up and down
β’ There are two floating methods β Sequential floating forward selection (SFFS) starts from the empty set
β’ After each forward step, SFFS performs backward steps as long as the objective function increases
β Sequential floating backward selection (SFBS) starts from the full set
β’ After each backward step, SFBS performs forward steps as long as the objective function increases
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17
β’ SFFS Algorithm (SFBS is analogous)
Empty feature set
Full feature set
1. π = β 2. Select the best feature π₯+ = arg max
π₯βππ
π½ ππ + π₯
ππ = ππ + π₯+; π = π + 1
3. Select the worst feature* π₯β = arg max
π₯βππ
π½ ππ β π₯
4. If π½ ππ β π₯β > π½ ππ then
ππ+1 = ππ β π₯β; π = π + 1
Go to step 3 Else Go to step 2
*Notice that youβll need to do book-keeping to avoid infinite loops