feature selection stability analysis for classification using microarray data
DESCRIPTION
Université Libre de Bruxelles DEA Bioinformatique. Feature Selection Stability Analysis for Classification Using Microarray Data. Panagiotis Moulos. Outline. Introduction Motivation Stability Measure Approach The bias/variance tradeoff Contributions Materials and Methods - PowerPoint PPT PresentationTRANSCRIPT
Feature Selection Stability Analysis for Feature Selection Stability Analysis for Classification Using Microarray DataClassification Using Microarray Data
Panagiotis MoulosPanagiotis Moulos
Université Libre de BruxellesDEA Bioinformatique
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 22
OutlineOutline IntroductionIntroduction
MotivationMotivation Stability Measure ApproachStability Measure Approach The bias/variance tradeoffThe bias/variance tradeoff ContributionsContributions
Materials and MethodsMaterials and Methods Stability MetricsStability Metrics Example (Hamming Distance)Example (Hamming Distance) Experimental AnalysisExperimental Analysis
ResultsResults Visualizing InstabilityVisualizing Instability Stability ResultsStability Results Accuracy ResultsAccuracy Results RemarksRemarks Feature AggregationFeature Aggregation
DiscussionDiscussion General RemarksGeneral Remarks Future WorkFuture Work
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 33
MotivationMotivation
Microarrays are invaluable tools for cancer studies at the Microarrays are invaluable tools for cancer studies at the molecular level molecular level prognosisprognosis, , early diagnosisearly diagnosis
Microarray data analysisMicroarray data analysis
However, these signatures are sensitive to perturbations: a However, these signatures are sensitive to perturbations: a small perturbation (e.g. remove 1 sample) may lead to a small perturbation (e.g. remove 1 sample) may lead to a completely different signaturecompletely different signature
Supervised/Unsupervisedlearning for tumor
classificationFeature Selection techniques
forimportant gene identification
Cancer GeneticSignature
Prognosis
(Early) Diagnosis
(1) (1) gg11
(2) (2) gg22
(3) (3) gg33
(4) (4) gg44
(5) g(5) g55 ClasClasss
SS11 00
SS22 11
SS33 00
SS44 11
SS55 00
Feature Selection
Classification
full gene ranking list(2,5,3,4,1)
signature(2,5,3)
BUT
full gene ranking list(1,3,4,5,2)
signature(1,3,4)
Similaritybetween(2,5,3)
and(1,3,4)
?
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 44
Stability Measure ApproachStability Measure Approach
Problem of similarity between two gene lists can be approached Problem of similarity between two gene lists can be approached mathematically by the theory of permutationsmathematically by the theory of permutations
Given a set Given a set GGnn = (g = (g11, g, g22, …, g, …, gnn)) of objects, a permutation of objects, a permutation ππ is a is a bijective function between bijective function between GGnn and and GGnn
Concerning the frame of microarray dataConcerning the frame of microarray data The The nn genes – features involved are labeled with a unique number genes – features involved are labeled with a unique number
between between 1, …, n1, …, n Every gene ranking list (Every gene ranking list (full ranking listfull ranking list) is exactly a permutation) is exactly a permutation ππ
on the set on the set {1, …, n}{1, …, n} where the image where the image ππ(i)(i) of the i of the ithth gene is its gene is its ranking inside ranking inside ππ
If we are interested only for the top If we are interested only for the top NN ranked genes – features of ranked genes – features of GGnn, we define as , we define as ππ** the the partial ranking listpartial ranking list of of GGnn which contains the which contains the first first NN elements of elements of ππ
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 55
Stability Measure Approach (Example)Stability Measure Approach (Example)
A full ranking list: A full ranking list: GG55 = (1,2,3,4,5) = (1,2,3,4,5) A permutation: A permutation: ππ = (3,2,5,4,1) = (3,2,5,4,1) where where ππ(1) = 3, (1) = 3, ππ(2) = 2, (2) = 2, ππ(3) = 5, (3) = 5, ππ(4) = 4, (4) = 4, ππ(5) = (5) =
11 A partial ranking list with the top N = 3 ranked genes: A partial ranking list with the top N = 3 ranked genes: ππ* = (3,2,5)* = (3,2,5) where where ππ*(1)*(1) = =
3, 3, ππ*(2) = 2, *(2) = 2, ππ*(3) = 5*(3) = 5 How can we summarize variability betweenHow can we summarize variability between
Full ranking lists Full ranking lists ππ and and σσ Partial ranking listsPartial ranking lists π*π* and and σ*σ*
Several metrics proposed in statistical literature (e.g. Critchlow, Several metrics proposed in statistical literature (e.g. Critchlow, 1985)1985)
Gene NameGene Name Gene IdentificationGene Identification
31307_at31307_at 11
31457_at 31457_at 22
32088_at32088_at 33
40821_at40821_at 44
40831_at40831_at 55
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 66
The bias/variance tradeoffThe bias/variance tradeoff
A central issue in choosing a model for a given problem is A central issue in choosing a model for a given problem is selecting the level of selecting the level of structural complexity structural complexity (# (# variables/parameters etc) that best suits the data that it must variables/parameters etc) that best suits the data that it must accommodateaccommodate
Deciding on the correct amount of flexibility in a model is Deciding on the correct amount of flexibility in a model is therefore a tradeoff between these two sources of the misfit. therefore a tradeoff between these two sources of the misfit. This is called the This is called the bias/variance tradeoffbias/variance tradeoff
Too many Too many parameterparameter
ss
Inclusion Inclusion of noiseof noise
OverfittinOverfittingg
Variance Variance contributiocontributio
nn
Too few Too few parameterparameter
ss
Not Not enough enough
flexibilityflexibility
MisfitMisfit Bias Bias contributiocontributio
nn
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 77
ContributionsContributions
1.1. Experimental study of signature stability in gene expression Experimental study of signature stability in gene expression datasets by resampling (bootstrap, jackknife) datasets for datasets by resampling (bootstrap, jackknife) datasets for different ranking/feature selection methodsdifferent ranking/feature selection methods
2.2. Study of several forms of feature selection stability using Study of several forms of feature selection stability using statistical similarity measuresstatistical similarity measures
3.3. Classification performance assessment for each feature Classification performance assessment for each feature selection and classification methodselection and classification method
4.4. Study of possible correlation between feature selection stability Study of possible correlation between feature selection stability and classification accuracy for all feature selection and and classification accuracy for all feature selection and classification methodsclassification methods
5.5. Proposal of a feature aggregation procedure to obtain more Proposal of a feature aggregation procedure to obtain more stable probabilistic gene signaturesstable probabilistic gene signatures
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 88
OutlineOutline IntroductionIntroduction
MotivationMotivation Stability Measure ApproachStability Measure Approach The bias/variance tradeoffThe bias/variance tradeoff ContributionsContributions
Materials and MethodsMaterials and Methods Stability MetricsStability Metrics Example (Hamming Distance)Example (Hamming Distance) Experimental AnalysisExperimental Analysis
ResultsResults Visualizing InstabilityVisualizing Instability Stability ResultsStability Results Accuracy ResultsAccuracy Results RemarksRemarks Feature AggregationFeature Aggregation
DiscussionDiscussion General RemarksGeneral Remarks Future WorkFuture Work
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 99
Stability MetricsStability Metrics Measures of Feature Selection StabilityMeasures of Feature Selection Stability
Stability of SelectionStability of Selection Stability of selectionStability of selection for a given dataset is the stability of appearance of for a given dataset is the stability of appearance of
certain features after resampling the original dataset. certain features after resampling the original dataset. Hamming DistanceHamming Distance
InconsistencyInconsistency
Stability of RankingStability of Ranking SStability of rankingtability of ranking for a given dataset is the stability of both the for a given dataset is the stability of both the
appearance and the ranking order of certain features after resampling appearance and the ranking order of certain features after resampling the original dataset the original dataset
Spearman’s FootruleSpearman’s Footrule Kendall’s TauKendall’s Tau
NNN1H0llNN
1mm2
H jinorm
n
kjkik
m
i
m
ij ji
norm 2 , ,
1
1
1 11
nkk
lk
1 ,otherwise ,0
selected is gene if ,1
with
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1100
Example (Hamming Distance)Example (Hamming Distance)
Example of stability metric: Hamming DistanceExample of stability metric: Hamming Distance Calculation of Hamming Distance for: Calculation of Hamming Distance for: mm = 5, = 5, nn = 10, = 10, NN = 5 = 5
74932
97483
46218
57183
15832
OriginalDataset
Resampling 1
Resampling 2
Resampling 3
Resampling 4
Resampling 5
0101001110
0111001100
0010101011
0011010101
0010010111
48.02666466642101
1552
normH
212 H413 H
634 H
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1111
Experimental AnalysisExperimental Analysis
Datasets: Breast Cancer (HBC, Tamoxifen), Leukemia (MLL, Golub), Datasets: Breast Cancer (HBC, Tamoxifen), Leukemia (MLL, Golub), LymphomaLymphoma
Classification algorithms: k-NN (k = 5), Support Vector MachinesClassification algorithms: k-NN (k = 5), Support Vector Machines Feature Selection algorithmsFeature Selection algorithms
Filters: Gram – Schmidt orthogonalization, k-NN and SVM correlation based Filters: Gram – Schmidt orthogonalization, k-NN and SVM correlation based filter (gene ranking according to misclassification error by 1 – gene trained filter (gene ranking according to misclassification error by 1 – gene trained classifier)classifier)
Wrapper: Sequential Forward Selection wrapperWrapper: Sequential Forward Selection wrapper Feature aggregation (Feature aggregation (main personal contributionmain personal contribution))
Gather together all different signaturesGather together all different signatures Remove duplicatesRemove duplicates Exclude features with low selection frequency according to a thresholdExclude features with low selection frequency according to a threshold
Resampling strategiesResampling strategies Bootstrap (on each step resample patients of a dataset with replacement)Bootstrap (on each step resample patients of a dataset with replacement) Jackknife (on each step remove 1 – 5% of samples)Jackknife (on each step remove 1 – 5% of samples)
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1122
OutlineOutline IntroductionIntroduction
MotivationMotivation Stability Measure ApproachStability Measure Approach The bias/variance tradeoffThe bias/variance tradeoff ContributionsContributions
Materials and MethodsMaterials and Methods Stability MetricsStability Metrics Example (Hamming Distance)Example (Hamming Distance) Experimental AnalysisExperimental Analysis
ResultsResults Visualizing InstabilityVisualizing Instability Stability ResultsStability Results Accuracy ResultsAccuracy Results RemarksRemarks Feature AggregationFeature Aggregation
DiscussionDiscussion General RemarksGeneral Remarks Future WorkFuture Work
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1133
Visualizing InstabilityVisualizing Instability
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1144
Stability Results (1)Stability Results (1) Stability of Selection (Bootstrap)Stability of Selection (Bootstrap)
Stability of Ranking (Bootstrap)Stability of Ranking (Bootstrap)
Overall Stabilities - Bootstrap - Filtering
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
Datasets
Inst
abil
ity Gram - Schmidt
Correlation kNN
Correlation SVM
Random Case
Hamming Distance Inconsistency
Overall Stabilities - Bootstrap - Wrapping
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
HBC Lymphoma Golub Tamoxifen MLL HBC Lymphoma Golub Tamoxifen MLL
Datasets
Inst
abil
ity
Gram - Schmidt > kNN
Gram - Schmidt > SVM
Correlation kNN > kNN
Correlation SVM > SVM
Random Case
Hamming Distance Inconsistency
Overall Stabilities - Bootstrap - Filtering
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Datasets
Inst
abil
ity Gram - Schmidt
Correlation kNN
Correlation SVM
Random Case
Spearman's Footrule Kendall's Tau
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1155
Stability Results (2)Stability Results (2) Stability of Selection (Jackknife)Stability of Selection (Jackknife)
Stability of Ranking (Jackknife)Stability of Ranking (Jackknife)
Overall stabilities - Jackknife - Filtering
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
HBC Lymphoma Golub Tamoxifen MLL HBC Lymphoma Golub Tamoxifen MLL
Datasets
Inst
abil
ity Gram - Schmidt
Correlation kNN
Correlation SVM
Random Case
Hamming Distance Inconsistency
Overall Stabilities - Jackknife - Wrapping
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
HBC Lymphoma Golub Tamoxifen MLL HBC Lymphoma Golub Tamoxifen MLL
Datasets
Inst
abil
ity
Gram - Schmidt > kNN
Gram - Schmidt > SVM
Correlation kNN > kNN
Correlation SVM > SVM
Random Case
Hamming Distance Inconsistency
Overall Stabilities - Jackknife - Filtering
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
HBC
Lym
phom
a
Golu
b
Tamox
ifen
MLL
HBC
Lym
phom
a
Golu
b
Tamox
ifen
MLL
Datasets
Inst
abil
ity Gram - Schmidt
Correlation kNN
Correlation SVM
Random Case
Spearman's Footrule Kendall's Tau
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1166
Accuracy ResultsAccuracy Results
Mean Accuracies - No Filtering or Wrapping
0
20
40
60
80
100
HBC
Lym
phom
a
Golu
b
Tamox
ifen
MLL
HBC
Lym
phom
a
Golu
b
Tamox
ifen
MLL
Dataset
Acc
ura
cy p
erce
nta
ge
kNN
SVM
Bootstrap Jackknife
x
No Filteringor Wrapping
Overall Accuracies - Bootstrap
0
20
40
60
80
100
Dataset
Acc
ura
cy P
erce
nta
ge
Gram - Schmidt > kNN
Gram - Schmidt > SVM
Correlation kNN > kNN
Correlation SVM > SVM
Filter Wrap
Overall Accuracies - Jackknife
0
20
40
60
80
100
Dataset
Acc
ura
cy P
erce
nta
ge
Gram - Schmidt > kNN
Gram - Schmidt > SVM
Correlation kNN > kNN
Correlation SVM > SVM
Filter Wrap
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1177
RemarksRemarks
StabilityStability Stability inversely proportional to size of perturbationStability inversely proportional to size of perturbation Gram – Schmidt orthogonalization outperforms classifier based Gram – Schmidt orthogonalization outperforms classifier based
correlationscorrelations Filters more stable than the wrapperFilters more stable than the wrapper Correlation between stability of selection and stability of rankingCorrelation between stability of selection and stability of ranking
AccuracyAccuracy Accuracy proportional to size of perturbationAccuracy proportional to size of perturbation Gram – Schmidt orthogonalization is outperformed by classifier Gram – Schmidt orthogonalization is outperformed by classifier
based correlations based correlations Filters outperform the wrapperFilters outperform the wrapper Performance is improved after the application of Feature Selection Performance is improved after the application of Feature Selection
techniquestechniques
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1188
Feature AggregationFeature Aggregation Feature AggregationFeature Aggregation
Class permutation test shows no overfittingClass permutation test shows no overfitting t – test between mean accuracies before and after aggregation reveals t – test between mean accuracies before and after aggregation reveals
improvement in the performance of wrapper but not of the filtersimprovement in the performance of wrapper but not of the filters
Overal Accuracies after Aggregation - Filter
0
10
20
30
40
50
60
70
80
90
100
HBC Lymphoma Golub Tamoxifen MLL
Dataset
Acc
ura
cy
Gram - Schmidt > kNN
Gram - Schmidt > SVM
Correlation kNN > kNN
Correlation SVM > SVM
Overal Accuracies after Aggregation - Wrapper
0
10
20
30
40
50
60
70
80
90
100
HBC Lymphoma Golub Tamoxifen MLL
Dataset
Acc
ura
cy
Gram - Schmidt > kNN
Gram - Schmidt > SVM
Correlation kNN > kNN
Correlation SVM > SVM
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 1199
OutlineOutline IntroductionIntroduction
MotivationMotivation Stability Measure ApproachStability Measure Approach The bias/variance tradeoffThe bias/variance tradeoff ContributionsContributions
Materials and MethodsMaterials and Methods Stability MetricsStability Metrics Example (Hamming Distance)Example (Hamming Distance) Experimental AnalysisExperimental Analysis
ResultsResults Visualizing InstabilityVisualizing Instability Stability ResultsStability Results Accuracy ResultsAccuracy Results RemarksRemarks Feature AggregationFeature Aggregation
DiscussionDiscussion General RemarksGeneral Remarks Future WorkFuture Work
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 2200
General RemarksGeneral Remarks Similarity metrics: their use depend on what kind of stability we studySimilarity metrics: their use depend on what kind of stability we study Filters more stable and accurate than wrappers: although wrappers Filters more stable and accurate than wrappers: although wrappers
return few variables, their selection procedure can be highly variantreturn few variables, their selection procedure can be highly variant One would expect that high stability leads to high accuracy. However One would expect that high stability leads to high accuracy. However
this is not always the case. this is not always the case. Why?Why?
Best compromise between bias and variance depends on many Best compromise between bias and variance depends on many parameters (feature selection algorithm, top parameters (feature selection algorithm, top NN ranked genes etc.) ranked genes etc.)
AggregationAggregation Filters: Lower variance in Feature Selection Filters: Lower variance in Feature Selection Aggregation does not improve Aggregation does not improve
accuracyaccuracy Wrapper: Higher variance in Feature Selection Wrapper: Higher variance in Feature Selection Aggregation improves Aggregation improves
accuracy (by adjusting variance to achieve better compromise between bias accuracy (by adjusting variance to achieve better compromise between bias and variance)and variance)
High StabilityLower variance inFeature Selection
bias/variance tradeoffModel less flexibleHigher bias
Aggregation doesnot improve
Accuracy(e.g. Filters)
Low StabilityHigher variance inFeature Selection
bias/variance tradeoffModel more flexibleLower bias
AggregationimprovesAccuracy
(e.g. Wrapper)
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 2211
Conclusions – Future WorkConclusions – Future Work ConclusionsConclusions
We have shown that genetic signatures are sensitive to perturbationsWe have shown that genetic signatures are sensitive to perturbations Stability analysis using similarity metrics is necessary in order to evaluate Stability analysis using similarity metrics is necessary in order to evaluate
signature sensitivitysignature sensitivity Aggregation procedure creates a distribution of selected genes which can be Aggregation procedure creates a distribution of selected genes which can be
used as a more stable probabilistic genetic signature for cancer microarray used as a more stable probabilistic genetic signature for cancer microarray studiesstudies
It is better to use a more stable probabilistic signature consisting of more It is better to use a more stable probabilistic signature consisting of more genes than a perturbation sensitive signature consisting of less genesgenes than a perturbation sensitive signature consisting of less genes
Future workFuture work Study gene ranking using Markov Chains (MC): the selection of a gene during Study gene ranking using Markov Chains (MC): the selection of a gene during
the selection process could be dependent on the previous gene (1the selection process could be dependent on the previous gene (1stst order MC) order MC) or on more previously selected genes (higher order MC)or on more previously selected genes (higher order MC)
Comparison of stability between Forward Selection and Backward Elimination Comparison of stability between Forward Selection and Backward Elimination wrapperswrappers
Further research on the relation between stability and accuracy: use of more Further research on the relation between stability and accuracy: use of more algorithms, feature ranking based on stability/accuracy ratioalgorithms, feature ranking based on stability/accuracy ratio
Study the effect of updating classification models with new data on genetic Study the effect of updating classification models with new data on genetic signaturessignatures
Biological interpretation of selected genes in probabilistic signaturesBiological interpretation of selected genes in probabilistic signatures
Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 2222
AcknowledgementsAcknowledgements
Many thanks to:Many thanks to:
Gianluca Bontempi (Machine Learning Group, ULB)Gianluca Bontempi (Machine Learning Group, ULB) Christos Sotiriou (Microarray Unit, IJB)Christos Sotiriou (Microarray Unit, IJB) Benjamin Haibe – Kains (PhD student, MLG ULB, IJB)Benjamin Haibe – Kains (PhD student, MLG ULB, IJB)
Mrs Yiota Poirazi and the Computational Biology Group, FORTH (for Mrs Yiota Poirazi and the Computational Biology Group, FORTH (for this opportunity)this opportunity)