feature selection stability analysis for classification using microarray data

Feature Selection Stability Analysis for Feature Selection Stability Analysis for Classification Using Microarray DataClassification Using Microarray Data

Panagiotis MoulosPanagiotis Moulos

Université Libre de BruxellesDEA Bioinformatique

Feature Selection Stability Analysis for Classification Using Microarray DataFeature Selection Stability Analysis for Classification Using Microarray Data 22

OutlineOutline IntroductionIntroduction

MotivationMotivation Stability Measure ApproachStability Measure Approach The bias/variance tradeoffThe bias/variance tradeoff ContributionsContributions

Materials and MethodsMaterials and Methods Stability MetricsStability Metrics Example (Hamming Distance)Example (Hamming Distance) Experimental AnalysisExperimental Analysis

ResultsResults Visualizing InstabilityVisualizing Instability Stability ResultsStability Results Accuracy ResultsAccuracy Results RemarksRemarks Feature AggregationFeature Aggregation

DiscussionDiscussion General RemarksGeneral Remarks Future WorkFuture Work


MotivationMotivation

Microarrays are invaluable tools for cancer studies at the Microarrays are invaluable tools for cancer studies at the molecular level molecular level prognosisprognosis, , early diagnosisearly diagnosis

Microarray data analysisMicroarray data analysis

However, these signatures are sensitive to perturbations: a However, these signatures are sensitive to perturbations: a small perturbation (e.g. remove 1 sample) may lead to a small perturbation (e.g. remove 1 sample) may lead to a completely different signaturecompletely different signature

Supervised/Unsupervisedlearning for tumor

classificationFeature Selection techniques

forimportant gene identification

Cancer GeneticSignature

Prognosis

(Early) Diagnosis

(1) (1) gg11

(2) (2) gg22

(3) (3) gg33

(4) (4) gg44

(5) g(5) g55 ClasClasss

SS11 00

SS22 11

SS33 00

SS44 11

SS55 00

Feature Selection

Classification

full gene ranking list(2,5,3,4,1)

signature(2,5,3)

BUT

full gene ranking list(1,3,4,5,2)

signature(1,3,4)

Similaritybetween(2,5,3)

and(1,3,4)

?


Stability Measure ApproachStability Measure Approach

Problem of similarity between two gene lists can be approached Problem of similarity between two gene lists can be approached mathematically by the theory of permutationsmathematically by the theory of permutations

Given a set Given a set GGnn = (g = (g11, g, g22, …, g, …, gnn)) of objects, a permutation of objects, a permutation ππ is a is a bijective function between bijective function between GGnn and and GGnn

Concerning the frame of microarray dataConcerning the frame of microarray data The The nn genes – features involved are labeled with a unique number genes – features involved are labeled with a unique number

between between 1, …, n1, …, n Every gene ranking list (Every gene ranking list (full ranking listfull ranking list) is exactly a permutation) is exactly a permutation ππ

on the set on the set {1, …, n}{1, …, n} where the image where the image ππ(i)(i) of the i of the ithth gene is its gene is its ranking inside ranking inside ππ

If we are interested only for the top If we are interested only for the top NN ranked genes – features of ranked genes – features of GGnn, we define as , we define as ππ** the the partial ranking listpartial ranking list of of GGnn which contains the which contains the first first NN elements of elements of ππ


Stability Measure Approach (Example)Stability Measure Approach (Example)

A full ranking list: A full ranking list: GG55 = (1,2,3,4,5) = (1,2,3,4,5) A permutation: A permutation: ππ = (3,2,5,4,1) = (3,2,5,4,1) where where ππ(1) = 3, (1) = 3, ππ(2) = 2, (2) = 2, ππ(3) = 5, (3) = 5, ππ(4) = 4, (4) = 4, ππ(5) = (5) =

11 A partial ranking list with the top N = 3 ranked genes: A partial ranking list with the top N = 3 ranked genes: ππ* = (3,2,5)* = (3,2,5) where where ππ*(1)*(1) = =

3, 3, ππ*(2) = 2, *(2) = 2, ππ*(3) = 5*(3) = 5 How can we summarize variability betweenHow can we summarize variability between

Full ranking lists Full ranking lists ππ and and σσ Partial ranking listsPartial ranking lists π*π* and and σ*σ*

Several metrics proposed in statistical literature (e.g. Critchlow, Several metrics proposed in statistical literature (e.g. Critchlow, 1985)1985)

Gene NameGene Name Gene IdentificationGene Identification

31307_at31307_at 11

31457_at 31457_at 22

32088_at32088_at 33

40821_at40821_at 44

40831_at40831_at 55


The bias/variance tradeoffThe bias/variance tradeoff

A central issue in choosing a model for a given problem is A central issue in choosing a model for a given problem is selecting the level of selecting the level of structural complexity structural complexity (# (# variables/parameters etc) that best suits the data that it must variables/parameters etc) that best suits the data that it must accommodateaccommodate

Deciding on the correct amount of flexibility in a model is Deciding on the correct amount of flexibility in a model is therefore a tradeoff between these two sources of the misfit. therefore a tradeoff between these two sources of the misfit. This is called the This is called the bias/variance tradeoffbias/variance tradeoff

Too many Too many parameterparameter

ss

Inclusion Inclusion of noiseof noise

OverfittinOverfittingg

Variance Variance contributiocontributio

nn

Too few Too few parameterparameter

ss

Not Not enough enough

flexibilityflexibility

MisfitMisfit Bias Bias contributiocontributio

nn


ContributionsContributions

1.1. Experimental study of signature stability in gene expression Experimental study of signature stability in gene expression datasets by resampling (bootstrap, jackknife) datasets for datasets by resampling (bootstrap, jackknife) datasets for different ranking/feature selection methodsdifferent ranking/feature selection methods

2.2. Study of several forms of feature selection stability using Study of several forms of feature selection stability using statistical similarity measuresstatistical similarity measures

3.3. Classification performance assessment for each feature Classification performance assessment for each feature selection and classification methodselection and classification method

4.4. Study of possible correlation between feature selection stability Study of possible correlation between feature selection stability and classification accuracy for all feature selection and and classification accuracy for all feature selection and classification methodsclassification methods

5.5. Proposal of a feature aggregation procedure to obtain more Proposal of a feature aggregation procedure to obtain more stable probabilistic gene signaturesstable probabilistic gene signatures


Stability MetricsStability Metrics Measures of Feature Selection StabilityMeasures of Feature Selection Stability

Stability of SelectionStability of Selection Stability of selectionStability of selection for a given dataset is the stability of appearance of for a given dataset is the stability of appearance of

certain features after resampling the original dataset. certain features after resampling the original dataset. Hamming DistanceHamming Distance

InconsistencyInconsistency

Stability of RankingStability of Ranking SStability of rankingtability of ranking for a given dataset is the stability of both the for a given dataset is the stability of both the

appearance and the ranking order of certain features after resampling appearance and the ranking order of certain features after resampling the original dataset the original dataset

Spearman’s FootruleSpearman’s Footrule Kendall’s TauKendall’s Tau

NNN1H0llNN

1mm2

H jinorm

n

kjkik

m

i

m

ij ji

norm 2 , ,

1

1

1 11

nkk

lk

1 ,otherwise ,0

selected is gene if ,1

with


Example (Hamming Distance)Example (Hamming Distance)

Example of stability metric: Hamming DistanceExample of stability metric: Hamming Distance Calculation of Hamming Distance for: Calculation of Hamming Distance for: mm = 5, = 5, nn = 10, = 10, NN = 5 = 5

74932

97483

46218

57183

15832

OriginalDataset

Resampling 1

Resampling 2

Resampling 3

Resampling 4

Resampling 5

0101001110

0111001100

0010101011

0011010101

0010010111

48.02666466642101

1552

normH

212 H413 H

634 H


Experimental AnalysisExperimental Analysis

Datasets: Breast Cancer (HBC, Tamoxifen), Leukemia (MLL, Golub), Datasets: Breast Cancer (HBC, Tamoxifen), Leukemia (MLL, Golub), LymphomaLymphoma

Classification algorithms: k-NN (k = 5), Support Vector MachinesClassification algorithms: k-NN (k = 5), Support Vector Machines Feature Selection algorithmsFeature Selection algorithms

Filters: Gram – Schmidt orthogonalization, k-NN and SVM correlation based Filters: Gram – Schmidt orthogonalization, k-NN and SVM correlation based filter (gene ranking according to misclassification error by 1 – gene trained filter (gene ranking according to misclassification error by 1 – gene trained classifier)classifier)

Wrapper: Sequential Forward Selection wrapperWrapper: Sequential Forward Selection wrapper Feature aggregation (Feature aggregation (main personal contributionmain personal contribution))

Gather together all different signaturesGather together all different signatures Remove duplicatesRemove duplicates Exclude features with low selection frequency according to a thresholdExclude features with low selection frequency according to a threshold

Resampling strategiesResampling strategies Bootstrap (on each step resample patients of a dataset with replacement)Bootstrap (on each step resample patients of a dataset with replacement) Jackknife (on each step remove 1 – 5% of samples)Jackknife (on each step remove 1 – 5% of samples)


Visualizing InstabilityVisualizing Instability


Stability Results (1)Stability Results (1) Stability of Selection (Bootstrap)Stability of Selection (Bootstrap)

Stability of Ranking (Bootstrap)Stability of Ranking (Bootstrap)

Overall Stabilities - Bootstrap - Filtering

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

Datasets

Inst

abil

ity Gram - Schmidt

Correlation kNN

Correlation SVM

Random Case

Hamming Distance Inconsistency

Overall Stabilities - Bootstrap - Wrapping

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

HBC Lymphoma Golub Tamoxifen MLL HBC Lymphoma Golub Tamoxifen MLL

Datasets

Inst

abil

ity

Gram - Schmidt > kNN

Gram - Schmidt > SVM

Correlation kNN > kNN

Correlation SVM > SVM

Random Case


Overall Stabilities - Bootstrap - Filtering

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

Datasets

Inst

abil

ity Gram - Schmidt

Correlation kNN

Correlation SVM

Random Case

Spearman's Footrule Kendall's Tau


Stability Results (2)Stability Results (2) Stability of Selection (Jackknife)Stability of Selection (Jackknife)

Stability of Ranking (Jackknife)Stability of Ranking (Jackknife)

Overall stabilities - Jackknife - Filtering

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1


Datasets

Inst

abil

ity Gram - Schmidt

Correlation kNN

Correlation SVM

Random Case


Overall Stabilities - Jackknife - Wrapping

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1


Datasets

Inst

abil

ity





Random Case


Overall Stabilities - Jackknife - Filtering

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

HBC

Lym

phom

a

Golu

b

Tamox

ifen

MLL

HBC

Lym

phom

a

Golu

b

Tamox

ifen

MLL

Datasets

Inst

abil

ity Gram - Schmidt

Correlation kNN

Correlation SVM

Random Case

Spearman's Footrule Kendall's Tau


Accuracy ResultsAccuracy Results

Mean Accuracies - No Filtering or Wrapping

0

20

40

60

80

100

HBC

Lym

phom

a

Golu

b

Tamox

ifen

MLL

HBC

Lym

phom

a

Golu

b

Tamox

ifen

MLL

Dataset

Acc

ura

cy p

erce

nta

ge

kNN

SVM

Bootstrap Jackknife

x

No Filteringor Wrapping

Overall Accuracies - Bootstrap

0

20

40

60

80

100

Dataset

Acc

ura

cy P

erce

nta

ge





Filter Wrap

Overall Accuracies - Jackknife

0

20

40

60

80

100

Dataset

Acc

ura

cy P

erce

nta

ge





Filter Wrap


RemarksRemarks

StabilityStability Stability inversely proportional to size of perturbationStability inversely proportional to size of perturbation Gram – Schmidt orthogonalization outperforms classifier based Gram – Schmidt orthogonalization outperforms classifier based

correlationscorrelations Filters more stable than the wrapperFilters more stable than the wrapper Correlation between stability of selection and stability of rankingCorrelation between stability of selection and stability of ranking

AccuracyAccuracy Accuracy proportional to size of perturbationAccuracy proportional to size of perturbation Gram – Schmidt orthogonalization is outperformed by classifier Gram – Schmidt orthogonalization is outperformed by classifier

based correlations based correlations Filters outperform the wrapperFilters outperform the wrapper Performance is improved after the application of Feature Selection Performance is improved after the application of Feature Selection

techniquestechniques


Feature AggregationFeature Aggregation Feature AggregationFeature Aggregation

Class permutation test shows no overfittingClass permutation test shows no overfitting t – test between mean accuracies before and after aggregation reveals t – test between mean accuracies before and after aggregation reveals

improvement in the performance of wrapper but not of the filtersimprovement in the performance of wrapper but not of the filters

Overal Accuracies after Aggregation - Filter

0

10

20

30

40

50

60

70

80

90

100

HBC Lymphoma Golub Tamoxifen MLL

Dataset

Acc

ura

cy





Overal Accuracies after Aggregation - Wrapper

0

10

20

30

40

50

60

70

80

90

100

HBC Lymphoma Golub Tamoxifen MLL

Dataset

Acc

ura

cy






General RemarksGeneral Remarks Similarity metrics: their use depend on what kind of stability we studySimilarity metrics: their use depend on what kind of stability we study Filters more stable and accurate than wrappers: although wrappers Filters more stable and accurate than wrappers: although wrappers

return few variables, their selection procedure can be highly variantreturn few variables, their selection procedure can be highly variant One would expect that high stability leads to high accuracy. However One would expect that high stability leads to high accuracy. However

this is not always the case. this is not always the case. Why?Why?

Best compromise between bias and variance depends on many Best compromise between bias and variance depends on many parameters (feature selection algorithm, top parameters (feature selection algorithm, top NN ranked genes etc.) ranked genes etc.)

AggregationAggregation Filters: Lower variance in Feature Selection Filters: Lower variance in Feature Selection Aggregation does not improve Aggregation does not improve

accuracyaccuracy Wrapper: Higher variance in Feature Selection Wrapper: Higher variance in Feature Selection Aggregation improves Aggregation improves

accuracy (by adjusting variance to achieve better compromise between bias accuracy (by adjusting variance to achieve better compromise between bias and variance)and variance)

High StabilityLower variance inFeature Selection

bias/variance tradeoffModel less flexibleHigher bias

Aggregation doesnot improve

Accuracy(e.g. Filters)

Low StabilityHigher variance inFeature Selection

bias/variance tradeoffModel more flexibleLower bias

AggregationimprovesAccuracy

(e.g. Wrapper)


Conclusions – Future WorkConclusions – Future Work ConclusionsConclusions

We have shown that genetic signatures are sensitive to perturbationsWe have shown that genetic signatures are sensitive to perturbations Stability analysis using similarity metrics is necessary in order to evaluate Stability analysis using similarity metrics is necessary in order to evaluate

signature sensitivitysignature sensitivity Aggregation procedure creates a distribution of selected genes which can be Aggregation procedure creates a distribution of selected genes which can be

used as a more stable probabilistic genetic signature for cancer microarray used as a more stable probabilistic genetic signature for cancer microarray studiesstudies

It is better to use a more stable probabilistic signature consisting of more It is better to use a more stable probabilistic signature consisting of more genes than a perturbation sensitive signature consisting of less genesgenes than a perturbation sensitive signature consisting of less genes

Future workFuture work Study gene ranking using Markov Chains (MC): the selection of a gene during Study gene ranking using Markov Chains (MC): the selection of a gene during

the selection process could be dependent on the previous gene (1the selection process could be dependent on the previous gene (1stst order MC) order MC) or on more previously selected genes (higher order MC)or on more previously selected genes (higher order MC)

Comparison of stability between Forward Selection and Backward Elimination Comparison of stability between Forward Selection and Backward Elimination wrapperswrappers

Further research on the relation between stability and accuracy: use of more Further research on the relation between stability and accuracy: use of more algorithms, feature ranking based on stability/accuracy ratioalgorithms, feature ranking based on stability/accuracy ratio

Study the effect of updating classification models with new data on genetic Study the effect of updating classification models with new data on genetic signaturessignatures

Biological interpretation of selected genes in probabilistic signaturesBiological interpretation of selected genes in probabilistic signatures


AcknowledgementsAcknowledgements

Many thanks to:Many thanks to:

Gianluca Bontempi (Machine Learning Group, ULB)Gianluca Bontempi (Machine Learning Group, ULB) Christos Sotiriou (Microarray Unit, IJB)Christos Sotiriou (Microarray Unit, IJB) Benjamin Haibe – Kains (PhD student, MLG ULB, IJB)Benjamin Haibe – Kains (PhD student, MLG ULB, IJB)

Mrs Yiota Poirazi and the Computational Biology Group, FORTH (for Mrs Yiota Poirazi and the Computational Biology Group, FORTH (for this opportunity)this opportunity)

feature selection stability analysis for classification using microarray data

Documents