lecture 8 feature selection bioinformatics data analysis and tools

36
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools Elena Marchiori ([email protected])

Upload: tobit

Post on 06-Jan-2016

30 views

Category:

Documents


1 download

DESCRIPTION

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools. Elena Marchiori ([email protected]). Why select features. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Lecture 8

Feature Selection

Bioinformatics Data Analysis and Tools

Elena Marchiori ([email protected])

Page 2: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Why select features

• Select a subset of “relevant” input variables • Advantages:

– it is cheaper to measure less variables– the resulting classifier is simpler and potentially

faster – prediction accuracy may improve by discarding

irrelevant variables – identifying relevant variables gives more insight

into the nature of the corresponding classification problem (biomarker detection)

Page 3: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Why select features?

Correlation plotData: Leukemia, 3 class

No feature selection

Top 100 feature selection

Selection based on variance

-1 +1

Page 4: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Approaches

• Wrapper– feature selection takes into account the contribution to

the performance of a given type of classifier

• Filter– feature selection is based on an evaluation criterion

for quantifying how well feature (subsets) discriminate the two classes

• Embedded– feature selection is part of the training procedure of a

classifier (e.g. decision trees)

Page 5: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Embedded methods

• Attempt to jointly or simultaneously train both a classifier and a feature subset

• Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.

• Intuitively appealing

Example: tree-building algorithms

Adapted from J. Fridlyand

Page 6: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Input Features

Feature Selection by

Distance Metric Score

Train Model

Feature Selection Search

Feature Set

Importance of features given by the model

Filter Approach

Wrapper Approach

Input Features

Model

Train Model

Model

Approaches to Feature Selection

Adapted from Shin and Jasso

Page 7: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Filter methods

Rp

Feature selection Rs

s << pClassifier design

•Features are scored independently and the top s are used by the classifier

•Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc

Easy to interpret. Can provide some insight into the disease markers.

Adapted from J. Fridlyand

Page 8: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Problems with filter method

• Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new information

• Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others)

• Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others.

Adapted from J. Fridlyand

Page 9: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Dimension reduction: a variant on a filter method

• Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc)

• Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a “supergene”? (even though we don’t want to confuse the tasks)

• Those methods tend not to work better than simple filter methods.

Adapted from J. Fridlyand

Page 10: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Wrapper methods

Rp

Feature selection Rs

s << pClassifier design

•Iterative approach: many feature subsets are scored based on classification performance and best is used.

•Selection of subsets: forward selection, backward selection, Forward-backward selection, tree harvesting etc

Adapted from J. Fridlyand

Page 11: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Problems with wrapper methods

• Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated

• No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only.

• Easy to overfit.

p

Adapted from J. Fridlyand

Page 12: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Example: Microarray Analysis

“Labeled” cases(38 bone marrow samples: 27 AML, 11 ALL

Each contains 7129 gene expression values)

Train model(using Neural Networks, Support Vector

Machines, Bayesian nets, etc.)

Model34 New

unlabeled bone marrow samples

AML/ALL

key genes

Page 13: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

• Few samples for analysis (38 labeled)

• Extremely high-dimensional data (7129 gene expression values per sample)

• Noisy data

• Complex underlying mechanisms, not fully understood

Microarray Data Challenges to Machine Learning Algorithms:

Page 14: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Some genes are more useful than others for building classification models

Example: genes 36569_at and 36495_at are useful

Page 15: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Example: genes 36569_at and 36495_at are useful

AML

ALL

Some genes are more useful than others for building classification models

Page 16: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Example: genes 37176_at and 36563_at not useful

Some genes are more useful than others for building classification models

Page 17: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Importance of Feature (Gene) Selection

• Majority of genes are not directly related to leukemia

• Having a large number of features enhances the model’s flexibility, but makes it prone to overfitting

• Noise and the small number of training samples makes this even more likely

• Some types of models, like kNN do not scale well with many features

Page 18: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

With 7219 genes, how do we choose the best?

• Distance metrics to capture class separation• Rank genes according to distance metric score• Choose the top n ranked genes

HIGH score LOW score

Page 19: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

• Tamayo’s Relative Class Separation

• t-test

• Bhattacharyya distance

Distance Metrics

2

22

1

21

21

ns

ns

xx

21

22

21

22

21

212

2log

2

1)(

4

1

ss

ss

ss

xx

21

21

ss

xx

deviation standard s

i class ofr mean vecto

i

ix

Page 20: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

SVM-RFE: wrapper

• Recursive Feature Elimination:– Train linear SVM -> linear decision function– Use absolute value of variable weights to rank

variables– Remove half variables with lower rank– Repeat above steps (train, rank, remove) on data

restricted to variables not removed

• Output: subset of variables

Page 21: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

SVM-RFE• Linear binary classifier decision function

• Recursive Feature Elimination (SVM-RFE) – at each iteration:

1) eliminate threshold% of variables with lower score2) recompute scores of remaining variables

bxwxxf i

N

iiN

11 ),...,(

ii xw variableof score ||

Page 22: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

SVM-RFEI. Guyon et al.,Machine Learning,46,389-422, 2002

Page 23: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

RELIEF• Idea: relevant variables make nearest

examples of same class closer and make nearest examples of opposite classes more far apart.

1. weights = zero

2. For all examples in training set:– find nearest example from same (hit) and opposite class (miss)– update weight of each variable by adding abs(example - miss)

-abs(example - hit)

RELIEFI. Kira K, Rendell L,10th Int. Conf. on AI,129-134, 1992

Page 24: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

RELIEF AlgorithmRELIEF assigns weights to variables based on how well they separate samples from their

nearest neighbors (nnb) from the same and from the opposite class.

RELIEF%input: X (two classes)%output: W (weights assigned to variables)nr_var = total number of variables;weights = zero vector of size nr_var;for all x in X do

hit(x) = nnb of x from same class;miss(x) = nnb of x from opposite class;weights += abs(x-miss(x)) - abs(x-hit(x));

end;nr_ex = number of examples of X;return W = weights/nr_exNote: Variables have to be normalized (e.g., divide each variable by its (max – min) values)

Page 25: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

EXAMPLE

What are the weights of s1, s2, s3 and s4 assigned by RELIEF?

Page 26: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Classification: CV error

• Training error– Empirical error

• Error on independent test set – Test error

• Cross validation (CV) error– Leave-one-out (LOO)– N-fold CV

N samples

splitting

1/n samples for testing

Summarize CV error rate

N-1/n samples for training

Count errors

Page 27: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Two schemes of cross validation

N samples

LOO

Train and test the feature-selector and

the classifier

Count errors

N samples

feature selection

Train and test the classifier

Count errors

LOO

CV2CV1

Page 28: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Difference between CV1 and CV2

• CV1 gene selection within LOOCV• CV2 gene selection before before LOOCV• CV2 can yield optimistic estimation of classification true

error

• CV2 used in paper by Golub et al. :– 0 training error– 2 CV error (5.26%)– 5 test error (14.7%)– CV error different from test error!

Page 29: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Significance of classification results

• Permutation test:– Permute class label of samples– LOOCV error on data with permuted labels– Repeat process a high number of times– Compare with LOOCV error on original data:

• P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered

Page 30: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Application: Biomarker detection with Mass Spectrometric data of

mixed quality

• MALDI-TOF data.

• samples of mixed quality due to different storage time.

• controlled molecule spiking used to generate two classes.

I. Marchiori et al,IEEE CIBCB,385-391, 2005

Page 31: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Profiles of one spiked sample

Page 32: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Comparison of ML algorithms

• Feature selection + classification:1. RFE+SVM

2. RFE+kNN

3. RELIEF+SVM

4. RELIEF+kNN

Page 33: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

LOOCV results

• Misclassified samples are of bad quality (higher storage time)

• The selected features do not always correspond to m/z of spiked molecules

Page 34: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

LOOCV results • The variables selected by RELIEF correspond

to the spiked peptides• RFE is less robust than RELIEF over LOOCV

runs and selects also “irrelevant” variables

RELIEF-based feature selection yields results which are better interpretable than RFE

Page 35: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

BUT...

• RFE+SVM yields superior loocv accuracy than RELIEF+SVM

• RFE+kNN superior accuracy than RELIEF+kNN

(perfect LOOCV classification for RFE+1NN)

RFE-based feature selection yields better predictive performance than RELIEF

Page 36: Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools

Conclusion• Better predictive performance does not

necessarily correspond to stability and interpretability of results

• Open issues: – (ML/BIO) Ad-hoc measure of relevance for

potential biomarkers identified by feature selection algorithms (use of domain knowledge)?

– (ML) Is stability of feature selection algorithms more important than predictive accuracy?