classification / prediction

20
Classification / Prediction

Upload: yelena

Post on 23-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Classification / Prediction. “Supervised Learning” Use a “training set” of examples to create a model that is able to predict, given an unknown sample, which of two or more classes that sample belongs to. Classification. ?. How to build a classifier. How to evaluate a classifier. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Classification /  Prediction

Classification / Prediction

Page 2: Classification /  Prediction

Classification

“Supervised Learning”Use a “training set” of examples to create a model that is able to predict, given an unknown sample, which of two or more classes that sample belongs to.

?

Page 3: Classification /  Prediction

What we’ll cover

• How to build a classifier.

• How to evaluate a classifier.

• Using GenePattern to classify expression data.

Page 4: Classification /  Prediction

What Is a Classifier

• A predictive rule that uses a set of inputs (genes) to predict the values of the output (phenotype).

• Known examples (train data) are used to build the predictive rule.

• Goal:

– Achieve high generalization (predictive) power.

– Avoid over-fitting.

Page 5: Classification /  Prediction

Classification

DataData Known ClassesKnown Classes

Assess Gene-Class Correlation

Feature (Marker) Selection

Assess Gene-Class Correlation

Feature (Marker) Selection

Build ClassifierBuild Classifier

Test Classifier by Cross-Validation

Test Classifier by Cross-Validation

Evaluate Classifier on Independent Test SetEvaluate Classifier on Independent Test Set

Computational methodology

Page 6: Classification /  Prediction

Classification

Assess Gene-Class Correlation

Feature Selection

Assess Gene-Class Correlation

Feature Selection

Build ClassifierBuild Classifier

Test Classifier by Cross-Validation

Test Classifier by Cross-Validation

Evaluate Classifier on Independent Test SetEvaluate Classifier on Independent Test Set

Expression DataExpression Data Known ClassesKnown Classes

Regression Trees (CART)

Weighted Voting

k-Nearest Neighbors

Support Vector Machines

Computational methodology

Page 7: Classification /  Prediction

Classifiers

Important issues:– Few cases, many variables (genes)

– redundancy: many highly correlated genes.– noise: measurements are very imprecise.– feature selection: reducing p is a necessity.

Avoid over-fitting.

Page 8: Classification /  Prediction

project samples in gene space

gene 1gene 1

gene

2ge

ne 2

class orange

class black

K-nn Classifier

Example: K=5, 2 genes, 2 classes

Page 9: Classification /  Prediction

gene 1gene 1

gene

2ge

ne 2

class orange

class black

project unknown sample

?

K-nn Classifier

Example: K=5, 2 genes, 2 classes

Page 10: Classification /  Prediction

gene 1gene 1

gene 2

gene 2

class orange

class black

"consult" 5 closest neighbors:- 3 black- 2 orange

Distance measures:• Euclidean distance• 1-Pearson correlation• …

?

K-nn Classifier

Example: K=5, 2 genes, 2 classes

Page 11: Classification /  Prediction

Support Vector Machine (SVM)

Noble, Nat Biotech 2006

Page 12: Classification /  Prediction

Weighted Voting

Mixture of Experts approach:– Each gene casts a vote for one of

the possible classes.– The vote is weighted by a score

assessing the reliability of the expert (in this case, the gene).

– The class receiving the highest vote will be the predicted class.

– The vote can be used as a proxy for the probability of class membership (prediction strength).

Slonim et al., RECOMB 2000

g1

g2

…gi

gn-1

gn

Class 1 centroid

ig

?

Class 2 centroid

g1

g2

…gi

gn-

1

gn

new sample

Page 13: Classification /  Prediction

Class Prediction

Expression DataExpression Data

Known ClassesKnown Classes

Assess Gene-Class Correlation

Feature Selection

Assess Gene-Class Correlation

Feature Selection

Build PredictorBuild Predictor

Test Predictor by Cross-Validation

Test Predictor by Cross-Validation

Evaluate Predictor on Independent Test SetEvaluate Predictor on Independent Test Set

Regression Trees (CART)

Weighted Voting

k-Nearest Neighbors

Support Vector Machines

Computational methodology

Page 14: Classification /  Prediction

Testing the Classifier

• Evaluation on independent test set– Build the classifier on the train settrain set.– Assess prediction performance on test set.test set.

• Maximize generalization/Avoid overfitting.

• Performance measure

error rate = # of cases correctly classifiedtotal # of cases

Page 15: Classification /  Prediction

Testing the Classifier

• Evaluation on independent test set– What if we don’t have an independent test set?

• Cross Validation (XV): – Split the dataset into n folds (e.g., 10 folds of 10 cases each).– For each fold (e.g., for each group of 10 cases),

• train (i.e., build model) on n-1 folds (e.g., on 90 cases),

• test (i.e., predict) on left-out fold (e.g., on remaining 10 cases).

– Combine test results.– Frequently, leave-one-out XV (when small sample size).

Page 16: Classification /  Prediction

Testing the Classifier

ALL vs. MLL vs. AML

Learning curves – leave one out cross validation

Page 17: Classification /  Prediction

Testing the Classifier

• Error rate estimate:– Evaluate on independent test set:

• Best error estimate.– Cross Validation

• Needed when small sample size and for model selection.

Page 18: Classification /  Prediction

Classification Cookbook

• Start by splitting data into train and test set (stratified). “Forget” about the test set until the very end.

• Explore different feature selection methods and different classifiers on train set by XV.

• Once the “best” classifier and best classifier parameters have been selected (based on XV)– Build a classifier with given parameters on entire train set.

– Apply classifier to test set.

Page 19: Classification /  Prediction

Classification

• Split data into train and test set – SplitDatasetTrainTest• Explore different feature selection methods and different

classifiers on train set by CV.– CARTXValidation– KNNXValidation– WeightedVotingXValidation

• Once the “best” classifier and best classifier parameters have been selected (based on CV)– CART– KNN– WeightedVoting

• Examine results:– PredictionReultsViewer– FeatureSummaryViewer

GenePattern methods

Page 20: Classification /  Prediction

Classification Example