topics in the development and validation of gene expression profiling based predictive classifiers...
TRANSCRIPT
![Page 1: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/1.jpg)
Topics in the Development and Validation of Gene Expression
Profiling Based Predictive Classifiers
Richard Simon, D.Sc.Chief, Biometric Research Branch
National Cancer InstituteLinus.nci.nih.gov/brb
![Page 2: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/2.jpg)
BRB Websitehttp://linus.nci.nih.gov/brb
• Powerpoint presentations and audio files
• Reprints & Technical Reports
• BRB-ArrayTools software
• BRB-ArrayTools Data Archive
• Sample Size Planning for Targeted Clinical Trials
![Page 3: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/3.jpg)
Simplified Description of Microarray Assay
• Extract mRNA from cells of interest– Each mRNA molecule was transcribed from a single gene and it
has a linear structure complementary to that gene• Convert mRNA to cDNA introducing a fluorescently
labeled dye to each molecule• Distribute the cDNA sample to a solid surface containing
“probes” of DNA representing all “genes”; the probes are in known locations on the surface
• Let the molecules from the sample hybridize with the probes for the corresponding genes
• Remove excess sample and illuminate surface with laser with frequency corresponding to the dye
• Measure intensity of fluorescence over each probe
![Page 4: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/4.jpg)
Resulting Data
• Intensity over a probe is approximately proportional to abundance of mRNA molecules in the sample for the gene corresponding to the probe
• 40,000 variables measured for each case– Excessive hype– Excessive skepticism– Some familiar statistical paradigms don’t work
well
![Page 5: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/5.jpg)
Good Microarray Studies Have Clear Objectives
• Class Comparison (Gene Finding)– Find genes whose expression differs among predetermined
classes, e.g. tissue or experimental condition
• Class Prediction– Prediction of predetermined class (e.g. treatment outcome)
using information from gene expression profile– Survival risk-group prediction
• Class Discovery– Discover clusters of specimens having similar expression
profiles
![Page 6: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/6.jpg)
Class Comparison and Class Prediction
• Not clustering problems
• Supervised methods
![Page 7: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/7.jpg)
Class Prediction ≠ Class Comparison
• A set of genes is not a predictive model• Emphasis in class comparison is often on understanding
biological mechanisms– More difficult than accurate prediction and usually requires a
different experiment
• Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy
![Page 8: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/8.jpg)
Components of Class Prediction
• Feature (gene) selection– Which genes will be included in the model
• Select model type – E.g. Diagonal linear discriminant analysis,
Nearest-Neighbor, …
• Fitting parameters (regression coefficients) for model– Selecting value of tuning parameters
![Page 9: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/9.jpg)
Feature Selection
• Genes that are differentially expressed among the classes at a significance level (e.g. 0.01) – The level is a tuning parameter– Number of false discoveries is not of direct relevance for
prediction
• For prediction it is usually more serious to exclude an informative variable than to include some noise variables
![Page 10: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/10.jpg)
![Page 11: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/11.jpg)
Optimal significance level cutoffs for gene selection. 50 differentially expressed genes
out of 22,000 genes on the microarrays 2δ/σ n=10 n=30 n=50
1 0.167 0.003 0.00068
1.25 0.085 0.0011 0.00035
1.5 0.045 0.00063 0.00016
1.75 0.026 0.00036 0.00006
2 0.015 0.0002 0.00002
![Page 12: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/12.jpg)
Complex Gene Selection
• Small subset of genes which together give most accurate predictions– Genetic algorithms
• Little evidence that complex feature selection is useful in microarray problems
![Page 13: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/13.jpg)
Linear Classifiers for Two Classes
( )
vector of log ratios or log signals
features (genes) included in model
weight for i'th feature
decision boundary ( ) > or < d
i ii F
i
l x w x
x
F
w
l x
![Page 14: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/14.jpg)
Linear Classifiers for Two Classes
• Fisher linear discriminant analysis• Diagonal linear discriminant analysis (DLDA)
– Ignores correlations among genes
• Compound covariate predictor• Golub’s weighted voting method• Support vector machines with inner product
kernel• Perceptrons
![Page 15: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/15.jpg)
When p>>n
• It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero.
• There is generally not sufficient information in p>>n training sets to effectively use more complex methods
![Page 16: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/16.jpg)
Myth
• Complex classification algorithms such as neural networks perform better than simpler methods for class prediction.
![Page 17: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/17.jpg)
• Comparative studies have shown that simpler methods work as well or better for microarray problems because they avoid overfitting the data.
![Page 18: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/18.jpg)
Other Simple Methods
• Nearest neighbor classification
• Nearest k-neighbors
• Nearest centroid classification
• Shrunken centroid classification
![Page 19: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/19.jpg)
Evaluating a Classifier
• Most statistical methods were not developed for p>>n prediction problems
• Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data
• Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy
• Testing whether analysis of independent data results in selection of the same set of genes is not an appropriate test of predictive accuracy of a classifier
![Page 20: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/20.jpg)
![Page 21: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/21.jpg)
![Page 22: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/22.jpg)
Internal Validation of a Classifier
• Re-substitution estimate– Develop classifier on dataset, test predictions
on same data– Very biased for p>>n
• Split-sample validation
• Cross-validation
![Page 23: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/23.jpg)
Split-Sample Evaluation
• Training-set– Used to select features, select model type, determine
parameters and cut-off thresholds
• Test-set– Withheld until a single model is fully specified using
the training-set.– Fully specified model is applied to the expression
profiles in the test-set to predict class labels. – Number of errors is counted
![Page 24: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/24.jpg)
Leave-one-out Cross Validation
• Omit sample 1– Develop multivariate classifier from scratch on
training set with sample 1 omitted– Predict class for sample 1 and record whether
prediction is correct
![Page 25: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/25.jpg)
Leave-one-out Cross Validation
• Repeat analysis for training sets with each single sample omitted one at a time
• e = number of misclassifications determined by cross-validation
• Subdivide e for estimation of sensitivity and specificity
![Page 26: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/26.jpg)
• With proper cross-validation, the model must be developed from scratch for each leave-one-out training set. This means that feature selection must be repeated for each leave-one-out training set.
– Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of DNA microarray data. Journal of the National Cancer Institute 95:14-18, 2003.
• The cross-validated estimate of misclassification error is an estimate of the prediction error for model fit using specified algorithm to full dataset
![Page 27: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/27.jpg)
Prediction on Simulated Null Data
Generation of Gene Expression Profiles
• 14 specimens (Pi is the expression profile for specimen i)
• Log-ratio measurements on 6000 genes
• Pi ~ MVN(0, I6000)
• Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)?
Prediction Method
• Compound covariate prediction
• Compound covariate built from the log-ratios of the 10 most differentially expressed genes.
![Page 28: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/28.jpg)
Number of misclassifications
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Pro
po
rtio
n o
f sim
ula
ted
da
ta s
ets
0.00
0.05
0.10
0.90
0.95
1.00
Cross-validation: none (resubstitution method)Cross-validation: after gene selectionCross-validation: prior to gene selection
![Page 29: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/29.jpg)
![Page 30: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/30.jpg)
Major Flaws Found in 40 Studies Published in 2004
• Inadequate control of multiple comparisons in gene finding– 9/23 studies had unclear or inadequate methods to deal with
false positives• 10,000 genes x .05 significance level = 500 false positives
• Misleading report of prediction accuracy– 12/28 reports based on incomplete cross-validation
• Misleading use of cluster analysis – 13/28 studies invalidly claimed that expression clusters based on
differentially expressed genes could help distinguish clinical outcomes
• 50% of studies contained one or more major flaws
![Page 31: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/31.jpg)
Myth
• Split sample validation is superior to LOOCV or 10-fold CV for estimating prediction error
![Page 32: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/32.jpg)
![Page 33: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/33.jpg)
Comparison of Internal Validation MethodsMolinaro, Pfiffer & Simon
• For small sample sizes, LOOCV is much less biased than split-sample validation
• For small sample sizes, LOOCV is preferable to 10-fold, 5-fold cross-validation or repeated k-fold versions
• For moderate sample sizes, 10-fold is preferable to LOOCV
• Some claims for bootstrap resampling for estimating prediction error are not valid for p>>n problems
![Page 34: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/34.jpg)
![Page 35: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/35.jpg)
Simulated Data40 cases, 10 genes selected from 5000
Method Estimate Std Deviation
True .078
Resubstitution .007 .016
LOOCV .092 .115
10-fold CV .118 .120
5-fold CV .161 .127
Split sample 1-1 .345 .185
Split sample 2-1 .205 .184
.632+ bootstrap .274 .084
![Page 36: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/36.jpg)
Simulated Data40 cases
Method Estimate Std Deviation
True .078
10-fold .118 .120
Repeated 10-fold .116 .109
5-fold .161 .127
Repeated 5-fold .159 .114
Split 1-1 .345 .185
Repeated split 1-1 .371 .065
![Page 37: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/37.jpg)
DLBCL Data
Method Bias Std Deviation MSE
LOOCV -.019 .072 .008
10-fold CV -.007 .063 .006
5-fold CV .004 .07 .007
Split 1-1 .037 .117 .018
Split 2-1 .001 .119 .017
.632+ bootstrap -.006 .049 .004
![Page 38: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/38.jpg)
![Page 39: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/39.jpg)
• Ordinary bootstrap– Training and test sets overlap
• Bootstrap cross-validation (Fu, Carroll,Wang)
– Perform LOOCV on bootstrap samples– Training and test sets overlap
• Leave-one-out bootstrap– Predict for cases not in bootstrap sample– Training sets are too small
• Out-of-bag bootstrap (Breiman)
– Predict for case i based on majority rule of predictions for bootstrap samples not containing case i
• .632+ bootstrap– w*LOOBS+(1-w)RSB
![Page 40: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/40.jpg)
![Page 41: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/41.jpg)
![Page 42: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/42.jpg)
![Page 43: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/43.jpg)
Permutation Distribution of Cross-validated Misclassification Rate of a
Multivariate Classifier• Randomly permute class labels and repeat
the entire cross-validation• Re-do for all (or 1000) random
permutations of class labels• Permutation p value is fraction of random
permutations that gave as few misclassifications as e in the real data
![Page 44: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/44.jpg)
![Page 45: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/45.jpg)
![Page 46: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/46.jpg)
Does an Expression Profile Classifier Predict More Accurately Than Standard
Prognostic Variables?
• Not an issue of which variables are significant after adjusting for which others or which are independent predictors– Predictive accuracy, not significance
• The two classifiers can be compared by ROC analysis as functions of the threshold for classification
• The predictiveness of the expression profile classifier can be evaluated within levels of the classifier based on standard prognostic variables
![Page 47: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/47.jpg)
Does an Expression Profile Classifier Predict More Accurately Than Standard
Prognostic Variables?
• Some publications fit logistic model to standard covariates and the cross-validated predictions of expression profile classifiers
• This is valid only with split-sample analysis because the cross-validated predictions are not independent
log ( ) ( | )i iit p y x i z
![Page 48: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/48.jpg)
Survival Risk Group Prediction
• For analyzing right censored data to develop predictive classifiers it is not necessary to make the data binary
• Can do cross-validation to predict high or low risk group for each case
• Compute Kaplan-Meier curves of predicted risk groups• Permutation significance of log-rank statistic• Implemented in BRB-ArrayTools• BRB-ArrayTools also provides for comparing the risk
group classifier based on expression profiles to one based on standard covariates and one based on a combination of both types of variables
![Page 49: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/49.jpg)
Myth
• Huge sample sizes are needed to develop effective predictive classifiers
![Page 50: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/50.jpg)
Sample Size Planning References
• K Dobbin, R Simon. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27-38, 2005
• K Dobbin, R Simon. Sample size planning for developing classifiers using high dimensional DNA microarray data. Biostatistics (2007)
![Page 51: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/51.jpg)
Sample Size Planning for Classifier Development
• The expected value (over training sets) of the probability of correct classification PCC(n) should be within of the maximum achievable PCC()
![Page 52: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/52.jpg)
Probability Model
• Two classes• Log expression or log ratio MVN in each class with
common covariance matrix• m differentially expressed genes• p-m noise genes• Expression of differentially expressed genes are
independent of expression for noise genes• All differentially expressed genes have same inter-class
mean difference 2• Common variance for differentially expressed genes and
for noise genes
![Page 53: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/53.jpg)
Classifier
• Feature selection based on univariate t-tests for differential expression at significance level
• Simple linear classifier with equal weights (except for sign) for all selected genes. Power for selecting each of the informative genes that are differentially expressed by mean difference 2 is 1-(n)
![Page 54: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/54.jpg)
• For 2 classes of equal prevalence, let 1 denote the largest eigenvalue of the covariance matrix of informative genes. Then
1
( )m
PCC
![Page 55: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/55.jpg)
1
1( ) 1
1
mmPCC n
m p m
![Page 56: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/56.jpg)
1.0 1.2 1.4 1.6 1.8 2.0
40
60
80
100
2 delta/sigma
Sam
ple
siz
e
gamma=0.05gamma=0.10
Sample size as a function of effect size (log-base 2 fold-change between classes divided by
standard deviation). Two different tolerances shown, . Each class is equally represented in the population. 22000 genes on an array.
![Page 57: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/57.jpg)
![Page 58: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/58.jpg)
b) PCC(60) as a function of the proportion in the under-represented class. Parameter settings same
as a), with 10 differentially expressed genes among 22,000 total genes. If the proportion in the under-represented class is small (e.g., <20%), then the PCC(60) can decline significantly.
0.1 0.2 0.3 0.4 0.5
0.7
50.8
00.8
5
Proportion in under-represented class
PC
C(6
0)
![Page 59: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/59.jpg)
![Page 60: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/60.jpg)
![Page 61: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/61.jpg)
Acknowledgements
• Kevin Dobbin• Alain Dupuy• Wenyu Jiang• Annette Molinaro• Ruth Pfeiffer• Michael Radmacher• Joanna Shih• Yingdong Zhao• BRB-ArrayTools Development Team
![Page 62: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/62.jpg)
BRB-ArrayTools
• Contains analysis tools that I have selected as valid and useful
• Analysis wizard and multiple help screens for biomedical scientists
• Imports data from all platforms and major databases
• Automated import of data from NCBI Gene Express Omnibus
![Page 63: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/63.jpg)
Predictive Classifiers in BRB-ArrayTools
• Classifiers– Diagonal linear discriminant– Compound covariate – Bayesian compound covariate– Support vector machine with
inner product kernel– K-nearest neighbor
– Nearest centroid– Shrunken centroid (PAM)– Random forrest– Tree of binary classifiers for k-
classes
• Survival risk-group– Supervised pc’s
• Feature selection options– Univariate t/F statistic– Hierarchical variance option– Restricted by fold effect– Univariate classification power– Recursive feature elimination– Top-scoring pairs
• Validation methods– Split-sample– LOOCV– Repeated k-fold CV– .632+ bootstrap
![Page 64: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/64.jpg)
Selected Features of BRB-ArrayTools• Multivariate permutation tests for class comparison to control
number and proportion of false discoveries with specified confidence level– Permits blocking by another variable, pairing of data, averaging of
technical replicates• SAM
– Fortran implementation 7X faster than R versions• Extensive annotation for identified genes
– Internal annotation of NetAffx, Source, Gene Ontology, Pathway information
– Links to annotations in genomic databases• Find genes correlated with quantitative factor while controlling
number of proportion of false discoveries• Find genes correlated with censored survival while controlling
number or proportion of false discoveries• Analysis of variance
![Page 65: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/65.jpg)
Selected Features of BRB-ArrayTools
• Gene set enrichment analysis. – Gene Ontology groups, signaling pathways, transcription
factor targets, micro-RNA putative targets– Automatic data download from Broad Institute– KS & LS test statistics for null hypothesis that gene set is not
enriched– Hotelling’s and Goeman’s Global test of null hypothesis that
no genes in set are differentially expressed– Goeman’s Global test for survival data
• Class prediction– Multiple classifiers– Complete LOOCV, k-fold CV, repeated k-fold, .632
bootstrap– permutation significance of cross-validated error rate
![Page 66: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/66.jpg)
Selected Features of BRB-ArrayTools
• Survival risk-group prediction– Supervised principal components with and without clinical
covariates– Cross-validated Kaplan Meier Curves– Permutation test of cross-validated KM curves
• Clustering tools for class discovery with reproducibility statistics on clusters– Internal access to Eisen’s Cluster and Treeview
• Visualization tools including rotating 3D principal components plot exportable to Powerpoint with rotation controls
• Extensible via R plug-in feature• Tutorials and datasets
![Page 67: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/67.jpg)
BRB-ArrayTools
• Extensive built-in gene annotation and linkage to gene annotation websites
• Publicly available for non-commercial use– http://linus.nci.nih.gov/brb
![Page 68: Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch](https://reader035.vdocuments.mx/reader035/viewer/2022062320/56649d8c5503460f94a73053/html5/thumbnails/68.jpg)
BRB-ArrayToolsDecember 2006
• 6635 Registered users
• 1938 Distinct institutions
• 68 Countries
• 311 Citations