clustering and classification in gene expression data carlo colantuoni [email protected] slide...

55
Clustering and Classification In Gene Expression Data Carlo Colantuoni [email protected] Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry, Giovanni Parmigiani, David Madigan, Kevin Coombs, Richard Simon, Ingo Ruczinski.Classification based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber

Upload: tobias-stafford

Post on 24-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Clustering and Classification In Gene Expression Data

Carlo Colantuoni

[email protected]

Slide Acknowledgements:

Elizabeth Garrett-Mayer, Rafael Irizarry, Giovanni Parmigiani, David Madigan, Kevin Coombs, Richard Simon, Ingo Ruczinski.Classification based

in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber

Page 2: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Data from Garber et al.PNAS (98), 2001.

Page 3: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,
Page 4: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

• Clustering is an exploratory tool to see who's running with who: Genes and Samples.

• “Unsupervized”

• NOT for classification of samples.

• NOT for identification of differentially expressed genes.

Clustering

Page 5: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Clustering

• Clustering organizes things that are close into groups.

• What does it mean for two genes to be close?

• What does it mean for two samples to be close?

• Once we know this, how do we define groups?

• Hierarchical and K-Means Clustering

Page 6: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Distance

• We need a mathematical definition of distance between two points

• What are points?

• If each gene is a point, what is the mathematical definition of a point?

Page 7: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Points

• Gene1= (E11, E12, …, E1N)’

• Gene2= (E21, E22, …, E2N)’

• Sample1= (E11, E21, …, EG1)’

• Sample2= (E12, E22, …, EG2)’

• Egi=expression gene g, sample i

12........G

1 2 . . . . . . . N

DATA MATRIX

Page 8: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Most Famous Distance• Euclidean distance

– Example distance between gene 1 and 2:

– Sqrt of Sum of (E1i -E2i)2, i=1,…,N

• When N is 2, this is distance as we know it:Baltimore

DC

Distance

When N is 20,000 you have to think abstractly

Page 9: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Correlation can also be used to compute distance

• Pearson Correlation

• Spearman Correlation

• Uncentered Correlation

• Absolute Value of Correlation

Page 10: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

The difference is that, if you have two vectors X and Y with identical shape, but which are offset relative to each other by a fixed value, they will have a standard Pearson correlation (centered correlation) of 1 but will not have an uncentered correlation of 1.

Page 11: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

The similarity/distance matrices1 2 ………………………………...G

12........G

12........G

1 2 ……….N

DATA MATRIX GENE SIMILARITY MATRIX

Page 12: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

The similarity/distance matrices

1 2 …………..N

12........G

12...N

1 2 ……….N

DATA MATRIX

SAMPLE SIMILARITY MATRIX

Page 13: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Gene and Sample Selection

• Do you want all genes included?

• What to do about replicates from the same individual/tumor?

• Genes that contribute noise will affect your results.

• Including all genes: dendrogram can’t all be seen at the same time.

• Perhaps screen the genes?

Page 14: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Two commonly seen clustering approaches in gene expression data analysis

• Hierarchical clustering– Dendrogram (red-green picture)– Allows us to cluster both genes and samples

in one picture and see whole dataset “organized”

• K-means/K-medoids– Partitioning method– Requires user to define K = # of clusters a

priori– No picture to (over)interpret

Page 15: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Hierarchical Clustering

• The most overused statistical method in gene expression analysis

• Gives us pretty red-green picture with patterns• But, pretty picture tends to be pretty unstable.• Many different ways to perform hierarchical

clustering• Tend to be sensitive to small changes in the

data• Provided with clusters of every size: where to

“cut” the dendrogram is user-determined

Page 16: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Choose clustering direction

• Agglomerative clustering (bottom-up)– Starts with as each gene in its own cluster– Joins the two most similar clusters– Then, joins next two most similar clusters– Continues until all genes are in one cluster

• Divisive clustering (top-down)– Starts with all genes in one cluster– Choose split so that genes in the two clusters are

most similar (maximize “distance” between clusters)– Find next split in same manner– Continue until all genes are in single gene clusters

Page 17: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Choose linkage method (if bottom-up)

• Single Linkage: join clusters whose distance between closest genes is smallest (elliptical)

• Complete Linkage: join clusters whose distance between furthest genes is smallest (spherical)

• Average Linkage: join clusters whose average distance is the smallest.

Page 18: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Dendrogram Creation + Interpretation

Page 19: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Dendrogram Creation + Interpretation

Page 20: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Dendrogram Creation + Interpretation

Page 21: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Cluster Assignment

Page 22: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

450 relevant genes + 450 “noise” genes. 450 relevant genes.

Simulated Data with 4 clusters: 1-10, 11-20, 21-30, 31-40

Page 23: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

K-means and K-medoids

• Partitioning Method• Don’t get pretty picture• MUST choose number of clusters K a priori• More of a “black box” because output is most

commonly looked at purely as assignments• Each object (gene or sample) gets assigned to a

cluster• Begin with initial partition• Iterate so that objects within clusters are most

similar

Page 24: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

K-means (continued)

• Euclidean distance most often used

• Spherical clusters.

• Can be hard to choose or figure out K.

• Not unique solution: clustering can depend on initial partition

• No pretty figure to (over)interpret

Page 25: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

K-means Algorithm

1. Choose K centroids at random2. Make initial partition of objects into k clusters by

assigning objects to closest centroid3. Calculate the centroid (mean) of each of the k clusters.4. a. For object i, calculate its distance to each of the centroids. b. Allocate object i to cluster with closest centroid. c. If object was reallocated, recalculate centroids based on new clusters.4. Repeat 3 for object i = 1,….N.5. Repeat 3 and 4 until no reallocations occur.6. Assess cluster structure for fit and stability

Page 26: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

K-means

• We start with some data

• Interpretation:– We are showing

expression for two samples for 14 genes

– We are showing expression for two genes for 14 samples

• This is with 2 genes. Iteration = 0

Page 27: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

K-means

• Choose K centroids• These are starting

values that the user picks.

• There are some data driven ways to do it

Iteration = 0

Page 28: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

K-means

• Make first partition by finding the closest centroid for each point

• This is where distance is used

Iteration = 1

Page 29: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

K-means

• Now re-compute the centroids by taking the middle of each cluster

Iteration = 2

Page 30: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

K-means

• Repeat until the centroids stop moving or until you get tired of waiting

Iteration = 3

Page 31: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

K-means Limitations

• Final results depend on starting values

• How do we chose K? There are methods but not much theory saying what is best.

• Where are the pretty pictures?

Page 32: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Assessing cluster fit and stability

• Most often ignored.

• Cluster structure is treated as reliable and precise

• Can be VERY sensitive to noise and to outliers

• Homogeneity and Separation

• Cluster Silhouettes: how similar genes within a cluster are to genes in other clusters (Rousseeuw Journal of Computation and Applied Mathematics, 1987)

Page 33: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Silhouettes

• Silhouette of gene i is defined as:

• ai = average distance of gene i to other gene in same cluster

• bi = average distance of gene i to genes in its nearest neighbor cluster

s ib a

a bi i

i i

( )m ax ( , )

Page 34: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

WADP: Weighted Average Discrepancy Pairs

• Add perturbations to original data

• Calculate the number of paired samples that cluster together in the original cluster that didn’t in the perturbed

• Repeat for every cutoff (i.e. for each k)

• Do iteratively• Estimate for each k the

proportion of discrepant pairs.

Page 35: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,
Page 36: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Classification

• Diagnostic tests are good examples of classifiers– A patient has a given disease or not– The classifier is a machine that accepts some

clinical parameters as input, and spits out an prediction for the patient

• D• Not-D

• Classes must be mutually exclusive and exhaustive

Page 37: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Components of Class Prediction

• Select features (genes)– Which genes will be included in the model

• Select type of classifier– E.g. (D)LDA, SVM, k-Nearest-Neighbor, …

• Fit parameters for model (train the classifier)

• Quantify predictive accuracy: Cross-Validation

Page 38: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Feature Selection

• Goal is to identify a small subset of genes which together give accurate predictions.

• Methods will vary depending on nature of classification problem– Choose genes with significant t-statistics to

distinguish between two simple classes e.g.

Page 39: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Classifier Selection

• In microarray classification, the number of features is (almost) always much greater than the number of samples.

• Overfitting is a distinct risk, and increases with more complicated methods.

Page 40: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

How microarrays differ from the rest of the world

• Complex classification algorithms such as neural networks that perform better elsewhere don’t do as well as simpler methods for expression data.

• Comparative studies have shown that simpler methods work as well or better for microarray problems because the number of candidate predictors exceeds the number of samples by orders of magnitude.

(Dudoit, Fridlyand and Speed JASA 2001)

Page 41: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Statistical Methods Appropriate for Class Comparison may not be Appropriate for

Class Prediction

• Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy.

• Demonstrating goodness of fit of a model to the data used to develop it is not a demonstration of predictive accuracy.

• Most statistical methods were not developed for p>>n prediction problems

Page 42: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Linear discriminant analysis

• If there are K classes, simply draw lines (planes) to divide the space of expression profiles into K regions, one for each class.

• If profile X falls in region K, predict class K.

Page 43: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Nearest Neighbor Classification

• To classify a new observation X, measure the distance d(X,Xi) between X and every sample Xi in training set

• Assign to X the class label of its “nearest neighbor” in the training set.

Page 44: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Random Forests

Build several “random” decision trees and have them vote to determine final classification

Page 45: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Evaluating a classifier

• Want to estimate the error rate when classifier is used to predict class of a new observation

• The ideal approach is to get a set of new observations, with known class label and see how frequently the classifier makes the correct prediction.

• Performance on the training set is a poor approach, and will deflate the error estimate.

• Cross validation methods are used to get less biased estimates of error using only the training data.

Page 46: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Split-Sample Evaluation

• Training-set– Used to select features, select model type, determine

parameters and cut-off thresholds

• Test-set– Withheld until a single model is fully specified using

the training-set.– Fully specified model is applied to the expression

profiles in the test-set to predict class labels. – Number of errors is counted

Page 47: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

V-fold cross validation• Divide data into V groups.• Hold one group back, train the classifier on

other V-1 groups, and use it to predict the last one.

• Rotate through all V points, holding each back.

• Error estimate is total error rate on all V test groups.

Page 48: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Leave-one-out Cross Validation• Hold one data point back, train the

classifier on other n-1 data points, and use it to predict the last one.

• Rotate through all n points, holding each back.

• Error estimate is total error rate on all n test values.

Page 49: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

training set

test set

spec

imen

s

log-expression ratios

spec

imen

s

log-expression ratios

full data set

Non-cross-validated Prediction

Cross-validated Prediction (Leave-one-out method)

1. Prediction rule is built using full data set.2. Rule is applied to each specimen for class

prediction.

1. Full data set is divided into training and test sets (test set contains 1 specimen).

2. Prediction rule is built from scratch using the training set.

3. Rule is applied to the specimen in the test set for class prediction.

4. Process is repeated until each specimen has appeared once in the test set.

Page 50: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Which to use depends mostly on sample size

• If the sample is large enough, split into test and train groups.

• If sample is barely adequate for either testing or training, use leave one out

• In between consider V-fold. This method can give more accurate estimates than leave one out, but reduces the size of training set.

Page 51: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Beware

• Cross-validation of a model cannot occur after selecting the genes to be used in the model

Page 52: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Incomplete (incorrect) Cross-Validation

• Publications are using all the data to select genes and then cross-validating only the parameter estimation component of model development– Highly biased– Many published complex methods which make strong

claims based on incorrect cross-validation. • Frequently seen in complex feature set selection algorithms• Some software encourages inappropriate cross-validation

Page 53: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Gene-Expression Profiles in Hereditary Breast Cancer

• Breast tumors studied:7 BRCA1+ tumors8 BRCA2+ tumors7 sporadic tumors

• Log-ratios measurements of 3226 genes for each tumor after initial data filtering

cDNA MicroarraysParallel Gene Expression Analysis

RESEARCH QUESTIONCan we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?

Page 54: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

BRCA1

g

# of

significant genes

# of misclassified

samples (m)

% of random permutations with

m or fewer misclassifications

10-2 182 3 0.4 10-3 53 2 1.0 10-4 9 1 0.2

Page 55: Clustering and Classification In Gene Expression Data Carlo Colantuoni ccolantu@jhsph.edu Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

BRCA2

g

# of significant

genes

m = # of misclassified elements

(misclassified samples)

% of random permutations with m

or fewer misclassifications

10-2 212 4 (s11900, s14486, s14572, s14324) 0.8 10-3 49 3 (s11900, s14486, s14324) 2.2 10-4 11 4 (s11900, s14486, s14616, s14324) 6.6