bio277 lab 2: clustering and classification of microarray data

14
Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI [email protected]

Upload: byrd

Post on 21-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Bio277 Lab 2: Clustering and Classification of Microarray Data. Jess Mar Department of Biostatistics Quackenbush Lab DFCI [email protected]. Machine Learning. Machine learning algorithms predict new classes based on patterns discerned from existing data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bio277 Lab 2: Clustering and Classification of Microarray Data

Bio277 Lab 2: Clustering and Classification of Microarray

Data

Jess Mar

Department of Biostatistics

Quackenbush Lab DFCI

[email protected]

Page 2: Bio277 Lab 2: Clustering and Classification of Microarray Data

Machine Learning

Machine learning algorithms predict new classes based on patterns discerned from existing data.

Classification algorithms are a form of supervised learning.

Clustering algorithms are a form of unsupervised learning.

Goal: derive a rule (classifier) that assigns a new object (e.g. patient

microarray profile) to a pre-specified group (e.g. aggressive vs non-

aggressive prostate cancer).

Page 3: Bio277 Lab 2: Clustering and Classification of Microarray Data

The Golub Data

Golub et al. published gene expression microarray data in a 1999 Science paper entitled: Molecular Classification of Cancer – Class Discovery and Class Prediction by Gene Expression Monitoring.

The primary focus of their paper was to demonstrate the use of a class discovery procedure which could assign tumors to either acute myeloid leukemia (ALL) versus acute lymphoblastic leukemia (AML).

Bioconductor has this (pre-processed) data packaged up in golubEsets.

> library(golubEsets)

> library(help=golubEsets)

Page 4: Bio277 Lab 2: Clustering and Classification of Microarray Data

Some Clustering Algorithms for Array Data

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

Hierarchical Methods:

Single, Average, Complete Linkage plus other variations.

Partitioning Methods:

Self-Organising Maps (Köhonen)

K-Means Clustering

Gene shaving

(Hastie, Tibshirani et al.)

Model based clustering

Plaid models

(Lazzeroni &

Owen)

Page 5: Bio277 Lab 2: Clustering and Classification of Microarray Data

Cluster Analysis

Hierarchical Methods:

(Agglomerative, Divisive) + (Single, Average, Complete) Linkage…

Model-based Methods:

Mixed models. Plaid models. Mixture models…

A clustering problem is generally much harder than a classification problem because we don’t know the number of classes.

Clustering genes on the basis of experiments or across a time series.

Elucidate unknown gene function.

Clustering slides on the basis of genes.

Discover subclasses in tissue samples.

Page 6: Bio277 Lab 2: Clustering and Classification of Microarray Data

Hierarchical Clustering

n genes in n clusters

n genes in 1 cluster

divisive

agg

lom

erat

ive

We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’.

Euclidean distance

(Pearson) correlation

Source: J-Express Manual

Page 7: Bio277 Lab 2: Clustering and Classification of Microarray Data

Single linkage

Complete linkage

Average linkage

Different Ways to Determine Distances Between Clusters

Page 8: Bio277 Lab 2: Clustering and Classification of Microarray Data

Implementing Hierarchical Clustering

Agglomerative hierarchical clustering with the function agnes:

> colnames(eset.filt) <- classLabels

> plot(agnes(dist(t(eset.filt)

, method="euclidean")))

Page 9: Bio277 Lab 2: Clustering and Classification of Microarray Data

Principal Component Analysis

Multi-dimensional scaling tool. See GC's lectures for a more in depth treatment.

In our Golub data set, PCA will take the data (~500 genes x 72 samples) and map each sample vector (ALL or AML) from 558 dimensions to 2 dimensions.

> pca.samples <- princomp(eset.filt)

> plot(pca.samples)

Page 10: Bio277 Lab 2: Clustering and Classification of Microarray Data

Principal Components

Page 11: Bio277 Lab 2: Clustering and Classification of Microarray Data
Page 12: Bio277 Lab 2: Clustering and Classification of Microarray Data

Classification Example: Support Vector Machine

For this example we will use data from Golub et al.

• 47 patients with ALL, 25 patients with AML

• 7129 genes from an Affymettrix HGU6800 but we'll take a subset for this example.

> library(MLInterfaces) ; library(golubEsets)

> library(e1071)

> data(golubMerge)

To fit the support vector machine:

> model <- svm(classLabels[1:40]~., data=t(eset.train))

Page 13: Bio277 Lab 2: Clustering and Classification of Microarray Data

Visualizing the SVM

What predictions were made for the test set?predLabels <- predict(model, t(eset.test))

> predLabelsALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML AML ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML AML AML AML AML AML Levels: ALL AML

How do these stack up to the true classification?> trueLabels <- classLabels[41:72]> table(predLabels, trueLabels)

trueLabelspredLabels ALL AML ALL 21 0 AML 0 11

Page 14: Bio277 Lab 2: Clustering and Classification of Microarray Data

More Materials, More Labs?

Hypothesis Testing of Differentially Expressed Genes

Gene Set Enrichment

Clustering

Classification

Support Vector Machines

Lecture Topics Covered Since

Last Lab

Tutorial: BioConductor Tour