canadian bioinformatics workshops

35
Canadian Bioinformatics Workshops www.bioinformatics.ca

Upload: justina-laura-gibson

Post on 18-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

www.bioinformatics.ca

Page 2: Canadian Bioinformatics Workshops

2Module #: Title of Module

Page 3: Canadian Bioinformatics Workshops

Lecture 7ML & Data Visualization & Microarrays

MBP1010

Dr. Paul C. BoutrosWinter 2015

DEPARTMENT OFMEDICAL BIOPHYSICSDEPARTMENT OFMEDICAL BIOPHYSICS

This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others

††

††

Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)

Page 4: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Machine-Learning• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Sequence Analysis• Final Exam (written)

Page 5: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

House Rules• Cell phones to silent

• No side conversations

• Hands up for questions

Page 6: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Topics For This Week• Machine-learning 101 (Briefly)

• Data visualization 101

• Attendance

• Microarrays 101

Page 7: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])

D.cho<-dist(cho.data, method = "euclidean")

hc.single<-hclust(D.cho, method = "single", members=NULL)

Example: cell cycle data

Page 8: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

plot(hc.single)

Single linkage

Example: cell cycle data

Page 9: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Careful with the interpretation of dendrograms:they introduce a proximity between elements that does not correlate with distance between elements!cf.: # 1 and #47

Example: cell cycle data

Page 10: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=2

rect.hclust(hc.single,k=2)

Example: cell cycle data

Page 11: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=3

rect.hclust(hc.single,k=3)

Example: cell cycle data

Page 12: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=4

rect.hclust(hc.single,k=4)

Example: cell cycle data

Page 13: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=5

rect.hclust(hc.single,k=5)

Example: cell cycle data

Page 14: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Single linkage, k=25

rect.hclust(hc.single,k=25)

Example: cell cycle data

Page 15: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1 2

3 4

class.single<-cutree(hc.single, k = 4)

par(mfrow=c(2,2))

matplot(t(cho.data[class.single==1,]),type="l",

xlab="time",ylab="log expression value")

matplot(t(cho.data[class.single==2,]),type="l",

xlab="time",ylab="log expression value")

matplot(as.matrix(cho.data[class.single==3,]),

type="l",xlab="time",ylab="log expression value")

matplot(t(cho.data[class.single==4,]),type="l",

xlab="time",ylab="log expression value")

Properties of cluster members, single linkage, k=4

Example: cell cycle data

Page 16: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1 2

3 4

Single linkage, k=4

1 2

3 4

Example: cell cycle data

Page 17: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1 2

3 4

Complete linkage, k=4

1 2

3 4

Single linkage, k=4

Example: cell cycle data

Page 18: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Hierarchical clustering analyzed

Advantages Disadvantages

There may be small clusters nested inside large ones

Clusters might not be naturally represented by a hierarchical structure

No need to specify number groups ahead of time

Its necessary to ‘cut’ the dendrogram in order to produce clusters

Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

Page 19: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Partitioning methods

• Anatomy of a partitioning based method• data matrix• distance function• number of groups

• Output• group assignment of every object

Page 20: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Partitioning based methods

• Choose K groups• initialise group centers

• aka centroid, medoid

• assign each object to the nearest centroid according to the distance metric

• reassign (or recompute) centroids

• repeat last 2 steps until assignment stabilizes

Page 21: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

K-means vs. K-medoids

K-means K-medoids

Centroids are the ‘mean’ of the clusters

Centroids are an actual object that minimizes the total within cluster distance

Centroids need to be recomputed every iteration

Centroid can be determined from quick look up into the distance matrix

Initialisation difficult as notion of centroid may be unclear before beginning

Initialisation is simply K randomly selected objects

kmeans pam

Page 22: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Partitioning based methods

Advantages Disadvantages

Number of groups is well defined

Have to choose the number of groups

A clear, deterministic assignment of an object to a group

Sometimes objects do not fit well to any cluster

Simple algorithms for inference

Can converge on locally optimal solutions and often require multiple restarts with random initializations

Page 23: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

N items, assume K clusters

Goal is to minimize

over the possible assignments and centroids .

represents the location of the cluster.

K-means

Page 24: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1. Divide the data into K clustersInitialize the centroids with the mean of the clusters

2. Assign each item to the cluster with closest centroid

3. When all objects have been assigned, recalculate the centroids (mean)

4. Repeat 2-3 until the centroids no longer move

K-means

Page 25: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

K-means

Page 26: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)

K-means

Page 27: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

K-means, k=4

1 2

3 4

set.seed(100)km.cho<-kmeans(cho.data, 4)

par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")

K-means

Page 28: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

1 2

3 4

K-means, k=4

1 2

3 4

Single linkage, k=4

K-means

Page 29: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

K-means and hierarchical clustering methods are simple, fast and useful techniques

Beware of memory requirements for HC

Both are bit “ad hoc”:Number of clusters?

Distance metric?

Good clustering?

Summary

Page 30: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Meta-Analysis• Combining results of multiple-studies that study related

hypotheses

• Often used to merge data from different microarray platforms

• Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data

Page 31: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Why Do Meta-Analysis?• Can identify publication biases• Appropriately weights diverse studies

• Sample-size• Experimental-reliability• Similarity of study-specific hypotheses to the overall one

• Increases statistical power• Reduces information

• A single meta-analysis vs. five large studies• Provides clearer guidance

Page 32: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Challenges of Meta-Analysis• No control for bias

• What happens if most studies are poorly designed?

• File-drawer problem• Publication bias can be detected, but not explicitly controlled

for

• How homogeneous is the data?• Can it be fairly grouped?• Simpson’s Paradox

Page 33: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Simpson’s Paradox

Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!

Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!

Page 34: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Topics For This Week• Machine-learning 101 (Focus: Unsupervised)

• Data visualization 101

Page 35: Canadian Bioinformatics Workshops

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca

Topics For This Week• Machine-learning 101 (Briefly)

• Data visualization 101

• Attendance

• Microarrays 101