mmg991 session 7 - michigan state universitysilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 silhouette...

33
MMG991 Session 7 Non-hierarchical cluster analysis Review fundamental concepts As implemented in S-Plus Microarray data Other applications Open discussion on implementation Selection of projects

Upload: others

Post on 10-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

MMG991 Session 7

• Non-hierarchical cluster analysis– Review fundamental concepts– As implemented in S-Plus

• Microarray data• Other applications

– Open discussion on implementation• Selection of projects

Page 2: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Cluster analysis revisited• Hierarchical methods

– Goals– Agglomorative– Divisive– Unsupervised– output

• Partitioning methods– Goals– k-means– pam, clara and fanny– Supervised

• Selecting the number of groups– cutree()

– output

Page 3: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Cutting the tree

• cutree()– Returns a vector of group number for the objects clustered– Input tree (output of hclust()– Height of cut (h) or number of groups (k)

• Visualizing the cuts– Currently no default plotting routine– So, what can we do

• Table of groupings• “decorate” the tree

Page 4: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Setting up the example

set1a <-read.table("set1a.txt", header=T, sep="\t")set1a[,1]<-paste(as.character(set1a[,1]),"1a",sep=".")row.names(set1a)<-set1a[,1]set1a<-set1a[sort(dimnames(set1a)[[1]]),-1]

set1a.norm<-(set1a-apply(set1a,1,mean))/apply(set1a,1,stdev)for(i in 1:ncol(set1a.norm))

dimnames(set1a.norm)[[2]][i]<-paste("exp-", as.character(i), sep="")

graphsheet()par(mfcol=c(2,1))

Page 5: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Before0

24

68

10

0 50 100 150 200 250 300

010

2030

4050

Page 6: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

1 2 3 45

6 78

9 1011 12

1314 15 16 17 18 19 20 21

2223 24 25

26 2728

29 30 31 3233 34 35 36 37 38

3940 41 42 4344 45 46 47 48 49

50510

1520

25

0 50 100 150 200 250 300

010

2030

4050

Page 7: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Drawing the figuresset1a.clust<-hclust(dist(set1a.norm, met="euc"),

met="aver")set1a.clust<-clorder(set1a.clust, apply(set1a.norm,

1, mean))set1a.plclust<-plclust(set1a.clust, labels=FALSE)set1a.cutree<-cutree(set1a.clust, k=14)temp<-cbind(set1a.plclust$x, set1a.plclust$y,

col=as.vector(set1a.cutree))for(i in 1:14)

points(temp[temp[,3]==i,1], temp[temp[,3]==i,2],col=i, pch=16)

set1a.norm<-set1a.norm[set1a.clust$order,]image(list(x=1:dim(set1a.norm)[1],

y=1:dim(set1a.norm)[2], z=as.matrix(set1a.norm)))

image.legend(as.matrix(set1a.norm), x=nrow(set1a.norm)*1.066,y=ncol(set1a.norm)*1.05, size=c(.1, 2.55),hor=F,cex=0.66,mgp=c(0,0.25,0))

Page 8: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Second dimensionset1a.tclust<-hclust(dist(t(set1a.norm), met="euc"),

met="aver")set1a.tclust<-clorder(set1a.tclust, apply(t(set1a.norm), 1,

mean))set1a.tplclust<-plclust(set1a.tclust, labels=FALSE)set1a.tcutree<-cutree(set1a.tclust, k=6)temp<-cbind(set1a.tplclust$x, set1a.tplclust$y,

col=as.vector(set1a.tcutree))for(i in 1:6)

points(temp[temp[,3]==i,1], temp[temp[,3]==i,2], col=i,pch=16)

set1a.norm<-set1a.norm[,set1a.tclust$order]image(list(x=1:dim(set1a.norm)[1], y=1:dim(set1a.norm)[2],

z=as.matrix(set1a.norm)))image.legend(as.matrix(set1a.norm),x=nrow(set1a.norm)*1.066,

y=ncol(set1a.norm)*1.05, size=c(.1, 2.55), hor=F,cex=0.66,mgp=c(0,0.25,0))

Page 9: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

02

46

810

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Gene clusters identified

Page 10: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Experiment clusters identified5

1015

2025

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Page 11: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

k-means

• Objective– partition observations into groups that minimizes within group

sum of squared distances (withinss). – Centroids– Requires a defined number of groups– Determining optimum number of groups– No graphical output

– The classic example• Ruspini’s data set

Page 12: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

kmeans(ruspini,4)Centers:

x y [1,] 98.17647 114.8824[2,] 20.15000 64.9500[3,] 43.91304 146.0435[4,] 68.93333 19.4000

Clustering vector:[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3[40] 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Within cluster sum of squares:[1] 4558.235 3689.500 3176.783 1456.533

Cluster sizes:[1] 17 20 23 15

Available arguments:[1] "cluster" "centers" "withinss" "size"

Page 13: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Ruspini dataset

ruspini$x

rusp

ini$

y

0 20 40 60 80 100 120

050

100

150

Page 14: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Ruspini dataset, k=4

ruspini[, 1]

rusp

ini[,

2]

0 20 40 60 80 100 120

050

100

150

Page 15: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Ruspini dataset, k=5

ruspini[, 1]

rusp

ini[,

2]

0 20 40 60 80 100 120

050

100

150

Page 16: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

So, how many clusters are there?• Hartigan’s recommendation• if• (sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-k -1)

> 10• addition of group is justifed

• Setting up a test…

kscore.ruspini<-as.list(2:21)for(i in 2:20){

k<-kmeans(ruspini, i)kscore.ruspini[[i]]<-k$withinss

}for(i in 2:19){

print((sum(kscore.ruspini[[i]])/sum(kscore.ruspini[[i+1]])-1)*(nrow(ruspini)-i-1))

}

Page 17: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

[3] 53.74084

[4] 210.9672

[5] 8.920013

[6] 21.9182

[7] 13.646

[8] 10.26787

[9] 12.0679

[10] 6.488705

[11] 13.35045

[12] 9.935521

[13] 4.754963

[14] 7.834783

[15] 6.378573

[16] 3.886348

[17] 2.072307

[18] 4.197096

[19] 5.176718

[20] 4.354051

Page 18: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

k-means with array dataset1a.norm<-(set1a -

apply(set1a,1,mean))/apply(set1a,1,stdev)set1a.norm<-set1a.norm[sort(dimnames(set1a.norm)[[1]]),]set1a.kmeans<-kmeans(set1a.norm, 14)gene.order<-cbind(dimnames(set1a)[[1]], set1a.kmeans$cluster)gene.order<-gene.order[order(gene.order[,2]),1]set1a.kmeans<-kmeans(t(set1a.norm), 6)exp.order<-cbind(dimnames(set1a)[[2]], set1a.kmeans$cluster)exp.order<-exp.order[order(exp.order[,2]),1]

#to visualize the output of the two analysis

temp<-set1a.norm[gene.order, exp.order]

image(list(x=1:dim(temp)[1], y=1:dim(temp)[2], z=as.matrix(temp)))

Page 19: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Kmeans, genes=14, exp=6

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Page 20: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Optimum number of clusters

Genes[3] 52.11781[4] 61.86213[5] 76.23372[6] 57.22758[7] 134.5601[8] 115.662[9] 111.2952[10] 89.475[11] 182.0612[12] 187.1214[13] 153.94[14] 316.5357[15] 8.525307[16] 4.961622[17] 6.483269[18] 6.932642[19] 3.641691[20] 3.420847

Experiments[3] 82.84932[4] 111.1703[5] 52.30111[6] 46.1943[7] 46.47232[8] 49.47469[9] 23.58832[10] 39.69845[11] 28.4224[12] 28.27107[13] 43.54086[14] 22.23373[15] 35.81718[16] 23.46791[17] 29.98056[18] 24.97635[19] 31.67716[20] 30.31866

Page 21: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Optimized k-means clustering

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Page 22: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Six steps of cluster analysis

• Obtaining the data matrix– Test data set

• Standardizing the data matrix– Normalization

• Computing the resemblance matrix– Similarity– Dissimilarity– Distance– Other measures

• Clustering the data• Rearranging the data matrix• Goodness of fit

Page 23: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Partitioning around medoids (pam())

• Similar to k-means– Utilizes medoids rather than centroids– More robust

• Minimizes sum of dissimilarities rather sum of squared Euclideandistances

– Provides grapical output to evaluate clustering• Silhouete plots

– Denotes number of clusters, cluster width and quality– Ranked in decreasing order– Overall average silhouette width

» Heuristics

– pam(x, k, diss=F, metric="euclidean", stand=F, save.x=T, save.diss=T)

Page 24: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

pam() with array dataset1a.pam<-pam(set1a.norm,14)set1a.tpam<-pam(t(set1a.norm),4)gene.order<-cbind(dimnames(set1a.norm)[[1]],

set1a.pam$clustering)gene.order<-gene.order[order(gene.order[,2]),1]exp.order<-cbind(dimnames(set1a.norm)[[2]],

set1a.tpam$clustering)exp.order<-exp.order[order(exp.order[,2]),1]temp<-set1a.norm[gene.order, exp.order]image(list(x=1:dim(temp)[1], y=1:dim(temp)[2],

z=as.matrix(temp)))image.legend(as.matrix(temp), x=nrow(temp)*1.075,

y=ncol(temp)*1.05, size=c(.125, 6.1), hor=F,cex=0.66, tck=-0.01, mgp=c(0,0.5,0))

Page 25: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Silhouette plot, grouped by gene, k=14

0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.83

Page 26: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Silhouette plot, grouped by expt, k=4

-0.2 0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.29

Page 27: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

DNA array data by pam()

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Page 28: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Clustering large applications

• Clara– Optimized version of pam()– Limitations of k-means and pam()

• Memory requirements are quadratic– Algorithm works with subsets

• Divides data into k clusters• Remaining objects assigned to clusters• Susbsequent iterations forced to contain currently best medoids

– clara(x, k, metric="euclidean", stand=F, samples=5, sampsize=40 + 2 * k, save.x=T, save.diss=T)

Page 29: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Silhouette plots

0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.84

-0.2 0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.31

Page 30: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

DNA array data by clara()

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

Page 31: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.77

0.0 0.2 0.4 0.6 0.8 1.0Silhouette width

Average silhouette width : 0.22

Silhouette plots

Page 32: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

0 50 100 150 200 250 300

010

2030

4050

-3-2

-10

12

3

DNA array data by fanny()

Page 33: MMG991 Session 7 - Michigan State UniversitySilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average silhouette width : 0.84-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width Average

Summing up

• Cluster analysis provides a means of organizing the data based on common features

• Different algorithms may arrive at different solutions

• Homework for next week– Comparing the output of hierarchical and partition methods

• Use Eisen’s test data– Which genes consistently group together?– Which experiments consistently group together?

– Projects