mmg991 session 7 - michigan state universitysilhouette plots 0.0 0.2 0.4 0.6 0.8 1.0 silhouette...
TRANSCRIPT
MMG991 Session 7
• Non-hierarchical cluster analysis– Review fundamental concepts– As implemented in S-Plus
• Microarray data• Other applications
– Open discussion on implementation• Selection of projects
Cluster analysis revisited• Hierarchical methods
– Goals– Agglomorative– Divisive– Unsupervised– output
• Partitioning methods– Goals– k-means– pam, clara and fanny– Supervised
• Selecting the number of groups– cutree()
– output
Cutting the tree
• cutree()– Returns a vector of group number for the objects clustered– Input tree (output of hclust()– Height of cut (h) or number of groups (k)
• Visualizing the cuts– Currently no default plotting routine– So, what can we do
• Table of groupings• “decorate” the tree
Setting up the example
set1a <-read.table("set1a.txt", header=T, sep="\t")set1a[,1]<-paste(as.character(set1a[,1]),"1a",sep=".")row.names(set1a)<-set1a[,1]set1a<-set1a[sort(dimnames(set1a)[[1]]),-1]
set1a.norm<-(set1a-apply(set1a,1,mean))/apply(set1a,1,stdev)for(i in 1:ncol(set1a.norm))
dimnames(set1a.norm)[[2]][i]<-paste("exp-", as.character(i), sep="")
graphsheet()par(mfcol=c(2,1))
Before0
24
68
10
0 50 100 150 200 250 300
010
2030
4050
1 2 3 45
6 78
9 1011 12
1314 15 16 17 18 19 20 21
2223 24 25
26 2728
29 30 31 3233 34 35 36 37 38
3940 41 42 4344 45 46 47 48 49
50510
1520
25
0 50 100 150 200 250 300
010
2030
4050
Drawing the figuresset1a.clust<-hclust(dist(set1a.norm, met="euc"),
met="aver")set1a.clust<-clorder(set1a.clust, apply(set1a.norm,
1, mean))set1a.plclust<-plclust(set1a.clust, labels=FALSE)set1a.cutree<-cutree(set1a.clust, k=14)temp<-cbind(set1a.plclust$x, set1a.plclust$y,
col=as.vector(set1a.cutree))for(i in 1:14)
points(temp[temp[,3]==i,1], temp[temp[,3]==i,2],col=i, pch=16)
set1a.norm<-set1a.norm[set1a.clust$order,]image(list(x=1:dim(set1a.norm)[1],
y=1:dim(set1a.norm)[2], z=as.matrix(set1a.norm)))
image.legend(as.matrix(set1a.norm), x=nrow(set1a.norm)*1.066,y=ncol(set1a.norm)*1.05, size=c(.1, 2.55),hor=F,cex=0.66,mgp=c(0,0.25,0))
Second dimensionset1a.tclust<-hclust(dist(t(set1a.norm), met="euc"),
met="aver")set1a.tclust<-clorder(set1a.tclust, apply(t(set1a.norm), 1,
mean))set1a.tplclust<-plclust(set1a.tclust, labels=FALSE)set1a.tcutree<-cutree(set1a.tclust, k=6)temp<-cbind(set1a.tplclust$x, set1a.tplclust$y,
col=as.vector(set1a.tcutree))for(i in 1:6)
points(temp[temp[,3]==i,1], temp[temp[,3]==i,2], col=i,pch=16)
set1a.norm<-set1a.norm[,set1a.tclust$order]image(list(x=1:dim(set1a.norm)[1], y=1:dim(set1a.norm)[2],
z=as.matrix(set1a.norm)))image.legend(as.matrix(set1a.norm),x=nrow(set1a.norm)*1.066,
y=ncol(set1a.norm)*1.05, size=c(.1, 2.55), hor=F,cex=0.66,mgp=c(0,0.25,0))
02
46
810
0 50 100 150 200 250 300
010
2030
4050
-3-2
-10
12
3
Gene clusters identified
Experiment clusters identified5
1015
2025
0 50 100 150 200 250 300
010
2030
4050
-3-2
-10
12
3
k-means
• Objective– partition observations into groups that minimizes within group
sum of squared distances (withinss). – Centroids– Requires a defined number of groups– Determining optimum number of groups– No graphical output
– The classic example• Ruspini’s data set
kmeans(ruspini,4)Centers:
x y [1,] 98.17647 114.8824[2,] 20.15000 64.9500[3,] 43.91304 146.0435[4,] 68.93333 19.4000
Clustering vector:[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3[40] 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Within cluster sum of squares:[1] 4558.235 3689.500 3176.783 1456.533
Cluster sizes:[1] 17 20 23 15
Available arguments:[1] "cluster" "centers" "withinss" "size"
Ruspini dataset
ruspini$x
rusp
ini$
y
0 20 40 60 80 100 120
050
100
150
Ruspini dataset, k=4
ruspini[, 1]
rusp
ini[,
2]
0 20 40 60 80 100 120
050
100
150
Ruspini dataset, k=5
ruspini[, 1]
rusp
ini[,
2]
0 20 40 60 80 100 120
050
100
150
So, how many clusters are there?• Hartigan’s recommendation• if• (sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-k -1)
> 10• addition of group is justifed
• Setting up a test…
kscore.ruspini<-as.list(2:21)for(i in 2:20){
k<-kmeans(ruspini, i)kscore.ruspini[[i]]<-k$withinss
}for(i in 2:19){
print((sum(kscore.ruspini[[i]])/sum(kscore.ruspini[[i+1]])-1)*(nrow(ruspini)-i-1))
}
[3] 53.74084
[4] 210.9672
[5] 8.920013
[6] 21.9182
[7] 13.646
[8] 10.26787
[9] 12.0679
[10] 6.488705
[11] 13.35045
[12] 9.935521
[13] 4.754963
[14] 7.834783
[15] 6.378573
[16] 3.886348
[17] 2.072307
[18] 4.197096
[19] 5.176718
[20] 4.354051
k-means with array dataset1a.norm<-(set1a -
apply(set1a,1,mean))/apply(set1a,1,stdev)set1a.norm<-set1a.norm[sort(dimnames(set1a.norm)[[1]]),]set1a.kmeans<-kmeans(set1a.norm, 14)gene.order<-cbind(dimnames(set1a)[[1]], set1a.kmeans$cluster)gene.order<-gene.order[order(gene.order[,2]),1]set1a.kmeans<-kmeans(t(set1a.norm), 6)exp.order<-cbind(dimnames(set1a)[[2]], set1a.kmeans$cluster)exp.order<-exp.order[order(exp.order[,2]),1]
#to visualize the output of the two analysis
temp<-set1a.norm[gene.order, exp.order]
image(list(x=1:dim(temp)[1], y=1:dim(temp)[2], z=as.matrix(temp)))
Kmeans, genes=14, exp=6
0 50 100 150 200 250 300
010
2030
4050
-3-2
-10
12
3
Optimum number of clusters
Genes[3] 52.11781[4] 61.86213[5] 76.23372[6] 57.22758[7] 134.5601[8] 115.662[9] 111.2952[10] 89.475[11] 182.0612[12] 187.1214[13] 153.94[14] 316.5357[15] 8.525307[16] 4.961622[17] 6.483269[18] 6.932642[19] 3.641691[20] 3.420847
Experiments[3] 82.84932[4] 111.1703[5] 52.30111[6] 46.1943[7] 46.47232[8] 49.47469[9] 23.58832[10] 39.69845[11] 28.4224[12] 28.27107[13] 43.54086[14] 22.23373[15] 35.81718[16] 23.46791[17] 29.98056[18] 24.97635[19] 31.67716[20] 30.31866
Optimized k-means clustering
0 50 100 150 200 250 300
010
2030
4050
-3-2
-10
12
3
Six steps of cluster analysis
• Obtaining the data matrix– Test data set
• Standardizing the data matrix– Normalization
• Computing the resemblance matrix– Similarity– Dissimilarity– Distance– Other measures
• Clustering the data• Rearranging the data matrix• Goodness of fit
Partitioning around medoids (pam())
• Similar to k-means– Utilizes medoids rather than centroids– More robust
• Minimizes sum of dissimilarities rather sum of squared Euclideandistances
– Provides grapical output to evaluate clustering• Silhouete plots
– Denotes number of clusters, cluster width and quality– Ranked in decreasing order– Overall average silhouette width
» Heuristics
– pam(x, k, diss=F, metric="euclidean", stand=F, save.x=T, save.diss=T)
pam() with array dataset1a.pam<-pam(set1a.norm,14)set1a.tpam<-pam(t(set1a.norm),4)gene.order<-cbind(dimnames(set1a.norm)[[1]],
set1a.pam$clustering)gene.order<-gene.order[order(gene.order[,2]),1]exp.order<-cbind(dimnames(set1a.norm)[[2]],
set1a.tpam$clustering)exp.order<-exp.order[order(exp.order[,2]),1]temp<-set1a.norm[gene.order, exp.order]image(list(x=1:dim(temp)[1], y=1:dim(temp)[2],
z=as.matrix(temp)))image.legend(as.matrix(temp), x=nrow(temp)*1.075,
y=ncol(temp)*1.05, size=c(.125, 6.1), hor=F,cex=0.66, tck=-0.01, mgp=c(0,0.5,0))
Silhouette plot, grouped by gene, k=14
0.0 0.2 0.4 0.6 0.8 1.0Silhouette width
Average silhouette width : 0.83
Silhouette plot, grouped by expt, k=4
-0.2 0.0 0.2 0.4 0.6 0.8 1.0Silhouette width
Average silhouette width : 0.29
DNA array data by pam()
0 50 100 150 200 250 300
010
2030
4050
-3-2
-10
12
3
Clustering large applications
• Clara– Optimized version of pam()– Limitations of k-means and pam()
• Memory requirements are quadratic– Algorithm works with subsets
• Divides data into k clusters• Remaining objects assigned to clusters• Susbsequent iterations forced to contain currently best medoids
– clara(x, k, metric="euclidean", stand=F, samples=5, sampsize=40 + 2 * k, save.x=T, save.diss=T)
Silhouette plots
0.0 0.2 0.4 0.6 0.8 1.0Silhouette width
Average silhouette width : 0.84
-0.2 0.0 0.2 0.4 0.6 0.8 1.0Silhouette width
Average silhouette width : 0.31
DNA array data by clara()
0 50 100 150 200 250 300
010
2030
4050
-3-2
-10
12
3
0.0 0.2 0.4 0.6 0.8 1.0Silhouette width
Average silhouette width : 0.77
0.0 0.2 0.4 0.6 0.8 1.0Silhouette width
Average silhouette width : 0.22
Silhouette plots
0 50 100 150 200 250 300
010
2030
4050
-3-2
-10
12
3
DNA array data by fanny()
Summing up
• Cluster analysis provides a means of organizing the data based on common features
• Different algorithms may arrive at different solutions
• Homework for next week– Comparing the output of hierarchical and partition methods
• Use Eisen’s test data– Which genes consistently group together?– Which experiments consistently group together?
– Projects