flat clustering approaches colin dewey bmi/cs 576 [email protected] fall 2015
TRANSCRIPT
Flat clustering
• Cluster objects/genes/samples into K clusters• In the following, we will consider clustering genes
based on their expression profiles• K: number of clusters, a user defined argument• Two example algorithms
– K-means– Gaussian mixture model-based clustering
Notation for K-means clustering
• K number of clusters• Nk Number of elements in cluster k• p-dimensional expression profile for ith gene• is the collection of N gene expression
profiles to cluster• Center of the kth cluster• C(i) Cluster assignment (1 to K) of ith gene
K-means clustering
• Hard-clustering algorithm• Dissimilarity measure is the Euclidean distance• Minimizes within-cluster scatter defined as
Center of cluster k
Sum over all genes assigned to cluster k
K-means clustering
• consider an example in which our vectors have 2 dimensions
+ +
+
+
profile cluster center
K-means clustering
• each iteration involves two steps– assignment of profiles to clusters– re-computation of the cluster centers (means)
+ +
+
+
+ +
+
+
assignment re-computation of cluster centers
K-means algorithm
• Input: K, number of clusters, a set X={x1,.. xN} of data points, where xi are p-dimensional vectors
• Initialize– Select initial cluster means
• Repeat until convergence– Assign each xi to cluster C(i) such that
– Re-estimate the mean of each cluster based on new members
K-means: updating the mean
• To compute the mean of the kth cluster
Number of genes in cluster k All genes in cluster k
K-means stopping criteria
• Assignment of objects to clusters don’t change
• Fix the max number of iterations
• Optimization criterion changes by a small value
K-means Clustering Example
f1
f2
x1
x2
x3
x4
Given the following 4 instances and 2 clusters initialized as shown.
K-means Clustering Example (Continued)
assignments remain the same,so the procedure has converged
Gaussian mixture model based clustering
• K-means is hard clustering– At each iteration, a data point is assigned to one and only
one cluster
• We can do soft clustering based on Gaussian mixture models– Each cluster is represented by a distribution (in our case a
Gaussian)– We assume the data is generated by a mixture of the
Gaussians
Gaussian distribution
• Gaussian distribution
: Mean: Standard deviation
Representation of Clusters
• in the EM approach, we’ll represent each cluster using a p-dimensional multivariate Gaussian
where
is the mean of the Gaussian
is the covariance matrix
this is a representation of a Gaussian in a 2-D space
Cluster generation
• We model our data points with a generative process• Each point is generated by:
– Choosing a cluster (1,2,…,k) by sampling from probability distribution over the clusters
• Prob(cluster k) = Pk
– Sampling a point from the Gaussian distribution Nj
EM Clustering
• the EM algorithm will try to set the parameters of the Gaussians, , to maximize the log likelihood of the data, X
EM Clustering
• the parameters of the model, , include the means, the covariance matrix and sometimes prior weights for each Gaussian
• here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights
EM Clustering: Hidden Variables
• on each iteration of K-means clustering, we had to assign each instance to a cluster
• in the EM approach, we’ll use hidden variables to represent this idea
• for each instance we have a set of hidden variables
• we can think of as being 1 if is a member of cluster j and 0 otherwise
EM Clustering: the E-step
• recall that is a hidden variable which is 1 if generated and 0 otherwise
• in the E-step, we compute , the expected value of this hidden variable
+
assignment
0.3
0.7
EM Clustering: the M-step
• given the expected values , we re-estimate the means of the Gaussians and the cluster probabilities
• can also re-estimate the covariance matrix if we’re varying it
EM Clustering – Overall algorithm
• Initialize parameters (e.g., means)• Loop until convergence
– E-step: Compute expected values of Zij values given current parameters
– M-step: Update parameters using E[Zij] values• Means• Cluster probabilities
EM Clustering ExampleConsider a one-dimensional clustering problem in which the data given are: x1 = -4 x2 = -3 x3 = -1 x4 = 3 x5 = 5
The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is:
where denotes the mean (center) of the Gaussian.
Initially, we set P1 = P2 = 0.5
EM Clustering Example
EM Clustering Example: E Step
EM Clustering Example: M-step
EM Clustering Example
• here we’ve shown just one step of the EM procedure
• we would continue the E- and M-steps until convergence
Comparing K-means and GMMs
• K-means– Hard clustering– Optimizes within cluster scatter– Requires estimation of means
• GMMs– Soft clustering– Optimizes likelihood of the data– Requires estimation of mean (and possibly covariance) and
mixture probabilities
Choosing the number of clusters
• Picking the number of clusters based on the clustering objective will result in K=N (number of data points)
• Two common options:– Pick k based on penalized clustering objective G
– Pick based on cross-validation
Picking K based on cross-validation
Cluster Cluster Cluster
Training set Test set
Evaluate Evaluate Evaluate
Average
Data set
Split into 3 sets
Compute objective based on test data
Run method on all data once K has been determined
Evaluation of clusters
• Internal validation– How well does clustering optimize the intra-cluster
similarity and inter-cluster dissimilarity
• External validation– Do genes in the same cluster have similar function?
Internal validation
• One measure of assessing cluster quality is the Silhouette index (SI)
• More positive the SI, better the clusters
K: number of clustersCj : Set representing jth clusterb(xi): Average distance of xi to instances in next closest clustera(xi): Average distance of xi to other instances in same cluster
External validation
• Are genes in the same cluster associated with similar function?
• Gene Ontology (GO) is a controlled vocabulary of terms used to annotate genes of a particular category
• One can use GO terms to study whether the genes in a cluster are associated with the same GO term more than expected by chance
• One can also see if genes in a cluster are associated with similar transcription factor binding sites
The Gene Ontology• A controlled vocabulary of more than 30K concepts describing molecular
functions, biological processes, and cellular components
Gene Ontology terms for yeast genes
YDR420W 1,3-beta-D-glucan_biosynthetic_processYGR032W 1,3-beta-D-glucan_biosynthetic_processYLR342W 1,3-beta-D-glucan_biosynthetic_processYMR306W 1,3-beta-D-glucan_biosynthetic_processYDR420W 1,3-beta-D-glucan_metabolic_processYGR032W 1,3-beta-D-glucan_metabolic_processYLR342W 1,3-beta-D-glucan_metabolic_processYMR306W 1,3-beta-D-glucan_metabolic_processYDL049C 1,6-beta-glucan_biosynthetic_processYFR042W 1,6-beta-glucan_biosynthetic_processYGR143W 1,6-beta-glucan_biosynthetic_processYKL035W 1,6-beta-glucan_biosynthetic_processYOR336W 1,6-beta-glucan_biosynthetic_processYPR159W 1,6-beta-glucan_biosynthetic_processYDL049C 1,6-beta-glucan_metabolic_processYFR042W 1,6-beta-glucan_metabolic_processYGR143W 1,6-beta-glucan_metabolic_processYJL174W 1,6-beta-glucan_metabolic_processYKL035W 1,6-beta-glucan_metabolic_processYOR336W 1,6-beta-glucan_metabolic_processYPR159W 1,6-beta-glucan_metabolic_processYDR111C 1-aminocyclopropane-1-carboxylate_biosynthetic_process…
Yeast gene Gene ontology process
Assessing significance of association
• Assume we have a term t with Nt terms annotated in the genome
• For a cluster c with Nc genes, assume Hct is the number of genes annotated with term t
• We ask if c’s members are significantly associated with genes annotated with term t
• Specifically we ask how likely are we to see Hct of Nc genes to be annotated with t by random chance
• A common way to do this is using the Hyper-geometric distribution
Hypergeometric distribution
• A probability distribution over discrete random variables that describes the probability of k successes in n draws without replacement from a finite population of size N, with exactly K successes
Hypergeometric test example• Consider the example of a crate of 100 wine bottles• The crate has 20 bottles of red wine and 80 bottles of white
wine• Suppose you were blind-folded and were allowed to select
10 bottles• What is the probability of selecting 5 of these 10 bottles to
be of red wine?• Number of successes K=20• Total number of objects N=100• Number of draws (n=10)• Number of successes in our draws k=5
Hypergeometric distribution to assess significance of cluster term association
• Nc: Number of genes in a cluster c (Draws)
• Nt: Number of genes in our genome background to have term t function (Successes)
• Hct: Number of genes in cluster c with term t (Successes in Draws)
• N: Total genes in the background• Probability of observing Hct of Nc genes with term t by
random chance is
Other commonly used annotation categories
• KEGG pathway• BioCarta• MsigDB (Molecular Signature database)• REACTOME• Binding sites of transcription factors
Using Gene Ontology to assess the quality of a cluster
Gen
es
Conditions
GO terms
Transcription factor binding sites for HAP4 and MSN2/4
Summary of flat clustering
• Two methods– Gaussian mixture models (soft clustering)– K-means (hard clustering)
• Both need users to specify K• K can be picked based on penalized objective or
cross-validation• Clusters can be evaluated based on internal and
external validation