flat clustering approaches colin dewey bmi/cs 576 [email protected] fall 2015

Flat clustering approaches

Colin DeweyBMI/CS 576

www.biostat.wisc.edu/[email protected]

Fall 2015

http://www.biostat.wisc.edu/bmi576

Flat clustering

• Cluster objects/genes/samples into K clusters• In the following, we will consider clustering genes

based on their expression profiles• K: number of clusters, a user defined argument• Two example algorithms

– K-means– Gaussian mixture model-based clustering

Notation for K-means clustering

• K number of clusters• Nk Number of elements in cluster k• p-dimensional expression profile for ith gene• is the collection of N gene expression

profiles to cluster• Center of the kth cluster• C(i) Cluster assignment (1 to K) of ith gene

K-means clustering

• Hard-clustering algorithm• Dissimilarity measure is the Euclidean distance• Minimizes within-cluster scatter defined as

Center of cluster k

Sum over all genes assigned to cluster k

K-means clustering

• consider an example in which our vectors have 2 dimensions

+ +

+

+

profile cluster center

K-means clustering

• each iteration involves two steps– assignment of profiles to clusters– re-computation of the cluster centers (means)

+ +

+

+

+ +

+

+

assignment re-computation of cluster centers

K-means algorithm

• Input: K, number of clusters, a set X={x1,.. xN} of data points, where xi are p-dimensional vectors

• Initialize– Select initial cluster means

• Repeat until convergence– Assign each xi to cluster C(i) such that

– Re-estimate the mean of each cluster based on new members

K-means: updating the mean

• To compute the mean of the kth cluster

Number of genes in cluster k All genes in cluster k

K-means stopping criteria

• Assignment of objects to clusters don’t change

• Fix the max number of iterations

• Optimization criterion changes by a small value

K-means Clustering Example

f1

f2

x1

x2

x3

x4

Given the following 4 instances and 2 clusters initialized as shown.

K-means Clustering Example (Continued)

assignments remain the same,so the procedure has converged

Gaussian mixture model based clustering

• K-means is hard clustering– At each iteration, a data point is assigned to one and only

one cluster

• We can do soft clustering based on Gaussian mixture models– Each cluster is represented by a distribution (in our case a

Gaussian)– We assume the data is generated by a mixture of the

Gaussians

Gaussian distribution

• Gaussian distribution

: Mean: Standard deviation

Representation of Clusters

• in the EM approach, we’ll represent each cluster using a p-dimensional multivariate Gaussian

where

is the mean of the Gaussian

is the covariance matrix

this is a representation of a Gaussian in a 2-D space

Cluster generation

• We model our data points with a generative process• Each point is generated by:

– Choosing a cluster (1,2,…,k) by sampling from probability distribution over the clusters

• Prob(cluster k) = Pk

– Sampling a point from the Gaussian distribution Nj

EM Clustering

• the EM algorithm will try to set the parameters of the Gaussians, , to maximize the log likelihood of the data, X

EM Clustering

• the parameters of the model, , include the means, the covariance matrix and sometimes prior weights for each Gaussian

• here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights

EM Clustering: Hidden Variables

• on each iteration of K-means clustering, we had to assign each instance to a cluster

• in the EM approach, we’ll use hidden variables to represent this idea

• for each instance we have a set of hidden variables

• we can think of as being 1 if is a member of cluster j and 0 otherwise

EM Clustering: the E-step

• recall that is a hidden variable which is 1 if generated and 0 otherwise

• in the E-step, we compute , the expected value of this hidden variable

+

assignment

0.3

0.7

EM Clustering: the M-step

• given the expected values , we re-estimate the means of the Gaussians and the cluster probabilities

• can also re-estimate the covariance matrix if we’re varying it

EM Clustering – Overall algorithm

• Initialize parameters (e.g., means)• Loop until convergence

– E-step: Compute expected values of Zij values given current parameters

– M-step: Update parameters using E[Zij] values• Means• Cluster probabilities

EM Clustering ExampleConsider a one-dimensional clustering problem in which the data given are: x1 = -4 x2 = -3 x3 = -1 x4 = 3 x5 = 5

The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is:

where denotes the mean (center) of the Gaussian.

Initially, we set P1 = P2 = 0.5

EM Clustering Example

EM Clustering Example: E Step

EM Clustering Example: M-step

EM Clustering Example

• here we’ve shown just one step of the EM procedure

• we would continue the E- and M-steps until convergence

Comparing K-means and GMMs

• K-means– Hard clustering– Optimizes within cluster scatter– Requires estimation of means

• GMMs– Soft clustering– Optimizes likelihood of the data– Requires estimation of mean (and possibly covariance) and

mixture probabilities

Choosing the number of clusters

• Picking the number of clusters based on the clustering objective will result in K=N (number of data points)

• Two common options:– Pick k based on penalized clustering objective G

– Pick based on cross-validation

Picking K based on cross-validation

Cluster Cluster Cluster

Training set Test set

Evaluate Evaluate Evaluate

Average

Data set

Split into 3 sets

Compute objective based on test data

Run method on all data once K has been determined

Evaluation of clusters

• Internal validation– How well does clustering optimize the intra-cluster

similarity and inter-cluster dissimilarity

• External validation– Do genes in the same cluster have similar function?

Internal validation

• One measure of assessing cluster quality is the Silhouette index (SI)

• More positive the SI, better the clusters

K: number of clustersCj : Set representing jth clusterb(xi): Average distance of xi to instances in next closest clustera(xi): Average distance of xi to other instances in same cluster

External validation

• Are genes in the same cluster associated with similar function?

• Gene Ontology (GO) is a controlled vocabulary of terms used to annotate genes of a particular category

• One can use GO terms to study whether the genes in a cluster are associated with the same GO term more than expected by chance

• One can also see if genes in a cluster are associated with similar transcription factor binding sites

The Gene Ontology• A controlled vocabulary of more than 30K concepts describing molecular

functions, biological processes, and cellular components

Gene Ontology terms for yeast genes

YDR420W 1,3-beta-D-glucan_biosynthetic_processYGR032W 1,3-beta-D-glucan_biosynthetic_processYLR342W 1,3-beta-D-glucan_biosynthetic_processYMR306W 1,3-beta-D-glucan_biosynthetic_processYDR420W 1,3-beta-D-glucan_metabolic_processYGR032W 1,3-beta-D-glucan_metabolic_processYLR342W 1,3-beta-D-glucan_metabolic_processYMR306W 1,3-beta-D-glucan_metabolic_processYDL049C 1,6-beta-glucan_biosynthetic_processYFR042W 1,6-beta-glucan_biosynthetic_processYGR143W 1,6-beta-glucan_biosynthetic_processYKL035W 1,6-beta-glucan_biosynthetic_processYOR336W 1,6-beta-glucan_biosynthetic_processYPR159W 1,6-beta-glucan_biosynthetic_processYDL049C 1,6-beta-glucan_metabolic_processYFR042W 1,6-beta-glucan_metabolic_processYGR143W 1,6-beta-glucan_metabolic_processYJL174W 1,6-beta-glucan_metabolic_processYKL035W 1,6-beta-glucan_metabolic_processYOR336W 1,6-beta-glucan_metabolic_processYPR159W 1,6-beta-glucan_metabolic_processYDR111C 1-aminocyclopropane-1-carboxylate_biosynthetic_process…

Yeast gene Gene ontology process

Assessing significance of association

• Assume we have a term t with Nt terms annotated in the genome

• For a cluster c with Nc genes, assume Hct is the number of genes annotated with term t

• We ask if c’s members are significantly associated with genes annotated with term t

• Specifically we ask how likely are we to see Hct of Nc genes to be annotated with t by random chance

• A common way to do this is using the Hyper-geometric distribution

Hypergeometric distribution

• A probability distribution over discrete random variables that describes the probability of k successes in n draws without replacement from a finite population of size N, with exactly K successes

Hypergeometric test example• Consider the example of a crate of 100 wine bottles• The crate has 20 bottles of red wine and 80 bottles of white

wine• Suppose you were blind-folded and were allowed to select

10 bottles• What is the probability of selecting 5 of these 10 bottles to

be of red wine?• Number of successes K=20• Total number of objects N=100• Number of draws (n=10)• Number of successes in our draws k=5

Hypergeometric distribution to assess significance of cluster term association

• Nc: Number of genes in a cluster c (Draws)

• Nt: Number of genes in our genome background to have term t function (Successes)

• Hct: Number of genes in cluster c with term t (Successes in Draws)

• N: Total genes in the background• Probability of observing Hct of Nc genes with term t by

random chance is

Other commonly used annotation categories

• KEGG pathway• BioCarta• MsigDB (Molecular Signature database)• REACTOME• Binding sites of transcription factors

Using Gene Ontology to assess the quality of a cluster

Gen

es

Conditions

GO terms

Transcription factor binding sites for HAP4 and MSN2/4

Summary of flat clustering

• Two methods– Gaussian mixture models (soft clustering)– K-means (hard clustering)

• Both need users to specify K• K can be picked based on penalized objective or

cross-validation• Clusters can be evaluated based on internal and

external validation

flat clustering approaches colin dewey bmi/cs 576 [email protected] fall 2015

Documents

cluster scatter

cluster ci

cluster centersk

cluster kall genes

kth cluster number of

center of cluster ksum

clustering genes

clustering notation