anindya bhattacharya and rajat k. de bioinformatics, 2008
TRANSCRIPT
![Page 1: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/1.jpg)
Anindya Bhattacharya and Rajat K. DeBioinformatics, 2008
![Page 2: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/2.jpg)
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
2
![Page 3: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/3.jpg)
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
3
![Page 4: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/4.jpg)
Correlation Clustering
4
![Page 5: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/5.jpg)
Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004.
It is basically based on the notion of graph partitioning.
5
![Page 6: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/6.jpg)
How to construct the graph? Nodes: genes. Edges: correlation between the genes.
Two types of edges: Positive edge. Negative edge.
6
![Page 7: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/7.jpg)
For example:
7
XX YY Positive correlation coefficient: Positive edge( )
XX YY Negative correlation coefficient: Negative edge( )
CC
GG
BB
DD
AA
HH
GG
FF
EE
Cluster 1
Cluster 2
Graph Construction
Graph Partitioning CC
GG
BB
DD
AA
HH
GG
FF
EE
![Page 8: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/8.jpg)
How to measure the quality of clusters? The number of agreements. The number of disagreements.
The number of agreements: the number of genes that are in correct clusters.
The number of disagreements: the number of genes wrongly clustered.
8
![Page 9: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/9.jpg)
For example:
9
AA
CC
DD EE
BB
Cluster 1
Cluster 2
The measure of agreements is the sum of:(1) # of positive edges in the same clusters(2) # of negative edges in different clustersThe measure of disagreements is the sum of:(1) # of negative edges in the same clusters(2) # of positive edges in different clusters
4 + 4 = 8
0 + 2 = 2
![Page 10: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/10.jpg)
Minimization of disagreements or equivalently Maximization of agreements!
However, it’s NP-Complete proved by Bansal et al., 2004.
Another problem is without the magnitude of correlation coefficients.
10
![Page 11: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/11.jpg)
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
11
![Page 12: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/12.jpg)
Pearson correlation coefficientTerms and measurements used in
DCCADivisive Correlation Clustering
Algorithm
12
![Page 13: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/13.jpg)
Consider a set of genes, , for each of which expression values are given.
The Pearson correlation coefficient between two genes and is defined as:
13
lth sample value of gene
mean value of gene from samples
![Page 14: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/14.jpg)
: and are positively correlated with the degree of correlation as its magnitude.
: and are negatively correlated with value .
14
![Page 15: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/15.jpg)
We define some terms and measurements used in DCCA: Attraction Repulsion Attraction/Repulsion value Average correlation value
15
![Page 16: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/16.jpg)
Attraction: There’s an attraction between and if .
Repulsion: There’s a repulsion between and if .
Attraction/Repulsion value: Magnitude of
is the strength of attraction or repulsion.
16
![Page 17: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/17.jpg)
The genes will be grouped into disjoint clusters .
Average correlation value: Average correlation value for a gene with respect to cluster is defined as:
17
the number of data points in
![Page 18: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/18.jpg)
indicates that the average correlation for a gene with other genes inside the cluster .
Average correlation value reflects the degree of inclusion of to cluster .
18
![Page 19: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/19.jpg)
19
Divisive Correlation Clustering Algorithm
11 mm
m samples
11 mm
n genes
DCCA
C1C1 C2C2 CkCk
K disjoint clustersX1
Xn
![Page 20: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/20.jpg)
Step 1:
Step 2: for each iteration, do: Step 2-i:
20
![Page 21: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/21.jpg)
Step 2: Step 2-ii:
Step 2-iii:
21
C1C1 C2C2 CpCp
Which cluster exists the most repulsion value?
Cluster C!
![Page 22: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/22.jpg)
Step 2-iv:
22
xixi
xjxj
xk
xk
xk
xk
xk
xk
xk
xkx
k
xk
xk
xk
xk
xk
Cluster C
xjxj
xixi
Cp
Cq
![Page 23: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/23.jpg)
Step 2-v:
23
xk
xk
C1C1 C2C2 CKCK
The highest average correlation value!
C1C1 C2C2 CKCKxk
xk
Place a copy of xk
CNEW: new clusters
![Page 24: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/24.jpg)
Step 2-vi:
24
C1C1 C2C2 CKCK
C1C1 C2C2 CKCK
CNEW: new clusters
Any change?
![Page 25: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/25.jpg)
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
25
![Page 26: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/26.jpg)
Performance comparison A synthetic dataset ADS Nine gene expression datasets
26
![Page 27: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/27.jpg)
A synthetic dataset ADS:
27
Three groups.
![Page 28: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/28.jpg)
Experimental results:
28
Clustering correctly.
![Page 29: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/29.jpg)
Experimental results:
29
Undesired Clusters.
Undesired Clusters.
![Page 30: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/30.jpg)
Five yeast datasets: Yeast ATP, Yeast PHO, Yeast AFR, Yeast
AFRt, Yeast Cho et al.Four mammalian datasets:
GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745.
30
![Page 31: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/31.jpg)
Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster.
31
Attributes
Mutual information
The entropies for each cluster-attribute pair.
The entropies for clustering result independent of attributes.
The entropies for each of the NA attributes independent of clusters.
![Page 32: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/32.jpg)
z-score is defined as:
32
The computed MI for the clustered data, using the
attribute database.
MIrandom is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained.
Mean of these MI-values.
The standard deviation of these MI-values.
![Page 33: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/33.jpg)
A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result.
Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets.
33
![Page 34: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/34.jpg)
Experimental results:
34
![Page 35: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/35.jpg)
Experimental results:
35
![Page 36: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/36.jpg)
Experimental results:
36
![Page 37: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/37.jpg)
Experimental results:
37
![Page 38: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/38.jpg)
Experimental results:
38
![Page 39: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/39.jpg)
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
39
![Page 40: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/40.jpg)
Pros: DCCA is able to obtain clustering
solution from gene-expression dataset with high biological significance.
DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input.
40
![Page 41: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649eac5503460f94bb1fc9/html5/thumbnails/41.jpg)
Cons: The computation cost for repairing any
misplacement occurring in clustering step is high.
DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1.
41