Framework for Sequence Cluster Merging (Also showing importance of domain knowledge)
Arvind Gopu
Masters student, Computer Science & Bioinformatics
Indiana University, Bloomington
http://biokdd.informatics.indiana.edu/~agopu
Email: [email protected]
Introduction
Sequence Clustering very important research topic. Bottom-up approach – basically merge elements
recursively upto certain specificity Top-down approach – split elements until desired
specificity is achieved Two important issues: selectivity and sensitivity
Sequence clustering problem is unique No “observable” attributes unlike most clustering problems Example:
Supermarket: Soda, Fruit juice, Frozen foods, Clothing, etc. Demographic: Height, Race, etc.
Sequence clustering: Just a bunch of amino acid characters! (with accompanying well studied sequence comparison/alignment programs).
Introduction …
Getting back to sequence clustering… Fragmentation problem – well known in sequence
clustering algorithms. Example: BAG (Sun Kim) 99 % accuracy (selective) but at cost of ~40-50 %
fragmentation (over-sensitive) Solution?
Bottom-Up merging back of fragmented clusters
Need for framework
Suggested bottom-up approach possible using various sub-methods Framework: Do common and unique tasks
seamlessly Insert new sub-methods easily with very little
hassle Implemented primarily in Perl with supporting
C programs and Unix Shell scripts
Framework Schematic
Test Merge’bility
Merge Suggestions
from Clustering Algorithm
Prepare Sequence Data
Post-process New Clustering
Result
Generate Combined Profile for Two
Fragment Clusters
Enhanced Clustering
Result
Test Scaffold
Framework – Profile Generation
Test Merge’bility
Merge Suggestions
from Clustering Algorithm
Prepare Sequence Data
Post-process New Clustering
Result
GENERATE COMBINED PROFILE
FOR TWO FRAGMENT CLUSTERS
Enhanced Clustering
Result
Test Scaffold
Profile Generation – MSA
MSA (C1)
MSA (C2)
MSA (C1, C2)
C1
C2
Combined Profile
MSA = Multiple Sequence Alignment
Profile Generation – MSA
Common first step: MSA profile generation for two fragment clusters C1 and C2 (Clustalw) MSA (C1) and MSA (C2) Most expensive step in framework
Common second step: Combined profile generation (Clustalw) Prof_Align [MSA (C1), MSA (C2)]
Profile Generation – MSA explained.. All of the implemented techniques depend on MSA
profiles MSA profile: align more than 2 sequences
simultaneously
Image from http://bioinformatics.weizmann.ac.il/~pietro/Making_and_using_protein_MA/
Profile Generation – MSA explained..
Image from http://www.mscs.mu.edu/~cstruble/class/mscs230/fall2002/notes/3
Framework – Merge’bility Test
TEST MERGE’BILITY
Merge Suggestions
from Clustering Algorithm
Prepare Sequence Data
Post-process New Clustering
Result
Generate Combined Profile for Two
Fragment Clusters
Enhanced Clustering
Result
Test Scaffold
Model Comparison based Merge Test
Model Comparison based Merge Test Statistics/Machine learning technique based
method: Uses Relative Entropy and Statistical measures
w.r.t. Runs test Drawbacks
Almost impossible to nail down on threshold values for z-score or any other statistical measure
Extremely dependent sample size equality – does not work well when the two fragment sizes vary
Model Comparison based Merge Test Each column in a MSA profile is a probabilistic
model (details of construction beyond the scope of this talk)
Compute similarity between corresponding columns in the two fragments – Kullback Liebler distance Need to consider gaps while matching up columns –
challenging task Also need to screen for random “good” distances – taken
care off using random model in distance computation
Model Comparison based Merge Test
Model Comparison based Merge Test Using column wise comparison distance
scores, compute “distance vector” Symbolic representation for “good”, “bad” and
“don’t care” distances (detail abstracted) Do standard statistical test: Runs test to
check out how random distance vector is… Nice pattern:
y | y | y | n | n | y | y | y | n | n | y | y | y Random pattern:
y | n | y | y | n | n | n | y | n | y | n | n | y | y
Model Comparison based Merge Test
4) Do Runs test
Model Comparison based Merge Test Compute mean, standard deviation and
subsequently z-score Threshold to separate “good” and “bad” merges
Drawbacks again… Threshold will be sample specific, hard to have
one threshold for entire dataset (illustrated in test results)
Failure rate is high if sample size is unequal
Phylogenetic Tree based Merge Test
Merge’bility Test – Techniques … Phylogenetic tree based method:
Evolutionary Distance based method Drawback: Too strict; many false negatives possible;
Also hard to nail a threshold Evolutionary Least Common Ancestor (LCA)
based method Improved performance in both of the previously
mentioned issues
Phylogenetic TreeEvolutionary Distance based
Merge Test
Phylogenetic Tree Distance based method Clustalw (or other tree generation tools)
provide NJ tree of a MSA profile Sequence length normalized distance from
root for each sequence 0 < distance < 1
Define some threshold for distance that constitutes intra/inter cluster distances
Phylogenetic Tree Distance based method Distance between sequences from…
Two clusters will be closer to: ‘1’ if two clusters are not merge’ble – call these “bad
distances” ‘0’ if two clusters are actually part of the same super
cluster The same cluster will be obviously closer to ‘0’ –
these constitute “good distances”; don’t care in our case
Count number of “bad distances” Gives a good idea of how good a merge is
Phylogenetic Tree Distance based method Good enough? Not
yet – need for normalization of the “bad distance” count. Why? Number of edges
between vertices of same/different clusters is proportional to size of clusters!
Phylogenetic Tree Distance based method Once normalization of number of “bad
distances” is done, this method churned out decent results Normalizing factor? Contentious.. What is a good
normalizer? Method too strict for unequally sized clusters.
Most merges rejected leading to appreciable number of false negatives Inherent nature of MSA programs and unequally sized
profiles (cluster sizes)
Phylogenetic TreeLCA Coverage based Merge
Test
Phy.Tree LCA coverage based method Clustalw, Phylip (or other tree generation
tools) provide a rooted phylogenetic tree for a MSA profile
Looking at the tree, one can easily make out if a pair of clusters should be merged or not How? Parse tree into a usual tree data structure and
look for common ancestor of sequences of each cluster
Example…
Phy.Tree LCA coverage based method Good Merge
Sequences of the two clusters (shaded blue and red) are from the same super cluster
Phy.Tree LCA coverage based method Bad Merge
Sequences of the two clusters (shaded blue and red) are from different super clusters
Phy.Tree LCA coverage based method Same LCA for both clusters? Good merge! If not … Bad merge?
Not quite. Possible that LCAs may be different but they cover sequences from either cluster upto a considerable extent
Better to use coverage of LCAs instead Example…
Phy.Tree LCA coverage based method Why LCA Coverage?
Second cluster has three sequences, but its LCA covers four more sequences from the other cluster
Phy.Tree LCA coverage based method Coverage test:
For clusters Ci and Ck, choose smaller cluster say Ci i.e | Ci | < | Ck |
Define Cov (LCA[Ci]) as the number of sequences LCA Ci covers.
If Cov(LCA[Ci]) > # of sequences in Ci
… where | Ci | < | Ck | i.e. { Cov (LCA[Ci]) / | Ci | } > 1
Or {Cross Coverage (LCA[Ci])} > 0
Phy.Tree LCA coverage based method Advantages:
Sample size difference does not play a big role Demarcating between “good” and “bad” merges is
much simpler and straight forward Shown to work really well on a variety of data
sizes, difficulty levels – test results… Possible weakness:
Bound to fail for extremely small fragments (say 2 sequences each) – hard not to have a common LCA !
Test Results – 4 datasets(from COG database)
Test Results – Data set 1
DATA: COG {0001, 0005} (Real Size: 35,30) MERGE’BILITY TEST METHOD
Observed OutcomeFragment Cluster Size Expected Outcome
n (F1) n (F2) Good / Bad Model Comparison(0.0001)
Phy.tree Distance Phy.tree LCA coverage
10 10 Good Good Good Good
10 10 Bad Bad Bad Bad
10 5 Good Good Good Good
10 5 Bad Bad Bad Bad
10 3 Good Good Good Good
10 3 Bad Bad Bad Bad
4 2 Good Good Good Good
4 2 Bad Bad Bad Bad
3 3 Good Good Good Good
3 3 Bad Bad Bad Bad
Test Results – Data set 2
DATA: COG {0142, 0183} (Real Size: 74,116) MERGE’BILITY TEST METHOD
Observed OutcomeFeagment Cluster Size Expected Outcome
n (F1) n (F2) Good / Bad Model Comparison(0.001)
Phy.tree Distance Phy.tree LCA coverage
10 10 Good Good Good Good
10 10 Bad Bad Bad Bad
10 5 Good Good Bad Good
10 5 Bad Bad Bad Bad
10 3 Good Good Bad Good
10 3 Bad Good Bad Bad
4 2 Good Good Bad Good
4 2 Bad Bad Bad Bad
3 3 Good Good Bad Bad
3 3 Bad Bad Bad Bad
Test Results – Data set 3
DATA: COG {0380, 0383} (Real Size: 15,13) MERGE’BILITY TEST METHOD
Observed OutcomeFragment Cluster Size Expected Outcome
n (F1) n (F2) Good / Bad Model Comparison(0.001 / 0.0005)
Phy.tree Distance Phy.tree LCA coverage
10 10 Good Good / Bad Good Good
10 10 Bad Good / Bad Bad Bad
10 5 Good Good / Bad Bad Good
10 5 Bad Good / Bad Bad Bad
10 3 Good Good / Bad Bad Good
10 3 Bad Good / Bad Bad Bad
4 2 Good Good / Good Good Bad
4 2 Bad Bad / Good Bad Bad
3 3 Good Good / Bad Good Good
3 3 Bad Bad / Bad Bad Good
Test Results – Data set 4DATA: COG {0160, 0161} (Real Size: 79,49) MERGE’BILITY TEST METHOD
Observed OutcomeFragment Cluster Size Expected Outcome
n (F1) n (F2) Good / Bad Model Comparison(0.0001)
Phy.tree Distance Phy.tree LCA coverage
10 10 Good Bad Good Good
10 10 Bad Good Good Bad
10 5 Good Bad Good Good
10 5 Bad Good Good Bad
10 3 Good Good Good Good
10 3 Bad Good Bad Bad
4 2 Good Good Good Good
4 2 Bad Good Good Bad
3 3 Good Good Good Good
3 3 Bad Good Good Bad
2 2 Good Good Good Good
2 2 Bad Good Good Good
Acknowledgements!
A big thank you to: Prof. Sun Kim, advisor My parents, brother, grand parents! All my colleagues and friends: JH, Zhiping, Scott Martin,
SR, Raj, Anshul, Pat Hayes and everyone else! Folks at CS & Informatics: CS Systems staff, Lucy, Linda,
Wendy, Cheryl, Errissa, Bob! Profs. Marty Siegel and Gary Wiggins – GPC. RATS folks!
Did I forget someone?! Sorry if I did…