zorro : a masking program for incorporating alignment accuracy in phylogenetic inference sourav...
TRANSCRIPT
ZORRO : A masking program for incorporating Alignment Accuracy in
Phylogenetic Inference
Sourav ChatterjiMartin Wu
Probabilistic Masking using pair-HMMs
• Probabilistic formulation of alignment problem.
• Can answer additional questions– Alignment Reliability– Sub-optimal Alignments
Durbin et al., Cambridge University Press (1998)
Probabilistic Masking
• What is the probability residues xi and yj are homologous?
• Posterior Probability the residues xi and yj are homologous
• Can be calculated efficiently for all pairs (and gaps) in quadratic time.
y]y]Pr[x,Pr[x,y]y]x,x,,,yyPr[xPr[x
]]yyPr[xPr[x jjiijjii
An Ideal Weighting Scheme
• Accounts for correlations between pairs– e.g. A-C and A-D
• Accounts for distance between the sequences in a pair– e.g. C-D
The Zorro Weighting Scheme
4 3
3
3 3
Calculate Ne, the number of pairs that share an edge e.
The Zorro Weighting Scheme
4 3
3
3 3
•Normalize the edge weight by Ne.•Weight of a pair is sum of normalized weights of edges on the path.
Scoring Multiple Alignment Columns
• Calculate the “posterior probability matrix” and weights wij for every pair of sequences.
• Weighted “sum of pairs” score for column r :
jji,i,ijij
jjiijji,i,
ijij
ww
]]rrPr[rPr[rww
Some Notes
• Improve Running Time– Sample a subset of pairs– Performance almost similar
• Using Confidence Scores– Cutoff Based Scheme (we use 0.5)– Weighted Sampling of columns according to
confidence scores.
Testing
The Balibase 3.0 Benchmark Database
Testing
• Realign sequences using MSA programs like Clustalw.
• Sensitivity: for all correctly aligned columns, the fraction that has been masked as good
• Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad
Performance
Gblocks
ZORRO
Sensitivity Specificity
96.3% 95.1%
54.4% 94.7%
Effect on Phylogenetic Inference
• Gblocks data-set– Protein Sequences obtained by simulating
evolution on known trees– Diversity in data-set• Topology (Symmetric/Asymmetric)• Evolutionary Rates• Alignment Lengths (not tested yet)
Effect on Phylogenetic Inference
Protocol Symmetric Tree Inference Accuracy
Asymmetric Tree Inference Accuracy
No Masking 95.17% 91.95 %
Gblocks 84.14 % 86.44 %
Prob. Masking 93.56% 93.33 %
Clustalw alignments, PhyML tree
Effect on Phylogenetic Inference
Protocol
Symmetric Tree Inference Accuracy
Asymmetric Tree Inference Accuracy
All HighSupport All High
Support
No Masking 94.25 % 69.23% 91.95 % 57.44%
Gblocks 89.2 % 57.44% 90.80 % 51.88%
Prob. Masking 94.02% 68.21% 93.79 % 62.05%
MAFFT alignments, PhyML tree
Effect on Phylogenetic Inference
Clustalw alignments, PhyML tree
Protocol
Symmetric Tree Inference Accuracy
Asymmetric Tree Inference Accuracy
All HighSupport All High
Support
No Masking 95.17% 62.05% 91.95 % 55.38%
Gblocks 84.14 % 41.03% 86.44 % 37.95%
Prob. Masking 93.56% 72.31% 93.33 % 63.59%
Effect on Phylogenetic Inference
Muscle alignments, PhyML tree
Protocol
Symmetric Tree Inference Accuracy
Asymmetric Tree Inference Accuracy
All HighSupport All High
Support
No Masking 94.71% 71.28% 93.10 % 61.03%
Gblocks 89.43 % 57.95% 90.11 % 50.26%
Prob. Masking 93.56% 70.77% 95.17 % 64.62%
Conclusions/Future Work
• Technical Issues– What if a few sequences are “bad”/non-
homologous?– Incorporate reliability in likelihood equation and
Bayesian methods.• With Dr. Darling in July
• Testing– “Real” Data Sets?