fast identification and statistical evaluation of segmental homologies in comparative maps peter...
Post on 21-Dec-2015
213 views
TRANSCRIPT
Fast identification and statistical evaluation of segmental
homologies in comparative maps
Peter Calabrese1, Sugata Chakravarty2 and Todd Vision3
1Department of Mathematics, University of Southern California; Departments of 2Operations Research and 3Biology
University of North Carolina at Chapel Hill
comparative maps Spaghetti Diagram Crop Circle
Livingstone et al 1999 Genetics 152:1183Gale & Devos 1998 PNAS 95:1972
some terms
• Feature: a gene or some other marker
• Segment: a string or substring of features
• Homology: descent from a common ancestor
• Block: a pair of segments that are putatively homologous. These are what we seek!
local genome alignment
• Consider each chromosome to be a string of features
• Assign common letters to homologous features • Identify segments sharing multiple pairs of
common letters• Differences from DNA/protein alignment
– high frequency of gaps relative to matches– inversions may occur within the alignment
homology matrix
gene 1 2 3 4 5 6 7 8 1 - 0 0 0 1 0 0 0 2 0 - 0 0 0 1 0 0 3 0 0 - 0 0 0 1 0 4 0 0 0 - 0 0 0 1 5 1 0 0 0 - 0 0 0 6 0 1 0 0 0 - 0 0 7 0 0 1 0 0 0 - 0 8 0 0 0 1 0 0 0 -
1 2 3 4
5 6 7 8
Bancroft (2001) TIG 17, 89 after Ku et al (2000) PNAS 97, 9121
Non-homologous features may be abundant within homologous segments
going beyond eyeballing
• LineUp – Hampson et al (2003)– Designed for genetic maps with error
• ADHoRe – Van der Poele (2002)– Designed for unambiguous marker order data
• Both provide automatic detection of blocks• For statistics, both employ permutation tests
– Computationally intensive– p-values are approximate
FISH:Fast Identification of Segmental Homology
• Block identification– Dynamic programming provides speed and
optimality guarantee– Can be generalized to multiple alignments
• Statistical assessment– Null model of duplication and transposition– Closed-form equation for calculating p-
values (i.e. no permutation testing)
from homology matrix to graph
• nodes ()– represent dots in the homology
matrix
• edges ()– connect nodes with nearest
neighbors– are unidirectional– have an associated distance– must be shorter than some
threshold
from homology matrix to graph
• nodes ()– represent dots in the homology matrix
• edges ()– connect nodes with nearest neighbors– are unidirectional– have an associated distance– must be shorter than some threshold
• paths ()– traverse shortest available edges– can be efficiently computed– can be considered candidate blocks
null model
• Within a genome: homologies are due to the duplication of individual features followed by insertion into a (uniformly) random position
• Between genomes: homologies are due to the above process plus the transposition of randomly chosen features into randomly chosen positions.
computing neighborhood size
• h = # nodes / # cells in matrix• n = # cells in neighborhood• Prob(neighborhood has 1 node)
p = 1 – (1-h)n
• Threshold distance for p=T under Manhattan distance (x+y)dT = 0.5 + sqrt[(log(1-p)/log(1-h)+0.25]
• T is analogous to a gap parameter– small T: few false positive edges, short blocks– large T: more false positive edges, longer blocks
block statistics
• Chen-Stein Theorem: number of blocks with k nodes is approximately Poisson
• Expected number of blocks = cpu
• Conservative matrix-wide p-value
Prob(X k) < 1 – e –cpu
where c is the # of cells in the matrix and pu = h(nh)k-1
identifying blocks
• Let edge from i to j have weight wij =1
• Initialize: score of block terminating at i Si = 0
• Recursion for block scores Sj = max(Si + wij) i such that j Ti
• Dynamic programming can be used to find all maximally extended blocks
simulation experiment
k obs stderr upbound lowbound2 45.8 0.06 47.6 40.13 2.28 0.02 2.39 1.784 0.113 0.003 0.120 0.0795 0.006 0.001 0.006 0.0046 0.0003 0.0002 0.0003 0.0002
How often are blocks of size k observed under the null model compared with
expectation?
FISH v.1.0
• http://www.bio.unc.edu/faculty/vision/lab/FISH– source code – compiled executables– documentation– sample data
• Adjustable parameters (e.g. T)• Reports statistics on blocks• Is fast and memory-efficient
applications
• Automated pairwise alignment of genome maps as part of Phytome project
• Prediction of gene content in regions of unsequenced genomes
• Studies of genome evolution, especially duplication and gene order rearrangement
future work
• Biologically motivated neighborhood geometries (ADHoRe)
• Non-discrete marker positions (LineUp)– Genetic versus physical maps– Map uncertainty
• Robustness to deviation from null model (permutation tests)
• Extension to homologies among 3 or more segments