wabi 2005 algorithms for imperfect phylogeny haplotyping (ipph) with a single homoplasy or...
Post on 20-Dec-2015
214 views
TRANSCRIPT
WABI 2005
Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single
Homoplasy or Recombnation Event
Yun S. Song, Yufeng Wu and Dan Gusfield
University of California, Davis
Haplotyping Problem
• Diploid organisms have two copies of (not identical) chromosomes.
• A single copy is haplotype, a vector of Single Nucleotides Polymorphisms (SNPs)
• SNP: a site with two types of nucleotides occur frequently, 0 or 1
• The mixed description is genotype, vector of 0,1,2– If both haplotypes are 0, genotype is 0– If both haplotypes are 1, genotype is 1– If one is 0 and the other is 1, genotype is 2
Haplotypes and Genotypes
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
2 1 2 1 0 0 1 2 0
Two haplotypes per individual
Genotype for the individual
Merge the haplotypes
Sites: 1 2 3 4 5 6 7 8 9
• Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes
Perfect Phylogeny Haplotyping (PPH)
• Finding original haplotypes in nature hopeless without genetic model to guide solution picking
• Gusfield (2002) introduced PPH problem• PPH is to find HI solutions that fit into a
perfect phylogeny.• Nice results for PPH, including a linear time
algorithm
The Perfect Phylogeny Model for Haplotypes
00000
1
2
4
3
510100
1000001011
00010
01010
12345sitesAncestral sequence
Extant sequences at the leaves
Site mutations on edges
The tree derives the set M:1010010000010110101000010
Assume at most 1 mutationat each site
Imperfect Phylogeny Haplotyping (IPPH): Extending PPH
• Often, the real biological data does not have PPH solutions.
• Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic)
• Our approach: IPPH with explicit genetic model, with small amount of– Homoplasy, i.e. back or recurrent mutation – Recombination
• Goal: Extend usage of PPH– Real data: may be of small perturbation from PPH– Haplotype block: low recombination or homoplasy
Back/Recurrent Mutation for Haplotypes
Data000010101110
000
000110
2 1
3
010 101
1
010100
More than one mutation at a site
Recombinations: Single Crossover
• Recombination is one of the principle genetic force shaping genetic variations
• Two equal length sequences generate the third equal length sequence
110001111111001 000110000001111
Prefix Suffix
11000 0000001111
breakpoint
IPPH (Imperfect Phylogeny Haplotyping) Problems
• Small deviation from PPH• H-1 IPPH problem
– Find a tree that allows exactly one site to mutate twice – The rest of sites can only mutate at most once– Derive haplotypes for the given genotypes
• R-1 IPPH problem– Find a network that has exactly one recombination
event– Each site mutates at most once– Derive haplotypes for the given genotypes
Number of Minimum Recombinations for Haplotypes
Rmin Rho=1 Rho=3 Rho=5
0 60.8% 23.6% 8.4%
1 31.8% 35.2% 27.6%
2 6.8% 24.8% 27.8%
3 11.6% 21.6%
4 3.8% 9.0%
5 0.8% 3.6%
6 0.2% 1.4%
Frequency of Minimumrecombinations for small rho(scaled recombination rate)
20 sequences30 sites500 simulations
Haplotyping with One Homoplasy
More than one mutation at a site 1
s1 s2 s3
a1 0 0 0
a2 0 1 0
b1 1 0 1
b2 1 1 0
s1 s2 s3
a 0 2 0
b 1 2 2
Genotype Haplotype000
a1b2
2 1
3
a2 b1
1
010100
1 Homoplasy Tree
Algorithm for H1-IPPH
• For each site s in the input genotype data M– Test whether M-{s} has PPH solutions– If not, move to next site.– Otherwise, check whether 1 homoplasy at site s
can lead to HI solutions– If yes, stop and report result
• Assume only one PPH solution for M-{s}• But how to find solutions with 1 homoplasy at
s efficiently?
PPH
M-{i3} {i3} Mh-{i3} h{i3}
r2
r2’ s2’
s2
Assume Mh-{i3} is fixed.Haplotypes for the same genotype must pair up.Two ways to pair
Combine Mh-{i3} with h{i3}
• 4 ways to try pairing i3.• Exponential number in general, even for one PPH solution• Need polynomial-time method to avoid trying all the pairings
?
Mh-{i3} h{i3} Mh1 Mh2
1 Homoplasy: from T to Tr, Ts
s s
Recurrent mutation @ site s
Tree T
L1 L2O1 O2
L1, L2 O1, O2 s
Ts
Tree Tr
s induces a split Ts
Deleting s induces tree Tr
From Tr, Ts to T
Find two subtrees Ts1, Ts2, in Tr, s.t.
Tree Tr
L O s
Ts
Ts1, Ts2 corresponds to one side
s s
Tree T
L1 L - L1O1 O2
of Ts
L1 L - L1
2. Pick leaves from Tr corresponding the chosen partition side1. Pick one side of partition from Ts
3. Check whether the selected leaves fit into two sub-trees
Algorithms and Results
• Efficient graph-coloring based method to select two subtrees (skipped)
• Implemented in C++• Simulation with data with program ms.• Compare to PHASE (a haplotyping program)
– Accuracy: comparable– Speed: at least 10x faster– 100x100 data: about 3 seconds
• Can identify the homoplasy site with high accuracy: >95% in simulation
1-SPR operation
SPR: subtree-prune-regraft operation
1 recombination condition equivalent to distance-SPR(TL,TR) = 1
Algorithm for R1-IPPH
• Brute-force 1-SPR idea leads to exponential time when TL or TR are not binary.
• Trickier than H1-IPPH, but with care, R1-IPPH can be solved in polynomial time. (not in paper)
Conclusions
• Contributions– Assuming bounded number of PPH solutions1. Polynomial time algorithm for H1-IPPH problem2. Polynomial time algorithm for R1-IPPH problem3. Possible extension to more than 1 homoplasy
event.
• Open problems– Haplotyping with more than 1 recombination
efficiently.– Remove assumption that number of PPH solutions
for M-{s} is bounded.