triplet and quartet distances between trees of arbitrary degree
DESCRIPTION
Triplet and Quartet Distances Between Trees of Arbitrary Degree. ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana, USA , 8 January 2013. Rooted. Evolutionary Tree. Time. Unrooted Evolutionary Tree. Dominant modern approach to study evolution is from DNA analysis. - PowerPoint PPT PresentationTRANSCRIPT
Triplet and Quartet DistancesBetween Trees of Arbitrary Degree
Gerth Stølting BrodalAarhus University
Rolf FagerbergUniversity of Southern Denmark
Thomas Mailund, Christian N. S. Pedersen, Andreas SandAarhus University, Bioinformatics Research Center
ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana, USA, 8 January 2013
Evolutionary Tree
Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan
Tim
e
Rooted
Unrooted Evolutionary Tree
Dominant modern approach to study evolution is from DNA analysis
Constructing Evolutionary Trees – Binary or Arbitrary Degrees ?
Sequence dataDistance matrix
1 2 3 ··· n123···
n
Neighbor JoiningSaitou, Nei 1987
[ O(n3) Saitou, Nei 1987 ]
Refined Buneman TreesMoulton, Steel 1999
[ O(n3) Brodal et al. 2003 ]
Buneman TreesBuneman 1971
[ O(n3) Berry, Bryan 1999 ]
123···
n
.... ....
Binary trees(despite no evidence
in distance data)
Arbitrary degrees(strong support for all edges ; few branches)
Arbitrary degree (compromise ; good
support for all edges)
Data Analysis vs Expert Trees – Binary vs Arbitrary Degrees ?
Linguistic expert classification (Aryon Rodrigues)
Neighbor Joining on linguistic data
Cultural Phylogenetics of the Tupi Language Family in Lowland South America. R. S. Walker, S. Wichmann, T. Mailund, C. J. Atkisson. PLoS One. 7(4), 2012.
Evolutionary Tree Comparison
?
split1357|2468
1
4
32
5
67
T2
81
6
3
2
5
47
T1
8
Common Only T1 Only T2
1357|2468 35|124678 57|123468
13567|248 48|123567
Robinson-Foulds distance = # non-common splits = 2 + 1 = 3
[Day 1985] O(n) time algorithm using 2 x DFS + radix sort
D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorialmathematics, VI, Lecture Notes in Mathematics, pages 119–126. Springer, 1979.
8
8
Robinson-Foulds Distance (unrooted trees)
?
T1
Common Only T1 Only T2
(none) 12567|3481257|3468157|2346857|123468
125678|3412578|3461578|2346578|1234678|123456
1
6
2
5
4
7T2
3
1
6
2
5
4
7
3
RF-dist(T1 , T2) = 4 + 5 = 9RF-dist(T1\{8} , T2\{8}) = 0
Robinson-Foulds very sensitive to outliers
D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorialmathematics, VI, Lecture Notes in Mathematics, pages 119–126. Springer, 1979.
resolved : ij|kl
Quartet Distance (unrooted trees)
Consider all quartets, i.e. topologies of subsets of 4 leaves {i,j,k,l}
Quartet T1 T2
{1,2,3,4} 14|23 14|23{1,2,3,5} 13|25 15|23{1,2,4,5} 14|25 1245{1,3,4,5} 14|35 1345{2,3,4,5} 25|34 23|45
i
j
k
l
i
j
k
l
unresolved : ijkl(only non-binary trees)
Quartet-dist(T1 , T2) = - # common quartets = 5 - 1 = 4
1
3
2 5
24
3 1
5
T1 T2
4
G. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34:193-200, 1985.
Triplet Distance (rooted trees)
Consider all triplets, i.e. topologies of subsets of 3 leaves {i,j,k}
Triplet T1 T2{1,2,3} 2|13 2|13{1,2,4} 1|24 4|12{1,2,5} 1|25 5|12{1,3,4} 4|13 4|13{1,3,5} 5|13 5|13{1,4,5} 1|45 1|45{2,3,4} 3|24 4|23{2,3,5} 3|25 5|23{2,4,5} 5|24 2|45{3,4,5} 3|45 3|45
resolved : k|iji j
k i j k
unresolved : ijk(only non-binary trees)
Triplet-dist(T1 , T2) = - # common triplets = 10 - 5 = 5
1
2
53
T1
44
1
52
T2
3
D. E. Critchlow, D. K. Pearl, C. L. Qian: The triples distance for rooted bifurcating phylogenetic trees. Systematic Biology, 45(3):323-334, 1996.
RootedTriplet distance
UnrootedQuartet distance
BinaryO(n2)
O(nlog n)CPQ 1996
[SODA 2013]
O(n3)O(n2)
O(nlog2 n)O(nlog n)
D 1985
BTKL 2000
BFP 2001
BFP 2003
Degrees d
O(n2)O(nlog n)
BDF 2011
[SODA 2013]
O(d 9nlog n)
O(n2.688)O(dnlog
n)
SPMBF 2007
NKMP 2011
[SODA 2013]
Computational Results
12
534 1
3
2 5
4
12
534 6
712
3 110
6 7 1311
58
9
Distance Computation T2
Resolved Unresolved
T1
ResolvedA : Agree
C
B : Disagree
Unresolved D E
i jk
i j k i j k
i j k
i jk
j ki
i kj
i kj
Triplet-dist(T1 , T2) = A – E = B + C + D
A + B + C + D + E = D + E and C + E unresolved in one tree
Sufficient to compute A and E or A and B
i jk
j ki
O(n) triplets & quartets
O(n·log n) triplets
O(n·log n) triplets & quartets
O(d·n·log n) quartets
T2
Resolved Unresolved
T1
ResolvedA : Agree
C
B : Disagree
Unresolved D E
i jk
i j k i j k
i j k
i jk
j ki
i kj
i kj
i jk
j ki
Parameterized Triplet & Quartet DistancesB + α·(C + D) , 0 α 1
BDF 2011 O(n2) for triplet, NKMP 2011 O(n2.688) for quartet[SODA 13] O(n·log n) and O(d·n·log n), respectively
Counting Unresolved Triplets in One Tree
n1 n2 n3 ··· nd
v∑v
∑i < j < k
n i ·n j·n k
∑v ( ∑
i < j <k <ln i· n j· n k · n l+(n−∑l n l ) ∑i < j < k n i ·n j·n k )
Computable in O(n) time using DFS + dynamic programming
n1 n2 n3 ··· nd
v
Triplet anchored at v
Quartet anchored at v
Quartets (root tree arbitrary)
Counting Agreeing Triplets(Basic Idea)
v
1 i j d
∑v T1
∑wT 2
∑c
∑1≤i ≤d
(n i❑c
2 ) (nw−nc−n i❑w+n i
❑c )
T1
j
T2
c
i i
∑1≤ i≤d
n i❑w
w0
Efficient ComputationLimit recolorings in T1 (and T2) to O(n·log n)
v
1 1 1
0v
1 2 d
0v
1
0v
0
v
1
0
v
1
0...
Count T2 contribution
(precondition)
Recolor Recolor Recurse
Recolor & recurse
T1
Reduce recoloring cost in T2 to O(n·log2 n)
1
2 3
4
56
7 891
24
79
8563
Reduce recoloring cost in T2 from O(n·log2 n) to O(n·log n)
T2
arbitrary
height
degree
H(T2)
binary
height
O(log n)
Contract T2 and reconstruct H(T2) during recursion
Counting Agreeing Triplets (II)
v
1 i j d
T10
node in H(T2) =component
composition in T2
∑1≤ i ≤d
(n i❑C1
2 )(n∗❑C2−n i❑C2 )∑
1≤ i ≤d(n∗❑C1−n i
❑C1 ) n( ii )❑ C2∑1≤ i ≤d
n i❑C1 ·n i↑∗
❑ C2+ +
Contribution to agreeing triplets at node in H(T2)
i
i j
ii
j
j
ii
C2
C1
From O(n·log2 n) to O(n·log n)Update O(1) counters for all
colors through node
∑2≤ i ≤d
log (¿T2∨¿n i )= ∑2≤i ≤d
n i ∙ log nv
n iColored path lengths
v
1 i j d
T1 0
nv
ni
w H( T2)
Compressed version of T2 of size O(nv)
Total cost for updating counters
∑leaf l∈ T1
∑ancestor a( j)not heavy child
log na ( j +1)
na ( j )= n ∙ log n
a(1)a(2)
l=a(0)
a(3)
a(4) a(5)
T1
Counting Quartets...
Bottleneck in computing disagreeing resolved-resolved quartets
v
1 i j d
T10
i j i j
∑1≤ i<d
∑i < j ≤ d
n ( ij )G1 ·n ( ij )G2G1 G2
T2
double-sum factor d time
Root T1 and T2 arbitrary Keep up to 15+38d different counters per node in H(T2)...
Distance Computation T2
Resolved Unresolved
T1
ResolvedA : Agree
C
B : Disagree
Unresolved D E
i jk
i j k i j k
i j k
i jk
j ki
i kj
i kj
Triplet-dist(T1 , T2) = A – E = B + C + D
A + B + C + D + E = D + E and C + E unresolved in one tree
Sufficient to compute A and E or A and B
i jk
j ki
O(n) triplets & quartets
O(n·log n) triplets
O(n·log n) triplets & quartets
O(d·n·log n) quartets
RootedTriplet distance
UnrootedQuartet distance
BinaryO(n2)
O(nlog n)CPQ 1996
[SODA 2013]
O(n3)O(n2)
O(nlog2 n)O(nlog n)
D 1985
BTKL 2000
BFP 2001
BFP 2003
Degrees d
O(n2)O(nlog n)
BDF 2011
[SODA 2013]
O(d 9nlog n)
O(n2.688)O(dnlog
n)
SPMBF 2007
NKMP 2011
[SODA 2013]
Summary
d = maximal degree of any node in T1 and T2
12
534 1
3
2 5
4
12
534 6
712
3 110
6 7 1311
58
9
O(n·log n) ?
o(n·log n) ? The End