triplet and quartet distances between trees of arbitrary degree gerth stølting brodal aarhus...

20
Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas Mailund, Christian N. S. Pedersen, Andreas Sand Aarhus University, Bioinformatics Research Center ETH Zürich, Switzerland, 22 November 2012 (paper to be presented at SODA’13)

Upload: eliza-wiseman

Post on 29-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Triplet and Quartet DistancesBetween Trees of Arbitrary Degree

Gerth Stølting BrodalAarhus University

Rolf FagerbergUniversity of Southern Denmark

Thomas Mailund, Christian N. S. Pedersen, Andreas SandAarhus University, Bioinformatics Research Center

ETH Zürich, Switzerland, 22 November 2012

(paper to be presented at SODA’13)

Page 2: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Evolutionary Tree

Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan

Tim

e

Rooted

Page 3: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Unrooted Evolutionary Tree

Dominant modern approach to study evolution is from DNA analysis

Page 4: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Constructing Evolutionary Trees – Binary or Arbitrary Degrees ?

Sequence dataDistance matrix

1 2 3 ··· n123···

n

Neighbor JoiningSaitou, Nei 1987

[ O(n3) Saitou, Nei 1987 ]

Refined Buneman TreesMoulton, Steel 1999

[ O(n3) Brodal et al. 2003 ]

Buneman TreesBuneman 1971

[ O(n3) Berry, Bryan 1999 ]

123···

n

.... ....

Binary trees(despite no evidence

in distance data)

Arbitrary degrees(strong support for all edges ; few branches)

Arbitrary degree (compromise ; good

support for all edges)

Page 5: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Data Analysis vs Expert Trees – Binary vs Arbitrary Degrees ?

Linguistic expert classification (Aryon Rodrigues)

Neighbor Joining on linguistic data

Cultural Phylogenetics of the Tupi Language Family in Lowland South America. R. S. Walker, S. Wichmann, T. Mailund, C. J. Atkisson. PLoS One. 7(4), 2012.

Page 6: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Evolutionary Tree Comparison

?

split1357|2468

1

4

32

5

67

T2

81

6

3

2

5

47

T1

8

Common Only T1 Only T2

1357|2468 35|124678 57|123468

13567|248 48|123567

Robinson-Foulds distance = # non-common splits = 2 + 1 = 3

[Day 1985] O(n) time algorithm using 2 x DFS + radix sort

D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorialmathematics, VI, Lecture Notes in Mathematics, pages 119–126. Springer, 1979.

Page 7: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

8

8

Robinson-Foulds Distance (unrooted trees)

?

T1

Common Only T1 Only T2

(none) 12567|3481257|3468157|2346857|123468

125678|3412578|3461578|2346578|1234678|123456

1

6

2

5

4

7T2

3

1

6

2

5

4

7

3

RF-dist(T1 , T2) = 4 + 5 = 9RF-dist(T1\{8} , T2\{8}) = 0

Robinson-Foulds very sensitive to outliers

D. F. Robinson and L. R. Foulds. Comparison of weighted labeled trees. In Combinatorialmathematics, VI, Lecture Notes in Mathematics, pages 119–126. Springer, 1979.

Page 8: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

resolved : ij|kl

Quartet Distance (unrooted trees)

Consider all quartets, i.e. topologies of subsets of 4 leaves {i,j,k,l}

Quartet T1 T2

{1,2,3,4} 14|23 14|23{1,2,3,5} 13|25 15|23{1,2,4,5} 14|25 1245{1,3,4,5} 14|35 1345{2,3,4,5} 25|34 23|45

i

j

k

l

i

j

k

l

unresolved : ijkl(only non-binary trees)

Quartet-dist(T1 , T2) = - # common quartets = 5 - 1 = 4

1

3

2 5

24

3 1

5

T1 T2

4

G. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34:193-200, 1985.

Page 9: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Triplet Distance (rooted trees)

Consider all triplets, i.e. topologies of subsets of 3 leaves {i,j,k}

Triplet T1 T2

{1,2,3} 2|13 2|13{1,2,4} 1|24 4|12{1,2,5} 1|25 5|12{1,3,4} 4|13 4|13{1,3,5} 5|13 5|13{1,4,5} 1|45 1|45{2,3,4} 3|24 4|23{2,3,5} 3|25 5|23{2,4,5} 5|24 2|45{3,4,5} 3|45 3|45

resolved : k|iji j

k i j k

unresolved : ijk(only non-binary trees)

Triplet-dist(T1 , T2) = - # common triplets = 10 - 5 = 5

1

2

53

T1

44

1

52

T2

3

D. E. Critchlow, D. K. Pearl, C. L. Qian: The triples distance for rooted bifurcating phylogenetic trees. Systematic Biology, 45(3):323-334, 1996.

Page 10: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

RootedTriplet distance

UnrootedQuartet distance

BinaryO(n2)

O(nlog n)CPQ 1996

[SODA 2013]

O(n3)O(n2)

O(nlog2 n)O(nlog n)

D 1985

BTKL 2000

BFP 2001

BFP 2003

Degrees d O(n2)

O(nlog n)BDF 2011

[SODA 2013]

O(d 9nlog n)

O(n2.688)O(dnlog n)

SPMBF 2007

NKMP 2011

[SODA 2013]

Computational Results

12

534 1

3

2 5

4

12

534 6

712

3 110

6 7 1311

58

9

Page 11: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Distance Computation T2

Resolved Unresolved

T1

ResolvedA : Agree

C

B : Disagree

Unresolved D E

i jk

i j k i j k

i j k

i jk

j ki

i kj

i kj

Triplet-dist(T1 , T2) = A – E = B + C + D

A + B + C + D + E = D + E and C + E unresolved in one tree

Sufficient to compute A and E or A and B

i jk

j ki

O(n) triplets & quartets

O(n·log n) triplets

O(n·log n) triplets & quartets

O(d·n·log n) quartets

Page 12: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

T2

Resolved Unresolved

T1

ResolvedA : Agree

C

B : Disagree

Unresolved D E

i jk

i j k i j k

i j k

i jk

j ki

i kj

i kj

i jk

j ki

Parameterized Triplet & Quartet DistancesB + α·(C + D) , 0 α 1

BDF 2011 O(n2) for triplet, NKMP 2011 O(n2.688) for quartet[SODA 13] O(n·log n) and O(d·n·log n), respectively

Page 13: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Counting Unresolved Triplets in One Tree

n1 n2 n3 ··· nd

v∑v

∑i < j < k

n i ·n j·n k

∑v ( ∑

i < j <k <l

n i· n j· n k · n l+(n−∑l n l ) ∑i < j < k n i ·n j·n k )

Computable in O(n) time using DFS + dynamic programming

n1 n2 n3 ··· nd

v

Triplet anchored at v

Quartet anchored at v

Quartets (root tree arbitrary)

Page 14: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Counting Agreeing Triplets(Basic Idea)

v

1 i j d

∑v T1

∑wT 2

∑c

∑1≤i ≤d

(n i❑c

2 ) (nw−nc−n i❑w+n i

❑c )

T1

j

T2

c

i i

∑1≤ i≤d

n i❑w

w0

Page 15: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Efficient ComputationLimit recolorings in T1 (and T2) to O(n·log n)

v

1 1 1

0v

1 2 d

0v

1

0v

0

v

1

0

v

1

0...

Count T2 contribution

(precondition)

Recolor Recolor Recurse

Recolor & recurse

T1

Reduce recoloring cost in T2 to O(n·log2 n)

1

2 3

4

56

7 891

24

79

8563

Reduce recoloring cost in T2 from O(n·log2 n) to O(n·log n)

T2

arbitrary

height

degree

H(T2)

binary

height

O(log n)

Contract T2 and reconstruct H(T2) during recursion

Page 16: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Counting Agreeing Triplets (II)

v

1 i j d

T10

node in H(T2) =component

composition in T2

∑1≤ i≤d

(n i❑C1

2 )(n∗❑C2−n i❑C2 )∑

1≤ i≤d(n∗❑C1−n i

❑C1 ) n( ii )❑ C2∑1≤ i≤d

n i❑C1 ·n i↑∗

❑ C2+ +

Contribution to agreeing triplets at node in H(T2)

i

i j

ii

j

j

ii

C2

C1

Page 17: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

From O(n·log2 n) to O(n·log n)Update O(1) counters for all

colors through node

∑2≤ i≤d

log (¿ T2∨¿n i )= ∑2≤i ≤d

n i ∙ lognv

n iColored path lengths

v

1 i j d

T10

nv

ni

w H(T2)

Compressed version of T2 of size O(nv)

Total cost for updating counters

∑leaf  l∈ T1

∑ancestor   a( j)not   heavy   child

logna ( j +1)

na ( j ) =  n ∙ log na(1)

a(2)

l=a(0)

a(3)

a(4) a(5)

T1

Page 18: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Counting Quartets...

Bottleneck in computing disagreeing resolved-resolved quartets

v

1 i j d

T10

i j i j

∑1≤ i<d  

∑i < j ≤ d

n ( ij )G1 ·n ( ij )G2G1 G2

T2

double-sum factor d time

Root T1 and T2 arbitrary Keep up to 15+38d different counters per node in H(T2)...

Page 19: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

Distance Computation T2

Resolved Unresolved

T1

ResolvedA : Agree

C

B : Disagree

Unresolved D E

i jk

i j k i j k

i j k

i jk

j ki

i kj

i kj

Triplet-dist(T1 , T2) = A – E = B + C + D

A + B + C + D + E = D + E and C + E unresolved in one tree

Sufficient to compute A and E or A and B

i jk

j ki

O(n) triplets & quartets

O(n·log n) triplets

O(n·log n) triplets & quartets

O(d·n·log n) quartets

Page 20: Triplet and Quartet Distances Between Trees of Arbitrary Degree Gerth Stølting Brodal Aarhus University Rolf Fagerberg University of Southern Denmark Thomas

RootedTriplet distance

UnrootedQuartet distance

BinaryO(n2)

O(nlog n)CPQ 1996

[SODA 2013]

O(n3)O(n2)

O(nlog2 n)O(nlog n)

D 1985

BTKL 2000

BFP 2001

BFP 2003

Degrees dO(n2)

O(nlog n)BDF 2011

[SODA 2013]

O(d 9nlog n)

O(n2.688)O(dnlog n)

SPMBF 2007

NKMP 2011

[SODA 2013]

Summary

d = maximal degree of any node in T1 and T2

12

534 1

3

2 5

4

12

534 6

712

3 110

6 7 1311

58

9

O(n·log n) ?

o(n·log n) ? The End