gene tree discordance and multi-species coalescent models

52
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio

Upload: ivy

Post on 13-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Mike DeGiorgio. Randa Tao. Gene tree discordance and multi-species coalescent models. Noah Rosenberg December 21, 2007. James Degnan. David Bryant. Gene trees and species trees. Different genes may produce different inferences about species relationships. T 2. T 3. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Gene tree discordance and  multi-species coalescent models

Gene tree discordance and multi-species coalescent models

Noah RosenbergDecember 21, 2007

James Degnan Randa TaoDavid Bryant

Mike DeGiorgio

Page 2: Gene tree discordance and  multi-species coalescent models

Gene trees and species trees

Different genes may produce different inferences about species relationships

Page 3: Gene tree discordance and  multi-species coalescent models

Coalescent model for evolution within species, conditional on the species tree

Hudson (1983, Evolution)Tajima (1983, Genetics)

Nei (1987, Molecular Evolutionary Genetics book)Pamilo & Nei (1988, Molecular Biology and

Evolution)Takahata (1989, Genetics)

Wu (1991, Genetics)Hudson (1992, Genetics)

Maddison (1997, Systematic Biology)

T2

T3

Page 4: Gene tree discordance and  multi-species coalescent models

1. Coalescences occur within species, with the same rate for each lineage pair.

3. When species splits are encountered, lineages from all groups descended from the split are allowed to coalesce.

Assumptions of the multispecies coalescent model conditional on a species tree

2. The rate of coalescence is proportional to the number of pairs of lineages.

T2

T3

Page 5: Gene tree discordance and  multi-species coalescent models

The probability that i lineages have j ancestors at T coalescent time units (T = t / N ) in the past is

a[k] = a(a-1)…(a-k+1)

a(k) = a(a+1)…(a+k-1)

Takahata and Nei (1985, Genetics)Tavare (1984, Theoretical Population Biology)

Page 6: Gene tree discordance and  multi-species coalescent models

Concordant gene tree Discordant gene tree

2. 1/3 of the probability that gene tree is determined in the ancestral phase, or (1/3)e-T

1. The probability gene tree is determined in the 2-species phase, or 1-e-T

Probability of concordance equals 1-(2/3)e-T

For 3 taxa, the probability of concordance is a sum of two terms:

T

A B C

Probability of a concordant gene tree topology

Hudson (1983, Evolution)Nei (1987, Molecular Evolutionary Genetics)Tajima (1983, Genetics)

Page 7: Gene tree discordance and  multi-species coalescent models

Probability of the matching gene tree ((AB)C)

Probability of a particular discordant gene tree ((BC)A)

Page 8: Gene tree discordance and  multi-species coalescent models

It would be desirable to have a general computation of the probability that a particular species tree topology with branch lengths gives

rise to a particular gene tree topology

Page 9: Gene tree discordance and  multi-species coalescent models

Gene tree probabilities under the multispecies coalescent model

A coalescent history gives the list of species tree branches on which gene tree coalescences occur.

Consider a species tree S (topology and branch lengths)

Consider a species tree G (topology only)

A B C A B C

JH Degnan & LA SalterEvolution 59: 24-37 (2005)

Page 10: Gene tree discordance and  multi-species coalescent models

The list of coalescent histories for an example with five taxa

A B C D E A C B D E

Species tree Gene tree

4321

(A,C) ((AC),B) (D,E) (((AC)B,(DE)) Probability

gij(T) is the probability that i lineages coalesce to j lineages during time T

Page 11: Gene tree discordance and  multi-species coalescent models

What are the properties of the number of coalescent histories?

Computing the probabilities of gene trees

Is it possible for the most likely gene tree to disagree with the species tree?

Using the probabilities of gene trees

How do species tree inference algorithms behave when applied to multiple gene trees?

Page 12: Gene tree discordance and  multi-species coalescent models

The number of coalescent histories

Page 13: Gene tree discordance and  multi-species coalescent models

The number of coalescent histories for the matching gene tree

12

3

4

5678

A B C D E F

AS,m is the number of coalescent histories for the matching gene tree when we subdivide the species tree root into m pieces

Page 14: Gene tree discordance and  multi-species coalescent models

The number of coalescent histories for trees with at most 5 taxa

Page 15: Gene tree discordance and  multi-species coalescent models

Number of coalescent histories for special shapes with n taxa

Catalan number Cn-1 (Degnan 2005)

1, 2, 5, 14, 42, 132, 429, 1430…

Number of taxa in left subtree is l

-, -, -, 13, 42, 138, 462, 1573…

Page 16: Gene tree discordance and  multi-species coalescent models

The number of coalescent histories for up to 11 taxa

Page 17: Gene tree discordance and  multi-species coalescent models

Ratio of the largest and smallest number of coalescent histories for n taxa

>

Page 18: Gene tree discordance and  multi-species coalescent models

Which types of shapes have the most coalescent histories?

The number of coalescent histories for trees with 8 taxa

Most

Least

Page 19: Gene tree discordance and  multi-species coalescent models

Caterpillar-like shapes with n taxa, based on 4- and 5-taxon subtrees

Cn-1

~(5/4)Cn-1 (1.25)Cn-1

~(23/16)Cn-1 (1.4375)Cn-1

Page 20: Gene tree discordance and  multi-species coalescent models

Largest values for caterpillar-like shapes based on 7 and 8-taxon subtrees

~(1381/256)Cn-1 (5.39453125)Cn-1

~(189/64)Cn-1 (2.953125)Cn-1

Page 21: Gene tree discordance and  multi-species coalescent models

Can a non-matching gene tree have more coalescent histories?

Caterpillar species tree

1430 coalescent histories

1441 coalescent histories

Page 22: Gene tree discordance and  multi-species coalescent models

Is it possible for the most likely gene tree to disagree with the species tree?

Using the probabilities of gene trees

How do species tree inference algorithms behave when applied to multiple gene trees?

What are the properties of the number of coalescent histories?

Computing the probabilities of gene trees

Page 23: Gene tree discordance and  multi-species coalescent models

For n>3 taxa, can species trees be discordant with the gene trees they are

most likely to produce?

Page 24: Gene tree discordance and  multi-species coalescent models

The labeled history for a gene tree is its sequence of coalescence events.

B C DA B C DA

The two labeled histories below produce the same labeled topology ((AB)(CD))

Randomly joining pairs of lineages leads to a uniform distribution over the set of possible labeled histories.

The number of labeled histories possible for four taxa is

Page 25: Gene tree discordance and  multi-species coalescent models

A B C D

T2

T3

If the branch lengths of the species tree are sufficiently short, coalescences will occur more anciently than the species tree root.

B C DA

B C DA

B C DA

Combined

probability 1/9

Probability 1/18

Page 26: Gene tree discordance and  multi-species coalescent models

((AB)(CD)) 0.132((AC)(BD)) 0.094((AD)(BC)) 0.094(((AB)C)D) 0.125(((AB)D)C) 0.100(((AC)B)D) 0.070(((AC)D)B) 0.062(((AD)B)C) 0.032(((AD)C)B) 0.032(((BC)A)D) 0.070(((BC)D)A) 0.062(((BD)A)C) 0.032(((BD)C)A) 0.032(((CD)A)B) 0.032(((CD)B)A) 0.032

0.140.14

A B C D

Species tree

Gene tree frequency distribution

Matching gene tree

Page 27: Gene tree discordance and  multi-species coalescent models

T2 (units of N generations)

T3

Species tree is (((AB)C)D)

Most likely gene tree is not (((AB)C)D)

T2

T3

Species tree is (((AB)C)D) butmost likely gene tree is ((AB)(CD))

A species tree topology produces anomalous gene trees if branch lengths can be chosen so that the most likely gene tree topology differs from the species tree topology.

Page 28: Gene tree discordance and  multi-species coalescent models

A B C D

T2

T3

B C DA

B C DA

B C DA

Combined

probability 1/9

Probability 1/18

Does the 4-taxon symmetric species tree topology produce anomalous gene trees?

Page 29: Gene tree discordance and  multi-species coalescent models

• 3 species – no anomalous gene trees.

• 4 species – asymmetric but not symmetric species trees have AGTs.

• 5 or more species?

Probability of the concordant gene tree

Probability of a particular discordant gene tree

Page 30: Gene tree discordance and  multi-species coalescent models

B C DA B C DA E B D EA FC

For n > 4, suppose a species tree topology is not n-maximally probable.

If its branches are short enough, it produces AGTs that are n-maximally probable.

With 5 or more species, any species tree topology produces at least one anomalous gene tree.

A labeled topology for n taxa is n-maximally probable if its probability under random branching is greater than or equal to that of any other labeled topology with n taxa.

Proof:

Page 31: Gene tree discordance and  multi-species coalescent models

Suppose a species tree topology is n-maximally probable.

With 5 or more species, any species tree topology produces at least one anomalous gene tree.

Proof (continued):

For n > 8 an inductive argument reduces the problem to the case of n=5, 6, 7, or 8.

For n=5, 6, 7, or 8 taxa it remains to show that the n-maximally probable species tree topologies produce AGTs.

Page 32: Gene tree discordance and  multi-species coalescent models

With 5 or more species, any species tree topology produces at least one anomalous gene tree.

Proof (continued):

For n=5 the n-maximally probable species tree topology produces AGTs.

Page 33: Gene tree discordance and  multi-species coalescent models

With 5 or more species, any species tree topology produces at least one anomalous gene tree.

Proof (continued):

For n=5, 6, 7, or 8 the n-maximally probable species tree topologies produce AGTs.

Page 34: Gene tree discordance and  multi-species coalescent models

With 5 or more species, any species tree topology produces at least one anomalous gene tree.

Proof (continued):

For n > 8 one of the two most basal subtrees has between 5 and n-1 taxa inclusive.

G H I J

Choose branch lengths to produce an AGT for that subtree, and make them long for the other subtree.

An inductive argument for n > 8 reduces the problem to the case of n=5, 6, 7, or 8.

Page 35: Gene tree discordance and  multi-species coalescent models

If the species tree topology is not n-maximally probable, it has maximally probable AGTs.

With 5 or more species, any species tree topology produces at least one anomalous gene tree.

Proof (summary):

For n > 8, induction reduces the problem to the case of n=5, 6, 7, or 8.

By example, n-maximally probable species tree topologies produce AGTs for n=5, 6, 7, or 8.

This completes the proof

Page 36: Gene tree discordance and  multi-species coalescent models

Some properties of anomalous gene trees

Page 37: Gene tree discordance and  multi-species coalescent models

Species tree

Gene tree

A B C D E

D E C A B

Anomalous gene trees can have the same unlabeled shape as the species tree

Page 38: Gene tree discordance and  multi-species coalescent models

There exist mutually anomalous sets of tree topologies (“wicked forests”).

Page 39: Gene tree discordance and  multi-species coalescent models

AGTs can occur if some but not all species tree branches are short

T4T3

T2

Page 40: Gene tree discordance and  multi-species coalescent models

T2 (units of N generations)

T3

Does the severity of AGTs increase with more taxa?

Maximal value for shared branch length

that still produces AGTs: 0.1568

Page 41: Gene tree discordance and  multi-species coalescent models

Does the severity of AGTs increase with more taxa?

Page 42: Gene tree discordance and  multi-species coalescent models

Number of AGTs for the 4-taxon asymmetric species tree

Page 43: Gene tree discordance and  multi-species coalescent models

Number of AGTs for 5-taxon species trees

Page 44: Gene tree discordance and  multi-species coalescent models

Does the number of AGTs increase with more taxa?

Page 45: Gene tree discordance and  multi-species coalescent models

What implications do gene tree probabilities have for phylogenetic

inference algorithms?

Page 46: Gene tree discordance and  multi-species coalescent models

• Most commonly observed gene tree topology

Statistically inconsistent in estimating the species tree

T3

T2

A B C D

T2 (units of N generations)

T3

A B C D

A B C D

Species tree Estimated species tree

Page 47: Gene tree discordance and  multi-species coalescent models

• Estimated gene tree of concatenated sequence

Statistically inconsistent in estimating the species tree

Page 48: Gene tree discordance and  multi-species coalescent models

• Maximum likelihood based on the frequency distribution of gene tree topologies

Statistically consistent even when anomalous gene trees exist

((AB)(CD)) 0.132((AC)(BD)) 0.094((AD)(BC)) 0.094(((AB)C)D) 0.125(((AB)D)C) 0.100(((AC)B)D) 0.070(((AC)D)B) 0.062(((AD)B)C) 0.032(((AD)C)B) 0.032(((BC)A)D) 0.070(((BC)D)A) 0.062(((BD)A)C) 0.032(((BD)C)A) 0.032(((CD)A)B) 0.032(((CD)B)A) 0.032

0.140.14

A B C D

Species tree

Gene tree frequency distribution

Matching gene tree

Anomalousgene tree

Page 49: Gene tree discordance and  multi-species coalescent models

• Consensus among gene tree topologies

-Majority rule consensus-Greedy consensus-Rooted triple consensus (R*)

Page 50: Gene tree discordance and  multi-species coalescent models

• Tree obtained by agglomeration using minimum pairwise coalescence times across a large number of loci (“Glass tree”)

Page 51: Gene tree discordance and  multi-species coalescent models

Summary

There exist algorithms for computing gene tree probabilities on species trees

The number of coalescent histories increases quickly - algorithmic improvements in gene tree probability computations are likely possible

HOWEVER, some algorithms can infer the correct species tree even when gene tree discordance is extreme

A species tree can disagree with the gene tree that it is most likely to produce

This severe discordance only gets worse with more taxa

Page 52: Gene tree discordance and  multi-species coalescent models

Acknowledgments

David BryantMike DeGiorgioJames DegnanRanda Tao

National Science Foundation DEB-0716904