csci6904 genomics and biological computing phylogenetics

CSCI6904

Genomics and Biological Computing

Phylogenetics

Phylogeny

A non-biological example

Howe, CJ., Barbrook, AC., Spencer, M., Robinson, P., Bordalejo, B. and Mooney, LR., 2001, Manuscript Evolution, Trends in Genetics, 17(3), 147-152

Phylogeny

An analogy that works well…

Genome Manuscript

DNA Polymerase Scribes

Mutations Transcription error/alterations

DNA sequences Extant manuscripts

Deletion/Insertion

Rearrangements

Lateral Gene transfer Using part of a second template to enhance a copy of the manuscript

Selective pressure Politics, Esthetics, linguistics

Gene history (tree) Manuscript history (tree)

Phylogeny

An analogy that works well…

Thanks to Gutenberg and his invention of the printing press, the rate at which manuscripts are evolving have decreased by many orders of magnitude (next to 0, actually).

Raw data

The encoding of the data has to be done in a slightly different manner as it is preferable to treat words as characters. Consequently, the alphabet is of an un-manageable size.

Phylogeny

The data is collected from extant manuscripts:

Phylogeny

… and aligned so all characters are homologous:

Phylogeny

What can be discovered:

Which manuscript is the closest to the original draft?

Are all know manuscript found in, say, Belgium are descendent of a single copy of the manuscript?

What would be the most likely text in the (long lost) original version?

What can be discovered:

Whatever happened to the first chapter?

In the case to the left, there is evidence from “phylogenetic” analysis that the first half of the prologue of the manuscript El was taken from a different source than for the rest of the text.

In genomics, if a gene gets misplaced in a tree, it may indicate that the gene was acquired by transfer rather than heredity.

There are evidences that the transfer of a single gene transformed a benign bacteria: Yersinia Pestis, into the agent of the “black death”.

“The study, published in the April 26 issue of Science, shows that an enzyme called phospholipase D (PLD), previously known as Yersinia murine toxin, allows Y pestis to survive in the midgut of the rat flea. By acquiring the gene that encodes PLD, "the bacterium gradually changed from a germ that causes a mild human stomach illness acquired via contaminated food or water to the flea-borne agent of the 'Black Death,' which in the 14th century killed one-fourth of Europe's population," the NIH said in a news release.”Hinnebusch BJ, Rudolph AE, Cherepanov P, et al. Role of Yersinia murine toxin in survival of Yersinia pestis in the midgut of the flea vector. Science 2002;296(5568):733-5

Phylogeny

Strategies

Discrete character approaches

Parsimonious criterion

Model likelihood criterion

Hypothesis likelihood criterion

Distance-based clustering

Least-square

Neighbor-Joining / UPGMA (Implicit topology)

Minimum Evolution

UPGMA’s shortcoming

“Molecular Clock” assumption can be rejected in most cases.

In this example, un-equally evolving sequences are clustering according to their rate of evolution rather than according to the history of the genes.

Neighbor-Joining algorithm

Guarantee to recover the true tree if the distance matrix is an exact reflection of the tree.

How realistic is it to assume that these distances are behaving as such?

Triangle inequality

Rarely respected. Especially if any of D(A,B), D(B,C) are large.

The reason: Saturation.

Distance metrics between sequences

AC AB BCD D D

What is saturation?

Time 1 2 3 4 5 6 7

A -------------------------------- P

A F A H K H P

AB

In both cases, if only the time step 1 and 7 are known, the most likely distance will be the same.

1 7

2

34

5

6

Saturation is theoretically expected

Maximum likelihood distances

The following describe the evaluation of distances using the maximum likelihood criterion.

This is the best method to evaluate distances between biological sequences.

A G

C T

Jukes-Cantor Model

For nucleotides, there are a limited number of substitutions

Matrix with 1 expected substitution per 100 sites.

A G T C

A 0.99

G 0.03 0.99

T 0.03 0.03 0.99

C 0.03 0.03 0.03 0.99

Jukes-Cantor Model

For nucleotides there are a limited number of substitutions

Given two (short) sequences

C C A T

C C G T

A G T C

A 0.99

G 0.03 0.99

T 0.03 0.03 0.99

C 0.03 0.03 0.03 0.99

P1 =

, , ,A G C T

The Likelihood of that these two sequences are related is then:

(1) c c c c c c a a g t t tL P P P P

Jukes-Cantor Model



C C A T

C C G T

A G T C

A 0.99

G 0.03 0.99

T 0.03 0.03 0.99

C 0.03 0.03 0.03 0.99

P1 =

, , ,A G C T

What if the distance implied by P1 are not realistic/representative?


Extrapolation of probability matrices.

As we have seen for the PAM matrix a few weeks back.

We can obtain a pij for any multiple of PAM1 by doing

a matrix multiplication.

2

,1

,2*

1, , 3, 4, ,,

,4

i

iM

j i j j j i ji j

i


There will be then a different probability associated to each possible distances

1 1 1 1(1) c c c c c c a a g t t tL P P P P 2 2 2 2(2) c c c c c c a a g t t tL P P P P

( ) l l l lc c c c c c a a g t t tL l P P P P


The probability is as a function of the distance between two sequences.

There is thus a value of distance (l) that maximizes the probability of observing two related sequence.

In other words, there is a t values that maximize the likelihood that two sequences are related.

'( ) ...l l l l lc c c c c c a a g t t t x x xL l P P P P P

For branch length l over k sites

Arbitrary P matrices from

4 4log(5)

log( )

5

log( )

t t P

t Qt

e

P e

Q P

P e

Q is the log(P) matrix for an arbitrary unit of distance.

In Practice, the model can be custom built for an input dataset

Q R

Vector of frequencies for each character (can be estimated from input dataset)

A matrix of relative rate of substitution (large amount of empirical data (PAM, JTT) or optimized (WAG))


Now, imagine that two sequences are un-related.

– The real Branch Length (t) is equal to +– The BL estimate will converge to a value necessarily smaller due to the presence of some site being identical by coincidence.

Even random sequences are going to have “matches”

Although Likelihood distance should tend to large values in this case.

Even random sequences are going to have “matches”

Saturation should be compensated for in ML distances.

However, and because of:

• Non-homogenous frequencies• Rate heterogeneity• Change in the P matrix over time• Non-independence of characters in a sequence.

Long distances still a bit contentious to evaluate.

Time reversibility is also assumed

Time 1 2 3 4 5 6 7

A A A P P P P

A A A H K H P

AB

Without time reversibility assumed, it would be impossible to measure a distance between two sequences without involving an undefined bifurcation.

1 7

2

34

5

6

Time reversibility is also assumed

In practice, this means that the entries in our matrices of substitution have to be symmetrical such that :

This is also practical from a bioinformatics perspective since there it cut in ½ the number of parameters in the model.

1 7

2

34

5

6

| |P a b l P b a l

Another distance-based method that intuitively make sense

Least Square method

2n n

ij ij iji j

U w D d

D matrix entry

Sum of all t along the path from i to j.

Weight

21 11, , ,...ij

ij ij

wD D

One last distance-based method that we would intuitively use

Once abstracted :We are looking for an acyclic, binary graph with n terminal vertices that conforms the best to a set of n2 constraints.

i o f j x

2n n

ij ij iji j

U w D d

t1t2 t3 t4 t5

t45

t345t12

1 12 345 45 4ijd t t t t t

There is a danger of time traveling

with some tk < 0


Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)

i o f j x

t1t2 t3 t4 t5

t45

t345t12

iofjx

jxIn the path

Not in path



i o f j x

t1t2 t3 t4 t5

t45

t345

iofjx

jxIn the path

Not in path



i o f j x

t1t2 t3 t4 t5

t45

t345

iofjx

jxIn the path

Not in path



i o f j x

t4 t5

t45

t345

iofjx

jxIn the path

Not in path



i o f j x

t4 t5

t45

t345

iofjx

jxIn the path

Not in path


Once abstracted :Although there is n terminal nodes, there will be 2n-1 nodes in the tree/graph (rooted tree)

i o f j x

t4 t5

t45

t345

iofjx

jx

2

,

n n

ij ij ij k ki j k

U w D x t

,

1

0ij kx

In the path

Not in path


There is a straightforward solution to this linear algebra problem.

i o f j x

t1t2 t3 t4 t5

t45

t345t12

iofjx

jx

2

,

n n

ij ij ij k ki j k

U w D x t

,

1

0ij kx

In the path

Not in path


Minimum EvolutionCan be used as a selection criterion between Least-Square tree topologies.

This is done by selecting the topology amongst a collection of suitable topology that minimizes :

i o f j x

t1t2 t3 t4 t5

t45

t345t12

iofjx

jx kk all edges

t

Tree space

Unlike UPGMA and NJ, the problem with this previous method is that you have to provide a

topology prior to the calculation….

Phylogeny

Strategies

Discrete character approaches

Parsimonious criterion

Model likelihood criterion

Bayesian statistics

Distance-based clustering

Least-square

Neighbor-Joining / UPGMA (Implicit topology)

Minimum Evolution

Phylogeny

Discrete-character signal versus distance

Distance : Use the characters and a function to evaluate distance metrics. These are used to determine the length of the branch/edges between nodes/vertices. These internal nodes/edges are simply there to maximally reconcile the distance data into a binary tree.

Character : Use discrete characters implicitly or explicitly to define the state of each nodes.

Parsimony

Intuitive method that can be run manually

Assumes that everything observed in the data is connected by the most straightforward relationships.

Parsimony

Algorithm Postorder tree transversal : from terminal nodes toward the “center”.

At each node:

1. Create an intersection of the set of observation in the immediate descendent nodes.

2. If the intersection set is null. Create a set that is the union of the two descendents.Add one to the count of

changes recorded.

Parsimony

Algorithm

The most parsimonious tree will be the topology which will minimize the number of changes to explain the data over all sites (columns).

Statistics

Consistency

Retention

minCCI

C

max

max min

C CRI

C C

Parsimony

Side-effects

The reconstruction is assuming that the most parsimonious explanation is the correct one.

It also assumes that all changes have a similar “cost”.

Therefore, the parsimony method does not seem to be designed to deal with saturation.

Maximum likelihood criterion

AbstractionWe have a collection of items (sequences). We know that all the instances in the collection are stochastically derived from a unique parent in the hierarchy. We also have a have a model for this stochastic process represented as a Markov process.

We are thus looking for a tree (topology+distances) that will maximize the likelihood of the data, given the Markov process.

Jukes-Cantor Model



C C A T

C C G T

A G T C

A 0.99

G 0.03 0.99

T 0.03 0.03 0.99

C 0.03 0.03 0.03 0.99

P1 =

, , ,A G C T

What if the distance implied by P1 are not realistic/representative?



There will be an optimal distance between two sequences.

1 1 1 1(1) c c c c c c a a g t t tL P P P P 2 2 2 2(2) c c c c c c a a g t t tL P P P P

( ) l l l lc c c c c c a a g t t tL l P P P P

Distance to an internal node

There will be an optimal distance between two sequences:

( ) t t t tc c c c c c a a g t t tL t P P P P

If the sequence of only one of the node is known, the other end could be any possible characters:

( ) t t t tc c x c c x a a x g g x

x x x x

L t P P P P

Model based phylogeny

It is possible to compute likelihood of internal nodes by summing over all possibilities.

6 1 2 8 3 7 4 5( ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , )x y z w

P x P y x t P A y t P C y t P z x t P C z t P w z t P C w t P G w t

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7


The structure of the equation once the summation are pushed as far right as possible is the same as the structure of the tree.

6 1 2 8 3 7 4 5( ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , )x y z w

P x P y x t P A y t P C y t P z x t P C z t P w z t P C w t P G w t

( , ), ( , ( , ))A C C C G

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7


The calculation at one node thus depend on the conditional likelihood of each possible character S in the children nodes.

( )4 5

( ) ( ) ( ) ( )1 2

( ) ( | , ) ( | , )

( ), ( ),..., ( )

iw

i i i iw w w w n

L s P C S t P G S t

L L s L s L s

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7


For terminal nodes:

For internal nodes:

This is done for each site i.

The log(L) are stored rather than L.

( ) 0,1,0,0,0,0,0,0,...,0iL

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7( ) ( )( ) ( | , ) ( )i iy child child

child a

L s P a s t L a


For terminal nodes:

For internal nodes:

For innermost nodes:

( ) 0,1,0,0,0,0,0,0,...,0iL

( ) ( ) ( )i ia x

a

L L aFor a tree:

( )i

i

L L

( ) ( )( ) ( | , ) ( )i iy child child

child a

L s P a s t L a

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7

Tree Space

… and an epic complexity in search space

( ) 3 5 7 9 11 ...(2 3)T n n

1

(2 3)!( )

2 1 !n

nT n

n

Where n is of interest in the 20-100 range.

Tree data structure

What if we don’t constrain the binary connectivity?

Multifurcation is not necessary because any multifurcated tree is a case of a binary tree with at least one internal branch length set to zero.

Introduce one sequence at the time to the branch which gives the best likelihood.

Exploring the topology space

Incremental method – Stepwise addition

( ) 3 5 7 9 11 ... (2 3)T n n

This method of building a topology is intrinsically greedy.


Local rearrangement – Nearest Neighbor Interchange

For a tree with four lineages, there is only three possible topologies, one of which is already computed.


Greedy Methods – Nearest Neighbor Interchange

Sensitive to the initial tree, may not recover from some types of error early in the optimization.


Global Methods

Methods that have the potential to sample the entire breadth of the search space in only a few consecutive iterations.

Subtree Pruning Regrafting (SPR)Tree Bisection and Reconnection (TBR)


Subtree Pruning Regrafting

Search space per step is narrowed down to:

2O n


Tree bisection and reconnection

This algorithms randomizes the site of reconnection in both subtree.

Search space is narrowed down to:

3O n

Meaning of all this

Trees

Biologists love them. As we have seen, they attempt to go beyond the logical clustering of data items. Instead, they are used to reconstruct the process under which the data was generated.

For example: It is because of phylogenetic trees that we know that we (modern eukaryotes) are originating from the symbiosis between a cyanobacteria and a elusive ancestral cell.

The likelihood of a tree is a reflection of the goodness of fit of the alignment to a tree and a model of substitution.

How good is a tree?

Problem:Not only the cluster matter, but each individual internal nodes

contains usable information. How can we ascertain that a node is any good?

There are no reference set that can be used and for which we know the true answer.

Cause of error:

Model misspecification, mixed history within a gene, sampling error, …

Significance of difference in likelihood values.

If the likelihood evaluation depends on a single parameter:

This relationship is however impractical since the likelihood calculation rarely depend on a single parameter.

An (imperfect) example to this would be if we were interested in

evaluating the certainty on a single branch length at the time.

Real trees have a lot more parameters: 2n-1 branches, the rate distribution shape parameter, etc…

20 1

ˆ2 ln lnL L

Re-sampling using the bootstrap

Getting around the sampling error

Assumption : All the significant signal is present in the data, but the signal’s blend is affected by the size of the sample.

Given a dataset of n sequences that are k character long:

Create a new dataset by randomly and uniformly choosing site

indices “i” until the resampled dataset has a size of n X k.

1 1 11

1

... ...

... ... ...

...

i k

n nk

a a a

a a

Re-sampling using the bootstrap

In practice, this is used to generate a large number of

replicates and count the frequency of observing a

given internal node.

High “bootstrap value” node stable to sampling error

Does not mean that a given internal node is “real”.

Re-sampling using the Jackknife

Getting around the sampling error in the sequence axis

Principle : Randomly delete a small fraction of the data

The term “Jackknife” is also used in cases where trees are reconstructed by randomly deleting whole sequences from the

dataset.

Using simulation

This is known as the parametric bootstrap

Principle : Create a distribution of likelihood values generated from simulated datasets

The test is done by evaluating the probability that the “real” likelihood value is part of the distribution of tree likelihood from simulated datasets.

Using simulation

Simulating multiple sequence alignment from a tree is the reverse problem of inferring a tree from an

alignment. ? ?

? ? ?

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

x = {…}, random, drawn from

On a per site basis, the probability vector of each site in the node y can be calculated with:

6 ,

,...,i i

i

Qty

x A x Yx y y y

P e

P P P

? ?? ? ?

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

Using simulation


This test can be used to evaluate whether the data can be simulated from a given combination of tree and model.

If the test tree is wrong, the simulated dataset should not include the “real” data.

If the test tree is the true tree and the model is relevant to what really happened during the evolution of the gene, the likelihood the of “real” data and the simulated series should not be statistically different.

Using simulation


This test can be used to evaluate whether the data can be simulated from a given combination of tree and model.

The test is expected to be conservative because the simulated dataset are generated and recovered using the same parameters while the “real” data comes from a true process.

Time consideration

Bootstrapping requires building distributions

This usually means that the long calculation has to be re-run on permuted datasets about 100-1000 times over. All this, just to

harvest a few numbers.

Paired-site tests

Can be used to compare two trees

There is a number of techniques that compare two trees on the basis of their site likelihoods.

Winning site test, z test, t-test, Wilcoxon signed rank test, …

These test are more appropriate to estimate error bars in the topology dimension of a solution.

Paired-site tests

In our research group, we are using such test in our optimization strategy

ln lni ii ref testL L

ref is better

It is possible to eliminate statistically worst trees rapidly, without re-sampling, and treat the solution not as a data point but rather as an area in topology space.

Our research are showing that

Sums of likelihood are sensitive to the variance of the poorly modeled site.

Single thread search are not very robust to local minima.

Meaning of all this

Site likelihood

Bioinformaticians love them. They provide information that is not contained in individual sequences. (i.e.: no matter how hard one will scan one genome). Further, they contain information on properties that may be impossible to physically observe.

Site likelihoods are a reflection of the goodness of fit of one position in a protein given a solution optimized with all the available data.

Phylogeny allow to assemble sequences into an informative, time-dependent

structure

For the next few slides we will look at how phylogenetic information can be used to detect new signal in sequence information.

Site-wise rate of evolution.Rate dependent functional shifts.Rate independent functional shifts.

This framework offers a new source of information for pattern detection and recognition.

Estimating rates amongst sites

Basic calculation assumed a constant rate. Variable constant rate can be approximated on a site-per-site basis:

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

( ) ( )

0

( ) ( )

1

, ( )

( )

i ii i i

ki i

k kj

L r L r dr

L w L r

This will be true as long as the mean rate is 1:

1k kk

w r

Extracting information from rates estimates

Sequence alignments were first used to identify which positions were “conserved”.

The rationale was that if the same character was conserved across all sequences, it was constrained and played an important role.

We can refer this to “eyeball bioinformatics”.

This method of predicting function is very fragile to the source dataset.

Sampling homogeneityCharacter similarity

Extracting information from rates estimates

Case 1 Sampling homogeneity

2 Alignments for protein sequences of gene X.

The conclusion will be necessary that there is more conserved sites using the spider data. While in fact some of the same sequences are present in the second dataset.

35 Spider sequences 5 Spider 5 Mammal5 Bacterial5 Fungal5 Nematodes 5 Primate5 Rice plants sequences

Fast Slow

Maximum-Likelihood Site-Rates are Biologically Relevant

Rhodopsin-like G-protein receptors

Pfam (dataset 1Tml_7) 69 taxa

Maximum-Likelihood Site-Rates are Biologically Relevant

Tubulin

34 taxa 33 taxa

The constraints imposed by co-evolution far outweigh the

structural constraints.

Fast Slow

What can be done with rate of evolution

Predict functionally important regions in proteins

Example : We know that gene G is binding a drug, but the mechanism is unknown. Using site rate estimate, a patch of slow evolving sites is detected at the surface of the protein’s structure. This is potentially a good place to investigate further.

Why bother with computational methods?

Time to gather data:

Sequence << Structure << Biological activity

Computational methods are best used trying to fill the gap between genomic data and the real world.

What can be done with rate of evolution

The technique gain in power if used in a comparative strategy

Often, an un-characterized gene will have a relative protein that is already well known. It is possible to compare the two dataset of sequences in a 3D context to predict the presence or absence of function.

The computational technique to do this are usually based on site likelihood methods

Comparing rates scalar value has no statistical meaning.There are many different schemes, many exploit a variant of site likelihood ratio statistic for two aligned datasets of sequences a and b:

( ) ( )( )

( )a b

a b

ab

i ir ri

r r ir

L LL

L

Inferring Function in Homologs of eF1

Evolutionary Patterns in Elongation Factors and

paralogs

eF1 34 taxa(a)eF1 33 taxa

HBS1 10 taxaeRF3 20 taxa

Pairwise comparison using bivariate site rate

estimation

eRF3

eF1

(a)eF1

HBS1


eRF3Eukaryotic Release

Factor

Loss of eF1–analog interfaceLoss of constraint in 1’ loop

Interact with eRF1 (a tRNA mimic)

eF1eF1

Slow in eRF3Slow in Eukaryotes

Differently Evolving Sites



Factor



eF1eF1





Factor



eF1eF1





Factor



eF1eF1





Factor



eF1eF1




HBS1Unknown Function

Loss of eF1–analog interface

Most likely no tRNA binding

eF1eF1

Slow in HBS1Slow in Eukaryotes


Phylogenetics and bioinformatics

Phylogenetics existed long before bioinformatics

Comes from mathematic, statistic and genetic circles.

Phylogenetic is very relevant to bioinformatics

Capture a dimension of the data that is not visible from a collection of sequences.

Phylogenetic is become an increasingly central theme in sequence analyses.

csci6904 genomics and biological computing phylogenetics

Documents

single gene

manuscript evolution

survival of yersinia

role of yersinia murine

y pestis

single copy

true tree

rat flea