csci6904 genomics and biological computing phylogenetics
TRANSCRIPT
CSCI6904
Genomics and Biological Computing
Phylogenetics
Phylogeny
A non-biological example
Howe, CJ., Barbrook, AC., Spencer, M., Robinson, P., Bordalejo, B. and Mooney, LR., 2001, Manuscript Evolution, Trends in Genetics, 17(3), 147-152
Phylogeny
An analogy that works well…
Genome Manuscript
DNA Polymerase Scribes
Mutations Transcription error/alterations
DNA sequences Extant manuscripts
Deletion/Insertion
Rearrangements
Lateral Gene transfer Using part of a second template to enhance a copy of the manuscript
Selective pressure Politics, Esthetics, linguistics
Gene history (tree) Manuscript history (tree)
Phylogeny
An analogy that works well…
Thanks to Gutenberg and his invention of the printing press, the rate at which manuscripts are evolving have decreased by many orders of magnitude (next to 0, actually).
Raw data
The encoding of the data has to be done in a slightly different manner as it is preferable to treat words as characters. Consequently, the alphabet is of an un-manageable size.
Phylogeny
The data is collected from extant manuscripts:
Phylogeny
… and aligned so all characters are homologous:
Phylogeny
What can be discovered:
Which manuscript is the closest to the original draft?
Are all know manuscript found in, say, Belgium are descendent of a single copy of the manuscript?
What would be the most likely text in the (long lost) original version?
What can be discovered:
Whatever happened to the first chapter?
In the case to the left, there is evidence from “phylogenetic” analysis that the first half of the prologue of the manuscript El was taken from a different source than for the rest of the text.
In genomics, if a gene gets misplaced in a tree, it may indicate that the gene was acquired by transfer rather than heredity.
There are evidences that the transfer of a single gene transformed a benign bacteria: Yersinia Pestis, into the agent of the “black death”.
“The study, published in the April 26 issue of Science, shows that an enzyme called phospholipase D (PLD), previously known as Yersinia murine toxin, allows Y pestis to survive in the midgut of the rat flea. By acquiring the gene that encodes PLD, "the bacterium gradually changed from a germ that causes a mild human stomach illness acquired via contaminated food or water to the flea-borne agent of the 'Black Death,' which in the 14th century killed one-fourth of Europe's population," the NIH said in a news release.”Hinnebusch BJ, Rudolph AE, Cherepanov P, et al. Role of Yersinia murine toxin in survival of Yersinia pestis in the midgut of the flea vector. Science 2002;296(5568):733-5
Phylogeny
Strategies
Discrete character approaches
Parsimonious criterion
Model likelihood criterion
Hypothesis likelihood criterion
Distance-based clustering
Least-square
Neighbor-Joining / UPGMA (Implicit topology)
Minimum Evolution
UPGMA’s shortcoming
“Molecular Clock” assumption can be rejected in most cases.
In this example, un-equally evolving sequences are clustering according to their rate of evolution rather than according to the history of the genes.
Neighbor-Joining algorithm
Guarantee to recover the true tree if the distance matrix is an exact reflection of the tree.
How realistic is it to assume that these distances are behaving as such?
Triangle inequality
Rarely respected. Especially if any of D(A,B), D(B,C) are large.
The reason: Saturation.
Distance metrics between sequences
AC AB BCD D D
What is saturation?
Time 1 2 3 4 5 6 7
A -------------------------------- P
A F A H K H P
AB
In both cases, if only the time step 1 and 7 are known, the most likely distance will be the same.
1 7
2
34
5
6
Saturation is theoretically expected
Maximum likelihood distances
The following describe the evaluation of distances using the maximum likelihood criterion.
This is the best method to evaluate distances between biological sequences.
A G
C T
Jukes-Cantor Model
For nucleotides, there are a limited number of substitutions
Matrix with 1 expected substitution per 100 sites.
A G T C
A 0.99
G 0.03 0.99
T 0.03 0.03 0.99
C 0.03 0.03 0.03 0.99
Jukes-Cantor Model
For nucleotides there are a limited number of substitutions
Given two (short) sequences
C C A T
C C G T
A G T C
A 0.99
G 0.03 0.99
T 0.03 0.03 0.99
C 0.03 0.03 0.03 0.99
P1 =
, , ,A G C T
The Likelihood of that these two sequences are related is then:
(1) c c c c c c a a g t t tL P P P P
Jukes-Cantor Model
For nucleotides there are a limited number of substitutions
Given two (short) sequences
C C A T
C C G T
A G T C
A 0.99
G 0.03 0.99
T 0.03 0.03 0.99
C 0.03 0.03 0.03 0.99
P1 =
, , ,A G C T
What if the distance implied by P1 are not realistic/representative?
(1) c c c c c c a a g t t tL P P P P
Extrapolation of probability matrices.
As we have seen for the PAM matrix a few weeks back.
We can obtain a pij for any multiple of PAM1 by doing
a matrix multiplication.
2
,1
,2*
1, , 3, 4, ,,
,4
i
iM
j i j j j i ji j
i
Extrapolation of probability matrices.
There will be then a different probability associated to each possible distances
1 1 1 1(1) c c c c c c a a g t t tL P P P P 2 2 2 2(2) c c c c c c a a g t t tL P P P P
( ) l l l lc c c c c c a a g t t tL l P P P P
Extrapolation of probability matrices.
The probability is as a function of the distance between two sequences.
There is thus a value of distance (l) that maximizes the probability of observing two related sequence.
In other words, there is a t values that maximize the likelihood that two sequences are related.
'( ) ...l l l l lc c c c c c a a g t t t x x xL l P P P P P
For branch length l over k sites
Arbitrary P matrices from
4 4log(5)
log( )
5
log( )
t t P
t Qt
e
P e
Q P
P e
Q is the log(P) matrix for an arbitrary unit of distance.
In Practice, the model can be custom built for an input dataset
Q R
Vector of frequencies for each character (can be estimated from input dataset)
A matrix of relative rate of substitution (large amount of empirical data (PAM, JTT) or optimized (WAG))
Extrapolation of probability matrices.
Now, imagine that two sequences are un-related.
– The real Branch Length (t) is equal to +– The BL estimate will converge to a value necessarily smaller due to the presence of some site being identical by coincidence.
Even random sequences are going to have “matches”
Although Likelihood distance should tend to large values in this case.
Even random sequences are going to have “matches”
Saturation should be compensated for in ML distances.
However, and because of:
• Non-homogenous frequencies• Rate heterogeneity• Change in the P matrix over time• Non-independence of characters in a sequence.
Long distances still a bit contentious to evaluate.
Time reversibility is also assumed
Time 1 2 3 4 5 6 7
A A A P P P P
A A A H K H P
AB
Without time reversibility assumed, it would be impossible to measure a distance between two sequences without involving an undefined bifurcation.
1 7
2
34
5
6
Time reversibility is also assumed
In practice, this means that the entries in our matrices of substitution have to be symmetrical such that :
This is also practical from a bioinformatics perspective since there it cut in ½ the number of parameters in the model.
1 7
2
34
5
6
| |P a b l P b a l
Another distance-based method that intuitively make sense
Least Square method
2n n
ij ij iji j
U w D d
D matrix entry
Sum of all t along the path from i to j.
Weight
21 11, , ,...ij
ij ij
wD D
One last distance-based method that we would intuitively use
Once abstracted :We are looking for an acyclic, binary graph with n terminal vertices that conforms the best to a set of n2 constraints.
i o f j x
2n n
ij ij iji j
U w D d
t1t2 t3 t4 t5
t45
t345t12
1 12 345 45 4ijd t t t t t
There is a danger of time traveling
with some tk < 0
One last distance-based method that we would intuitively use
Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)
i o f j x
t1t2 t3 t4 t5
t45
t345t12
iofjx
jxIn the path
Not in path
One last distance-based method that we would intuitively use
Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)
i o f j x
t1t2 t3 t4 t5
t45
t345
iofjx
jxIn the path
Not in path
One last distance-based method that we would intuitively use
Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)
i o f j x
t1t2 t3 t4 t5
t45
t345
iofjx
jxIn the path
Not in path
One last distance-based method that we would intuitively use
Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)
i o f j x
t4 t5
t45
t345
iofjx
jxIn the path
Not in path
One last distance-based method that we would intuitively use
Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)
i o f j x
t4 t5
t45
t345
iofjx
jxIn the path
Not in path
One last distance-based method that we would intuitively use
Once abstracted :Although there is n terminal nodes, there will be 2n-1 nodes in the tree/graph (rooted tree)
i o f j x
t4 t5
t45
t345
iofjx
jx
2
,
n n
ij ij ij k ki j k
U w D x t
,
1
0ij kx
In the path
Not in path
One last distance-based method that we would intuitively use
There is a straightforward solution to this linear algebra problem.
i o f j x
t1t2 t3 t4 t5
t45
t345t12
iofjx
jx
2
,
n n
ij ij ij k ki j k
U w D x t
,
1
0ij kx
In the path
Not in path
One last distance-based method that we would intuitively use
Minimum EvolutionCan be used as a selection criterion between Least-Square tree topologies.
This is done by selecting the topology amongst a collection of suitable topology that minimizes :
i o f j x
t1t2 t3 t4 t5
t45
t345t12
iofjx
jx kk all edges
t
Tree space
Unlike UPGMA and NJ, the problem with this previous method is that you have to provide a
topology prior to the calculation….
Phylogeny
Strategies
Discrete character approaches
Parsimonious criterion
Model likelihood criterion
Bayesian statistics
Distance-based clustering
Least-square
Neighbor-Joining / UPGMA (Implicit topology)
Minimum Evolution
Phylogeny
Discrete-character signal versus distance
Distance : Use the characters and a function to evaluate distance metrics. These are used to determine the length of the branch/edges between nodes/vertices. These internal nodes/edges are simply there to maximally reconcile the distance data into a binary tree.
Character : Use discrete characters implicitly or explicitly to define the state of each nodes.
Parsimony
Intuitive method that can be run manually
Assumes that everything observed in the data is connected by the most straightforward relationships.
Parsimony
Algorithm Postorder tree transversal : from terminal nodes toward the “center”.
At each node:
1. Create an intersection of the set of observation in the immediate descendent nodes.
2. If the intersection set is null. Create a set that is the union of the two descendents.Add one to the count of
changes recorded.
Parsimony
Algorithm
The most parsimonious tree will be the topology which will minimize the number of changes to explain the data over all sites (columns).
Statistics
Consistency
Retention
minCCI
C
max
max min
C CRI
C C
Parsimony
Side-effects
The reconstruction is assuming that the most parsimonious explanation is the correct one.
It also assumes that all changes have a similar “cost”.
Therefore, the parsimony method does not seem to be designed to deal with saturation.
Maximum likelihood criterion
AbstractionWe have a collection of items (sequences). We know that all the instances in the collection are stochastically derived from a unique parent in the hierarchy. We also have a have a model for this stochastic process represented as a Markov process.
We are thus looking for a tree (topology+distances) that will maximize the likelihood of the data, given the Markov process.
Jukes-Cantor Model
For nucleotides there are a limited number of substitutions
Given two (short) sequences
C C A T
C C G T
A G T C
A 0.99
G 0.03 0.99
T 0.03 0.03 0.99
C 0.03 0.03 0.03 0.99
P1 =
, , ,A G C T
What if the distance implied by P1 are not realistic/representative?
(1) c c c c c c a a g t t tL P P P P
Extrapolation of probability matrices.
There will be an optimal distance between two sequences.
1 1 1 1(1) c c c c c c a a g t t tL P P P P 2 2 2 2(2) c c c c c c a a g t t tL P P P P
( ) l l l lc c c c c c a a g t t tL l P P P P
Distance to an internal node
There will be an optimal distance between two sequences:
( ) t t t tc c c c c c a a g t t tL t P P P P
If the sequence of only one of the node is known, the other end could be any possible characters:
( ) t t t tc c x c c x a a x g g x
x x x x
L t P P P P
Model based phylogeny
It is possible to compute likelihood of internal nodes by summing over all possibilities.
6 1 2 8 3 7 4 5( ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , )x y z w
P x P y x t P A y t P C y t P z x t P C z t P w z t P C w t P G w t
A CC C G
y
x
z
w
t1t2
t3
t4 t5
t6
t7
t8
t7t7t7t7
Model based phylogeny
The structure of the equation once the summation are pushed as far right as possible is the same as the structure of the tree.
6 1 2 8 3 7 4 5( ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , )x y z w
P x P y x t P A y t P C y t P z x t P C z t P w z t P C w t P G w t
( , ), ( , ( , ))A C C C G
A CC C G
y
x
z
w
t1t2
t3
t4 t5
t6
t7
t8
t7t7t7t7
Model based phylogeny
The calculation at one node thus depend on the conditional likelihood of each possible character S in the children nodes.
( )4 5
( ) ( ) ( ) ( )1 2
( ) ( | , ) ( | , )
( ), ( ),..., ( )
iw
i i i iw w w w n
L s P C S t P G S t
L L s L s L s
A CC C G
y
x
z
w
t1t2
t3
t4 t5
t6
t7
t8
t7t7t7t7
Model based phylogeny
For terminal nodes:
For internal nodes:
This is done for each site i.
The log(L) are stored rather than L.
( ) 0,1,0,0,0,0,0,0,...,0iL
A CC C G
y
x
z
w
t1t2
t3
t4 t5
t6
t7
t8
t7t7t7t7( ) ( )( ) ( | , ) ( )i iy child child
child a
L s P a s t L a
Model based phylogeny
For terminal nodes:
For internal nodes:
For innermost nodes:
( ) 0,1,0,0,0,0,0,0,...,0iL
( ) ( ) ( )i ia x
a
L L aFor a tree:
( )i
i
L L
( ) ( )( ) ( | , ) ( )i iy child child
child a
L s P a s t L a
A CC C G
y
x
z
w
t1t2
t3
t4 t5
t6
t7
t8
t7t7t7t7
Tree Space
… and an epic complexity in search space
( ) 3 5 7 9 11 ...(2 3)T n n
1
(2 3)!( )
2 1 !n
nT n
n
Where n is of interest in the 20-100 range.
Tree data structure
What if we don’t constrain the binary connectivity?
Multifurcation is not necessary because any multifurcated tree is a case of a binary tree with at least one internal branch length set to zero.
Introduce one sequence at the time to the branch which gives the best likelihood.
Exploring the topology space
Incremental method – Stepwise addition
( ) 3 5 7 9 11 ... (2 3)T n n
This method of building a topology is intrinsically greedy.
Exploring the topology space
Local rearrangement – Nearest Neighbor Interchange
For a tree with four lineages, there is only three possible topologies, one of which is already computed.
Exploring the topology space
Greedy Methods – Nearest Neighbor Interchange
Sensitive to the initial tree, may not recover from some types of error early in the optimization.
Exploring the topology space
Greedy Methods – Nearest Neighbor Interchange
Sensitive to the initial tree, may not recover from some types of error early in the optimization.
Exploring the topology space
Global Methods
Methods that have the potential to sample the entire breadth of the search space in only a few consecutive iterations.
Subtree Pruning Regrafting (SPR)Tree Bisection and Reconnection (TBR)
Exploring the topology space
Subtree Pruning Regrafting
Search space per step is narrowed down to:
2O n
Exploring the topology space
Tree bisection and reconnection
This algorithms randomizes the site of reconnection in both subtree.
Search space is narrowed down to:
3O n
Meaning of all this
Trees
Biologists love them. As we have seen, they attempt to go beyond the logical clustering of data items. Instead, they are used to reconstruct the process under which the data was generated.
For example: It is because of phylogenetic trees that we know that we (modern eukaryotes) are originating from the symbiosis between a cyanobacteria and a elusive ancestral cell.
The likelihood of a tree is a reflection of the goodness of fit of the alignment to a tree and a model of substitution.
How good is a tree?
Problem:Not only the cluster matter, but each individual internal nodes
contains usable information. How can we ascertain that a node is any good?
There are no reference set that can be used and for which we know the true answer.
Cause of error:
Model misspecification, mixed history within a gene, sampling error, …
Significance of difference in likelihood values.
If the likelihood evaluation depends on a single parameter:
This relationship is however impractical since the likelihood calculation rarely depend on a single parameter.
An (imperfect) example to this would be if we were interested in
evaluating the certainty on a single branch length at the time.
Real trees have a lot more parameters: 2n-1 branches, the rate distribution shape parameter, etc…
20 1
ˆ2 ln lnL L
Re-sampling using the bootstrap
Getting around the sampling error
Assumption : All the significant signal is present in the data, but the signal’s blend is affected by the size of the sample.
Given a dataset of n sequences that are k character long:
Create a new dataset by randomly and uniformly choosing site
indices “i” until the resampled dataset has a size of n X k.
1 1 11
1
... ...
... ... ...
...
i k
n nk
a a a
a a
Re-sampling using the bootstrap
In practice, this is used to generate a large number of
replicates and count the frequency of observing a
given internal node.
High “bootstrap value” node stable to sampling error
Does not mean that a given internal node is “real”.
Re-sampling using the Jackknife
Getting around the sampling error in the sequence axis
Principle : Randomly delete a small fraction of the data
The term “Jackknife” is also used in cases where trees are reconstructed by randomly deleting whole sequences from the
dataset.
Using simulation
This is known as the parametric bootstrap
Principle : Create a distribution of likelihood values generated from simulated datasets
The test is done by evaluating the probability that the “real” likelihood value is part of the distribution of tree likelihood from simulated datasets.
Using simulation
Simulating multiple sequence alignment from a tree is the reverse problem of inferring a tree from an
alignment. ? ?
? ? ?
y
x
z
w
t1t2
t3
t4 t5
t6
t7
t8
x = {…}, random, drawn from
On a per site basis, the probability vector of each site in the node y can be calculated with:
6 ,
,...,i i
i
Qty
x A x Yx y y y
P e
P P P
? ?? ? ?
y
x
z
w
t1t2
t3
t4 t5
t6
t7
t8
Using simulation
This is known as the parametric bootstrap
This test can be used to evaluate whether the data can be simulated from a given combination of tree and model.
If the test tree is wrong, the simulated dataset should not include the “real” data.
If the test tree is the true tree and the model is relevant to what really happened during the evolution of the gene, the likelihood the of “real” data and the simulated series should not be statistically different.
Using simulation
This is known as the parametric bootstrap
This test can be used to evaluate whether the data can be simulated from a given combination of tree and model.
The test is expected to be conservative because the simulated dataset are generated and recovered using the same parameters while the “real” data comes from a true process.
Time consideration
Bootstrapping requires building distributions
This usually means that the long calculation has to be re-run on permuted datasets about 100-1000 times over. All this, just to
harvest a few numbers.
Paired-site tests
Can be used to compare two trees
There is a number of techniques that compare two trees on the basis of their site likelihoods.
Winning site test, z test, t-test, Wilcoxon signed rank test, …
These test are more appropriate to estimate error bars in the topology dimension of a solution.
Paired-site tests
In our research group, we are using such test in our optimization strategy
ln lni ii ref testL L
ref is better
It is possible to eliminate statistically worst trees rapidly, without re-sampling, and treat the solution not as a data point but rather as an area in topology space.
Our research are showing that
Sums of likelihood are sensitive to the variance of the poorly modeled site.
Single thread search are not very robust to local minima.
Meaning of all this
Site likelihood
Bioinformaticians love them. They provide information that is not contained in individual sequences. (i.e.: no matter how hard one will scan one genome). Further, they contain information on properties that may be impossible to physically observe.
Site likelihoods are a reflection of the goodness of fit of one position in a protein given a solution optimized with all the available data.
Phylogeny allow to assemble sequences into an informative, time-dependent
structure
For the next few slides we will look at how phylogenetic information can be used to detect new signal in sequence information.
Site-wise rate of evolution.Rate dependent functional shifts.Rate independent functional shifts.
This framework offers a new source of information for pattern detection and recognition.
Estimating rates amongst sites
Basic calculation assumed a constant rate. Variable constant rate can be approximated on a site-per-site basis:
A CC C G
y
x
z
w
t1t2
t3
t4 t5
t6
t7
t8
( ) ( )
0
( ) ( )
1
, ( )
( )
i ii i i
ki i
k kj
L r L r dr
L w L r
This will be true as long as the mean rate is 1:
1k kk
w r
Extracting information from rates estimates
Sequence alignments were first used to identify which positions were “conserved”.
The rationale was that if the same character was conserved across all sequences, it was constrained and played an important role.
We can refer this to “eyeball bioinformatics”.
This method of predicting function is very fragile to the source dataset.
Sampling homogeneityCharacter similarity
Extracting information from rates estimates
Case 1 Sampling homogeneity
2 Alignments for protein sequences of gene X.
The conclusion will be necessary that there is more conserved sites using the spider data. While in fact some of the same sequences are present in the second dataset.
35 Spider sequences 5 Spider 5 Mammal5 Bacterial5 Fungal5 Nematodes 5 Primate5 Rice plants sequences
Fast Slow
Maximum-Likelihood Site-Rates are Biologically Relevant
Rhodopsin-like G-protein receptors
Pfam (dataset 1Tml_7) 69 taxa
Maximum-Likelihood Site-Rates are Biologically Relevant
Tubulin
34 taxa 33 taxa
The constraints imposed by co-evolution far outweigh the
structural constraints.
Fast Slow
What can be done with rate of evolution
Predict functionally important regions in proteins
Example : We know that gene G is binding a drug, but the mechanism is unknown. Using site rate estimate, a patch of slow evolving sites is detected at the surface of the protein’s structure. This is potentially a good place to investigate further.
Why bother with computational methods?
Time to gather data:
Sequence << Structure << Biological activity
Computational methods are best used trying to fill the gap between genomic data and the real world.
What can be done with rate of evolution
The technique gain in power if used in a comparative strategy
Often, an un-characterized gene will have a relative protein that is already well known. It is possible to compare the two dataset of sequences in a 3D context to predict the presence or absence of function.
The computational technique to do this are usually based on site likelihood methods
Comparing rates scalar value has no statistical meaning.There are many different schemes, many exploit a variant of site likelihood ratio statistic for two aligned datasets of sequences a and b:
( ) ( )( )
( )a b
a b
ab
i ir ri
r r ir
L LL
L
Inferring Function in Homologs of eF1
Evolutionary Patterns in Elongation Factors and
paralogs
eF1 34 taxa(a)eF1 33 taxa
HBS1 10 taxaeRF3 20 taxa
Pairwise comparison using bivariate site rate
estimation
eRF3
eF1
(a)eF1
HBS1
Inferring Function in Homologs of eF1
eRF3Eukaryotic Release
Factor
Loss of eF1–analog interfaceLoss of constraint in 1’ loop
Interact with eRF1 (a tRNA mimic)
eF1eF1
Slow in eRF3Slow in Eukaryotes
Differently Evolving Sites
Inferring Function in Homologs of eF1
eRF3Eukaryotic Release
Factor
Loss of eF1–analog interfaceLoss of constraint in 1’ loop
Interact with eRF1 (a tRNA mimic)
eF1eF1
Slow in eRF3Slow in Eukaryotes
Differently Evolving Sites
Inferring Function in Homologs of eF1
eRF3Eukaryotic Release
Factor
Loss of eF1–analog interfaceLoss of constraint in 1’ loop
Interact with eRF1 (a tRNA mimic)
eF1eF1
Slow in eRF3Slow in Eukaryotes
Differently Evolving Sites
Inferring Function in Homologs of eF1
eRF3Eukaryotic Release
Factor
Loss of eF1–analog interfaceLoss of constraint in 1’ loop
Interact with eRF1 (a tRNA mimic)
eF1eF1
Slow in eRF3Slow in Eukaryotes
Differently Evolving Sites
Inferring Function in Homologs of eF1
eRF3Eukaryotic Release
Factor
Loss of eF1–analog interfaceLoss of constraint in 1’ loop
Interact with eRF1 (a tRNA mimic)
eF1eF1
Slow in eRF3Slow in Eukaryotes
Differently Evolving Sites
Inferring Function in Homologs of eF1
HBS1Unknown Function
Loss of eF1–analog interface
Most likely no tRNA binding
eF1eF1
Slow in HBS1Slow in Eukaryotes
Differently Evolving Sites
Phylogenetics and bioinformatics
Phylogenetics existed long before bioinformatics
Comes from mathematic, statistic and genetic circles.
Phylogenetic is very relevant to bioinformatics
Capture a dimension of the data that is not visible from a collection of sequences.
Phylogenetic is become an increasingly central theme in sequence analyses.