introducciÓn a la bioinformÁtica...
TRANSCRIPT
Paulino Gomez-Puertas Bioinformática.
Filogenias moleculares
INTRODUCCIÓN A LA BIOINFORMÁTICA
2012
Paulino Gomez-Puertas Bioinformática.
Richard Owen
Paulino Gomez-Puertas Bioinformática.
• Homologue: the same organ under every variety
of form and function (true or essential
correspondence - homology)
• Analogy: superficial or misleading similarity
Richard Owen 1843
Owen’s definition of
homology
Paulino Gomez-Puertas Bioinformática.
Charles Darwin
Paulino Gomez-Puertas Bioinformática.
• “The natural system is based upon descent with
modification .. the characters that naturalists consider as
showing true affinity (i.e. homologies) are those which
have been inherited from a common parent, and, in so
far as all true classification is genealogical; that
community of descent is the common bond that
naturalists have been seeking”
Charles Darwin, Origin of species 1859 p. 413
Darwin and homology
Paulino Gomez-Puertas Bioinformática.
• Homology: similarity that is the result of inheritance from
a common ancestor
• The identification and analysis of homologies is central
to phylogenetics (the study of the evolutionary history of
genes and species)
• Similarity and homology are not be the same thing
although they are often and wrongly used
interchangeably
Homology is...
Paulino Gomez-Puertas Bioinformática.
hypothesis:
SIMILARITY implies HOMOLOGY
HIGHER SIMILARITY implies CLOSER HOMOLOGY
Paulino Gomez-Puertas Bioinformática.
Clustering methods.
- UPGMA (Unweighted Pair Group Method with Arithmetic mean)
- Neighbour Joining (M. Saitou & M. Nei)
(correct unequal rates of evolution in different branches of the tree)
Cladistic methods: (patterns of ancestry)
- Maximum parsimony
- Maximum likelihood (assigns quantitative probalillities to mutational
events, rather than merely counting them).
Paulino Gomez-Puertas Bioinformática.
Clustering methods.
- UPGMA (Unweighted Pair Group Method with Arithmetic mean)
- Neighbour Joining (M. Saitou & M. Nei)
(correct unequal rates of evolution in different branches of the tree)
A (simplistic) UPGMA example:
cat L P L W
whale F P L W
lizard L E L C
trout F E B C
legs/fins eggs/placenta branchias/lungs warm/cold-blooded
distance matrix: C W L T
C 0 1 2 4
W 0 3 3
L 0 2
T 0
Paulino Gomez-Puertas Bioinformática.
distance matrix: C W L T
C 0 1 2 4
W 0 3 3
L 0 2
T 0
smallest nonzero distance
cat whale
0.5 0.5
reduced distance matrix:
(C / W) L T
(C / W) 0 1/2(2+3)=2.5 1/2(4+3)=3.5
L 0 2
T 0
UPGMA
Paulino Gomez-Puertas Bioinformática.
reduced distance matrix: (C / W) L T
(C / W) 0 2.5 3.5
L 0 2
T 0
smallest nonzero
distance
lizard trout
1 1
UPGMA
reduced distance matrix:
(C / W) (L / T)
(C / W) 0 1/2(2.5+3.5)=3
(L / T) 0
cat whale
0.5 0.5
Paulino Gomez-Puertas Bioinformática.
smallest nonzero
distance
UPGMA reduced distance matrix:
(C / W) (L / T)
(C / W) 0 3
(L / T) 0
C W L T
C 0 1 2 4
W 0 3 3
L 0 2
T 0
lizard trout
1
cat whale
0.5
1 0.5
cat whale
0.5 0.5
lizard trout
1 1
1 0.5
Paulino Gomez-Puertas Bioinformática.
C W L T
C 0 1 2 4
W 0 3 3
L 0 2
T 0
UPGMA
cat K E D D
whale K E R D
lizard E E D R
trout E D R R
using protein/dna multiple sequence alignments:
cat A T C C
whale A T G C
lizard T T C G
trout T C G G
distance matrix:
cat whale
0.5 0.5
lizard trout
1 1
1 0.5
lizard trout
1
cat whale
0.5
1 0.5
Paulino Gomez-Puertas Bioinformática.
Cladistic methods: (patterns of ancestry)
- Maximum parsimony
- Maximum likelihood
Maximum likelihood: assigns quantitative probalillities to mutational events,
rather than merely counting them.
ATCA
ATCG TTCA
ATCG ATGG
Maximum parsimony example (ATCG, ATGG, TCCA, TTCA)
TTCA TCCA
C->G T->C
A->G A->T
four mutations
ATCG
ATCA TTCG
ATCG TCCA TTCA ATGG
A->T
T->C
T->A
C->G
G->A A->T
G->A A->G
seven mutations
Paulino Gomez-Puertas Bioinformática.
Bootstrapping • Characters are resampled with replacement to
create many bootstrap replicate data sets
• Each bootstrap replicate data set is analysed (e.g.
with parsimony, distance, ML)
• Agreement among the resulting trees is
summarized with a majority-rule consensus tree
• Frequency of occurrence of groups, bootstrap
proportions (BPs), is a measure of support for
those groups
• Additional information is given in partition tables
Paulino Gomez-Puertas Bioinformática.
Bootstrapping Original data matrix Characters Taxa 1 2 3 4 5 6 7 8 A R R Y Y Y Y Y Y B R R Y Y Y Y Y Y C Y Y Y Y Y R R R D Y Y R R R R R R Outgp R R R R R R R R
A B C D 1 2 1
2
3 4 5
6 7 8
A B C D
1 2 2
5 5
6 6
8
Outgroup Outgroup
Resampled data matrix Characters Taxa 1 2 2 5 5 6 6 8 A R R R Y Y Y Y Y B R R R Y Y Y Y Y C Y Y Y Y Y R R R D Y Y Y R R R R R Outgp R R R R R R R R
Randomly resample characters from the original data with
replacement to build many bootstrap replicate data sets of the
same size as the original - analyse each replicate data set
Summarise the results of
multiple analyses with a
majority-rule consensus tree
Bootstrap proportions (BPs) are
the frequencies with which
groups are encountered in
analyses of replicate data sets
A B C D
Outgroup
96%
66%
Paulino Gomez-Puertas Bioinformática.
• Uses tree diagrams to portray relationships based upon recency of common ancestry
• There are two types of trees commonly displayed in publications: – Cladograms
– Phylograms
Phylogenetic
systematics
Paulino Gomez-Puertas Bioinformática.
Bacterium 1
Bacterium 3
Bacterium 2
Eukaryote 1
Eukaryote 4
Eukaryote 3
Eukaryote 2
Bacterium 1
Bacterium 3 Bacterium 2
Eukaryote 1
Eukaryote 4 Eukaryote 3
Eukaryote 2
Phylograms show branch order and branch lengths
Cladograms and phylograms
Cladograms show branching order - branch lengths are meaningless
Paulino Gomez-Puertas Bioinformática.
Rooted by outgroup
Rooting trees using an outgroup
archaea
archaea
archaea
eukaryote
eukaryote
eukaryote
eukaryote
bacteria outgroup
root
eukaryote
eukaryote
eukaryote
eukaryote
Unrooted tree
archaea
archaea
archaea
Monophyletic group
Monophyletic group
Paulino Gomez-Puertas Bioinformática.
Groups on trees
Baldauf (2003). Phylogeny for the faint of heart: a tutorial. Trends in Genetics 19:345-351.
A monophyletic group (a clade) contains species derived from a unique common ancestor with respect to the rest of the tree
A polyphyletic group is not a group at all! (e.g. if we put all things with wings in a single group)
A paraphyletic group is one which includes only some descendents (e.g. a group comprising animals without humans would be paraphyletic)
Paulino Gomez-Puertas Bioinformática.
Is there a molecular clock?
• The idea of a molecular clock was initially
suggested by Zuckerkandl and Pauling in
1962
• They noted that rates of amino acid
replacements in animal haemoglobins were
roughly proportional to time - as judged
against the fossil record
Paulino Gomez-Puertas Bioinformática.
Introducing time in trees:
the molecular clock
Paulino Gomez-Puertas Bioinformática.
The molecular clock for alpha-globin: Each point represents the number of substitutions separating each
animal from humans
0
20
40
60
80
1000
10
0
20
0
30
0
40
0
50
0
Time to common ancestor (millions of years)
number
of s
ubst
itut
ions
cow
platypus chicken
carp
shark
Paulino Gomez-Puertas Bioinformática.
Rates of amino acid replacement
in different proteins
Protein Rate (mean replacements per siteper 10 9 years)
Fibrinopeptides 8.3Insulin C 2.4Ribonuclease 2.1Haemoglobins 1.0Cytochrome C 0.3Histone H4 0.01
Paulino Gomez-Puertas Bioinformática.
Small subunit ribosomal RNA
18S or 16S rRNA
Paulino Gomez-Puertas Bioinformática.
There is no universal molecular
clock • The initial proposal saw the clock as a Poisson process with
a constant rate
• Now known to be more complex - differences in rates occur for:
• different sites in a molecule
• different genes
• different regions of genomes
• different genomes in the same cell
• different taxonomic groups for the same gene
• There is no universal molecular clock affecting all genes
• There might be ‘local’ clocks but they need to be carefully tested and calibrated
Paulino Gomez-Puertas Bioinformática.
Chaperonin 60 Protein Maximum Likelihood Tree (PROTML, Roger et al. 1998,
PNAS 95: 229)
Longest branches
Paulino Gomez-Puertas Bioinformática.
Rate heterogeneity is a common
problem in phylogenetic analyses
• Differences in rates occur between: • different sites in a molecule (e.g. at different codon
positions)
• different genes on genomes
• different regions of genomes
• different genomes in the same cell
• different taxonomic groups for the same gene
• We need to consider these issues when we make trees - otherwise we can get the wrong tree
Paulino Gomez-Puertas Bioinformática.
Multiple changes at a single
site - hidden changes
C A
C G T A
1 2 3
1
Seq 1
Seq 2
Number of changes
Seq 1 AGCGAG Seq 2 GCGGAC
Paulino Gomez-Puertas Bioinformática.
Convergence can also mislead
our methods:
• Thermophilic convergence or biased codon
usage patterns may obscure phylogenetic
signal
Paulino Gomez-Puertas Bioinformática.
% Guanine + Cytosine in 16S rRNA genes
from mesophiles and thermophiles
Thermophiles: Thermotoga maritima Thermus thermophilus Aquifex pyrophilus Mesophiles: Deinococcus radiodurans Bacillus subtilis
62 64 65 55 55
%GC all sites
72 72 73 52 50
variable sites
Paulino Gomez-Puertas Bioinformática.
Gene trees and species trees
We often assume that gene trees give us species trees
a
b
c
A
B
D
Gene tree Species tree
Paulino Gomez-Puertas Bioinformática.
Gene trees and species trees -
why might they differ?
• Gene duplication
• Horizontal gene transfer between species
• Gene analysis can produce trees that conflict
with accepted ideas of species relationships
based upon external data
Paulino Gomez-Puertas Bioinformática.
??
Mitochondrial (mt) genomes of Sauropsida (reptiles+birds).
A. Llanes. Univ. de La Habana
Paulino Gomez-Puertas Bioinformática.
Gracias a:
Hernán Dopazo CSAT - Príncipe Felipe
Valencia
Centro Nacional de Biotecnología.
Madrid Federico Abascal
Museo Nacional de Ciencias
Naturales. Madrid
Rafael Zardoya
Alehjandro Llanes Universidad de La Habana
Cuba
Paulino Gomez-Puertas Bioinformática.
Cuestiones…