introducciÓn a la bioinformÁtica...

36
Paulino Gomez-Puertas Bioinformática. Filogenias moleculares INTRODUCCIÓN A LA BIOINFORMÁTICA 2012

Upload: others

Post on 06-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Filogenias moleculares

INTRODUCCIÓN A LA BIOINFORMÁTICA

2012

Page 2: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Richard Owen

Page 3: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

• Homologue: the same organ under every variety

of form and function (true or essential

correspondence - homology)

• Analogy: superficial or misleading similarity

Richard Owen 1843

Owen’s definition of

homology

Page 4: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Charles Darwin

Page 5: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

• “The natural system is based upon descent with

modification .. the characters that naturalists consider as

showing true affinity (i.e. homologies) are those which

have been inherited from a common parent, and, in so

far as all true classification is genealogical; that

community of descent is the common bond that

naturalists have been seeking”

Charles Darwin, Origin of species 1859 p. 413

Darwin and homology

Page 6: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

• Homology: similarity that is the result of inheritance from

a common ancestor

• The identification and analysis of homologies is central

to phylogenetics (the study of the evolutionary history of

genes and species)

• Similarity and homology are not be the same thing

although they are often and wrongly used

interchangeably

Homology is...

Page 7: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

hypothesis:

SIMILARITY implies HOMOLOGY

HIGHER SIMILARITY implies CLOSER HOMOLOGY

Page 8: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Clustering methods.

- UPGMA (Unweighted Pair Group Method with Arithmetic mean)

- Neighbour Joining (M. Saitou & M. Nei)

(correct unequal rates of evolution in different branches of the tree)

Cladistic methods: (patterns of ancestry)

- Maximum parsimony

- Maximum likelihood (assigns quantitative probalillities to mutational

events, rather than merely counting them).

Page 9: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Clustering methods.

- UPGMA (Unweighted Pair Group Method with Arithmetic mean)

- Neighbour Joining (M. Saitou & M. Nei)

(correct unequal rates of evolution in different branches of the tree)

A (simplistic) UPGMA example:

cat L P L W

whale F P L W

lizard L E L C

trout F E B C

legs/fins eggs/placenta branchias/lungs warm/cold-blooded

distance matrix: C W L T

C 0 1 2 4

W 0 3 3

L 0 2

T 0

Page 10: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

distance matrix: C W L T

C 0 1 2 4

W 0 3 3

L 0 2

T 0

smallest nonzero distance

cat whale

0.5 0.5

reduced distance matrix:

(C / W) L T

(C / W) 0 1/2(2+3)=2.5 1/2(4+3)=3.5

L 0 2

T 0

UPGMA

Page 11: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

reduced distance matrix: (C / W) L T

(C / W) 0 2.5 3.5

L 0 2

T 0

smallest nonzero

distance

lizard trout

1 1

UPGMA

reduced distance matrix:

(C / W) (L / T)

(C / W) 0 1/2(2.5+3.5)=3

(L / T) 0

cat whale

0.5 0.5

Page 12: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

smallest nonzero

distance

UPGMA reduced distance matrix:

(C / W) (L / T)

(C / W) 0 3

(L / T) 0

C W L T

C 0 1 2 4

W 0 3 3

L 0 2

T 0

lizard trout

1

cat whale

0.5

1 0.5

cat whale

0.5 0.5

lizard trout

1 1

1 0.5

Page 13: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

C W L T

C 0 1 2 4

W 0 3 3

L 0 2

T 0

UPGMA

cat K E D D

whale K E R D

lizard E E D R

trout E D R R

using protein/dna multiple sequence alignments:

cat A T C C

whale A T G C

lizard T T C G

trout T C G G

distance matrix:

cat whale

0.5 0.5

lizard trout

1 1

1 0.5

lizard trout

1

cat whale

0.5

1 0.5

Page 14: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Cladistic methods: (patterns of ancestry)

- Maximum parsimony

- Maximum likelihood

Maximum likelihood: assigns quantitative probalillities to mutational events,

rather than merely counting them.

ATCA

ATCG TTCA

ATCG ATGG

Maximum parsimony example (ATCG, ATGG, TCCA, TTCA)

TTCA TCCA

C->G T->C

A->G A->T

four mutations

ATCG

ATCA TTCG

ATCG TCCA TTCA ATGG

A->T

T->C

T->A

C->G

G->A A->T

G->A A->G

seven mutations

Page 15: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Bootstrapping • Characters are resampled with replacement to

create many bootstrap replicate data sets

• Each bootstrap replicate data set is analysed (e.g.

with parsimony, distance, ML)

• Agreement among the resulting trees is

summarized with a majority-rule consensus tree

• Frequency of occurrence of groups, bootstrap

proportions (BPs), is a measure of support for

those groups

• Additional information is given in partition tables

Page 16: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Bootstrapping Original data matrix Characters Taxa 1 2 3 4 5 6 7 8 A R R Y Y Y Y Y Y B R R Y Y Y Y Y Y C Y Y Y Y Y R R R D Y Y R R R R R R Outgp R R R R R R R R

A B C D 1 2 1

2

3 4 5

6 7 8

A B C D

1 2 2

5 5

6 6

8

Outgroup Outgroup

Resampled data matrix Characters Taxa 1 2 2 5 5 6 6 8 A R R R Y Y Y Y Y B R R R Y Y Y Y Y C Y Y Y Y Y R R R D Y Y Y R R R R R Outgp R R R R R R R R

Randomly resample characters from the original data with

replacement to build many bootstrap replicate data sets of the

same size as the original - analyse each replicate data set

Summarise the results of

multiple analyses with a

majority-rule consensus tree

Bootstrap proportions (BPs) are

the frequencies with which

groups are encountered in

analyses of replicate data sets

A B C D

Outgroup

96%

66%

Page 17: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

• Uses tree diagrams to portray relationships based upon recency of common ancestry

• There are two types of trees commonly displayed in publications: – Cladograms

– Phylograms

Phylogenetic

systematics

Page 18: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Bacterium 1

Bacterium 3

Bacterium 2

Eukaryote 1

Eukaryote 4

Eukaryote 3

Eukaryote 2

Bacterium 1

Bacterium 3 Bacterium 2

Eukaryote 1

Eukaryote 4 Eukaryote 3

Eukaryote 2

Phylograms show branch order and branch lengths

Cladograms and phylograms

Cladograms show branching order - branch lengths are meaningless

Page 19: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Rooted by outgroup

Rooting trees using an outgroup

archaea

archaea

archaea

eukaryote

eukaryote

eukaryote

eukaryote

bacteria outgroup

root

eukaryote

eukaryote

eukaryote

eukaryote

Unrooted tree

archaea

archaea

archaea

Monophyletic group

Monophyletic group

Page 20: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Groups on trees

Baldauf (2003). Phylogeny for the faint of heart: a tutorial. Trends in Genetics 19:345-351.

A monophyletic group (a clade) contains species derived from a unique common ancestor with respect to the rest of the tree

A polyphyletic group is not a group at all! (e.g. if we put all things with wings in a single group)

A paraphyletic group is one which includes only some descendents (e.g. a group comprising animals without humans would be paraphyletic)

Page 21: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Is there a molecular clock?

• The idea of a molecular clock was initially

suggested by Zuckerkandl and Pauling in

1962

• They noted that rates of amino acid

replacements in animal haemoglobins were

roughly proportional to time - as judged

against the fossil record

Page 22: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Introducing time in trees:

the molecular clock

Page 23: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

The molecular clock for alpha-globin: Each point represents the number of substitutions separating each

animal from humans

0

20

40

60

80

1000

10

0

20

0

30

0

40

0

50

0

Time to common ancestor (millions of years)

number

of s

ubst

itut

ions

cow

platypus chicken

carp

shark

Page 24: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Rates of amino acid replacement

in different proteins

Protein Rate (mean replacements per siteper 10 9 years)

Fibrinopeptides 8.3Insulin C 2.4Ribonuclease 2.1Haemoglobins 1.0Cytochrome C 0.3Histone H4 0.01

Page 25: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Small subunit ribosomal RNA

18S or 16S rRNA

Page 26: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

There is no universal molecular

clock • The initial proposal saw the clock as a Poisson process with

a constant rate

• Now known to be more complex - differences in rates occur for:

• different sites in a molecule

• different genes

• different regions of genomes

• different genomes in the same cell

• different taxonomic groups for the same gene

• There is no universal molecular clock affecting all genes

• There might be ‘local’ clocks but they need to be carefully tested and calibrated

Page 27: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Chaperonin 60 Protein Maximum Likelihood Tree (PROTML, Roger et al. 1998,

PNAS 95: 229)

Longest branches

Page 28: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Rate heterogeneity is a common

problem in phylogenetic analyses

• Differences in rates occur between: • different sites in a molecule (e.g. at different codon

positions)

• different genes on genomes

• different regions of genomes

• different genomes in the same cell

• different taxonomic groups for the same gene

• We need to consider these issues when we make trees - otherwise we can get the wrong tree

Page 29: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Multiple changes at a single

site - hidden changes

C A

C G T A

1 2 3

1

Seq 1

Seq 2

Number of changes

Seq 1 AGCGAG Seq 2 GCGGAC

Page 30: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Convergence can also mislead

our methods:

• Thermophilic convergence or biased codon

usage patterns may obscure phylogenetic

signal

Page 31: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

% Guanine + Cytosine in 16S rRNA genes

from mesophiles and thermophiles

Thermophiles: Thermotoga maritima Thermus thermophilus Aquifex pyrophilus Mesophiles: Deinococcus radiodurans Bacillus subtilis

62 64 65 55 55

%GC all sites

72 72 73 52 50

variable sites

Page 32: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Gene trees and species trees

We often assume that gene trees give us species trees

a

b

c

A

B

D

Gene tree Species tree

Page 33: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Gene trees and species trees -

why might they differ?

• Gene duplication

• Horizontal gene transfer between species

• Gene analysis can produce trees that conflict

with accepted ideas of species relationships

based upon external data

Page 34: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

??

Mitochondrial (mt) genomes of Sauropsida (reptiles+birds).

A. Llanes. Univ. de La Habana

Page 35: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Gracias a:

Hernán Dopazo CSAT - Príncipe Felipe

Valencia

Centro Nacional de Biotecnología.

Madrid Federico Abascal

Museo Nacional de Ciencias

Naturales. Madrid

Rafael Zardoya

Alehjandro Llanes Universidad de La Habana

Cuba

Page 36: INTRODUCCIÓN A LA BIOINFORMÁTICA 2012bioweb.cbm.uam.es/Courses/MasterVirol2014/Trees/Trees_intro.pdfPaulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem

Paulino Gomez-Puertas Bioinformática.

Cuestiones…