tekaia evol trends proteomes

77
Exploring Evolutionary Trends in Proteomes Eukar yotes Hyperther mophiles Psychro philes Prokaryotes mesophiles Thermop hiles Fredj Tekaia Edouard Yeramian Institut Pasteur [email protected]

Upload: silvyfebry

Post on 27-May-2017

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Tekaia Evol Trends Proteomes

Exploring Evolutionary Trends in Proteomes

••••

Eukaryotes

Hyperthermophiles

Psychrophiles

Prokaryotes mesophiles

Thermophiles

Fredj Tekaia Edouard Yeramian

Institut [email protected]

Page 2: Tekaia Evol Trends Proteomes

433 36

46

http://www.genomesonline.org/

Tree of life

Complete genomes 2434 projects • 520 published (01-03-07)• 1086 Bacteria• 59 Archaea• 696 eukaryotes• 73 metagenomes

• 3 phylogenetic domains;• Lifestyles: mesophiles; (hyper)thermophiles; psychrophiles; extreme conditions,...

Page 3: Tekaia Evol Trends Proteomes

• In the post genomic era, multidimensional data resulting from large scale genome comparisons are available.

• Multivariate analysis methods are particularly helpful for the discovery of evolutionary trends associated with such data.

• Data driven exploratory analyses as opposed to model driven methods.

Page 4: Tekaia Evol Trends Proteomes

Methodology

Matrice T kij > 0

1 i p1

j

n

kij

sup

sup •

••

••

Correspondence Analysis

F1

Fp

••

• •

••

••

F(is) = -1/2.∑{fis

j.G(j) ; j=1,p};

Page 5: Tekaia Evol Trends Proteomes

Methodology

••

••

Matrice T kij > 0

Correspondence Analysis

Classification

1 i p1

j

n

kij

sup

F1

Fp

••

• •

••

••

••

•••

••

• orthogonal system;

• use of euclidean distance;

Page 6: Tekaia Evol Trends Proteomes

1. Evolution of Proteomes: Signatures and Trends in Amino Acid Compositions

2. Genome Trees from Whole Proteome Comparisons

Page 7: Tekaia Evol Trends Proteomes

Evolution of Proteomes: Signatures and Trends in Amino Acid

Compositions

•••

Eukaryotes

Hyperthermophiles

Psychrophiles Prokaryotes mesophiles

Thermophiles

Page 8: Tekaia Evol Trends Proteomes

Mining the wealth of information contained in complete genomes, to decipher genomic characteristics to the adaptive evolution of organisms in extreme conditions as high or low temperatures, has long been a matter of interest:• Kreil DP, Ouzounis CA (2001). Identification of thermophilic species by the amino acid compositions deduced from their genomes. NAR 2001, 468: 1608-15.• Tekaia F, Yeramian E, Dujon B (2002). Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene, 297: 51-60.• Suhre K, Claverie JM (2003). Genomic correlates of hyperthermostability, an update. J. Biol. Chem., 278: 17198-202. • Hickey DA, Singer GA (2004). Genomic and proteomic adaptations to growth at high temperature. Genome Biol., 5: 117. Epub 2004.• Brocchieri L (2004). Environmental signatures in proteome properties. Proc Natl Acad Sci U S A., 101: 8257-8.• Cavicchioli R (2006). Cold-adapted archaea. Nat. Rev. Microbiology,4: 331-3.• Lobry JR, Necsulea A. (2006). Synonymous codon usage and its potential link with optimal growth temperature in prokaryotes.Gene. 385:128-36.• Zeldovich KB, Berezovsky IN, Shakhnovich EI. (2007). Protein and DNA Sequence Determinants of Thermophilic Adaptation. PLoS Comput Biol. 3:e5.

Page 9: Tekaia Evol Trends Proteomes

The significant number of available completely sequenced genomes with different lifestyles offers an unprecedented opportunity to explore species evolution.

• Which universal properties can be deduced from amino acid compositions of proteomes?

• Are there specific properties associated with lifestyles and with phylogeny?

• What are the underlying evolutionary trends?

Among simple analyses:

amino acid composition of proteomes.

Page 10: Tekaia Evol Trends Proteomes

Outline

• Methodology;

• Species considered and data analysed;

• Species and amino acids distributions;

• Amino acids distribution and comparison with theoretical and experimental model chronologies of amino acids recruitment into the genetic code;

• Example: application to predicting candidate thermostable proteins in Aspergillus fumigatus.

Page 11: Tekaia Evol Trends Proteomes

Methodology

Matrice T kij > 0

1 i p1

j

n

kij

sup

sup •

••

••

Correspondence Analysis

F1

Fp

••

• •

••

••

F(is) = -1/2.∑{fis

j.G(j) ; j=1,p};

Page 12: Tekaia Evol Trends Proteomes

GC%

Growth t°

Tekaia, F., Yeramian, E. and Dujon, B. 2002. Gene 297: 51-60.54 species

Hyperthermophiles

Mesophiles

Thermophiles

Previous work showed:

Page 13: Tekaia Evol Trends Proteomes

including:

• 20 hyperthermophiles (HTH) (OGT >60°C up to 120°C),

• 7 thermophiles (TH) (OGT >50°C up to 60°C),

• 8 psychrophiles (PSYC) (OGT: -10°C, up to 15°C),

• 173 mesophiles (BMES) including 53 eukaryotes (EUK)

Data table: 222 (208 + 14 sup) vs 23 (20 aa + pol, char, hyd)

Amino Acid composition of 208 proteomes

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj

+ specific sites

Page 14: Tekaia Evol Trends Proteomes

208

13

...............

Amino Acid composition

Correspondence Analysis was used to explore relationships between species and amino acids.

org A R N D C Q E G H I L K M F P S T W Y V char pol hyd PC sc 5.5 4.4 6.1 5.8 1.3 3.9 6.6 5.0 2.1 6.6 9.6 7.4 2.1 4.5 4.3 9.0 5.8 1.0 3.3 5.6 26.3 34.4 39.1 8.1 sp 6.3 4.8 5.2 5.3 1.5 3.8 6.5 5.0 2.3 6.1 9.8 6.4 2.1 4.6 4.7 9.4 5.6 1.1 3.4 6.0 25.3 33.9 40.7 8.6 ncu 8.7 6.2 3.7 5.6 1.1 4.3 6.5 7.2 2.5 4.4 8.4 5.1 2.2 3.4 6.5 8.3 6.1 1.4 2.6 6.0 25.8 33.3 40.8 7.5 ca 4.9 3.7 6.7 5.7 1.2 4.4 6.2 5.0 2.1 7.1 9.3 7.2 1.9 4.5 4.5 9.3 6.2 1.0 3.5 5.5 25. 36.2 38.7 11.2 mgr 9.4 6.6 3.5 5.7 1.3 4.1 5.9 7.4 2.3 4.4 8.5 4.8 2.2 3.5 6.3 8.0 5.9 1.5 2.5 6.2 25.3 32.7 42.0 7.4 fg 8.2 5.8 3.9 5.9 1.3 4.0 6.2 6.7 2.4 5.1 8.7 5.1 2.3 3.8 5.9 8.1 6.1 1.5 2.8 6.1 25.4 32.9 41.6 7.5 an 8.6 6.2 3.7 5.6 1.2 4.0 6.2 6.8 2.4 5.0 9.2 4.6 2.0 3.7 6.0 8.4 6.0 1.5 2.9 6.1 24.9 32.9 42.0 8 ecun 5.0 6.7 3.9 5.5 2.0 2.3 8.1 6.5 1.9 6.7 9.5 7.1 3.0 4.8 3.4 8.0 4.1 0.8 3.6 7.0 29.3 30.4 40.2 1.1

HTH 7.4 5.8 3.5 4.7 0.8 2.0 8.3 7.4 1.6 7.4 10.6 7.0 2.2 4.2 4.5 5.2 4.4 1.1 3.9 8.0 27.4 27.0 45.4 -0.4 TH 9.0 6.3 3.6 5.3 0.8 3.1 6.4 7.5 1.9 7.0 9.9 4.7 2.6 4.0 4.7 6.1 5.1 1.2 3.6 7.4 24.6 29.7 45.6 5.2 PSYC 8.4 4.6 4.3 5.7 1.1 4.0 6.3 6.9 2.2 7.2 9.9 5.5 2.7 4.1 3.9 6.5 5.8 1.1 3.2 6.9 24.2 31.8 44.0 7.6 BMES 8.6 5.1 4.4 5.4 1.0 3.8 6.3 7.0 2.1 6.9 10.2 5.8 2.3 4.3 4.1 6.2 5.4 1.1 3.2 6.9 24.6 30.9 44.4 6.3 EUK 6.9 5.4 4.9 5.4 1.7 4.2 6.6 6.0 2.4 5.6 9.3 6.1 2.2 4.0 5.2 8.4 5.6 1.2 3.1 6.0 25.9 33.8 40.2 7.9 SPEC 7.6 6.1 4.8 5.1 1.8 4.0 6.3 6.1 2.5 4.9 8.8 5.7 2.2 3.6 5.7 8.8 5.8 1.2 2.9 6.0 25.8 34.2 39.9 8.4 A 6.7 5.4 4.8 5.4 1.2 2.6 7.8 6.3 1.8 7.3 9.6 6.9 2.3 4.1 4.0 6.7 5.0 1.1 3.9 7.1 27.2 30.5 42.2 3.3 B 9.4 5.8 4.1 5.4 1.0 4.1 6.0 7.3 2.1 5.6 10.1 5.0 2.2 3.9 4.7 6.6 5.5 1.4 3.0 6.8 24.3 31.5 44.0 7.2 E 6.9 5.7 4.4 5.3 2.0 4.6 6.6 6.0 2.6 4.8 9.1 5.8 2.2 3.8 5.8 8.7 5.7 1.2 2.9 5.9 26.0 34.2 39.7 8.2 EA 6.8 5.7 4.5 5.5 1.8 4.1 6.8 5.8 2.4 5.7 9.6 6.5 2.3 4.0 4.6 7.6 5.5 1.1 3.2 6.5 26.8 32.4 40.7 5.6 EB 7.4 5.5 4.3 5.5 1.5 4.0 6.3 6.7 2.5 5.4 9.5 5.4 2.2 4.1 5.4 7.7 5.6 1.3 3.1 6.5 25.1 33. 41.8 7.9 AB 8.6 5.3 3.9 5.0 1.0 3.3 6.3 7.1 1.9 7.0 10.7 5.5 2.4 4.5 4.2 6.2 5.1 1.3 3.3 7.3 24.0 29.8 46.0 5.8 EAB 8.1 5.4 4.0 5.4 1.3 3.8 6.6 7.0 2.2 6.1 9.9 5.7 2.4 4.1 4.6 6.9 5.5 1.2 3.0 7.0 25.2 31.4 43.3 6.2

Page 15: Tekaia Evol Trends Proteomes

P1

proteome1

Pn

proteomen

• bestnpp1

• allnpp1

• segmatchnpp1

• bestnppn

• allnppn

• segmatchnppn

• bestp1np

• allp1np

• segmatchp1np

• bestpnnp

• allpnnp

• segmatchpnnp

Species specific comparisons

NP

new proteome

blastp, pam250, SEG filter

bestnppi

np1 size pij e-value1 HS/IS/NS

allnppi

np1 size pij e-value1 HS/IS/NS

np1 size pik e-value HS/IS/NS

The expected number of HSPs with score at least S is given by: E = Kmne-S.

m and n are sequence and database lengths.

• Paralogs • Orthologs

Page 16: Tekaia Evol Trends Proteomes

•••

Eukaryotes

Hyperthermophiles

Psychrophiles Prokaryotes mesophiles

Thermophiles

Encephalitozoon cuniculi

Thermosynechococcus

elongatus

Page 17: Tekaia Evol Trends Proteomes

GC%

growth t°

•••

Mycoplasma mycoides

23%

Nocardia farcinica:

70%

Streptomyces coelicolor: 72%

Tetrahymena thermophila (Protists)

Saccharomyces

Entamoeba histolytica (Protists)

Cryptosporidium hominis Leishmania major:60%

Cyanidioschyzon merolae

Aspergilus fumigatus:50%Homo sapiens

Methanococcus jannaschii:31%Pyrococcus abyssi:44%

Methanopyrus kandleri:61%

Thermus-thermophilus:69%

Colwellia psychrerythraea Pseudoalteromonas haloplanktis

Encephalitozoon cuniculi

A. nidulans

A. oryzae

C. neoformansMus musculus

Rat

Candida Glabrata

Page 18: Tekaia Evol Trends Proteomes

Statistical characterization of the observed groups:

Mean amino acids between the 3 groups were compared using:

-One-way analysis of variance;

-Newman-Keuls multiple comparison test to detect significant differences at the probability level of p<0.001.

Page 19: Tekaia Evol Trends Proteomes

0123456789

1011

V (Val)Y (Tyr)E (Glu)G (Gly)I (Ile)

L (Leu)A (Ala)H (His)S (Ser)Q (Gln)T (Thr)C (Cys)D (Asp)P (Pro)N (Asn)R (Arg)M (Met)K (Lys)F (Phe)W (Trp)

**

*

*

*

*

*

* *

*

*

*

*

*

**

*

*

Mean aa composition in (hyper)thermophiles, prokaryotic mesophiles-psychrophiles and eukaryotes (*: sig. different at p<0.001)

Page 20: Tekaia Evol Trends Proteomes

0

5

10

15

20

25

30

35

40

45

50

hyd pol pol-char char

**

*

**

*

*

AA physico-chemical properties in (hyper)thermophiles, prokaryotic-pshychrophiles and eukaryotes(*: sig. different at p<0.001)

Page 21: Tekaia Evol Trends Proteomes

HTH-TH BMES-PSYC EUKV(Val) V(Val) V (Val)H(His) H(His) H (His)S (Ser) S (Ser) S (Ser)

pol pol polpol-char pol-char pol-charY (Tyr)

E (Glu a)Q (Gln)T (Thr)

D (Asp a)G (Gly)I (Ile)

L (Leu)C (Cys)

Hyd

Amino acid signatures (p<0.001)

• R (Arg), M (Met), F (Phe), K (Lys), N (Asn) and W (Trp) show no significant difference (at p<0.001).

Page 22: Tekaia Evol Trends Proteomes

Species evolutionary trends

Page 23: Tekaia Evol Trends Proteomes

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

T1

T2

• •••

••••

GC%

growth t°

[moderate_temperature]-[low_GC]

[high_temperature]-[high_GC]

ABAEAB B

SPECE

EBEA

Ancient

Recent

Page 24: Tekaia Evol Trends Proteomes

• Comparison of amino acid distribution with recent models of:• Jordan et al. Nature 433: 633-638 (2005)

• Trifonov, J. Biomol. Struct. & Dyn. 22: 1-11 (2004)

• Miller’s experiments: Science 117, 528-529. (1953)

• Analysis of Murchison meteorite (1983)

• and with ancient amino acids:

Comparison with model chronologies of amino acids recruitment into the genetic code

Page 25: Tekaia Evol Trends Proteomes

• They analysed 15 sets of three-way alignments of orthologous proteins encoded by triplets of closely related genomes from 15 taxa representing all three domains of life (Bacteria, Archaea and Eukaryota), and used phylogenies to polarize amino acid substitutions.

• All amino acids with declining frequencies are thought to be among the first incorporated into the genetic code;

• conversely, all amino acids with increasing frequencies, except Ser, were probably recruited late.

Model of Jordan et al. 2005: A universal trend of amino acid gain and loss in protein evolution. Nature.433:633-8.

Page 26: Tekaia Evol Trends Proteomes

• 4 “weak gainers”: Asn, Thr, Ile (accrue in 11 taxa/15) and Val (accrues slowly in all taxa);

• 5 strong “gainers”: Cys, Met, His, Ser and Phe (accrue in 14/15 taxa)

“were probably recruited late” i.e most recent aa.

• 4 strong “losers”: Pro, Ala, Glu, and Gly (decline in at least 13 taxa/15)

“thought to be among the first incorporated into the genetic code” i.e most ancient aa.

• 1 “weak looser”: Lys (lost in 10 taxa/15).

• In contrast: the remaining six amino-acids (Arg, Gln, Trp, Leu and Tyr) evolve more erratically.

Jordan et al. 2005.

Following observed frequencies, they subdivided amino acids into what they called:

Page 27: Tekaia Evol Trends Proteomes

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

T1

T2

GC%

growth t°

•••

Jordan et al., Nature 433, 633 (2005).

•”strong loosers” in T1: most ancient aa

Pro

AlaGlu Gly

A universal trend of aa gain and loss in protein evolution.

•”weak gainers”

Asn Thr

Ile Val

• “strong gainer” in T2: recruited late to the genetic code

HisSer

Cys

Phe Met

Page 28: Tekaia Evol Trends Proteomes

Model of Trifonov, E.N. 2004. The triplet code from first principles. J. Biomol. Struct. & Dyn. 22: 1-11.

• The chronology results in the consensus order:

G1 (Gly), A2 (Ala), D3 (Asp), V4 (Val), P5 (Pro), S6 (Ser), E7 (Glu), (L8 (Leu), T8 (Thr)), R10 (Arg), (I11 (Ile), Q11 (Gln), N11 (Asn)), H14 (His), K15 (Lys), C16 (Cys), F17 (Phe), Y18 (Tyr), M19 (Met), W20 (Trp).

• A consensus chronology of amino acids is built on the basis of 60 different criteria each offering certain temporal order.

Page 29: Tekaia Evol Trends Proteomes

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

T1

T2

GC%

growth t°

•••

Trifonov, E.N. (2004). The triplet code from first principles. J. Biomol. Struct. & Dyn. 22: 1-11.

Gly1Ala2

Val4

Asp3

Glu7

Pro5

Ser6

Leu8

Thr8

Arg10

•Tyr18

Asn11

Lys15

Gln11

Ile11

Cys16

His14

Phe17

Trp20

Met19

Page 30: Tekaia Evol Trends Proteomes

Comparison with ancient amino acids

Page 31: Tekaia Evol Trends Proteomes

Miller/Urey Experiment: 1953

• By the 1950s, scientists were in hot pursuit of the origin of life. The scientific community was examining what kind of environment would be needed to allow life to begin.

• In 1953, Miller took molecules which were believed to represent the major components of the early Earth's atmosphere and put them into a closed system

• Miller's experiment showed that organic compounds such as amino acids, which are essential to cellular life, could be made easily under the conditions that scientists believed to be present on the early earth.

Page 32: Tekaia Evol Trends Proteomes

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

T1

T2

GC%

growth t°

•••

Miller, S.L. Science 117, 528-529. (1953) Production of aa under possible primitive earth conditions.

Gly AlaVal

Asp

Glu

Pro

Ser

Leu

Thr

++++++

+

+

+

+

+

+

+

Ile+

Page 33: Tekaia Evol Trends Proteomes

The Murchison meteorite fall occurred on September 28, 1969 over Murchison, Australia. Over 100 kilograms of this meteorite have been found. This meteorite is of possible cometary origin due to its high water content of 12%.

An abundance of amino acids found within this meteorite has led to intense study by researchers as to its origins. More than 92 different amino acids have been identified within the Murchison meteorite to date. Nineteen of these are found on Earth. The remaining amino acids have no apparent terrestrial source.

Murchison meteorite 09-28-1969

Page 34: Tekaia Evol Trends Proteomes

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

T1

T2

GC%

growth t°

••

Cronin, J.R. and Pizzarello, S. (1983). Amino acids in meteorites. Adv Space Res. 3: 5-18. Murchison meteorite 28-09-1969

LeuAsp

Val+

+

Gly AlaGlu

Pro

+++++

++

++

Ile+++

Page 35: Tekaia Evol Trends Proteomes

Conclusions:

• segregation of eukaryotes;

• segregation of hyperthermophiles;

• non discrimination of psychrophiles.

• Simple description of amino acid compositions of proteomes (free from a priori model) revealed fundamental evolutionary properties:

• Amino acid signatures for hyperthermophiles and for eukaryotes.

Page 36: Tekaia Evol Trends Proteomes

Conclusions...:

• Correspondence Analysis helped these properties to be shown.

• Amino acids distribution is consistent with suggested model chronologies of their recruitment into the genetic code;

Page 37: Tekaia Evol Trends Proteomes

General Conclusion

• Amino acids are significant markers for species evolution.

Page 38: Tekaia Evol Trends Proteomes

Genome Trees from Whole Proteome Comparisons

Page 39: Tekaia Evol Trends Proteomes

• Species tree construction and difficulties;

• Post genome era species tree construction;

• Genome tree construction based on conservation profiles;

Outline

• Conclusions;

• References.

• Conservation profiles;

Page 40: Tekaia Evol Trends Proteomes

Species tree - Tree Of Life

• 16/18s rRNA tree (Woese 1990);Woese and others have used rRNA comparisons to construct a “Tree Of Life” showing the evolutionary relationships of a wide variety of organisms.

The « Tree Of Life » has long served as a useful tool for describing the history and relationships of organisms over evolutionary time. One species is represented as a branching point, or node, on the tree, and the branches represent paths of descent from a parental node.

Page 41: Tekaia Evol Trends Proteomes

The three-domain proposal based on the ribosomal RNA tree. Woese et al. PNAS. 87:4576-4579. (1990)

The two-empire proposal, separating eukaryotes from prokaryotes and eubacteria from archaebacteria. Mayr, D. PNAS 95:9720-23. (1998).

The three-domain proposal, with continuous lateral gene transfer among domains. Doolittle. Science 284:2124-8. (1999)

The ring of life, incorporating lateral gene transfer but preserving the prokaryote eukaryote divide. Rivera & Lake JA. Nature 431: 152-5. (2004)

Martin & Embley

Nature 431:152-5.(2004)

Page 42: Tekaia Evol Trends Proteomes

The 1.2-Megabase Genome Sequence of Mimivirus Raoult et al. Sciences, 306:1344-1350. (2004)

Genomic Databases and the Tree of LifeKeith A. Crandall and Jennifer E. BuhaySciences, 306; 1144-1145. (2004)

Prospects for Building the Tree of Life from Large Sequence Databases Driskell, et al .Sciences, 306; 1172-1174. (2004)

Page 43: Tekaia Evol Trends Proteomes

Pennisi, E. (1998). Genome data shake tree of life.Science 280:672-4.

New genome sequences are mystifying evolutionary biologists by revealing unexpected connections between microbes thought to have diverged hundreds of millions of years ago.

and suggests to construct species trees from their whole gene content.

Page 44: Tekaia Evol Trends Proteomes

Genome phylogeny based on gene content (1999)Snel, Bork, Huynen. Nature Genetics 21, 108-110.

E

A

B

Page 45: Tekaia Evol Trends Proteomes

Tekaia, Lazcano & Dujon (1999)Genome Research 9: 550-7.

E

A

B

Page 46: Tekaia Evol Trends Proteomes

433 36

46

http://www.genomesonline.org/

Tree of life

Complete genomes 2434 projects • 520 published (01-03-07)• 1086 Bacteria• 59 Archaea• 696 eukaryotes• 73 metagenomes

Abundance of genome data is raising expectations to accurately depict the evolutionary history of all genomes.

Idea: construct a species tree from many genes instead of only one gene.

Page 47: Tekaia Evol Trends Proteomes

Genomes 2 edition 2002. T.A. Brown

Gene tree - Species tree

Species tree

A B C

Gene tree

A B C

Time Duplication

Duplication

Speciation

Speciation

A B C

Page 48: Tekaia Evol Trends Proteomes

Problems with species tree construction

• main difficulties in species tree construction include extensive incongruence between alternative phylogenies generated from single-gene data sets;

-Genes don't evolve at the same rate nor in the same way;-the evolutionary history inferred from one gene may be different from what another gene appears to show.

Page 49: Tekaia Evol Trends Proteomes

Alternative solutions: integrative methods

• “supertree”The supertree approach estimates phylogenies for subsets of genes with good overlap, then combines these subtree estimates into a supertree.

Bininda-Emonds et al. 2002

• Depends on the ability to distinguish between orthologs and paralogs;

• Supertree approaches are controversial, in part because the methodology results in a degree of disconnection between the underlying genetic data and the final tree produced.

Page 50: Tekaia Evol Trends Proteomes

• “phylogenomic tree”(based on concatenation of a gene sample common to the considered species);

S1

Sn

.

.

• genes don't evolve at the same rate nor in the same way;

• a limited number of genes are shared among all species;

The tree of one percent (2006)Dagan and Martin. Genome Biology, 7:118.

Page 51: Tekaia Evol Trends Proteomes

More generally these methods suffer difficulties related to the phylogenetic tree construction:

• global sequence alignment (quality, gaps,...);

• different evolutionary histories of genes;

• substitution saturation;...

and

• more seriously from gene sampling difficulties.

Page 52: Tekaia Evol Trends Proteomes

A B C

Gene tree - Species tree: The gene sampling problem

A B C

Red is lost in CBlue is lost in A and B

A B C

gene tree # species tree

Adapted from:

Linder, Moret, Nakhleh, Warnow.

True species tree

Page 53: Tekaia Evol Trends Proteomes

A B C

Gene tree - Species tree: The gene sampling problem

All red orthologs has been lost in the 3 species.

A B C

Luckily: sampling gives the blue orthologs. The true species tree is reconstructed.

Page 54: Tekaia Evol Trends Proteomes

A B C

Gene tree - Species tree: The gene sampling problem

All versions of the gene are in the 3 species

A AB BC C

Gene trees are the same as the species tree

Page 55: Tekaia Evol Trends Proteomes

Genome tree is another alternative to construct species tree.

• The concept of genome tree is based on overall gene content similarity.

(consider more than single gene information)

Page 56: Tekaia Evol Trends Proteomes

Methodology

••

••

Matrice T kij > 0

Correspondence Analysis

Classification

1 i p1

j

n

kij

sup

F1

Fp

••

• •

••

••

••

•••

••

• orthogonal system;

• use of euclidean distance;

Page 57: Tekaia Evol Trends Proteomes

Systematic Analysis of Completely Sequenced Organisms

• In silico species specific comparisons (Tekaia & Dujon. J. Mol. Evol. 1999)

(27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)

Proteome1

Proteomen

Proteome

blastp, pam250, SEG filter

• 99 species

(B: 33; A: 19; E:27)

• total of 541880 proteins

Page 58: Tekaia Evol Trends Proteomes

Systematic Analysis of Completely Sequenced Organisms

• In silico species specific comparisons (27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)

• Degree of ancestral duplication and of ancestral conservation between pairs of species;

• Families of paralogs (Partition-MCL);• Families of orthologs (Partition-MCL);• Distribution of orthologous families according to the three domains of life;

• Determination of the protein dictionary (orthologs);

• Determination of protein conservation profiles;

Page 59: Tekaia Evol Trends Proteomes

Genome trees: data matrices

T = {Tij ; i=1,n; j=1,n; n is the number of surveyed species}

Tij is the overall similarity score between species j and i.

• Ancestral duplication and ancestral conservationT = {Tij = wij = (number of proteins in j conserved in i)/size(j)); i=1,n; j=1,n }.

n = 99 species and T corresponds to 541880 total proteins

Page 60: Tekaia Evol Trends Proteomes

org SC SP CE DM AG CA ATH HS MUS FR PF ECUNSC 40.5 63.9 17.5 27.1 22.3 65.9 23.4 22.9 27.3 18.0 22.5 35.8SP 58.4 37.4 18.8 29.3 26.3 54.3 25.0 25.0 29.6 20.0 24.6 38.4CE 38.1 46.6 65.2 51.9 50.6 35.5 27.5 44.6 54.4 42.4 24.8 34.8DM 40.5 50.2 39.2 65.8 69.9 37.5 29.5 50.3 62.7 47.9 26.5 36.3AG 40.9 50.2 39.8 73.1 59.5 38.0 30.6 50.2 60.3 48.7 26.5 36.0CA 71.8 65.5 18.4 27.7 25.7 35.8 24.3 23.2 27.8 18.5 22.3 35.7ATH 40.3 47.8 21.7 31.5 30.3 37.0 83.6 25.6 29.7 21.9 26.2 33.4HS 43.0 53.3 40.0 61.3 54.5 39.7 32.1 66.7 90.8 68.8 28.2 37.7MUS 41.7 52.5 39.5 62.1 54.7 39.1 31.5 76.8 77.8 67.7 27.6 37.2FR 42.0 52.6 40.0 60.7 59.9 39.5 32.7 68.7 81.8 63.4 27.6 37.4PF 25.9 31.2 13.1 19.3 15.9 22.2 16.3 17.2 21.0 13.2 28.3 28.9ECUN 19.5 23.4 8.9 13.1 10.8 16.2 11.4 12.0 15.2 9.0 13.6 26.1MJ 11.5 13.3 4.9 6.7 6.0 10.2 6.0 4.8 5.6 3.7 8.7 15.4MTH 13.6 16.2 4.6 7.4 7.6 11.2 8.0 5.1 6.1 4.0 8.3 15.2AF 14.4 16.5 5.9 8.2 8.7 11.8 8.7 5.6 6.6 4.5 8.6 15.4PH 16.3 18.7 5.0 7.1 9.2 11.1 9.7 5.2 6.0 4.1 7.9 15.3PA 14.3 15.2 5.4 7.5 7.3 11.9 7.4 5.5 6.4 4.3 8.3 15.9APEM 15.5 20.1 4.8 7.3 10.6 10.3 9.4 5.2 5.9 3.9 7.2 14.9TA 15.2 17.5 5.9 8.3 8.3 12.7 8.2 5.3 6.3 4.2 8.6 14.8TV 15.4 17.8 6.2 8.3 8.7 13.3 8.3 5.6 6.8 4.4 8.7 15.0H 14.8 17.7 5.8 8.3 9.8 12.0 10.2 5.5 6.6 4.5 8.0 13.9SSP2 16.7 19.4 7.1 9.1 9.4 14.2 9.5 6.2 7.4 4.9 9.5 15.9PFU 17.0 22.8 6.5 9.3 11.1 13.3 12.3 7.0 8.0 5.6 9.1 17.1STO 18.6 23.1 6.8 8.6 11.4 13.7 11.1 5.9 7.1 4.5 9.1 15.7PYAE 15.6 19.5 5.3 8.2 9.9 11.8 9.5 5.8 6.9 4.5 8.1 15.0MA 16.0 18.9 7.1 10.8 12.5 14.7 9.7 7.4 8.7 6.4 9.8 17.0MK 13.0 14.6 4.0 6.2 6.1 10.7 6.9 4.6 5.4 3.5 7.3 14.1MMA 14.8 17.4 6.4 9.2 9.5 13.5 8.1 6.6 7.9 5.3 9.7 15.8HI 13.0 14.3 4.8 7.3 8.5 11.1 8.7 4.4 5.4 4.0 8.2 8.7…..tnsp 74.4 79.2 49.7 76.4 81.0 72.6 58.8 78.7 93.7 72.8 42.3 48.1

Ancestral duplication and ancestral conservation

Wij

Page 61: Tekaia Evol Trends Proteomes

conservation tree

•species are clustered into 3 phylogenetic domains;• bacterial species cluster with archaeal species;• similar species cluster together;• “whole genome” species clustering tree;• very low resolution of deep clustering;

Page 62: Tekaia Evol Trends Proteomes

Genome trees: data matrices

T = {Tij ; i=1,n; j=1,n; n is the number of surveyed species}

Tij is the overall similarity score between species j and i.

• Shared orthologous genes

{sij = (shared orthologs between i and j) }

T = {Tij = sij/size(j); i=1,n; j=1,n }

Page 63: Tekaia Evol Trends Proteomes

Note on: Homologs - Paralogs - Orthologs

Homologs: A1, B1, A2, B2

Paralogs : A1 vs B1 and A2 vs B2

Orthologs: A1 vs A2 and B1 vs B2

S1 S2a b

Sequence analysis

Species-1 Species-2

Duplication

Ancestor

Evolution

Speciation

A1 A2

B1 B2

A

B

A

B

A

Time

Page 64: Tekaia Evol Trends Proteomes

Shared orthologous genesorg SC SP CE DM AG CA ATH HS MUS FR PF ECUNSC 0 2532 1533 1660 1671 3371 1582 1789 1733 1731 890 600SP 2532 0 1753 1917 1907 2588 1754 2060 2032 2024 1008 645CE 1533 1753 0 3910 3869 1611 1902 4036 3994 4047 1015 580DM 1660 1917 3910 0 7018 1728 2094 5057 5147 5035 1106 616AG 1671 1907 3869 7018 0 1738 2160 5016 5013 5059 1085 617CA 3371 2588 1611 1728 1738 0 1590 1850 1824 1827 873 595ATH 1582 1754 1902 2094 2160 1590 0 2404 2406 2399 1067 539HS 1789 2060 4036 5057 5016 1850 2404 0 14053 10286 1185 638MUS 1733 2032 3994 5147 5013 1824 2406 14053 0 10304 1169 632FR 1731 2024 4047 5035 5059 1827 2399 10286 10304 0 1146 626PF 890 1008 1015 1106 1085 873 1067 1185 1169 1146 0 453ECUN 600 645 580 616 617 595 539 638 632 626 453 0MJ 238 233 214 216 242 230 279 223 216 217 169 142MTH 254 247 237 247 278 245 306 251 248 249 171 141AF 261 255 254 260 303 248 310 260 263 265 182 151PH 251 245 250 259 297 237 281 273 258 271 187 155PA 267 261 255 268 311 256 312 276 273 278 189 156APEM 212 233 228 228 251 215 242 248 237 230 165 136TA 264 260 252 254 279 261 298 268 264 261 182 141TV 263 255 256 249 276 258 296 260 258 270 184 138H 255 264 258 249 284 248 318 271 267 272 173 140SSP2 302 317 293 292 326 300 360 310 309 311 200 155PFU 264 284 256 275 324 286 316 292 274 280 195 150STO 281 291 273 263 313 278 329 293 282 298 196 143PYAE 245 258 236 249 285 238 278 258 246 256 170 143MA 303 316 298 293 368 301 369 329 326 326 200 161MK 210 214 195 204 216 211 244 205 202 195 160 125MMA 289 298 276 280 338 280 349 305 299 297 194 160HI 268 273 231 243 388 268 382 259 259 267 181 86

sij

Page 65: Tekaia Evol Trends Proteomes

orthologs tree

• 3 phylogenetic domains;• bacterials species cluster with archaeal species;• similar species cluster together;• better resolution of deep species clustering.

Page 66: Tekaia Evol Trends Proteomes

Ancestor

species genome

Evolutionary processes include

Phylogeny*duplication genesis

Expansion*

HGT HGT

Exchange* loss Deletion*selection*

Expansion, Exchange and Deletion are noise. They should be eliminated or at least reduced.

• Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes:

Page 67: Tekaia Evol Trends Proteomes

Genome tree construction from “Protein Conservation Profiles” and attempt to reduce

noisy evolutionary processes

To overcome some of these limitations, we consider

Page 68: Tekaia Evol Trends Proteomes

p 0111111000111111111000110110111101001111101111

• A “conservation profile” is an n-component binary vector describing a protein conservation pattern across n species.

Components are 0 and 1, following absence or presence of homologs.

• A conservation profile is the trace of protein evolutionary histories jointly captured in a set of n species (multidimensional feature);

• Conservation profiles are signatures of evolutionary relationships;

Conservation profiles

• 99 species (B: 33; A: 19; E:27); 541880 proteins

Main interesting properties of conservation profiles:

Page 69: Tekaia Evol Trends Proteomes

E A B S1..............I.............I................Sn

G1,1 100000000000000000000000000000000000000000000000 G2,1 111111111111111111111111111111111111111111111111 G3,1 111111110011111111111111011101110101111111101111 ....................................................... Gn1,1 100001110001000000000000000000000000000000000000 G1,2 010000000000000000010100000000000111000011100011 G2,2 010000000000000000010100000000000111000011100011........................................................ Gn2,2 111111110011111111111111011101110101111111101111........................................................ G1,n 011110100000000000000000001000000000000000000001 G2,n 111111110011111111100011011101110101111111101111 G3,n 111111110011111111100011011101110101111111101111........................................................ Gnp,n 100110000000000000000000000000000000000000000001

Protein conservation profiles

Table : 541880 proteins x 99 species• Different conservation profiles represent different evolutionary histories

Page 70: Tekaia Evol Trends Proteomes

original total proteins (99 species)

non-specific proteins i.e conservation profiles (82%)

distinct conservation profiles (42%)

Distinct conservation profiles

541880

442460

184130

111111110011111111111111011101110101111111101111

100110000000000000000000000000000000000000000001

100000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111

010000000000000000010100000000000111000011100011

................................................

• This set is indicative of the various observed evolutionary histories.

• Effect of the duplication process is reduced(one representative from each set of identical conservation profiles)

Page 71: Tekaia Evol Trends Proteomes

0102030405060708090

100110120130140150160170180190200210220230240250

c01c02c03c04c05c06c07c08c09c10c11c12c13c14c15c16c17c18c19c20c21c22c23c24c25c26c27c28c29c30c31c32c33c34c35c36c37c38c39c40c41c42c43c44c45c46c47c48c49c50c51c52c53c54c55c56c57c58c59c60c61c62c63c64c65c66c67c68c69c70c71c72c73c74c75c76c77c78c79c80c81c82c83c84c85c86c87c88c89c90c91c92c93c94c95c96c97c98c99Conservation weights (sum of "1":presence)

Fractions (*10000) of distinct conservation profiles

Presence in the 184130 distinct conservation profiles:Mean=32.2; SD=23.3; min=1; Max=99.

Page 72: Tekaia Evol Trends Proteomes

Genome tree construction: data matrices

• Jaccard similarity scores between speciessij = N11/(N11+N01+N10);

N11; N01; N10 are respectively total occurrences of (1,1), (0,1) and (1,0) between i,j.

• 184130 d.c.prof

T = { Tij = sij ; i=1,n; j=1,n; n }

111111110011111111111111011101110101111111101111

100110000000000000000000000000000000000000000001

100000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111

010000000000000000010100000000000111000011100011

................................................

i jvarious evolutionary histories

Page 73: Tekaia Evol Trends Proteomes

Tekaia F, Yeramian E. (2005). PLoS Comput Biol.1(7):e75

profiles tree

Page 74: Tekaia Evol Trends Proteomes

Conclusions: Methodology

• Species classification is not an easy task!

• Methods that take into account whole genome informations are still needed;

• Correspondence analysis method might be helpful in revealing evolutionary trends embedded in the multidimensional relationships as obtained from large scale genome comparisons;

• Species tree construction should take into account the whole information included in the genomes;

Page 75: Tekaia Evol Trends Proteomes

• Thus they should correspond to the most accurate type of markers for species classification;• In principal profiles tree derived from distinct conservation profiles should considerably minimize genome acquisition effects and should reflect less noisy phylogenetic signals;• The profiles tree presents evidence of conservation of stable phylogenetic relationships and reveals unconventional species clustering;• The profiles tree corresponds to the classification of the evolutionary scenari.

Conclusions...• Conservation profiles represent most conserved and meaningful evolutionary signals jointly captured in a set of species;

Page 76: Tekaia Evol Trends Proteomes

References:• Tekaia, F. and Dujon, B. (1999). Pervasiveness of gene conservation and persistence of duplicates in cellular genomes. Journal of Molecular Evolution, 49:591-600.

• Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole proteome comparisons. Genome Res. 12:17-25.• Tekaia, F., Yeramian, E. and Dujon, B. (2002).Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297: 51-60.• Tekaia, F. and Yeramian, E. (2005).Genome Trees from Conservation Profiles. PLoS Comput Biol.1(7):e75.

• Tekaia F, Latgé JP. (2005). Aspergillus fumigatus: saprophyte or pathogen?Curr Opin Microbiol. 8:385-92. Review.

• Tekaia, F. and Yeramian, E. (2006).Evolution of Proteomes: Fundamental signatures and global trends in amino acid composition. BMC Genomics. 7:307.

• Systematic analysis of completely sequenced organisms:http://www.pasteur.fr/~tekaia/sacso.html

Page 77: Tekaia Evol Trends Proteomes

References:• Bininda-Emonds ORP (2005). Supertree Construction in the Genomic Age.Methods in Enzymology 395: p.745-757.• Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002)The (super)Tree Of Life: Procedures, Problems, and Prospects. Annual Review of Ecology and Systematics, Vol. 33: 265-289.

• Dagan, T. and W, Martin (2006). The tree of one percent. Genome Biology, 7:118.• Delsuc F, Brinkmann H, Philippe H. (2005). Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 6:361-75. Review.• Doolittle. Science 284:2124-8. (1999)• Driskell, et al. (2004). Sciences, 306; 1172-1174.

• http://www.genomesonline.org/gold.cgi (list of genome projects)• Keith A. Crandall and Jennifer E. Buhay (2004). Sciences, 306; 1144-1145.

• Linder, Moret, Nakhleh, and Warnow: http://compbio.unm.edu/networks1.ppt

• Martin & Embley (2004). Nature 431:152-5.

• MCL: a cluster algorithm for graphs: http://micans.org/mcl/

• Pennisi, E.(1998). Genome data shake tree of life.Science. 280:672-4.

• Rivera & Lake JA.(2004). Nature 431: 152-5.• Raoult et al.(2004). Sciences, 306:1344-1350.• Snel, Bork, Huynen (1999). Genome phylogeny based on gene content.Nature Genetics 21, 108-110.

• Snel B, Huynen MA, Dutilh BE (2005). Genome trees and the nature of genome evolution.Annu Rev Microbiol.;59:191-209. Review.

• Woese et al.(1990). PNAS. 87:4576-4579.