structure and function of proteins (bicl3215) fred r. opperdoes de duve institute and laboratory of...

Structure and Function of Proteins (BICL3215)

Fred R. Opperdoes

de Duve Institute and Laboratory of Biochemistry, Université catholique de Louvain, Brussels, Belgium

2

Typical Research Project• A gene is cloned and sequenced.

• Is this gene new or already described by others ?

• Are there ORFs ?

• If yes, which of the 6 ORFs codes for a protein ?

• What are the properties of the protein ?

• Is the protein present in other organisms as well ?

• If yes, how did the protein evolve with time ?

• What is the protein’s 3-D structure ?

(Tutorial Bioinformatics: DBCM3002)

3

Gene new or already described by others ?

BLAST: Basic Local Alignment Search Tool

From NCBI: http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://blast.ncbi.nlm.nih.gov/Blast.cgi

4

Gene new, or already described by others ?

•Carry out a BLASTN search against a nucleic acid database to identify only highly similar or identical sequences

•Carry out a TBLASTX search against a nucleic acid database to identify any possible homologous sequence in the database

4

5

TBLASTX

TBLASTX arries out a BLAST search of all 6 translated ORFs against each nucleotide sequence in the database after its tranlation in all 6 possible reading frames

6

Are there ORFs ?• ORF is a continuous sequence of nucleotide triplets uninterrupted by stop codons (TAA, TAG, TGA).

• A protein-encoding gene starts with an initiation codon (ATG) and ends with a stop codon.

• Use ORF finder at NCBI, the Translate tool at the Expasy proteomic server or Transeq tool of EMBOSS on a UNIX platform.

6

7

If yes, which of the 6 ORFs codes for a protein?

• Functional proteins have a minimal length (100 AAs for an enzyme, or 300 nucleotides), so take the longest ORF you can find.

• Blast the translated ORF against a protein database to identify a possible function

8

What are the properties of the protein ?

Do a BLASTP search against the SwissProt protein database

Do a search against a profile or sequence motif database such as PROSITE

Search for transmembrane domains (Kyte and Doolittle)

Search for signal peptides and targeting sequences (PSORT, TARGETP, etc.) 8

9

Is the protein present in other organisms as well ?

Do an initial BLASTP search against the SwissProt database to find all guaranteed homologous proteins

Do a BLASTP search against the Uniprot database with cutoff value E=10e-20 to find homologues in all other organisms (not guaranteed!) 9

10

How did the protein evolve with time ?

•What is its relationship to other homologous proteins ?

•Are these orthologous or paralogous sequences ?

•Have there been gene duplications

•Can you identify events of horizontal gene transfer ?

•To answer these questions create a phylogenetic tree !

10

11

What is the protein’s 3-D structure ?

• 3-D structure is essential to understand the functioning of the protein and the importance of the individual amino-acid residues (active site, epitopes, ligand binding site, etc.)

• Do a BLASTP search of your sequence against the PDB protein structure database.

• Import the entry for your protein or that of a homologue and visualize its structure using a program like RASMOL 1

1

12

Demo ExerciseUNIX platform using EMBOSS suite and DNA sequence of T. brucei triose-phosphate isomerase

mrsget embl 'id:X03921' fasta 0 > tpis_trybb.nuc >tpis_trybb.nuc

transeq tpis_trybb.nuc -frame all -filter

copy translated orf to test : cat > test; paste; ctrl-D

blast test (blastp and swissprot database)

patmatmotifs test -filter

pepwindow tpis_trybb

blast tpis_trybb (blastp and uniprot database)

quicktree test sw 30

blast test (blastp and PDB database)

mrsget pdb 'id:1ag1' entry 0 > 1ag1.pdb

rasmol 1ag1.pdb12

13

DNA or protein to create a phylogenetic tree ?

Many scientists think that:

• A nucleotide sequence contains all the information for the synthesis of a protein. Therefore, it is also the nucleotide sequence that should be used for phylogenetic inference.

• This is not necessarily true.13

14

As a rule of thumb:

• DNA sequences are more suited for phylogenies with closely related taxa

• Protein sequences are more suited for phylogenies with more distantly related taxa

14

DNA or protein to create a phylogenetic tree ?

0.1

TPIS HUMANTPIS MACMUTPIS RABITTPIS MOUSETPIS RAT

TPIS LATCHTPIS CHICKTPIS SCHJA

TPIS SCHMATPIS AEDTOTPIS CULPITPIS CULTA

TPIS ANOMETPIS DROMETPIS HELVITPIS CAEEL

TPIS GRAVETPIS ARATH

TPIS PETHYTPIS COPJATPIS LACSA

TPIS HORVUTPIS SECCE

TPIS MAIZETPIS ORYSA

TPIC SPIOLTPIC SECCETPIS STELP

TPIS TRYBBTPIS TRYCRTPIS LEIME

TPI1 GIALATPI2 GIALA

TPIS EMENITPIS SCHPO

TPIS YEASTTPIS COPCI

TPIS BACSUTPIS STAAU

TPIS BACMETPIS BACSTTPIS LACDE

TPIS LACLATPIS CLOAB

TPIS BORBUTPIS SYNY3

TPIS PLAFATPIS MYCHR

TPIS MYCFLTPIS MYCHY

TPIS MYCGETPIS MYCPN

TPIS TREPATPIS MYCLE

TPIS MYCTUTPIS CORGL

TPIS STRCOTPIS XANFL

TPIS CHLAUTPIS RHIET

PGKT THEMATPIS AQUAE

TPIS VIBSATPIS PSESY

TPIS CHLPNTPIS CHLTR

TPIS ECOLITPIS ENTCL

TPIS HAEINTPIS VIBMA

TPIS BUCAPTPIS HELPJTPIS HELPY

TPIS FRATUTPIS MORSP TPIS PYRHO

TPIS PYRWOTPIS METTH

TPIS ARCFUTPIS METJA

TPIS METBR

Animalia

Planta

Protists

Fungi

Eubacteria

Archaebacteria

The very first “tree of life” was based on protein sequences

The amino acid sequence of cytochrome c was the first to be used for the construction of a tree of life comprising both the prokaryotic and eukaryotic kingdoms (McLaughlin and Dayhoff, 1973)

16

Analysis based on cytochrome c (with distantly related taxa)

Human ----------------MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQADrosophila ------------MGVPAGDVEKGKKLFVQRCAQCHTVEAGGKHKVGPNLHGLIGRKTGQAaspergillus ---------GKDASFAPGDSAKGAKLFQTRCAQCHTVEAGGPHKVGPNLHGLFGRKTGQSSaccharomyces ----------MPAPYEKGSSKKGATLFKTRCLQCHTTEKGGANKVGPNLHGVFGRHSGQAChlamydomonas --------MSTFAEAPAGDLARGEKIFKTKCAQCHVAEKGGGHKQGPNLGGLFGRVSGTACrithidia ------MPPKARAPLPPGDAARGEKLFKGRAAQCHTANQGGANGVGPNLYGLVGRHSGTIRhodospirillum ASPEAYVEYRKQALKASGDHMKALSAIVKGQLPLNAEAAKHAEAIAAIMESLPAAFPEGT *. :. . : :. . .. : .: . . Human PGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNDrosophila AGFAYTDANKAKGITWNEDTLFEYLENPKKYIPGTKMIFAGLKKPNERGDLIAYLKSATKaspergillus EGYAYTDANKQAGVTWDENTLFSYLENPKKFIPGTKMAFGGLKKGKERNDLITYLKESTASaccharomyces EGYSYTEANKKAGVLWDEQHMSDYLENPKKYIPGTKMAFAGLKKAKDRNDLVTYLKEATSChlamydomonas AGFAYSKANKEAAVTWGESTLYEYLLNPKKYMPGNKMVFAGLKKPEERADLIAYLKQATACrithidia EGYAYSKANAESGVVWTPDVLDVYLENPKKFMPGTKMSFAGMKKPQERADVIAYLETLKGRhodospirillum AGIAKTEAK---AVVWSKADEFKADAVKSADAAKALAQAATAGDTAQMGKALAALGGTCK * : : *: .: * . . . : . :: * Human E-------Drosophila --------aspergillus --------Saccharomyces --------Chlamydomonas --------Crithidia --------Rhodospirillum GCHETFRE

Human 0.000 0.231 0.337 0.394 0.356 0.490 0.902 Drosophila 0.231 0.000 0.287 0.407 0.352 0.454 0.829 aspergillus 0.337 0.287 0.000 0.318 0.405 0.450 0.852 Saccharomyces 0.394 0.407 0.318 0.000 0.455 0.482 0.879 Chlamydomonas 0.356 0.352 0.405 0.455 0.000 0.420 0.872 Crithidia 0.490 0.454 0.450 0.482 0.420 0.000 0.874 Rhodospirillum 0.902 0.829 0.852 0.879 0.872 0.874 0.000

17

Analysis based on cytochrome c (with closely related taxa)

Barbary_ape MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGITWRhesus_macaque MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGITWGorilla MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWHuman MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWOrangutan MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWChimpanzee MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWSpider_monkey MGDVFKGKRIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQASGFTYTEANKNKGIIWRabbit MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITW **** ***:**: **:**************************** *::** ******* *Barbary_ape GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNERhesus_macaque GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNEGorilla GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNEHuman GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNEOrangutan GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNEChimpanzee GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNESpider_monkey GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNERabbit GEDTLMEYLENPKKYIPGTKMIFAGIKKKDERADLIAYLKKATNE ***********************.*****:***************

Barbary_ape 0.000 0.000 0.010 0.010 0.010 0.010 0.067 0.076 Rhesus_macaque 0.000 0.000 0.010 0.010 0.010 0.010 0.067 0.076 Gorilla 0.010 0.010 0.000 0.000 0.000 0.000 0.057 0.086 Human 0.010 0.010 0.000 0.000 0.000 0.000 0.057 0.086 Orangutan 0.010 0.010 0.000 0.000 0.000 0.000 0.057 0.086 Chimpanzee 0.010 0.010 0.000 0.000 0.000 0.000 0.057 0.086 Spider_monkey 0.067 0.067 0.057 0.057 0.057 0.057 0.000 0.105 Rabbit 0.076 0.076 0.086 0.086 0.086 0.086 0.105 0.000

Barbary_ape 0.000 0.000 0.010 0.010 0.010 0.010 0.067 0.076 Rhesus_macaque 0.000 0.000 0.010 0.010 0.010 0.010 0.067 0.076 Gorilla 0.010 0.010 0.000 0.000 0.000 0.000 0.057 0.086 Human 0.010 0.010 0.000 0.000 0.000 0.000 0.057 0.086 Orangutan 0.010 0.010 0.000 0.000 0.000 0.000 0.057 0.086 Chimpanzee 0.010 0.010 0.000 0.000 0.000 0.000 0.057 0.086 Spider_monkey 0.067 0.067 0.057 0.057 0.057 0.057 0.000 0.105 Rabbit 0.076 0.076 0.086 0.086 0.086 0.086 0.105 0.000

18

DNA evolves more rapidly than protein. Why ?

18

The universal genetic codeFirst Second Position Third

Position ------------------------------------ Position

| U(T) C A G |

U(T) Phe Ser Tyr Cys U(T)

Phe Ser Tyr Cys C

Leu Ser STOP STOP A

Leu Ser STOP Trp G

C Leu Pro His Arg U(T)

Leu Pro His Arg C

Leu Pro Gln Arg A

Leu Pro Gln Arg G

A Ile Thr Asn Ser U(T)

Ile Thr Asn Ser C

Ile Thr Lys Arg A

Met Thr Lys Arg G

G Val Ala Asp Gly U(T)

Val Ala Asp Gly C

Val Ala Glu Gly A

Val Ala Glu Gly G

The genetic code is degenerated• 64 different possible triplet codes encode 20

amino acids. Thus one amino acid may be encoded by 1 to 6 different triplet codes. Three of the 64 codes are termination or “stop” codons

• The different codons are used with unequal frequency This is referred to as "codon usage"

• Codon usage varies between types of proteins (housekeeping or not) and between species.

• Amino-acid codons have been degenerated (wobble) in the third position.

Protein versus DNA sequences

Different species use different codons

Homo sapiens [gbmam]: 1 CDS's (389 codons)

----------------------------------------------------------------------------

fields: [triplet] [frequency: per thousand] ([number])

----------------------------------------------------------------------------

UUU 20.6( 8) UCU 5.1( 2) UAU 7.7( 3) UGU 7.7( 3)

UUC 12.9( 5) UCC 20.6( 8) UAC 30.8( 12) UGC 0.0( 0)

UUA 10.3( 4) UCA 18.0( 7) UAA 0.0( 0) UGA 0.0( 0)

UUG 10.3( 4) UCG 0.0( 0) UAG 2.6( 1) UGG 15.4( 6)

Saccharomyces cerevisiae [gbpln]: 9295 CDS's (4586264 codons)----------------------------------------------------------------------------fields: [triplet] [frequency: per thousand] ([number])----------------------------------------------------------------------------

UUU 25.9(118900) UCU 23.6(108308) UAU 18.7( 85651) UGU 8.0( 36624)UUC 18.3( 83880) UCC 14.3( 65421) UAC 14.7( 67599) UGC 4.6( 21255)UUA 26.3(120698) UCA 18.7( 85618) UAA 1.0( 4476) UGA 0.6( 2742)UUG 27.2(124967) UCG 8.5( 39137) UAG 0.4( 2058) UGG 10.4( 47694)

Mutations at the third nucleotide of each codon (wobble position) in the DNA normally do not lead to a change of amino acid in the protein.

Yeasts, protozoa, and animals have different codon preferences. This would result in differences in DNA sequence related to codon bias and not to evolution.

Protein versus DNA sequences (2)

Differences between the “Universal” and Mitochondrial Genetic Codes

CodonUniversal code mitochondrial code

UGA Stop TrpAGA Arg StopAGG Arg Stop (or Lys*)AUA Ile Met

Modified from: Li and Graur, 1991, Fundamentals of Molecular Evolution , Sinauer Publ.* Only in arthropod mitochondria (Abascal et al., PLoS Biol 4, e127 (2006))

• Some protozoa use the codons UAA and UGA to encode glutamine, rather than STOP

• The inclusion of unique codons in a subset of the DNA sequences will tend to make that subset appear more divergent than they really are


• In all major groups organisms with AT rich and GC rich DNA can be found.

• High GC content of DNA in prokaryotes seems to be associated with an aerobic life style (Naya et al., 2002)


GC content of DNA in aerobic and anaerobic prokaryotes

Anaerobic

Aerobic

From Naya et al., J. Mol. Evol. 55 (2002) 260-264

27

Protein versus DNA sequences: Conclusion

• Codogenic DNA sequences may accommodate numerous mutations that are not expressed in the corresponding protein

• Therefore the rate of evolution of such DNA sequences is faster than that of proteins sequences

• As a consequence protein sequences rather than DNA sequences are preferred for the analysis of the evolutionary relationships between more distant taxa

27

When comparing sequences that have diverged for possibly a billion years or more,

• it is very likely that the wobble base at the third position of each codons will have become randomized completely.

• Some researchers exclude all wobble bases from a codogenic sequence. The result is that one actually looks at amino acid sequences.

• So why not taking the corresponding protein sequence directly?

Distant codogenic DNA sequences

The use of protein sequences in phylogeny requires knowledge of the properties of the individual amino acids as well as their single letter codes

Tranlation of DNA into Protein sequences

Alanine A Leucine L

Arginine R Lysine K

Asparagine N Methionine M

Aspartic acid D Phenylalanine F

Cysteine C Proline P

Glutamic acid E Serine S

Glutamine Q Threonine T

Glycine G Tryptophane W

Histidine H Tyrosine Y

Isoleucine I Valine V

In addition, B may be used for Asx (Aspartate or Asparagine) and X for Glx (Glutamate or Glutamine). J, O and U are not used.

Amino acids and their single letter codes

Advantages of the translation of DNA into protein (1)

• DNA is composed of only four kinds of unit: A, G, C and T• If gaps are not allowed, on the average, 25% of residues in two

randomly chosen aligned sequences would be identical• If gaps are allowed, as much as 50 % of residues in two randomly

chosen aligned sequences can be identical. Such a situation may obscure any genuine relationship that may exist. Especially when comparing distantly related or rapidly evolving gene sequences

• It is easier to translate a gene sequence into its corresponding protein than to remove the third wobble base from each of the codons in the gene

• Translations of open reading frames into their corresponding peptide sequences is available in most databases (GenPept and Uniprot databases)

Alignment of two random DNA sequences

Without indels19% identity

Indels allowed56% identity

• Translation of DNA into 21 different types of codon (20 amino acids and a terminator) allows the information to sharpen up considerably. Wrong frame information is set aside as well

• Third-base degeneracies are consolidated

• After insertion of gaps to align two random protein sequences it can be expected that they are between 10-20% identical

• As a result of the translation procedure the protein sequences with their 20 amino acids are much more easy to align than the corresponding DNA sequences with only 4 nucleotides

• Conclusion: The signal to noise ratio is greatly improved when using protein sequences over DNA sequences!

Advantages of the translation (2)

Alignment of two random protein sequences

Without indels 7% identity

Indels allowed22% identity

• If you still want to align distantly related gene sequences, it is advised to prepare first a protein alignment and then use this alignment for the alignment of the corresponding gene sequences and the precise placement of indels* in the aligned sequences (use EMBOSS’ tranalign).

*Indel = region with insertion or deletion

Advantages of the translation (3)

36

The various BLAST algorithms

3

BLAST: Basic Local Alignment Search Tool

From NCBI: http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://blast.ncbi.nlm.nih.gov/Blast.cgi

TBLASTX

• The blast algorithm TBLASTX allows the use of translated nucleic acid sequence information to search for distant relationships between genes

• converts a nucleotide query sequence into protein sequences in all 6 reading frames and then compares this to a nucleotide database which has been translated on all six reading frames.

http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html#protein_databases

NCBI BLASTN output

NCBI TBLASTX output

Nature of Sequence Diversion Divergence in Proteins

• The observed sequence difference of two diverging sequences takes the course of a negative exponential. This is the result of the fact that each position is subject to reverse changes ("back mutations") and multiple hits

• Thus the observed percentage of difference between the protein sequences is not proportional to the actual evolutionary difference between two homologous sequences

• The evolutionary distance between two proteins is expressed in PAM units. PAM (Dayhoff and Eck, 1968) stands for "accepted point mutation"

Relation between % distance and PAM distance

Distance %

4003002001000

510152025303540455055606570758085

Pam value

Twilight zone

Relation between % distance and PAM distance

PAM Distance

value (%)

80 50

100 60

200 75

250 85 Twilight zone

300 92

(From Doolittle, 1987, Of URFs and ORFs, University Science Books)

As the evolutionary distance increases, the probability of super-imposed mutations becomes greater resulting in a lower observed percent difference.

The Kimura correction for multiple substitutions

• K = -Ln(1 - D - (D2)/5)

where: D is the observed distance K is corrected distance.

• This formula gives mean number of estimated substitutions per site.K can be greater than 1 i.e. more than one substitution per site, on average. For example, if you observe 0.8 differences per site (80% difference; 20% identity), then the above formula predicts that there have been 2.5 substitutions per site over the course of evolution since the two sequences diverged.

• K can also be expressed in PAM units by multiplying it by 100 (mean number of substitutions per 100 residues).

Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983, page 75

Proteins evolve at highly different rates

Rate of Change TheoreticalPAMs / 108 yrs Lookback Time

Pseudogenes 400 45 x 106 yrsFibrinopeptides 90 200 "Lactalbumins 27 670 "Lysozymes 24 850 "Ribonucleases 21 850 "Haemoglobins 12 1500 "Acid proteases 8 2300 "Cytochrome c 4 5000 "Glyceraldehyde-P dehydrogenase 2 9000 "Glutamate dehydrogenase 1 18000 "

PAM = number of Accepted Point Mutations per 100 amino acids. Useful lookback time = 360 PAMs

Some Important Dates in History

Event Time before present (109 years)

Origin of the Universe 13.4 ± 4

Formation of the Solar System 4.6

First Self-replicating System 3.5 ± 0.5

Prokaryotic-Eukaryotic Divergence 2.0 ± 0.5

Plant-Animal Divergence ~1.0

Invertebrate-Vertebrate Divergence 0.5

Mammalian Radiation Beginning ~ 0.1

From Doolittle, Of URFs and ORFs, 1987

0.1

TPIS HUMANTPIS MACMUTPIS RABITTPIS MOUSETPIS RAT

TPIS LATCHTPIS CHICKTPIS SCHJA

TPIS SCHMATPIS AEDTOTPIS CULPITPIS CULTA

TPIS ANOMETPIS DROMETPIS HELVITPIS CAEEL

TPIS GRAVETPIS ARATH

TPIS PETHYTPIS COPJATPIS LACSA

TPIS HORVUTPIS SECCE

TPIS MAIZETPIS ORYSA

TPIC SPIOLTPIC SECCETPIS STELP

TPIS TRYBBTPIS TRYCRTPIS LEIME

TPI1 GIALATPI2 GIALA

TPIS EMENITPIS SCHPO

TPIS YEASTTPIS COPCI

TPIS BACSUTPIS STAAU

TPIS BACMETPIS BACSTTPIS LACDE

TPIS LACLATPIS CLOAB

TPIS BORBUTPIS SYNY3

TPIS PLAFATPIS MYCHR

TPIS MYCFLTPIS MYCHY

TPIS MYCGETPIS MYCPN

TPIS TREPATPIS MYCLE

TPIS MYCTUTPIS CORGL

TPIS STRCOTPIS XANFL

TPIS CHLAUTPIS RHIET

PGKT THEMATPIS AQUAE

TPIS VIBSATPIS PSESY

TPIS CHLPNTPIS CHLTR

TPIS ECOLITPIS ENTCL

TPIS HAEINTPIS VIBMA

TPIS BUCAPTPIS HELPJTPIS HELPY

TPIS FRATUTPIS MORSP TPIS PYRHO

TPIS PYRWOTPIS METTH

TPIS ARCFUTPIS METJA

TPIS METBR

Animalia

Planta

Protists

Fungi

Eubacteria

Archaebacteria

Phylogenetic tree from triosephosphate isomerase protein sequences

Proteins sequences lend them selves extremely well for the creation of distant phylogenies, provided that :

• pairwise uncorrected distances remain under 85%

• the sequences are sufficiently long.

In the case of introns :

• A study of the evolution of a protein using its DNA sequence should only include coding sequences

• This requires that in every DNA sequence all the introns are being edited out. This may be cumbersome and time consuming

• An easier approach would be the direct translation of the cDNA sequence into its corresponding protein sequence

When is protein preferred over DNA ? (1)

Typical structure of a eukaryotic gene

TATA box

Transcription initiation

Initiation codon

Stop codon

AATAA

Poly (A)addition site

Exon 1 Exon 2 Exon 3 Flanking regionFlanking region

5' 3'

Intron I Intron II

messenger RNA

In the case of multigene families :

• Organisms may contain many highly similar genes, while only one peptide sequence can be identified (e.g. histones, tubulins and GAPDH in humans).

• Using these DNA sequences, it would be difficult to decide which genes are expressed and which are not and thus which genes to include in the analysis.

• Moreover, if all the genes that are expressed encode the same protein, then DNA differences are not significant


Protein is the unit of selection :

• For protein-encoding genes, the object on which natural selection acts is the protein itself.

• The underlying DNA sequence reflects this process in combination with species-specific pressures on DNA sequence (like the need for aerophiles to have DNA that is richer in GC).

• If function demands that a protein maintains a specific sequence, there still is sufficient room for the DNA sequence to change.


In the case of RNA editing :

• The DNA sequence does not always translate into amino acid sequence.

• In post-transcriptional editing non-coded amino acids are added or coded amino acids are removed in the editing process.

• This could lead to major differences between the original DNA sequence and the mature mRNA sequence (sometimes more than 50%) sequence after editing


Pan-editing of mitochondrial RNA in Kinetoplastida

UCCuAuuA*AuUUUUUGuUA**UAuAGuuuuuuAA*UGUUGuuuGGuGuA*uuuuuuuAuUG*UGuuuAGuuuuGuuuuGuuGuuGuuuGuuuG****GUGuGuuAuuG**UUUUGAGAuuGuuGnote that the mature mRNA would not be able to hybridise with the gene present in the kinetoplast DNA and thus cannot be detected as such.

Some good advice

• It is recommended to prepare the phylogenetic trees both ways (DNA and Protein) and see how they compare

• For a group of species that are relatively close in time and closely related (like viral proteins or vertebrate enzymes), DNA-based analysis is probably a good way to go, since you avoid problems of codon bias and randomization of wobble bases. But check the protein anyway

54

Orthologous and paralogous genes

• Be aware of the problems of multigene families (for instance coding for isoenzymes)

• Be careful when you decide to exclude or include such sequences (you may compare paralogous rather than orthologous sequences)

Gene duplication

Speciation

rat LDH L and mouse LDH L are orthologousrat LDH L and mouse LDH M are paralogous

What is required

• A protein sequence• A set of homologous sequences• A good multiple sequence alignment• Several programs to create a

phylogenetic tree

56

Collections of protein sequences

• Enzyme database (Expasy)• Protein structure database (PDB)• SwissProt (Expasy)• UniProt (Expasy, EMBL))• GenPept (GeneBank)

59

What is required

• A protein sequence• A set of homologous sequences• A good multiple sequence alignment• Several programs to create a

phylogenetic tree

What is required

• A DNA or protein sequence• A set of homologous sequences• A good multiple sequence alignment• Several programs to create a

phylogenetic tree

Pair-wise alignment of two protein sequences according to the ‘Dot-Matrix’

method

C D E G L D P G S E R K

CDEGLDPGSERK

••

••

••

••

••

••

•

•

•

••

•

C D E P L D P G S Q R K

CDEGLDPGSERK

••

•

••

••

•

••

•

•

••

•

C D E L D P G S Q R K

CDEGLDPGSERK

••

•

••

••

•

••

•

•

••

•

C D E D G L S Q L K

CDEGLDPLSERK

••

•

•

••

•

•

••

•

•

•

•

A B

C D

Dot-Matrix plots

Two homologous sequences with 81% identity Two homologous sequences with 50% identity

Alignment parametres in ClustalX

PAM 350 matrix as used in ClustalC 12,

S 0, 2,

T -2, 1, 3,

P -3, 1, 0, 6,

A -2, 1, 1, 1, 2,

G -3, 1, 0,-1, 1, 5,

N -4, 1, 0,-1, 0, 0, 2,

D -5, 0, 0,-1, 0, 1, 2, 4,

E -5, 0, 0,-1, 0, 0, 1, 3, 4,

Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,

H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,

R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,

K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,

M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,

I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,

L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2, 6,

V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4, 2, 4,

F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1, 2,-1, 9,

Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4, 0,-4,-4,-2,-1,-1,-2, 7,10,

W -8,-2,-5,-6,-6,-7,-4,-7,-7,-5,-3, 2,-3,-4,-5,-2,-6, 0, 0,17,

C S T P A G N D E Q H R K M I L V F Y W

ClustalX distance matrixNon-corrected

Human 0.000 0.228 0.327 0.386 0.347 0.475 0.901

Drosophila 0.228 0.000 0.248 0.376 0.327 0.426 0.822

aspergillus 0.327 0.248 0.000 0.287 0.376 0.406 0.871

Saccharomyces 0.386 0.376 0.287 0.000 0.436 0.455 0.881

Chlamydomonas 0.347 0.327 0.376 0.436 0.000 0.386 0.871

Crithidia 0.475 0.426 0.406 0.455 0.386 0.000 0.871

Rhodospirillum 0.901 0.822 0.871 0.881 0.871 0.871 0.000

Corrected

Human 0.000 0.272 0.428 0.538 0.463 0.735 5.170

Drosophila 0.272 0.000 0.301 0.518 0.428 0.620 2.760

aspergillus 0.428 0.301 0.000 0.362 0.518 0.578 3.860

Saccharomyces 0.538 0.518 0.362 0.000 0.642 0.687 4.220

Chlamydomonas 0.463 0.428 0.518 0.642 0.000 0.538 3.860

Crithidia 0.735 0.620 0.578 0.687 0.538 0.000 3.860

Rhodospirillum 5.170 2.760 3.860 4.220 3.860 3.860 0.000

Matrices often used for the alignment of proteins

• PAM 250 (Dayhoff et al., 1978)• BLOSUM30 (Henikoff-Henikoff, 1992)• JTT (Jones et al., 1992)• mtREV24 (Adachi-Hasegawa, 1996)• GONNET 250 matrix (Gonnet et al.,

1992)

• For the creation of a phylogenetic tree a good alignment of protein sequences is of vital importance

• Only homologous residues should be aligned with each other

• Doubtful regions should not be included in the alignment

• Aligned sequences should have similar lengths

Alignment of two protein sequences (1)

• Alignment requires the user to make assumptions regarding relative costs of substitution versus insertions and deletions (indels).

• If substitution cost >> gap penalty: there will be many short gaps and no phylogenetic information.

• In general: search for maximum similarity and minimize the number of insertions and deletions.

• Exclude regions that cannot be aligned unambiguously!


Multiple alignment of protein sequences• For the construction of reliable phylogenetic trees the quality of a

multiple alignment is of the utmost importance

• There are many programs available for the multiple alignment of proteins. – A good program in the public domain is: ClustalW or ClustalX – Available on the web for free and for any platform (PC, Mac,

Unix/Linux)

• They quickly align sequence pairs and roughly determine the degrees of identity between each pair

• Then the sequences are aligned more precisely in a progressive way starting with the two closest sequences

• Editing of alignments: Jalview or Seaview

Most programs work best when the sequences have similar length.

Some rules of thumb for the manual alignment of proteins (1)

• An automatically produced multiple alignment often needs manual adjustment to improve the quality of the alignment.

• Such improvement can be obtained by using all the knowledge that is available about a protein.

• If a structure is available you should use the detailed information about secondary structure for the alignment.

• The rules for mutation of amino acids are dependent on their physicochemical properties.

• Surface residues (DRENK) are preferably mutated to residues of similar properties. Since they are not, or less, involved in protein folding they mutate rather easily.

• Hydrophobic residues (FAMILYVW) are preferentially replaced by other hydrophobic ones. These residues are mainly internal and determine the folding of the protein. They thus mutate rather slowly.


• The residues CHQST are indifferent and may be replaced with any other type of residue

• The residues (DRENKCHQST), when conserved throughout the alignment are very likely residues that are involved in the active site. So the multiple alignment should be adjusted accordingly

• Periodicity of charged residues may provide information as to the presence of elements of secondary structure such as α-helices and β-strands


α-helix

β-strand

• Indels (insertions/deletions) are never found in elements of secondary structure but only in loops.

• Pro and Gly interfere with secondary structure elements and thus have a preference for loops

• Hydrophobicity (or hydropathy) profiles according to Kyte and Doolittle of two homologous proteins are in general strikingly similar


Proline interferes with α-helix and β-sheet formation

From Deber and Therien,2002

Alignment of malate dehydrogenase sequencesSlcl|CHR34_tmp.0150 ----MKPST--LSRFKVTVLGASGAIGQPLALALVQNKRVSEL-----ALYDIVQPR---lcl|CHR34_tmp.0140 ----MRRSQ--GCFFRVAVLGAAGGIGQPLSLLLKNNKYVKEL-----KLYDVKGGP---lcl|CHR34_tmp.0130 MGLLFRRSLTALKKGKVVLFGCSNAVGQPLSLLLKMNPHVEELVCCNTAADDDVPGS---lcl|CHR28_tmp.0050 -----------MSAVKVAVTGAAGQIGYALVPLIARGALLGPTTPVELRLLDIEPALKAL . . :*.: *.:. :* .* : . : : *

lcl|CHR34_tmp.0150 -GVAVDLSHFPRKVKVTGYPTKWIHK--ALDGADLVLMSAGMPRRPGMT-HDDLFNTNALlcl|CHR34_tmp.0140 -GVAADLSHICAPAKVTGYTKDELSR--AVENADVVVIPAGIPRKPGMT-RDDLFNTNASlcl|CHR34_tmp.0130 -GIAADLSHIDTLPKVH-YATDEGQWPALLRDAQLILVCFGSSFDLLREDRDIALKAAAPlcl|CHR28_tmp.0050 AGVEAELEDCAFPLLDKVVVTADPRV--AFDGVAIAIMCGAFPRKAGME-RKDLLEMNAR *: .:*.. . . .. : :: . . ::. :: *

lcl|CHR34_tmp.0150 TVNELSAAVARYAPKSV-LAIISNPLNSMVPVAAETLQRAGVYDPRKLFGIISLNMMRARlcl|CHR34_tmp.0140 IVRDLAIAVGTHAPKAI-VGIITNPVNSTVPVAAEALKKVGVYDPARLFGVTTLDVVRARlcl|CHR34_tmp.0130 TMRRVMAAVASSDTTGN-VAVVSSPVNALTPFCAELLKASGKFDPRKLFGVTTLDVIRTRlcl|CHR28_tmp.0050 IFKEQGEAIAAVAASDCRVVVVGNPANTNALILLKSAQ--GKLNPRHVTAMTRLDHNRAL .. *:. .. : :: .* *: . . : : * :* :: .: *: *:

lcl|CHR34_tmp.0150 KMLGDFTGQDPEMLDVPVIGGHSGQTIVPLFSHS--GVELRQEQVEYLTHRVR-------lcl|CHR34_tmp.0140 TFVAEALGASPYDVDVPVIGGHSGETIVPLLSG---FPSLSEEQVRQLTHRIQ-------lcl|CHR34_tmp.0130 KLVAGTLHMNPYDVNVPVVGGCGGVTACPLIAQT--GLRIPLDDIVRISGEVQSYGVLFElcl|CHR28_tmp.0050 SLLARKAGVPVSQVRNVIIWGNHSSTQVPDTDSAVIGTTPAREAIKDDALDDD-----FV .::. : :: * . * * : : : .

lcl|CHR34_tmp.0150 --VGGD-EVVKAKEGRGSSSLSMAFAAAEWADGVLRAMDGEKTLLQCSFVESPLFADKCRlcl|CHR34_tmp.0140 --FGGD-EVVKAKDGAGSATLSMAFAGNEWTTAVLRALSGEKGVVVCTYVQS-TVEPSCAlcl|CHR34_tmp.0130 AAVGADSHDALSTEVAPPVALGLAYAACDFSTSLLKALRGDVGIVECALVES-TMRSETPlcl|CHR28_tmp.0050 QVVRGRGAEIIQLRGLSSAMSAAKAAVDHVHDWIHGTPEGVYVSMGVYSDENPYGVPSGL . . . . * . : : * : :. .

lcl|CHR34_tmp.0150 FFGSTVEVCKEGIERVLPLPPLNEYEEEQLDRCLPDLEKN-IRKGLAFVAENAATSTPSTlcl|CHR34_tmp.0140 FFSSPVLLGNSGVEKIYPVPMLNAYEEKLMAKCLEGLQSN-ITKGIAFSNK---------lcl|CHR34_tmp.0130 FFSSRVELGREGVQRVFPMGALTSYEHELIETAVPELMRD-VQAGIEAATQF--------lcl|CHR28_tmp.0050 IFSFP-CTCHAGEWTVVSGKLNGDLGKQRLASTIAELQEERAQAGL-------------- :*. . * : . .: : : * : *: :

Hydrophobicity profiles• Profiles according to Kyte and Doolittle of homologous proteins

are in general strikingly similar and may provide a tool in the alignment of two or more proteins.

• The two phosphoglycerate kinase sequences below share 50% identical residues.

Trypanosoma congolense PGK Euglena gracilis PGK

What is required

• A DNA or protein sequence• A set of homologous sequences• A good multiple sequence alignment• Several programs to create a

phylogenetic tree

• Character-based methods: – maximum parsimony – maximum likelihood

• Non-character-based methods: – distance matrix methods

Tree construction methods (1)

Tree construction methods (2)• Distance matrix methods

– Cluster analysis (UPGMA, WPGMA, etc)– Fitch & Margoliash (1967)– Transformed distance methods (eg. Li, 1981)– Neighbor-joining (Saitou & Nei, 1987)– ...many more

• Parsimony methods– Maximum parsimony (Protpars, PAUP)

• Other methods– Maximum likelihood (ProtML, PhyML,

TreePuzzle)– Splitstree, Mr. Bayes– ... many more

Text available from: [email protected]

Text and slides: http://www.deduveinstitute.be/~opperd/BICL3215/

Website: http://www.deduveinstitute.be/~opperd/private/proteins.html

http://www.deduveinstitute.be/~opperd/Capetown/

Distance Matrix Methods• UPGMA (Unweighted Pair Group with Arithmatic Mean) uses real

(uncorrected) distance values and a sequential clustering algorithm. (Should only be used with closely related OTUs, or when there is constancy of evolutionary rate)

• Neighbors relation methods

– FITCH (Fitch, 1981)

– Neighbor-Joining method, (Saitou and Nei, 1987)

Should all be used with corrected (see above) distance matrices

Pair-wise alignment of two protein sequences according to the ‘Dot-Matrix’

method

C D E G L D P G S E R K

CDEGLDPGSERK

••

••

••

••

••

••

•

•

•

••

•

C D E P L D P G S Q R K

CDEGLDPGSERK

••

•

••

••

•

••

•

•

••

•

C D E L D P G S Q R K

CDEGLDPGSERK

••

•

••

••

•

••

•

•

••

•

C D E D G L S Q L K

CDEGLDPLSERK

••

•

•

••

•

•

••

•

•

•

•

A B

C D

• Alignment requires the user to make assumptions regarding relative costs of substitution versus insertions and deletions (indels).

• If substitution cost >> gap penalty: there will be many short gaps and no phylogenetic information.

• In general: search for maximum identity and minimize the number of insertions and deletions.

• Exclude regions that cannot be aligned unambiguously!

• Visual alignment is possible using the "dot-matrix method"


Identity matrix as used in Clustal

C10,

S 0, 10,

T 0, 0, 10,

P 0, 0, 0, 10,

A 0, 0, 0, 0, 10,

G 0, 0, 0, 0, 0, 10,

N 0, 0, 0, 0, 0, 0, 10,

D 0, 0, 0, 0, 0, 0, 0, 10,

E 0, 0, 0, 0, 0, 0, 0, 0, 10,

Q 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

H 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

R 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

K 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

M 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

I 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

L 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

V 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

F 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

Y 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

W 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,


Distance matrix withmutation costs for amino acids

A S G L K V T P E D N I Q R F Y C H M W Z B X

Ala = A 0 1 1 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2

Ser = S 1 0 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 2 2 1 2 2 2

Gly = G 1 1 0 2 2 1 2 2 1 1 2 2 2 1 2 2 1 2 2 1 2 2 2

Leu = L 2 1 2 0 2 1 2 1 2 2 2 1 1 1 1 2 2 1 1 1 2 2 2

Lys = K 2 2 2 2 0 2 1 2 1 2 1 1 1 1 2 2 2 2 1 2 1 2 2

Val = V 1 2 1 1 2 0 2 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2

Thr = T 1 1 2 2 1 2 0 1 2 2 1 1 2 1 2 2 2 2 1 2 2 2 2

Pro = P 1 1 2 1 2 2 1 0 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2

Glu = E 1 2 1 2 1 1 2 2 0 1 2 2 1 2 2 2 2 2 2 2 1 2 2

Asp = D 1 2 1 2 2 1 2 2 1 0 1 2 2 2 2 1 2 1 2 2 2 1 2

Asn = N 2 1 2 2 1 2 1 2 2 1 0 1 2 2 2 1 2 1 2 2 2 1 2

Ile = I 2 1 2 1 1 1 1 2 2 2 1 0 2 1 1 2 2 2 1 2 2 2 2

Gln = Q 2 2 2 1 1 2 2 1 1 2 2 2 0 1 2 2 2 1 2 2 1 2 2

Arg = R 2 1 1 1 1 2 1 1 2 2 2 1 1 0 2 2 1 1 1 1 2 2 2

Phe = F 2 1 2 1 2 1 2 2 2 2 2 1 2 2 0 1 1 2 2 2 2 2 2

Tyr = Y 2 1 2 2 2 2 2 2 2 1 1 2 2 2 1 0 1 1 3 2 2 1 2

Cys = C 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 0 2 2 1 2 2 2

His = H 2 2 2 1 2 2 2 1 2 1 1 2 1 1 2 1 2 0 2 2 2 1 2

Met = M 2 2 2 1 1 1 1 2 2 2 2 1 2 1 2 3 2 2 0 2 2 2 2

Trp = W 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 0 2 2 2

Glx = Z 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2

Asx = B 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 2 2 1 2

??? = X 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

The distance table is generated by calculating the minimum number of base mutations required to convert an amino acid in row i to an amino acid in column j. Note Met->Tyr is the only change that requires all 3 codon positions to change.

Hydrophobicity matrix R K D E B Z S N Q G X T H A C M P V L I Y F W

Arg = R 10 10 9 9 8 8 6 6 6 5 5 5 5 5 4 3 3 3 3 3 2 1 0

Lys = K 10 10 9 9 8 8 6 6 6 5 5 5 5 5 4 3 3 3 3 3 2 1 0

Asp = D 9 9 10 10 8 8 7 6 6 6 5 5 5 5 5 4 4 4 3 3 3 2 1

Glu = E 9 9 10 10 8 8 7 6 6 6 5 5 5 5 5 4 4 4 3 3 3 2 1

Asx = B 8 8 8 8 10 10 8 8 8 8 7 7 7 7 6 6 6 5 5 5 4 4 3

Glx = Z 8 8 8 8 10 10 8 8 8 8 7 7 7 7 6 6 6 5 5 5 4 4 3

Ser = S 6 6 7 7 8 8 10 10 10 10 9 9 9 9 8 8 7 7 7 7 6 6 4

Asn = N 6 6 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 7 7 7 6 6 4

Gln = Q 6 6 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 7 7 7 6 6 4

Gly = G 5 5 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 8 7 7 6 6 5

??? = X 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 8 8 8 8 7 7 5

Thr = T 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 8 8 8 8 7 7 5

His = H 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 9 8 8 8 7 7 5

Ala = A 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 9 8 8 8 7 7 5

Cys = C 4 4 5 5 6 6 8 8 8 8 9 9 9 9 10 10 9 9 9 9 8 8 5

Met = M 3 3 4 4 6 6 8 8 8 8 9 9 9 9 10 10 10 10 9 9 8 8 7

Pro = P 3 3 4 4 6 6 7 8 8 8 8 8 9 9 9 10 10 10 9 9 9 8 7

Val = V 3 3 4 4 5 5 7 7 7 8 8 8 8 8 9 10 10 10 10 10 9 8 7

Leu = L 3 3 3 3 5 5 7 7 7 7 8 8 8 8 9 9 9 10 10 10 9 9 8

Ile = I 3 3 3 3 5 5 7 7 7 7 8 8 8 8 9 9 9 10 10 10 9 9 8

Tyr = Y 2 2 3 3 4 4 6 6 6 6 7 7 7 7 8 8 9 9 9 9 10 10 8

Phe = F 1 1 2 2 4 4 6 6 6 6 7 7 7 7 8 8 8 8 9 9 10 10 9

Trp = W 0 0 1 1 3 3 4 4 4 5 5 5 5 5 6 7 7 7 8 8 8 9 10

Hydrophobicity scoring matrix constructed from hydrophilicity data (M.Levitt, J. Mol. Biol. 104, 59 [1976]), derived by George et al. 1990, Mutation Data Matrix and Its Uses, Methods in Enzymology 183, 333.

1 PAM evolutionary distance

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

A R N D C Q E G H I L K M F P S T W Y V

Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18

Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1

Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1

Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1

Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2

Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1

Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2

Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5

His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1

Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33

Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15

Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1

Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4

Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0

Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2

Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2

Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9

Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0

Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1

Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901

[top row shows original amino acid; left column shows replacement amino acid]

Mutation probability matrix for the evolutionary distance of 1 PAM (i.e., one Accepted Point Mutation per 100 amino acids).

An element of this matrix, [Mij], gives the probability that the amino acid in column j will be replaced by the amino acid in

row i after a given evolutionary interval, in this case 1 PAM. Thus, there is a 0.56% probability that Asp will be replaced by

Glu. To simplify the appearance, the elements are shown multiplied by 10,000. (Adapted from Figure 82. Atlas of Protein

Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff, ed. National Biomedical Research Foundation, 1979.)

PAM 1 mutation matrix

PAM 100 matrix as used in Clustal

C 14,

S -1, 6,

T -5, 2, 7,

P -6, 1, -1, 10,

A -5, 2, 2, 1, 6,

G -8, 1, -3, -3, 1, 8,

N -8, 2, 0, -3, -1, -1, 7,

D -11, -1, -2, -4, -1, -1, 4, 8,

E -11, -2, -3, -3, 0, -2, 1, 5, 8,

Q -11, -3, -3, -1, -2, -5, -1, 1, 4, 9,

H -6, -4, -5, -2, -5, -7, 2, -1, -2, 4, 11,

R -6, -1, -4, -2, -5, -8, -3, -6, -5, 1, 1, 10,

K -11, -2, -1, -4, -4, -5, 1, -2, -2, -1, -3, 3, 8,

M -11, -4, -2, -6, -3, -8, -5, -8, -6, -2, -7, -2, 1, 13,

I -5, -4, -1, -6, -3, -7, -4, -6, -5, -5, -7, -4, -4, 2, 9,

L -12, -7, -5, -5, -5, -8, -6, -9, -7, -3, -5, -7, -6, 4, 2, 9,

V -4, -4, -1, -4, 0, -4, -5, -6, -5, -5, -6, -6, -6, 1, 5, 1, 8,

F -10, -5, -6, -9, -7, -8, -6,-11,-11,-10, -4, -7,-11, -2, 0, 0, -5, 12,

Y -2, -6, -6,-11, -6,-11, -3, -9, -7, -9, -1,-10,-10, -8, -4, -5, -6, 6, 13,

W -13, -4,-10,-11,-11,-13, -8,-13,-14,-11, -7, 1, -9,-11,-12, -7,-14, -2, -2, 19,


PAM 250 matrix as used in Clustal

C 12,

S 0, 2,

T -2, 1, 3,

P -3, 1, 0, 6,

A -2, 1, 1, 1, 2,

G -3, 1, 0,-1, 1, 5,

N -4, 1, 0,-1, 0, 0, 2,

D -5, 0, 0,-1, 0, 1, 2, 4,

E -5, 0, 0,-1, 0, 0, 1, 3, 4,

Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,

H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,

R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,

K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,

M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,

I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,

L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2, 6,

V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4, 2, 4,

F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1, 2,-1, 9,

Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4, 0,-4,-4,-2,-1,-1,-2, 7,10,

W -8,-2,-5,-6,-6,-7,-4,-7,-7,-5,-3, 2,-3,-4,-5,-2,-6, 0, 0,17,


Matrices often used for the alignment of proteins

• PAM 250 (Dayhoff et al., 1978)• BLOSUM62 (Henikoff-Henikoff, 1992)• JTT (Jones et al., 1992)• mtREV24 (Adachi-Hasegawa, 1996)• GONNET matrix (Gonnet et al., 1992)

Multiple alignment of protein sequences

• For the construction of reliable phylogenetic trees the quality of a multiple alignment is of the utmost importance

• There are many programs available for the multiple alignment of proteins. – A good program in the public domain is: ClustalW – A similar program is Pileup of the GCG package

• They quickly align sequence pairs and roughly determine the degrees of identity between each pair

• Then the sequences are aligned more precisely in a progressive way starting with the two closest sequences

Most programs work best when the sequences have similar length.


• An automatically produced multiple alignment often needs manual adjustment to improve the quality of the alignment.

• Such improvement can be obtained by using all the knowledge that is available about a protein.

• If a structure is available you should use the detailed information about secondary structure for the alignment.

Tree construction methods (1)• Distance matrix methods

– Cluster analysis (UPGMA, WPGMA, etc)

– Fitch & Margoliash (1967)

– Transformed distance methods (eg. Li, 1981)

– Neighbor-joining (Saitou & Nei, 1987)

– ...many more

• Parsimony methods

– Maximum parsimony

• Other methods

– Maximum likelihood (Felsenstein, 1981)

– ... many more

• Character-based methods: – maximum parsimony – maximum likelihood

• Non-character-based methods: – distance matrix methods

Tree construction methods (2)

Phylogeny (2)• Distance Matrix methods (in the public domain)– Least squares method (Fitch and Margoliash)

—Fitch, Kitsch of the Phylip package (Jo Felsentein, Univ. Washington)

– Neighbor-joining method—Neighbor of the Phylip package (Jo Felsentein, Univ. Washington) —Clustal, or Distnj in Protml package (Adachi and Hasegawa, Univ.

Tokyo)—Darwin (Gaston Gonner, ETH, Zurich, via mailserver or WWW)

• Protein Maximum likelihood (in the public domain)– Protml (Adachi and Hasegawa, Univ. Tokyo) (very cpu intensive)– TreePuzzle (Strimmer and von Haeseler, 1997)

• Protein maximal parsimony (in the public domain)— Protpars (Jo Felsentein, Univ. Washington) — Paup (David Swofford, latest version will be commercial)

Some useful information about phylogenetic trees

A

B

C

D

E

F

G

H

I

OTUs

Root

External nodes

Internalnodes

A-E are external nodes (extant)F-I are internal (ancestral) nodes

OTUs are operational taxonomic units. They can be species or sequences

They are the extant (existing) or extinct (ancestral) OTUs

Topology: order of the nodes on the tree

Distance Matrix Methods• UPGMA (Unweighted Pair Group with Arithmatic Mean) uses real

(uncorrected) distance values and a sequential clustering algorithm. (Should only be used with closely related OTUs, or when there is constancy of evolutionary rate)

• Transformed distance methods. Corrections may be introduced to obtain trees with true evolutionary distances (PAM values, Kimura), or corrections are carried out with reference to an outgroup (Farris, 1971; Klotz et al, 1979). Should be used when evolutionary distant organisms are included in the dataset

• Neighbors relation methods– FITCH (Fitch, 1981)– Neighbor-Joining method, (Saitou and Nei, 1987)

Should all be used with corrected (see above) distance matrices

Distance matrixUncorrected for Multiple Substitutions

1 2 3 4 5 0.00 0.63 0.63 22.88 18.50 AC007866_13 1 0.00 0.63 22.57 18.50 AC007866_17 2 0.00 22.88 17.87 AC007866_15 3 0.00 5.64 AC007866_9 4 0.00 AC007866_11 5Using the Kimura correction methodGap weighting is 0.000000

1 2 3 4 5 0.00 0.63 0.63 27.35 21.29 AC007866_13 1 0.00 0.63 26.90 21.29 AC007866_17 2 0.00 27.35 20.47 AC007866_15 3 0.00 5.88 AC007866_9 4 0.00 AC007866_11 5

Distance matrix as produced by the EMBOSS program distmat

UPGMA

• UPGMA (Unweighted Pair Group with Arithmetic Mean) uses real (uncorrected) distance values and a sequential clustering algorithm. (Should only be used with closely related OTUs, or when there is constancy of evolutionary rate)

Tree construction (UPGMA)First cycle

A B C D E

B 2 C 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 8

Cluster the pair of OTUs with the smallest distance, being A and B, The branching point is positioned at a distance of 2 / 2 = 1 substitution.

Following the first clustering A and B are considered as a single composite OTU(A,B) and we now calculate the new distance matrix as follows:

dist(A,B),C = (distAC + distBC) / 2 = 4

dist(A,B),D = (distAD + distBD) / 2 = 6

dist(A,B),E = (distAE + distBE) / 2 = 6

dist(A,B),F = (distAF + distBF) / 2 = 8

In other words the distance between a simple OTU and a composite OTU is the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU. Then a new distance matrix is recalculated using the newly calculated distances and the whole cycle is being repeated:

Tree construction (UPGMA)Tree construction (UPGMA)

Tree construction (UPGMA)

Second cycle

A,B C D E

C 4 D 6 6 E 6 6 4 F 8 8 8

8


Third cycle

A,B C D,E C 4 D,E 6 6 F 8 8 8

Tree construction (UPGMA 4)

Fourth cycle

AB,C D,E D,E 6 F 8 8


Fifth cycle

ABC,DE F 8

The final step consists of clustering the last OTU, F,with the composite OTU.

Pitfalls of UPGMA• The UPGMA clustering method is very

sensitive to unequal evolutionary rates. • Clustering works only if the data are

ultrametric • Ultrametric distances are defined by the

satisfaction of the 'three-point condition'.

The treepoint condition• For any three taxa: dist AC <= max (distAB, distBC) or,

• in words: the two greatest distances are equal, or

• UPGMA assumes that the evolutionary rate is the same for all branches

• If the assumption of rate constancy among lineages does not hold UPGMA may give an erroneous topology.

Non-ultrametric tree

Unequal rates of mutation lead to wrong trees

• UPGMA tree construction based on the data of the left tree would result in the erroneous tree at the right

UPGMA (conclusion)• UPGMA uses real (uncorrected) distance values and

a sequential clustering algorithm. • This method of tree construction is very sensitive to

differences in branch length or unequal rates of evolution.

• It should only be used with closely related OTUs, or when there is constancy of evolutionary rate.

• The method is often used in combination with isoenzyme or restriction site data or with morphological criteria

• Use sequence information rather than distance information

• Calculate for all possible trees the tree that represents the minimum number of substitutions at each informative site

Maximum Parsimony Methods

Maximum Parsimony analysis (2)

• Parsimony implies that simpler hypotheses are preferable to more complicated ones.

• Maximum parsimony is a character-based method that infers a phylogenetic tree by minimizing the total number of evolutionary steps required to explain a given set of data, or in other words by minimizing the total tree length.

• The steps may be base or amino-acid substitutions for sequence data, or gain and loss events for restriction site data.

• Maximum parsimony, when applied to protein sequence data either considers each site of the sequence as a multistate unordered characterd with 20 possible states (the amino-acids) (Eck and Dayhoff, 1966), or may take into account the genetic code and the number of mutations, 1, 2 or 3, that is required to explain an observed amino-acid substitution. The latter method is implemented in the PROTPARS program (Felsenstein, 1993).

• The maximum parsimony method searches all possible tree topologies for the optimal (minimal) tree. However, the number of unrooted trees that have to be analysed rapidly increases with the number of OTUs.


• The number of rooted trees (Nr) for n OTUs is given by:Nr = (2n -3)!/(2exp(n -2)) (n -2)!

• The number of unrooted trees (Nr) for n OTUs is given by:Nu = (2n -5)!/(2exp(n -3)) (n -3)!


Number of OTUs unrooted trees rooted trees 2 1 1 3 1 3 4 3 15 5 15 105 6 105 945 7 954 10,395 8 10,395 135,135 9 135,135 34,459,425 10 34,459,425 2.13E15 15 2.13E15 8.E21

This rapid increase in number of trees to be analysed may make it impossible to apply the method to very large datasets. In that case the parsimony method may become very time consuming, even on very fast computers.

maximum parsimony method for 4 nucleic-acid sequences

• Site _________________________ Sequence 1 2 3 4 5 6 7 8 9

1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G

• For four OTUs there are three possible unrooted trees. The trees are then analysed by searching for the ancestral sequences and by counting the number of mutations required to explain the respective trees :

(1) AAGAGTGCA AGATATCCA (3) \4 2/ Number of mutations \ 4 / AGCCGTGCG --- AGAGATCCG Tree I: 11 / \ /0 0\ (2) AGCCGTGCG AGAGATCCG (4)

(1) AAGAGTGCA AGCCGTGCG (2) \1 3/ \ 5 / AGGAGTGCA --- AGAGGTCCG Tree II: 14 / \ /4 1\ (3) AGATATCCA AGAGATCCG (4)

(1) AAGAGTGCA AGCCGTGCG (2) \1 3/ \ 5 / AGAAGTGCA --- AGATGTCCG Tree III: 16 / \ /5 2\ (4) AGAGATCCG AGATATCCA (3)

Tree I has the topology with the least number of mutations and thus is the most parsimonious tree.

Ancestral trees are calculated

This analysis includes both informative and non-informative sites in the sequence.

When only informative sites are included a much lesser number of sites can be analysed, which means in the case of large datasets a considerable gain in CPU time.

Informative sitesA site is informative only when there are at least two different kinds of nucleotides at the site, each of which is represented in at least two of the sequences under study.

• Site _________________________

Sequence 1 2 3 4 5 6 7 8 9

1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G

* * * Informative sites are indicated by an asterisk (*)

1 GGA 2 GGG 3 ACA 4 ACG ***

(1) GGA ACA (3) \1 1/ Number of mutations \ 2 / GGG --- ACG Tree I: 4 / \ /0 0\ (2) GGG ACG (4)

(1) GGA GGG (2) \1 1/ \ 1 / GCA --- GCG Tree II: 5 / \ /1 1\ (3) ACA ACG (4)

(1) GGA GGG (2) \2 1/ \ 0 / GCG --- GCG Tree III: 6 / \ /1 2\ (4) ACG ACA (3)

To infer a maximum parsimony tree, for each possible tree we calculate the minimum number of substitutions at each informative site. In the above example, for sites 5, 7, and 9, tree I requires in total 4 changes, tree II requires 5 changes, and tree III requires 6 changes. In the final step, we sum the number of changes over all the informative sites for each tree and choose the tree associated with the smallest number of substitutions. In our case, tree I is chosen because it requires the smallest number of changes (4) at the informative sites.

Informative sites only

How to find the best tree ?• Maximum parsimony searches for the optimal (minimal) tree. In this process more

than one minimal trees may be found. In order to guarantee to find the best possible tree an exhaustive evaluation of all possible tree topologies has to be carried out. However, this becomes impossible when there are more than 12 OTUs in a dataset.

• Branch and Bound: is a variation on maximum parsimony that garantees to find the minimal tree without having to evaluate all possible trees. This way a larger number of taxa can be evaluated but the method is still limited.

• Heuristic searches is a method with step-wise addition and rearrangement (branch swapping) of OTUs. Here it is not guaranteed to find the best tree.

• Since, in view of the size of the dataset, it is often not possible to carry out an exhaustive or other search for the best tree, it is adviced to change the order of the taxa in the dataset and to repeat the analysis, or to indicate to the program to do this for you by providing a so-called jumble factor to the program.

Consensus tree• Since the Maximum Parsimony method may result in more than one equally

parsimonious tree, a consensus tree should be created. For the creation of a consensus tree see bootstrapping.

Parsimony and branch lengths(1) G A (3) \1 0/ \ 1 / C -----A / \ /0 1\ (2) C T (4)

(1) G A (3) \0 1/ \ 1 / G -----T / \ /1 0\ (2) C T (4)

(1) G A (3) \1 1/ \ 1 / C -----A / \ /0 0\ (2) C A (4)

3 possible trees for 4 OTUs, all describe the same final state by assuming a total of 3 steps.

Each final state is arrived at via a different route.

Each of the three trees is equally valid, but the number of steps along the indiviual branches (or the length of each branch) is not determined.

For this reason branch lengths are not given in parsimony, but only the total number of steps for a tree.

Some final notes on maximum parsimony

• Maximum Parsimony (positive points): – is based on shared and derived characters. It therefore is a

cladistic rather than a phenetic method – does not reduce sequence information to a single number – tries to provide information on the ancestral sequences – evaluates different trees

• Maximum Parsimony (negative points): – does not assume an evolutionary model– is slow in comparison with distance methods – does not use all the sequence information (only informative

sites are used) – does not correct for multiple mutations (does not imply a

model of evolution) – does not provide information on the branch lengths – is notorious for its sensitivity to codon bias

130

• Ancestral sequences of bacterial elongation factors were predicted using MP and a collection of extant sequences

• Ancestral elongation factors were overexpressed and purified

• Thermal stabilities of the resurrected proteins were determined and plotted against geological time

• Results show that ancient life has adapted to changes of temperature throughout evolutionary history

130

Reconstruction of ancestral sequences (1)

Gaucher et al., 2008

131



132



How to root an unrooted tree?• The majority of methods yield unrooted trees• To root a tree one should add an outgroup to the dataset. An outgroup is

an OTU for which external information (eg. paleontological information) is available that indicates that the outgroup branched off before all other taxa

• Do not choose an outgroup that is very distantly related to your taxa. This may result in serious topolocical errors

• Do not choose either an outgroup that is too closely related to the taxa in question. In this case it may not be a true outgroup

• The use of more than one outgroup generally improves the estimate of tree topology

• In the absence of a good outgroup the root may be positioned by assuming approximately equal evolutionary rates over all the branches. In this way the root is put at the midpoint of the longest pathway between two OTUs

Maximum likelihood• It evaluates a hypothesis about evolutionary history in terms of the

probability that the proposed model and the hypothesized history would give rise to the observed data set. A history with a higher probability of reaching the observed state is preferred to a history with a lower probability. The method searches for the tree with the highest probability or likelihood.

• The following programs are available from the web:– DNAML (DNA data only. By Joe Felsenstein in the Phylip

package) – FastDNAML (DNA data only. A faster algorithm applied by Gary

Olsen to Joe Felsenstein's DNAML program ) – ProtML (DNA and protein. By Adachi and Hasegawa, 1992) – TreePuzzle (DNA and protein. By Strimmer and von Haeseler,

1995). This program applies a heuristic method and is much faster than PROTML, but does not guarantee to find the best tree.

Advantages and disadvantages of the maximum likelihood method

• There are some supposed adavantages of maximum likelihood methods over other methods.

– It is the estimation method least affected by sampling error

– It is robust to many violations of the assumptions in the evolutionary model

– with very short sequences it tends to outperform alternative methods such as parsimony or distance methods.

– the method is statistically well founded

– evalutates different tree topologies

– uses all the sequence information

• There are also some supposed disadvantages – maximum likelihood is very CPU intensive and thus extremely slow

– result is dependent on the model of evolution used

Explication of the methodMaximum likelihood evaluates the probability that the choosen evolutionary model will have generated the observed sequences. Phylogenies are then inferred by finding those trees that yield the highest likelihood. Assume that we have the aligned nucleotide sequences for four taxa:

1 j ....N (1) A G G C U C C A A ....A (2) A G G U U C G A A ....A (3) A G C C C A G A A.... A (4) A U U U C G G A A.... C

and we want to evauate the likelihood of the unrooted tree represented by the nucleotides of site j in the sequence and shown below:

(1) (2) \ / \ / ------ / \ / \ (3) (4) What is the probabliity that this tree would have generated the data presented in the sequence under the the chosen model ?

The models are time-reversible, therefore the likelihood of the tree is independent of the position of the root. Thus it is convenient to root the tree at an arbitrary internal node.

C C A G \ / | / \/ | / A | / \ | / \ | / A

_ _ | C C A G | | \ / | / | | \/ | / |L(j) = Sum(Prob | (5) | / |) | \ | / | | \ | / | |_ (6) _|

Assume that nucleotide sites evolve independently (the Markovian model of evolution). Then we can calculate the likelihood for each site separately and combine these to the total likelihood.

For the likelihood for site j, we have to consider all the possible scenarios by which the nucleotides present at the tips of the tree could have evolved. So the likelihood for a particular site is the summation of the probablilities of every possible reconstruction of ancestral states, given some model of base substitution. So in this specific case all possible nucleotides A, G, C, and T occupying nodes (5) and (6), or 4 x 4 = 16 possibilities :

In the case of protein sequences each site may ooccupy 20 states (that of the 20 amino acids) an thus 400 possibilities have to be considered. Since any one of these scenarios could have led to the amino-acid configuration at the tip of the tree, we must calculate the probability of each and sum and sum them to obtain the total probability for each site j.

Likelihood for one site

likelihood for the full treeThe likelihood for the full tree then is the product of the likelihood at each site.

N L= L(1) x L(2) ..... x L(N) = ΠL(j) j=1

Since the individual likelihoods are extremely small numbers it is convenient to sum the log likelihoods at each site and report the likelihood of the entire tree as the log likelihood.

N ln L= ln L(1) + ln L(2) ..... + ln L(N) = Σln L(j) j=1

The model of evolutionThe PROTML program in the MOLPHY package (Adachi and Hasegawa, 1992), as well as the TreePUZZLE program by Strimmer and von Haeseler (1995), have implemented an instantaneous rate matrix derived from the Dayhoff emperical substitution matrix. This has been called the Dayhoff model.

Recently a model called the JTT model of evolution and based upon the updated emperical substitution matrix of Jones et al. (1992) has been developed and and implemented in these programs.

The maximum likelihood tree

• The above procedure is then repeated for all possible topologies (or for all possible trees).

• The tree with the highest probablility is the tree with the highest maximum likelihood.

Bootstrapping• Bootstrapping is a way of testing the reliability of the dataset. It is the creation of

pseudoreplicate datasets by resampling. Bootstrapping allows you to assess whether the distribution of characters has been influenced by stochastic effects. In phylogenetic analyses nonparametric bootstrapping is the most commonly used method. The pseudoreplicate datasets are generated by randomly sampling the original character matrix to create new matrices of the same size as the original. The frequency with which a given branch is found is recorded as the bootstrap proportion. These proportions can be used as a measure of the reliability (within limitations) of individual branches in the optimal tree.

• Thus bootstrap analysis:– is a statistical method for obtaining an estimate of error

– is used to evaluate the reliability of a tree

– is used to examine how often a particular cluster in a tree appears when nucleotides or aminoacids are resampled

• NB: If the entire dataset is compatible and has not been biased by stochastic effects, all bootstrap trees should in principle have the same topology!

The practice of bootstrapping and the construction of a consensus tree

Take a dataset consisting of in total n sequences with m sites each (see below). A number of resampled datasets of the same size (n x m) as the original dataset is produced. However, each site is sampled at random and no more sites are sampled than there were original sites. In order to be statistically significant the number of the datasets should should be high and equal or higher than the number of individual sites present in the dataset.

Our example dataset consists of in total 4 sequences with 10 sites each (see below). When three new datasets are prepared by random sampling of sites, the following three sample sets of data can be obtained:

Sample 1 0 1 2 0 3 0 1 2 0 1 (<- number of times each site is sampled) ___________________ A A G G C U C C A A A A G G G U U U C A A A B A G G U U C G A A A B G G G U U U G A A A C A G C C C C G A A A C G C C C C C G A A A D A U U U C C G A A C D U U U C C C G A A C A B C B 1 C 6 5 D 8 7 4

Sample 2Sample 2 1 0 0 0 2 2 2 0 0 3 ___________________ A A G G C U C C A A A A A U U C C C C A A A B A G G U U C G A A A B A U U C C G G A A A C A G C C C C G A A A C A C C C C G G A A A D A U U U C C G A A C D A C C C C G G C C C

A B C B 2 C 4 2 D 7 5 3

Sample 3Sample 3 1 0 0 0 2 2 2 0 0 3 ___________________ A A G G C U C C A A A A A U U C C C C A A A B A G G U U C G A A A B A U U C C G G A A A C A G C C C C G A A A C A C C C C G G A A A D A U U U C C G A A C D A C C C C G G C C C

A B C B 1 C 3 2 D 6 3 4

Consensus treeA large number of datasets (between hundred and thousand, depending on computer power) and the same number of different trees are so generated. In this specific case taxa A and B form a cluster in all three trees, while C clusters with D in only one tree. There exist specialised programs, such as the program Consense in the Phylip package of Joe Felsenstein, that are able to analyse all the resulting trees and prepare the most likely tree or consensus tree from those data.

The resulting consensus tree for our small dataset is shown below. The number of times each branch point or node occured (the so-called bootstrap proportion) is indicated at each node.

Result A A G G C U C C A A A B A G G U U C G A A A C A G C C C C G A A A D A U U U C C G A A C

A B C B 2 C 3 3 D 6 4 4

Again some good advice (1)

• Tree topologies may strongly depend on the following:– DNA or Protein used in the analysis– Distance or Parsimony methods applied– The number of OTUs included in the alignment– The order of the OTUs in the alignment– The selection of a good outgroup

• None of the methods may guarantee the one tree with the correct topology

• So as to have an idea of the reliability of the topology of the resulting tree, one should do one or all of the following:– Apply more than one of different methods (distance, parsimony) to the

dataset.– Vary the parameters used by the different programs, such as seed

value and jumble factor for the order of OTU addition. – Add or remove one or more OTUs and see how this influences tree

topology.– Try to include an outgroup that may serve as a root for your tree.– Apply Bootstrap or Jacknife analyses to your dataset and prepare a

consensus tree of 100 - 1000 replicas (depending on the size of the dataset and on computer power).

• Only when widely different methods provide you with similar or identical tree topologies and such topologies are suported by good bootstrap values (> 95%) the trees can be considered reliable

Again some good advice (2)

Limitations of the various methods

• Distance approaches (UPGMA, corrected distances and neighbor-joining) do not use the original (sequence) data, but derived distance information. Some information is said to be lost

• Character-state approaches (Maximum Parsimony) are said to be more powerful than distance methods because they use the raw data. However, this is usually a small fraction of the data. Maximum parsimony uses only the informative sites. So when the number of informative sites is not large, this method is often less efficient than distance methods (Saitou and Nei, 1986). Maximum parsimony is notorious for its sensitivity to codon bias

• None of the methods is reliable when OTUs with highly unequal evolutionary separation are included in the dataset

Some terms used in molecular evolution

• Indel: position in a sequence alignment where one of the sequences has acquired an insertion or extension or has undergone a deletion

• Identity: percentage of identical residues in pairwise aligned sequences. Normally deletions or insertions are not taken into consideration, since it is not possible to tell how many events have been at the basis of the creation of such an indel

• Homology: two sequences are homologous or have homology when they have evolved from a common ancestral sequence. The same holds for the aligned residues in a sequence alignment. Homologous residues are derived from a common ancestral residuerity and homology as percentage should not be used. Two sequences can be similar, and have a certain percentage of identity, but cannot have a certain percentage of similarity. The same holds for homology.

Some PAM rates PAMS per 100

Million Years

IG kappa chain C region 37

Lactalbumin 27

Epidermal growth factor 26

Haptoglobin alpha chain 20

Serum albumin 19

Phospholipase A 19

Hemoglobin alpha chain 12

Animal lysozyme 9.8

Myoglobin 8.9

Amyloid AA 8.7

Acid proteases 8.4

Myelin basic protein 7.4

Cytochrome b 4.5

Lactate dehydrogenase 3.4

Adenylate kinase 3.2

Triosephosphate isomerase 2.8

Cytochrome c 2.2

Plant ferredoxin 1.9

Glutamate dehydrogenase 0.9

Histone H4 0.1

(Adapted from Table 1. Atlas of Protein Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff, ed. National Biomedical Research Foundation, 1979.)

The three letter amino acid code

A Ala I Ile S SerB Asx K Lys T ThrC Cys L Leu V ValD Asp M Met W TryE Glu N Asn X XxxF Phe P Pro Y TyrG Gly Q Gln Z GlxH His R Arg

Consider four hypothetical sequences:

PHYLOGENY, PHOLOGENY, PHLOGENY, PHOLONY

Alignment can be done in various ways:

PHYLOGENY PHY-LOGENY

PHOLOGENY or PH-OLOGENY

PH-LOGENY PH--LOGENY

PHOLO--NY PH-OLO--NY


Tree construction using distance-matrix methods

phylogenetic tree constructed from 6 aligned sequences

A MOLECULAR--EVOLUTION

B MOLEKULARE-EVOLUTIEN

C MOLECULAIREEVOLUTIEN

D MO-ECALIAREEFOLUTIE-

E MO-ESALIARE-GOLUTIU-

F NO-ASELIAKE-HODATAU-

A

B

C

D

E

F

1

11

2

2

2

4

1

1

1

structure and function of proteins (bicl3215) fred r. opperdoes de duve institute and laboratory of...

Documents