sequence alignment and phylogenetic analysis. evolution

100
Sequence Alignment and Phylogenetic Analysis

Upload: kelley-manning

Post on 13-Jan-2016

238 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Sequence Alignment and Phylogenetic Analysis. Evolution

Sequence Alignment and Phylogenetic Analysis

Page 2: Sequence Alignment and Phylogenetic Analysis. Evolution

Evolution

Page 3: Sequence Alignment and Phylogenetic Analysis. Evolution

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gapin the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 4: Sequence Alignment and Phylogenetic Analysis. Evolution

4

Example

H E A G A W G H E E

0 -8 -16

-24

-32

-40

-48

-56

-64

-72

-80

P -8 -2 -9 -17

-25

-33

-42

-49

-57

-65

-73

A -16

W -24

H -32

E -40

A -48

E -56

A E G H W

A 5 -1 0 -2 -3

E -1 6 -3 0 -3

H -2 0 -2 10 -3

P -1 -1 -2 -2 -4

W -3 -3 -3 -3 15

Page 5: Sequence Alignment and Phylogenetic Analysis. Evolution

5

H E A G A W G H E E

0 -8 -16

-24

-32

-40

-48

-56

-64

-72

-80

P -8 -2 -9 -17

-25

-33

-42

-49

-57

-65

-73

A -16

-10

-3 -4 -12

-20

-28

-36

-44

-52

-60

W -24

-18

-11

-6 -7 -15

-5 -13

-21

-29

-37

H -32

-14

-18

-13

-8 -9 -13

-7 -3 -11

-19

E -40

-22

-8 -16

-16

-9 -12

-15

-7 3 -5

A -48

-30

-16

-3 -11

-11

-12

-12

-15

-5 2

E -56

-38

-24

-11

-6 -12

-14

-15

-12

-9 1

A E G H W

A 5 -1 0 -2 -3

E -1 6 -3 0 -3

H -2 0 -2 10 -3

P -1 -1 -2 -2 -4

W -3 -3 -3 -3 15

Page 6: Sequence Alignment and Phylogenetic Analysis. Evolution

The Blosum50 Scoring Matrix

Page 7: Sequence Alignment and Phylogenetic Analysis. Evolution

Multiple Alignment

Page 8: Sequence Alignment and Phylogenetic Analysis. Evolution

Example

Page 9: Sequence Alignment and Phylogenetic Analysis. Evolution

ClustalW

• Popular multiple alignment tool today• ‘W’ stands for ‘weighted’ (different parts

of alignment are weighted differently).• Three-step process

1.) Construct pairwise alignments2.) Build Guide Tree3.) Progressive Alignment guided by the tree

Page 10: Sequence Alignment and Phylogenetic Analysis. Evolution

Step 1: Pairwise Alignment

Page 11: Sequence Alignment and Phylogenetic Analysis. Evolution
Page 12: Sequence Alignment and Phylogenetic Analysis. Evolution

Step 3: Progressive Alignment• Start by aligning the two most similar

sequences• Following the guide tree, add in the next

sequences, aligning to the existing alignment• Insert gaps as necessary

Page 13: Sequence Alignment and Phylogenetic Analysis. Evolution
Page 14: Sequence Alignment and Phylogenetic Analysis. Evolution

Some Guidelines for Choosing the Right Sequences

Page 15: Sequence Alignment and Phylogenetic Analysis. Evolution

Gathering Sequences with BLAST

• The most convenient way to select your sequences is to use a BLAST server

• Some BLAST servers are integrated with multiple-alignment methods:• www.expasy.ch (protein only)• srs.ebi.ac.uk (DNA/protein)• npsa-pbil.ibcp.fr

Page 16: Sequence Alignment and Phylogenetic Analysis. Evolution

Selecting a Method

• Many alternative methods exist for MSAs

• Most of them use the progressive algorithm• They all are approximate methods• None is guaranteed to deliver the best alignments

• All existing methods have pros and cons• ClustalW is the most popular (21,000 citations)• T-Coffee and ProbCons are more accurate but slower• MUSCLE is very fast, ideal for very large datasets

Page 17: Sequence Alignment and Phylogenetic Analysis. Evolution

ClustalW

• www.ebi.ac.uk/clustalw• pir.georgetown.edu/pirwww/search/

multialn.shtml• www.ddbj.nig.ac.jp/search/clustalw-e.html

Page 18: Sequence Alignment and Phylogenetic Analysis. Evolution
Page 19: Sequence Alignment and Phylogenetic Analysis. Evolution

Tcoffee

• TCOFFEE: www.tcoffee.org• CORE: evaluate MSA• MCOFFEE: run many and combine• EXPRESSO: with structural information

Page 20: Sequence Alignment and Phylogenetic Analysis. Evolution
Page 21: Sequence Alignment and Phylogenetic Analysis. Evolution

Running Many Methods at Once• MCOFFEE is a a meta-method

• It runs all the individual MSA methods• It gathers all the produced MSAs• It combines the MSAs into a single MSA

• MCOFFEE is more accurate than any individual method

• Its color output lets you estimate the reliability of your MSA

• MCOFFEE is available on www.tcoffee.org

Page 22: Sequence Alignment and Phylogenetic Analysis. Evolution

Editing and Publishing Alignments

Page 23: Sequence Alignment and Phylogenetic Analysis. Evolution

Alignments and Formats

• Many alternative formats exist for MSAs

• One format does not always have a clear advantage over another

• Changing formats is possible • Annotation information can sometimes be lost in a format change

• Not all formats contain the same information

• The annotation may change• Reformatting may cause the loss of annotation information

Page 24: Sequence Alignment and Phylogenetic Analysis. Evolution

The Most Common Sequence Formats

Page 25: Sequence Alignment and Phylogenetic Analysis. Evolution

Interleaved and Non-interleaved

The MSF FormatInterleaved

The FASTA FormatNon-interleaved

Page 26: Sequence Alignment and Phylogenetic Analysis. Evolution

Choosing Your Format• When choosing a format, ask yourself

four questions:• Is it supported by the programs I need to

use ?• Can my collaborators use it?• Can it support all of my annotation ?• Is it easy to read and manipulate ?

Page 27: Sequence Alignment and Phylogenetic Analysis. Evolution

Converting Formats• Don’t re-compute your MSA if it is not in the right

format

• Convert your file using one of the online conversion tools

• The 3 most popular reformatting utilities:• Fmtseq The most complete• RESDSEQ Very popular and robust• SeqCheck Can clean FASTA sequences

Page 28: Sequence Alignment and Phylogenetic Analysis. Evolution

An AlignmentCLUSTAL 2.1 multiple sequence alignment

sp|P02620|PRVB_MERME ---------------------------------------------AFAGI 5sp|P02622|PRVB_GADCA ---------------------------------------------AFKGI 5sp|P02619|PRVB_ESOLU ---------------------------------------------SFAGL 5sp|Q91482|PRVB1_SALSA --------------------------------------------MACAHL 6sp|P43305|PRVU_CHICK --------------------------------------------MSLTDI 6sp|P20472|PRVA_HUMAN --------------------------------------------MSMTDL 6sp|P80079|PRVA_FELCA --------------------------------------------MSMTDL 6sp|P02627|PRVA_RANES ---------------------------------------------PMTDL 5sp|P02626|PRVA_AMPME ---------------------------------------------SMTDV 5sp|P02586|TNNC2_RABIT MTDQQAEARSYLSEEMIAEFKAAFDMFDADGGGDISVKELGTVMRMLGQT 50

sp|P02620|PRVB_MERME LADADITAALAACKAEGS--FKHGEFFTKIG------LKGKSAADIKKVF 47sp|P02622|PRVB_GADCA LSNADIKAAEAACFKEGS--FDEDGFYAKVG------LDAFSADELKKLF 47sp|P02619|PRVB_ESOLU -KDADVAAALAACSAADS--FKHKEFFAKVG------LASKSLDDVKKAF 46sp|Q91482|PRVB1_SALSA CKEADIKTALEACKAADT--FSFKTFFHTIG------FASKSADDVKKAF 48sp|P43305|PRVU_CHICK LSPSDIAAALRDCQAPDS--FSPKKFFQISG------MSKKSSSQLKEIF 48sp|P20472|PRVA_HUMAN LNAEDIKKAVGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVF 48sp|P80079|PRVA_FELCA LGAEDIKKAVEAFTAVDS--FDYKKFFQMVG------LKKKSPDDIKKVF 48sp|P02627|PRVA_RANES LAAGDISKAVSAFAAPES--FNHKKFFELCG------LKSKSKEIMQKVF 47sp|P02626|PRVA_AMPME IPEADINKAIHAFKAGEA--FDFKKFVHLLG------LNKRSPADVTKAF 47sp|P02586|TNNC2_RABIT PTKEELDAIIEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECF 100 :: : :. * * : : *

sp|P02620|PRVB_MERME GIIDQDKSDFVEEDELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKI 97sp|P02622|PRVB_GADCA KIADEDKEGFIEEDELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKI 97sp|P02619|PRVB_ESOLU YVIDQDKSGFIEEDELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMI 96sp|Q91482|PRVB1_SALSA KVIDQDASGFIEVEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMI 98sp|P43305|PRVU_CHICK RILDNDQSGFIEEDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKI 98sp|P20472|PRVA_HUMAN HMLDKDKSGFIEEDELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKI 98sp|P80079|PRVA_FELCA HILDKDKSGFIEEDELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKI 98sp|P02627|PRVA_RANES HVLDQDQSGFIEKEELCLILKGFTPEGRSLSDKETTALLAAGDKDGDGKI 97sp|P02626|PRVA_AMPME HILDKDRSGYIEEEELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKI 97sp|P02586|TNNC2_RABIT RIFDRNADGYIDAEELAEIFR---ASGEHVTDEEIESLMKDGDKNNDGRI 147 : *.: ..::: :** :: . :: * :: .* :.** *

sp|P02620|PRVB_MERME GVEEFAAMV-----KG 108sp|P02622|PRVB_GADCA GVDEFGALVDKWGAKG 113sp|P02619|PRVB_ESOLU GVDEFAAMI-----KA 107sp|Q91482|PRVB1_SALSA GIDEFAVLV-----KQ 109sp|P43305|PRVU_CHICK GAEEFQEMV-----QS 109sp|P20472|PRVA_HUMAN GVDEFSTLVA----ES 110sp|P80079|PRVA_FELCA DVDEFFSLVA----KS 110sp|P02627|PRVA_RANES GVDEFVTLVS----ES 109sp|P02626|PRVA_AMPME GVDEFTSLVA----ES 109sp|P02586|TNNC2_RABIT DFDEFLKMMEG---VQ 160 . :** ::

Page 29: Sequence Alignment and Phylogenetic Analysis. Evolution

READSEQ

• http://www.ebi.ac.uk/cgi-bin/readseq.cgi

Page 30: Sequence Alignment and Phylogenetic Analysis. Evolution
Page 31: Sequence Alignment and Phylogenetic Analysis. Evolution

Different Formats (PHYLIP)

Page 32: Sequence Alignment and Phylogenetic Analysis. Evolution

PHYLIP (no gap)

Page 33: Sequence Alignment and Phylogenetic Analysis. Evolution

Different Formats (MSF)

Page 34: Sequence Alignment and Phylogenetic Analysis. Evolution

Converting Formats Can Be Dangerous

• Format conversion can result in data loss

• After converting your file, you must make sure your data is still intact

• The following slide shows the most common losses that occur during conversion

Page 35: Sequence Alignment and Phylogenetic Analysis. Evolution

Potential Information Loss When Converting MSAs

Page 36: Sequence Alignment and Phylogenetic Analysis. Evolution

Editing your MSA• If your MSA looks bad . . .

• Don’t torture the online server• Edit the MSA yourself locally

• Never, ever, ever (ever) use a standard word processor• Always use a dedicated MSA editor• The most popular online tool is Jalview

• You can get it at www.jalview.org

Page 37: Sequence Alignment and Phylogenetic Analysis. Evolution

With Jalview You Can . . .• Modify your MSA• Remove some of the redundant sequences• Insert/remove gaps• Shift portions of the MSA• Modify the alignment of a sub-group of

sequences• Recompute some portions of your alignment

Page 38: Sequence Alignment and Phylogenetic Analysis. Evolution
Page 39: Sequence Alignment and Phylogenetic Analysis. Evolution
Page 40: Sequence Alignment and Phylogenetic Analysis. Evolution
Page 41: Sequence Alignment and Phylogenetic Analysis. Evolution

Click a sequence to select

Page 42: Sequence Alignment and Phylogenetic Analysis. Evolution

Drag to select columns

Page 43: Sequence Alignment and Phylogenetic Analysis. Evolution

Some Special Features of Jalview • Computation of a consensus sequence• Computation of a phylogenetic tree• Removal of the redundancy• Applying any color scheme to your MSA

Page 44: Sequence Alignment and Phylogenetic Analysis. Evolution

Preparing Your MSA for Publication

• MSAs in publications usually come with shaded colors • You can improve your MSAs using online tools like Boxshade• Boxshade will shade your MSA according to its degree of

conservation

Page 45: Sequence Alignment and Phylogenetic Analysis. Evolution

MSA => LOGO Graph• A LOGO graph summarizes an MSA• Tall letters indicate highly conserved positions• Short letters indicate poorly conserved positions• LOGO graphs are ideal for identifying conserved patterns• weblogo.berkeley.edu/

Page 46: Sequence Alignment and Phylogenetic Analysis. Evolution

Going Farther• Your imagination is the limit when it comes to making MSAs

nice- looking and informative

• Four very popular and easy-to-install MSA editors:• CINEMA

• Seaview

• Belvu

• Kalignview

• Boxshade is the simplest shading tool

• If you need heavier capabilities, try Espript • Available at espript.ibpc.fr

Page 47: Sequence Alignment and Phylogenetic Analysis. Evolution

Molecular Evolution and Phylogenetic Reconstruction

Page 48: Sequence Alignment and Phylogenetic Analysis. Evolution

Early Evolutionary Studies• Anatomical features were the dominant

criteria used to derive evolutionary relationships between species since Darwin till early 1960s

• The evolutionary relationships derived from these relatively subjective observations were often inconclusive. Some of them were later proved incorrect

Page 49: Sequence Alignment and Phylogenetic Analysis. Evolution

Evolution and DNA Analysis: the Giant Panda Riddle

• For roughly 100 years scientists were unable to figure out which family the giant panda belongs to

• Giant pandas look like bears but have features that are unusual for bears and typical for raccoons, e.g., they do not hibernate

• In 1985, Steven O’Brien and colleagues solved the giant panda classification problem using DNA sequences and algorithms

Page 50: Sequence Alignment and Phylogenetic Analysis. Evolution

Evolutionary Tree of Bears and Raccoons

Page 51: Sequence Alignment and Phylogenetic Analysis. Evolution

Evolutionary Trees: DNA-based Approach

• 50 years ago: Emile Zuckerkandl and Linus Pauling brought reconstructing evolutionary relationships with DNA into the spotlight

• In the first few years after Zuckerkandl and Pauling proposed using DNA for evolutionary studies, the possibility of reconstructing evolutionary trees by DNA analysis was hotly debated

• Now it is a dominant approach to study evolution.

Page 52: Sequence Alignment and Phylogenetic Analysis. Evolution

Who are closer?

Page 53: Sequence Alignment and Phylogenetic Analysis. Evolution

Human-Chimpanzee Split?

Page 54: Sequence Alignment and Phylogenetic Analysis. Evolution

Chimpanzee-Gorilla Split?

Page 55: Sequence Alignment and Phylogenetic Analysis. Evolution

Three-way Split?

Page 56: Sequence Alignment and Phylogenetic Analysis. Evolution

Out of Africa Hypothesis

• Around the time the giant panda riddle was solved, a DNA-based reconstruction of the human evolutionary tree led to the Out of Africa Hypothesis that claims our most ancient ancestor lived in Africa roughly 200,000 years ago

Page 57: Sequence Alignment and Phylogenetic Analysis. Evolution

Human Evolutionary Tree (cont’d)

http://www.mun.ca/biology/scarr/Out_of_Africa2.htm

Page 58: Sequence Alignment and Phylogenetic Analysis. Evolution

The Origin of Humans: ”Out of Africa” vs Multiregional Hypothesis

Out of Africa:• Humans evolved in

Africa ~150,000 years ago

• Humans migrated out of Africa, replacing other shumanoids around the globe

• There is no direct descendence from Neanderthals

Multiregional:

• Humans evolved in the last two million years as a single species. Independent appearance of modern traits in different areas

• Humans migrated out of Africa mixing with other humanoids on the way

• There is a genetic continuity from Neanderthals to humans

Page 59: Sequence Alignment and Phylogenetic Analysis. Evolution

mtDNA analysis supports “Out of Africa” Hypothesis

• African origin of humans inferred from:• African population was the most diverse (sub-populations had more time to diverge)• The evolutionary tree separated one group of

Africans from a group containing all five populations.

• Tree was rooted on branch between groups of greatest difference.

Page 60: Sequence Alignment and Phylogenetic Analysis. Evolution

Evolutionary Tree of Humans: (microsatellites)

• Neighbor joining tree for 14 human populations genotyped with 30 microsatellite loci.

Page 61: Sequence Alignment and Phylogenetic Analysis. Evolution

Human Migration Out of Africa

http://www.becominghuman.org

1. Yorubans2. Western Pygmies3. Eastern Pygmies4. Hadza5. !Kung

1

2 3 4

5

Page 62: Sequence Alignment and Phylogenetic Analysis. Evolution

Two Neanderthal DiscoveriesFeldhofer, GermanyMezmaiskaya, CaucasusDistance: 2500km

Page 63: Sequence Alignment and Phylogenetic Analysis. Evolution

Two Neanderthal Discoveries

•Is there a connection between Neanderthals and today’s Europeans?

•If humans did not evolve from Neanderthals, whom did we evolve from?

Page 64: Sequence Alignment and Phylogenetic Analysis. Evolution

Multiregional Hypothesis?

• May predict some genetic continuity from the Neanderthals through to the Cro-Magnons up to today’s Europeans

• Can explain the occurrence of varying regional characteristics

Page 65: Sequence Alignment and Phylogenetic Analysis. Evolution

Sequencing Neanderthal’s mtDNA

•mtDNA from the bone of Neanderthal is used because it is up to 1,000x more abundant than nuclear DNA•DNA decay overtime and only a small amount of ancient DNA can be recovered (upper limit: 100,000 years)•PCR of mtDNA (fragments are too short, human DNA may mixed in)

Page 66: Sequence Alignment and Phylogenetic Analysis. Evolution

Neanderthals vs Humans: surprisingly large divergence

• AMH vs Neanderthal:• 22 substitutions and 6

indels in 357 bp region

• AMH vs AMH• only 8 substitutions

Page 67: Sequence Alignment and Phylogenetic Analysis. Evolution

Evolutionary TreesHow are these trees built from DNA sequences?

• leaves represent existing species• internal vertices represent ancestors• root represents the oldest evolutionary ancestor

Page 68: Sequence Alignment and Phylogenetic Analysis. Evolution

Reading Your Tree• There’s a lot of vocabulary in a tree

• Nodes correspond to common ancestors

• The root is the oldest ancestor• Often artificial

• Only meaningful with a good outgroup

• Trees can be un-rooted

• Branch lengths are only meaningful when the tree is scaled

• Cladograms are often scaled

• Phenograms are usualy unscaled

Page 69: Sequence Alignment and Phylogenetic Analysis. Evolution

Rooted and Unrooted TreesIn the unrooted tree the position of the root (“oldest ancestor”) is unknown. Otherwise, they are like rooted trees

Page 70: Sequence Alignment and Phylogenetic Analysis. Evolution

Type of Trees (Cladogram)

Page 71: Sequence Alignment and Phylogenetic Analysis. Evolution

Type of Trees (Phylogram)

Page 72: Sequence Alignment and Phylogenetic Analysis. Evolution

3 Ways to Use Your Tree• Finding the closest relative of your organism

• Usually done with a tree based on the ribosomal RNA

• Discovering the function of a gene• Finding the orthologues of your gene

• Finding the origin of your gene• Finding whether your gene comes from another

species

Page 73: Sequence Alignment and Phylogenetic Analysis. Evolution

Evolutionary Rate

• Normal mutation rate is 1 in 10-8 nucleotides

• Normal Polymorphic VarianceApproximately 1 in every 1000 nucleotides

• This is the background on which evolutionary

• changes are analyzed.

Page 74: Sequence Alignment and Phylogenetic Analysis. Evolution

Orthology and Paralogy• Orthologous genes

• Separated by speciation• Often have the same function

• Paralogous genes• Separated by duplications• Can have different functions

• In the graph:• A is paralogous with B• A1 is orthologous with A2

直系(垂直)同源和旁系(平行)同源

Page 75: Sequence Alignment and Phylogenetic Analysis. Evolution

Working on the Right Data

• Garbage in garbage out

• The quality of your tree depends on the quality of the data

• Your first task is to assemble a very accurate MSA

Page 76: Sequence Alignment and Phylogenetic Analysis. Evolution

DNA or Proteins• Most phylogenetic methods work on Proteins and DNA sequences• If possible, always compute a multiple-sequence alignment on the

protein sequences • Translate the sequences if the DNA is coding• Align the sequences• Thread the DNA sequences back onto the protein MSA with

coot.embl.de/pal2nal

• If your DNA sequences are coding and have more than 70% identity . . .• Compute the tree on the DNA multiple-sequence alignment

• If your DNA sequences are coding and have less than 70% identity . . .• Compute the tree on the protein multiple-sequence alignment

Page 77: Sequence Alignment and Phylogenetic Analysis. Evolution

Which Sequences ?• Orthologous sequences

• Produce a species tree • Show how the considered species have diverged

• Paralogous sequences• Produce a gene tree • Show the evolution of a protein family

Page 78: Sequence Alignment and Phylogenetic Analysis. Evolution

Establishing Orthology• Establishing orthology is very complicated• It is common practice to establish orthology using the best

reciprocal BLAST

• A is a gene of Genome X• B is a gene of Genome Y• BLAST (Gene A against Genome X) = B• BLAST (Gene B against Genome Y) = A

• A is B’s best friend and B is A’s best friend…• Phylogeny purists dislike this method

Page 79: Sequence Alignment and Phylogenetic Analysis. Evolution

Creating the Perfect Dataset

Page 80: Sequence Alignment and Phylogenetic Analysis. Evolution

Building the Right MSA• Your MSA should have as few gaps as possible.

Most time should remove columns with gaps.• Some variability but not too much!• Some conservation but not too much!

Page 81: Sequence Alignment and Phylogenetic Analysis. Evolution

Building the Right Tree• There are three types of tree-reconstruction methods

• Distance-based methods• Statistical methods• Parsimony methods

• Statistical methods are the most accurate• Maximum likelihood of success• Bayesian methods

• Statistical methods take more time• Limited to small datasets

Page 82: Sequence Alignment and Phylogenetic Analysis. Evolution

Distance-based method

• Compute a distance matrix• Try to fit the matrix to a tree• Fast but may not be very accurate

Page 83: Sequence Alignment and Phylogenetic Analysis. Evolution

Distances in Trees• Edges may have weights reflecting:

• Number of mutations on evolutionary path from one species to another

• Time estimate for evolution of one species into another

• In a tree T, we often compute dij(T) - the length of a path between leaves i and j

dij(T) – tree distance between i and j

Page 84: Sequence Alignment and Phylogenetic Analysis. Evolution

Distance in Trees: an Exampe

d1,4 = 12 + 13 + 14 + 17 + 12 = 68

i

j

Page 85: Sequence Alignment and Phylogenetic Analysis. Evolution

Distance Matrix

• Given n species, we can compute the n x n distance matrix Dij

• Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species.

Dij – edit distance between i and j

Page 86: Sequence Alignment and Phylogenetic Analysis. Evolution

Edit Distance vs. Tree Distance

• Given n species, we can compute the n x n distance matrix Dij

• Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species.

Dij – edit distance between i and j • Note the difference with

dij(T) – tree distance between i and j

Page 87: Sequence Alignment and Phylogenetic Analysis. Evolution

Compute a Distance MatrixEvolutionary Distance - number of substitutionsper 100 amino acids (for proteins) or nucleotides (for DNA)

A C T G T A G G A A T C G CA A T G A A A G A A T C G C

A C T G T A G G A A T C G CA C T G C A G G A A T A G CA A T G A A A G A A T C G C

3 observed changes

6 actual changes

Page 88: Sequence Alignment and Phylogenetic Analysis. Evolution

Edit Distance vs Tree Distance

d1,4 = 12 + 13 + 14 + 17 + 12 = 68

D1,4 may be smaller than 68, as some changes may not be observed

i

j

Page 89: Sequence Alignment and Phylogenetic Analysis. Evolution

Fitting Distance Matrix

• Given n species, we can compute the n x n distance matrix Dij

• Evolution of these genes is described by a tree that we don’t know.

• We need an algorithm to construct a tree that best fits the distance matrix Dij

Page 90: Sequence Alignment and Phylogenetic Analysis. Evolution

Fitting Distance Matrix

• Fitting means Dij = dij(T)

Lengths of path in an (unknown) tree T

Edit distance between species (known)

Page 91: Sequence Alignment and Phylogenetic Analysis. Evolution

Reconstructing a 3 Leaved Tree• Tree reconstruction for any 3x3 matrix is

straightforward• We have 3 leaves i, j, k and a center vertex c

Observe:

dic + djc = Dij

dic + dkc = Dik

djc + dkc = Djk

Page 92: Sequence Alignment and Phylogenetic Analysis. Evolution

Reconstructing a 3 Leaved Tree (cont’d) dic + djc = Dij

+ dic + dkc = Dik

2dic + djc + dkc = Dij + Dik

2dic + Djk = Dij + Dik

dic = (Dij + Dik – Djk)/2Similarly,

djc = (Dij + Djk – Dik)/2dkc = (Dki + Dkj – Dij)/2

Page 93: Sequence Alignment and Phylogenetic Analysis. Evolution

Trees with > 3 Leaves

• An tree with n leaves has 2n-3 edges

• This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables

• This is not always possible to solve for n > 3

Page 94: Sequence Alignment and Phylogenetic Analysis. Evolution

Additive Distance Matrices

Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij

NON-ADDITIVE otherwise

Page 95: Sequence Alignment and Phylogenetic Analysis. Evolution

Distance Based Phylogeny Problem

• Goal: Reconstruct an evolutionary tree from a distance matrix

• Input: n x n distance matrix Dij

• Output: weighted tree T with n leaves fitting D

• If D is additive, this problem has a solution and there is a simple algorithm to solve it

Page 96: Sequence Alignment and Phylogenetic Analysis. Evolution

Using Neighboring Leaves to Construct the Tree

• Find neighboring leaves i and j with parent k• Remove the rows and columns of i and j• Add a new row and column corresponding to k, where

the distance from k to any other leaf m can be computed as:

Dkm = (Dim + Djm – Dij)/2

Compress i and j into k, iterate algorithm for rest of tree

Page 97: Sequence Alignment and Phylogenetic Analysis. Evolution

Finding Neighboring Leaves• To find neighboring leaves we simply select a pair

of closest leaves.

Page 98: Sequence Alignment and Phylogenetic Analysis. Evolution

Finding Neighboring Leaves• To find neighboring leaves we simply select a pair

of closest leaves.

WRONG

Page 99: Sequence Alignment and Phylogenetic Analysis. Evolution

Finding Neighboring Leaves• Closest leaves aren’t necessarily neighbors

• i and j are neighbors, but (dij = 13) > (djk = 12)

• Finding a pair of neighboring leaves is

a nontrivial problem!

Page 100: Sequence Alignment and Phylogenetic Analysis. Evolution

Neighbor Joining Algorithm

• In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction

• Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves

• Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption