introduction to bioinformatics molecular phylogeny lesson 5

58
Introduction to Bioinformatics Molecular Phylogeny Lesson 5

Post on 21-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

Introduction to Bioinformatics

Molecular Phylogeny

Lesson 5

Page 2: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

2

Theory of Evolution: Life is monophyletic

• All organisms on Earth had a common ancestor.

• Any two organisms share a common ancestor in their past.

Ancestor

Descendant 1 Descendant 2

Page 3: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

3

Theory of Evolution:• Speciation events

lead to creation of different species (two species ).

• Speciation caused by physical separation into groups where different genetic variants become dominant.

Ancestor

Descendant 1 Descendant 2

Page 4: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

4

Ancestor

Page 5: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

5

Ancestor

Page 6: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

6

Ancestor

Page 7: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

7

extinct

extant 1 extant 2

The genetic distance between any two extant

organisms is computable.

Page 8: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

8

The differences The differences between 1 and between 1 and 2 are the result 2 are the result of changes on of changes on the lineage the lineage leading to leading to descendant 1 + descendant 1 + those on the those on the lineage leading lineage leading to descendant to descendant 2.2.

descendant 1 descendant 2

ancestor

Page 9: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

9

Thus, any set of species are related: the relation is Phylogeny

The relationships can be represented by Phylogenetic Tree (or dendrogram)

Page 10: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

10

5 MYA

120 MYA

1,500 MYAMYA = Million Years Ago

Page 11: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

11

Phylogenetic Tree Terminology

• Graph composed of nodes & branches

• Each branch connects two adjacent nodes

A B C D

E

F

R

Page 12: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

12

Phylogenetic Tree Terminology• Nodes represent the taxonomic units

• Taxonomic units = species/genes/individuals

• Branch = relations among the taxonomic units (descant & ancestry)

• Branching pattern = Topology

• Branch lengths correspond to number of substitutions. Longer branch means more substitutions.

Page 13: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

13

Phylogenetic Tree Terminology

AB C D E

internal node - hypothetical most recent common ancestors

leaf (terminal node) - current day species or gene “taxa”

Branches

Root

Page 14: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

14

OTUs & HTUs

• OTUs = Operational Taxonomic Units– leaves of the tree

• HTUs = Hypothetical Taxonomic Units– internal nodes of the tree

Page 15: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

15

Chimp HumanGorillaHuman ChimpGorilla

=

Chimp GorillaHuman

= =

Human GorillaChimp

TreesTrees

Page 16: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

16

Same thingSame thing

s4 s5s1 s3s2s4 s5s1 s3s2

=

Page 17: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

17

Newick format

A

B

C

D

E

((A,B),(C,(D,E)));

Page 18: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

18

Rooted vs. unrooted treesRooted vs. unrooted trees

1

2

3

3 1

2

Page 19: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

19

Gorilla gorilla

(Gorilla)

Homo sapiens (human)

Pan troglodytes (Chimpanzee)

Gallus gallus (chicken)

Page 20: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

20

3 possible UNROOTED trees:3 possible UNROOTED trees:

Human

Chimp

Chicken

Gorilla

Human

Gorilla

Chimp

Chicken

Human

Chicken

Chimp

Gorilla

the best tree

Page 21: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

21

Rooting based on priori knowledge:Rooting based on priori knowledge:

Human

Chimp

Chicken

Gorilla

Human ChimpChicken Gorilla

Page 22: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

22

Ingroup / Outgroup:Ingroup / Outgroup:

Human ChimpChicken Gorilla

INGROUPOUTGROUP

Page 23: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

23

Monophyletic groups (clades):

A group is monophyletic (clade) if it has a common

ancestor and all the descendents of this ancestor are in

the group.

Page 24: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

24

Monophyletic groupsMonophyletic groups

Human ChimpChicken Gorilla

The Gorilla+Human+Chimp are monophyletic

Page 25: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

25

Non-monophyletic groupsNon-monophyletic groups

Whale ChimpDrosophila Zebra-fish

The Zebra-fish+Whale are not monophyletic:

Adaptation to water occurred more than once during evolution, independently… (or was lost in the lineage leading to chimp).

Page 26: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

26

Monophyletic groups:Monophyletic groups:Human

Chimp

Chicken

Gorilla

When an unrooted tree is given, you cannot know which groups are monophyletic. You can only say which are not.

For example, Chicken + Rat might be monophyletic if the root was between Chicken + Rat and the rest. In fact, the real root of the tree is between Chicken and the rest, hence Chicken and rat are not monophyletic. But, Human and Gorilla are not monophyletic no matter where is the root…

Rat

Page 27: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

27

What data can be used?(1) Molecular data (DNA, RNA, proteins)

(2) Morphological data (living or fossilized organisms)

Page 28: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

28

Advantages of molecular data:

• Heritable entities• Characters’ description is unambiguous• Molecular data are amenable to quantitative

treatment• Can assess evolutionary relationship among

distantly related organisms (ribosomal RNA)• More abundant data (bacteria, algae)

Page 29: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

29

What we can learn from phylogenetics tree?

Determining the closest relatives of the organism that’s you are interested in.

Page 30: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

30

Example 1: Which species are closest to Human?

Human

Chimpanzee

Gorilla

Orangutan

Gorilla

Chimpanzee

Orangutan

Human

Molecular analysis:Chimpanzee is related more closely

to human than the gorilla

Pre-Molecular analysis:The great apes

(chimpanzee, Gorilla & orangutan)Separate from the human

Page 31: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

31

Example 2 :Guilty Sequence - scientists map a

murder weapon

“In 1998, a Louisiana doctor was convicted of attempting to murder his ex-girlfriend, a nurse. The murder weapon was a syringe of HIV-infected blood drawn from a patient under the doctor's care.”

Page 32: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

32

History of the virus:

©2002 National Academy of Sciences, U.S.A.

Metzker, Michael L. et al. (2002) Proc. Natl. Acad. Sci. USA 99, 14292-14297

Phylogenetic analysis of the RT region. The smaller set of boxed sequences represents the sequences from the victim, and the larger set of boxed sequences represents the patient plus victim sequences. LA denote viral sequences from control HIV-1 infected individuals.

Page 33: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

33

Species trees and Gene trees

• Species trees - representing the evolutionary relationships among species (the speciation process).

• Gene trees – Different genes may have different evolutionary history.

Page 34: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

34

Before Darwin, homology was defined morphologically.

Similarity between properties in various species.

Example:• Bats and butterflies fly, but the structures are different. • Bats fly and whales swim, yet the bones in a bat's wing and a whale's flipper are strikingly alike.

Conclusions: 1. Bats and butterflies wings are not homologous.2. Bat wings and whales flippers are homologous.

What is Homology ?What is Homology ?

Page 35: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

35

• Darwin (1859): Homology is a result of descent with modifications from a common ancestor.

• Modern genetics: Homology is determined by genes.

• Two sequences are homologous if they are similar and share a common ancestor (similarity by itself is not enough).

• Large enough similarities typically imply homology.

Homology Interpretation: Homology Interpretation: from Darwin to 21st Centuryfrom Darwin to 21st Century

Page 36: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

36

Homolog

• A gene related to a second gene by descent from a common ancestral DNA sequence.

Page 37: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

37

OrthologsHomologous sequences are Homologous sequences are

orthologousorthologous if they were separated if they were separated by aby a speciationspeciation event:event:

If a gene exists in a species, and that If a gene exists in a species, and that species diverges into two species, species diverges into two species, then the copies of this gene in the then the copies of this gene in the resulting species are orthologous.resulting species are orthologous.

Page 38: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

38

Orthologs

• Orthologs will typically have the same or similar function in the course of evolution.

• Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.

Page 39: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

39

Orthologs

speciation

ancestor

descendant 2descendant 2

Page 40: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

40

Paralogs Homologous sequences are Homologous sequences are

paralogousparalogous if they were separated if they were separated by a by a gene duplicationgene duplication event: event:

If a gene in an organism is If a gene in an organism is duplicated, then the two copies are duplicated, then the two copies are

paralogous. paralogous.

Page 41: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

41

Paralogs

• Orthologs will typically have the same or similar function.

• This is not always true for paralogs due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions.

Page 42: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

42

Paralogs

DuplicationDuplication

Page 43: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

43

Orthologs & Paralogs

Duplication

Speciation

Species a Species b

Paralogs

Orthologs

Orthologs

Page 44: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

44

How many rooted trees

a ba b c b a c c a b

N=3, TR(3) = 3

b c da c b da d b ca a c db c a db

TR = “TREE ROOTED”

N=2, TR(2) = 1

d a cb a b dc b a dc d a bc a b cd

b a cd c a bd b c da c b da d b ca

N=4, TR(4) = 15

Page 45: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

45

Number of Number of Number of taxarooted treesunrooted trees2 1 13 3 14 15 35 105 156 954 1057 10,395 9548 135,135 10,3959 2,027,025 135,13510 34,459,425 2,027,02511 654,729,075 34,459,42512 13,749,310,575 654,729,075

Number of possible trees:

Page 46: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

46

NRooted=(2n-3)! / 2n-2(n-2)!

NUnrooted=(2n-5)! / 2n-3(n-3)!

Number of possible trees

Page 47: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

47

Evolution is an historical process.

Only one historical narrative is true.

From 8,200,794,532,637,891,559,375 possibilities for 20 taxas, 1

possibility is true and 8,200,794,532,637,891,559,374 are false.

Truth is one, falsehoods are many.

Page 48: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

48

How do we know which of the

8,200,794,532,637,891,559,375 trees is true?

We don’t, we infer by using decision criteria.

Page 49: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

49

Methods

Page 50: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

50

Approach 1 - Distance methods• Two steps:

– Compute a distances between any two sequences from the MSA.

– Find the tree that agrees most with the distance table.

Approach 2 - Character state methods• Input: multiple sequence alignment

• Algorithms: – Maximum parsimony (MP)– Maximum likelihood (ML)

Page 51: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

51

Step 1 :Distances estimation

There are different methods to compute the distance between any two sequences. For example, one can take into account different probabilities between transitions and transversions…

B 8

OTU A B C

CD

7 912 14 11

D

A

Page 52: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

52

Step 2:From a distance table to a tree

• Algorithms:– UPGMA – Neighbor Joining (NJ)

Page 53: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

53

Neighbor Joining (NJ)

• Reconstructs unrooted tree• Calculates branch lengths • Based on Star decomposition• In each stage, the two nearest nodes of the

tree are chosen and defined as neighbors in our tree. This is done recursively until all of the nodes are paired together.

Page 54: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

54

What are neighbours?What are neighbours?Neighbours are defined as a pair of OTU's who Neighbours are defined as a pair of OTU's who have one internal node connecting them.have one internal node connecting them.

Neighbors, we are …Neighbors, we are …

BD

A C

A and B are neighbours,C and D are neighbours,But…A and C are not neighbours…

Page 55: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

55

Which pair is closest?Which pair is closest?

Neighbors, we are …Neighbors, we are …

ri=Σdik /(N-2) average distance from all nodes

Mij= dij - [ri + rj] distance of i,j relative to the rest

Page 56: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

56

7 9

OTU A B C

CDE

12 1 3

D

A

B 8

A

B

C

D

(B,D)

A

C

(B,D)

EE

11 10 2 6

E

OTU A (B,D) C

CE

7 6

A

10

E

11 8 2

Page 57: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

57

(B,D)

A

C

E

OTU A (B,D) C

CE

7 6

A

10

E

11 8 2

(B,D)

(B,D)

(C,E)

A

B

D

CE

A

=

Page 58: Introduction to Bioinformatics Molecular Phylogeny Lesson 5

58

Advantages and disadvantages of NJ • Advantages

– is fast and thus suited for large datasets and for bootstrap analysis

– permist lineages with largely different branch lengths

– permits correction for multiple substitutions

• Disadvantages – sequence information is reduced

• gives only one possible tree – strongly dependent on the model of evolution

used.