introduction to bioinformatics molecular phylogeny lesson 5
Post on 21-Dec-2015
222 views
TRANSCRIPT
Introduction to Bioinformatics
Molecular Phylogeny
Lesson 5
2
Theory of Evolution: Life is monophyletic
• All organisms on Earth had a common ancestor.
• Any two organisms share a common ancestor in their past.
Ancestor
Descendant 1 Descendant 2
3
Theory of Evolution:• Speciation events
lead to creation of different species (two species ).
• Speciation caused by physical separation into groups where different genetic variants become dominant.
Ancestor
Descendant 1 Descendant 2
4
Ancestor
5
Ancestor
6
Ancestor
7
extinct
extant 1 extant 2
The genetic distance between any two extant
organisms is computable.
8
The differences The differences between 1 and between 1 and 2 are the result 2 are the result of changes on of changes on the lineage the lineage leading to leading to descendant 1 + descendant 1 + those on the those on the lineage leading lineage leading to descendant to descendant 2.2.
descendant 1 descendant 2
ancestor
9
Thus, any set of species are related: the relation is Phylogeny
The relationships can be represented by Phylogenetic Tree (or dendrogram)
10
5 MYA
120 MYA
1,500 MYAMYA = Million Years Ago
11
Phylogenetic Tree Terminology
• Graph composed of nodes & branches
• Each branch connects two adjacent nodes
A B C D
E
F
R
12
Phylogenetic Tree Terminology• Nodes represent the taxonomic units
• Taxonomic units = species/genes/individuals
• Branch = relations among the taxonomic units (descant & ancestry)
• Branching pattern = Topology
• Branch lengths correspond to number of substitutions. Longer branch means more substitutions.
13
Phylogenetic Tree Terminology
AB C D E
internal node - hypothetical most recent common ancestors
leaf (terminal node) - current day species or gene “taxa”
Branches
Root
14
OTUs & HTUs
• OTUs = Operational Taxonomic Units– leaves of the tree
• HTUs = Hypothetical Taxonomic Units– internal nodes of the tree
15
Chimp HumanGorillaHuman ChimpGorilla
=
Chimp GorillaHuman
= =
Human GorillaChimp
TreesTrees
16
Same thingSame thing
s4 s5s1 s3s2s4 s5s1 s3s2
=
17
Newick format
A
B
C
D
E
((A,B),(C,(D,E)));
18
Rooted vs. unrooted treesRooted vs. unrooted trees
1
2
3
3 1
2
19
Gorilla gorilla
(Gorilla)
Homo sapiens (human)
Pan troglodytes (Chimpanzee)
Gallus gallus (chicken)
20
3 possible UNROOTED trees:3 possible UNROOTED trees:
Human
Chimp
Chicken
Gorilla
Human
Gorilla
Chimp
Chicken
Human
Chicken
Chimp
Gorilla
the best tree
21
Rooting based on priori knowledge:Rooting based on priori knowledge:
Human
Chimp
Chicken
Gorilla
Human ChimpChicken Gorilla
22
Ingroup / Outgroup:Ingroup / Outgroup:
Human ChimpChicken Gorilla
INGROUPOUTGROUP
23
Monophyletic groups (clades):
A group is monophyletic (clade) if it has a common
ancestor and all the descendents of this ancestor are in
the group.
24
Monophyletic groupsMonophyletic groups
Human ChimpChicken Gorilla
The Gorilla+Human+Chimp are monophyletic
25
Non-monophyletic groupsNon-monophyletic groups
Whale ChimpDrosophila Zebra-fish
The Zebra-fish+Whale are not monophyletic:
Adaptation to water occurred more than once during evolution, independently… (or was lost in the lineage leading to chimp).
26
Monophyletic groups:Monophyletic groups:Human
Chimp
Chicken
Gorilla
When an unrooted tree is given, you cannot know which groups are monophyletic. You can only say which are not.
For example, Chicken + Rat might be monophyletic if the root was between Chicken + Rat and the rest. In fact, the real root of the tree is between Chicken and the rest, hence Chicken and rat are not monophyletic. But, Human and Gorilla are not monophyletic no matter where is the root…
Rat
27
What data can be used?(1) Molecular data (DNA, RNA, proteins)
(2) Morphological data (living or fossilized organisms)
28
Advantages of molecular data:
• Heritable entities• Characters’ description is unambiguous• Molecular data are amenable to quantitative
treatment• Can assess evolutionary relationship among
distantly related organisms (ribosomal RNA)• More abundant data (bacteria, algae)
29
What we can learn from phylogenetics tree?
Determining the closest relatives of the organism that’s you are interested in.
30
Example 1: Which species are closest to Human?
Human
Chimpanzee
Gorilla
Orangutan
Gorilla
Chimpanzee
Orangutan
Human
Molecular analysis:Chimpanzee is related more closely
to human than the gorilla
Pre-Molecular analysis:The great apes
(chimpanzee, Gorilla & orangutan)Separate from the human
31
Example 2 :Guilty Sequence - scientists map a
murder weapon
“In 1998, a Louisiana doctor was convicted of attempting to murder his ex-girlfriend, a nurse. The murder weapon was a syringe of HIV-infected blood drawn from a patient under the doctor's care.”
32
History of the virus:
©2002 National Academy of Sciences, U.S.A.
Metzker, Michael L. et al. (2002) Proc. Natl. Acad. Sci. USA 99, 14292-14297
Phylogenetic analysis of the RT region. The smaller set of boxed sequences represents the sequences from the victim, and the larger set of boxed sequences represents the patient plus victim sequences. LA denote viral sequences from control HIV-1 infected individuals.
33
Species trees and Gene trees
• Species trees - representing the evolutionary relationships among species (the speciation process).
• Gene trees – Different genes may have different evolutionary history.
34
Before Darwin, homology was defined morphologically.
Similarity between properties in various species.
Example:• Bats and butterflies fly, but the structures are different. • Bats fly and whales swim, yet the bones in a bat's wing and a whale's flipper are strikingly alike.
Conclusions: 1. Bats and butterflies wings are not homologous.2. Bat wings and whales flippers are homologous.
What is Homology ?What is Homology ?
35
• Darwin (1859): Homology is a result of descent with modifications from a common ancestor.
• Modern genetics: Homology is determined by genes.
• Two sequences are homologous if they are similar and share a common ancestor (similarity by itself is not enough).
• Large enough similarities typically imply homology.
Homology Interpretation: Homology Interpretation: from Darwin to 21st Centuryfrom Darwin to 21st Century
36
Homolog
• A gene related to a second gene by descent from a common ancestral DNA sequence.
37
OrthologsHomologous sequences are Homologous sequences are
orthologousorthologous if they were separated if they were separated by aby a speciationspeciation event:event:
If a gene exists in a species, and that If a gene exists in a species, and that species diverges into two species, species diverges into two species, then the copies of this gene in the then the copies of this gene in the resulting species are orthologous.resulting species are orthologous.
38
Orthologs
• Orthologs will typically have the same or similar function in the course of evolution.
• Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.
40
Paralogs Homologous sequences are Homologous sequences are
paralogousparalogous if they were separated if they were separated by a by a gene duplicationgene duplication event: event:
If a gene in an organism is If a gene in an organism is duplicated, then the two copies are duplicated, then the two copies are
paralogous. paralogous.
41
Paralogs
• Orthologs will typically have the same or similar function.
• This is not always true for paralogs due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions.
42
Paralogs
DuplicationDuplication
43
Orthologs & Paralogs
Duplication
Speciation
Species a Species b
Paralogs
Orthologs
Orthologs
44
How many rooted trees
a ba b c b a c c a b
N=3, TR(3) = 3
b c da c b da d b ca a c db c a db
TR = “TREE ROOTED”
N=2, TR(2) = 1
d a cb a b dc b a dc d a bc a b cd
b a cd c a bd b c da c b da d b ca
N=4, TR(4) = 15
45
Number of Number of Number of taxarooted treesunrooted trees2 1 13 3 14 15 35 105 156 954 1057 10,395 9548 135,135 10,3959 2,027,025 135,13510 34,459,425 2,027,02511 654,729,075 34,459,42512 13,749,310,575 654,729,075
Number of possible trees:
46
NRooted=(2n-3)! / 2n-2(n-2)!
NUnrooted=(2n-5)! / 2n-3(n-3)!
Number of possible trees
47
Evolution is an historical process.
Only one historical narrative is true.
From 8,200,794,532,637,891,559,375 possibilities for 20 taxas, 1
possibility is true and 8,200,794,532,637,891,559,374 are false.
Truth is one, falsehoods are many.
48
How do we know which of the
8,200,794,532,637,891,559,375 trees is true?
We don’t, we infer by using decision criteria.
49
Methods
50
Approach 1 - Distance methods• Two steps:
– Compute a distances between any two sequences from the MSA.
– Find the tree that agrees most with the distance table.
Approach 2 - Character state methods• Input: multiple sequence alignment
• Algorithms: – Maximum parsimony (MP)– Maximum likelihood (ML)
51
Step 1 :Distances estimation
There are different methods to compute the distance between any two sequences. For example, one can take into account different probabilities between transitions and transversions…
B 8
OTU A B C
CD
7 912 14 11
D
A
52
Step 2:From a distance table to a tree
• Algorithms:– UPGMA – Neighbor Joining (NJ)
53
Neighbor Joining (NJ)
• Reconstructs unrooted tree• Calculates branch lengths • Based on Star decomposition• In each stage, the two nearest nodes of the
tree are chosen and defined as neighbors in our tree. This is done recursively until all of the nodes are paired together.
54
What are neighbours?What are neighbours?Neighbours are defined as a pair of OTU's who Neighbours are defined as a pair of OTU's who have one internal node connecting them.have one internal node connecting them.
Neighbors, we are …Neighbors, we are …
BD
A C
A and B are neighbours,C and D are neighbours,But…A and C are not neighbours…
55
Which pair is closest?Which pair is closest?
Neighbors, we are …Neighbors, we are …
ri=Σdik /(N-2) average distance from all nodes
Mij= dij - [ri + rj] distance of i,j relative to the rest
56
7 9
OTU A B C
CDE
12 1 3
D
A
B 8
A
B
C
D
(B,D)
A
C
(B,D)
EE
11 10 2 6
E
OTU A (B,D) C
CE
7 6
A
10
E
11 8 2
57
(B,D)
A
C
E
OTU A (B,D) C
CE
7 6
A
10
E
11 8 2
(B,D)
(B,D)
(C,E)
A
B
D
CE
A
=
58
Advantages and disadvantages of NJ • Advantages
– is fast and thus suited for large datasets and for bootstrap analysis
– permist lineages with largely different branch lengths
– permits correction for multiple substitutions
• Disadvantages – sequence information is reduced
• gives only one possible tree – strongly dependent on the model of evolution
used.