multiple global alignment and phylogenetic tree
DESCRIPTION
Multiple Global Alignment and Phylogenetic tree. Outline. Multiple sequence alignment—MSA Motivation The sum of pairs method (SP) Phylogenetic tree Clustering Neighbour joining Clustalw. What is a Multiple Sequence Alignment. MSA is the alignment of more than two sequences. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/1.jpg)
Michael Schroeder BioTechnological CenterTU [email protected]://biotec.tu-dresden.de Biotec
Multiple Global Alignment and Phylogenetic tree
![Page 2: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/2.jpg)
By Michael Schroeder, Biotec 2
Outline
Multiple sequence alignment—MSA Motivation The sum of pairs method (SP)
Phylogenetic tree Clustering Neighbour joining
Clustalw
![Page 3: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/3.jpg)
By Michael Schroeder, Biotec 3
What is a Multiple Sequence Alignment
MSA is the alignment of more than two sequences
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG— * *
An example of MSA alignment
![Page 4: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/4.jpg)
By Michael Schroeder, Biotec 4
Dynamic Programming in 3D
QUESTION:Which alignmentwould be generatedFor DQLF, DNVQ, QGL?
![Page 5: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/5.jpg)
By Michael Schroeder, Biotec 5
Dynamic Programming in 3D
D--Q-LF
DNVQ---
---QGL-
![Page 6: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/6.jpg)
By Michael Schroeder, Biotec 6
How many cases do we need to consider?
In standard dynamic programming we considered 3 cases, namely match/mismatch, insert, and delete
For three sequences s1, s2, s3 there are 7 possibilities:
For m sequences there are 2m -1 possibilities
si1 - si
1 si1 - - si
1
sj2 sj
2 - sj2 - sj
2 -
sk3 sk
3 sk3 - sk
3 - -
QUESTION:Why is it “2”?
![Page 7: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/7.jpg)
By Michael Schroeder, Biotec 7
Complexity
For m sequences each of length n the matrix has nm cells and for each we must check 2m -1 possibilities: That’s prohibitive!
Solution: Use pruning techniques (cut-offs) and heuristics to guide the search for the best solution
![Page 8: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/8.jpg)
By Michael Schroeder, Biotec 8
A little excursion to Romania:
A* Search
Further reading Russel/Norvig, Artificial Intelligence, Chapter 4. Prentice-Hall
![Page 9: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/9.jpg)
By Michael Schroeder, Biotec 9
Problem: Find the shortest path from Arad to Bucharest
Arad
Bucharest
OradeaZerind
Faragas
Neamt
Iasi
Vaslui
Hirsova
Eforie
Urziceni
Giurgui
Pitesti
Sibiu
Dobreta
Craiova
Rimnicu
Mehadia
Timisoara
Lugoj
87
92
142
86
98
86
211
101
90
99
151
71
75
140118
111
70
75
120
138
146
97
80
140
80
97
101
Sibiu
Rimnicu
Pitesti
Optimal route is (140+80+97+101) = 418 miles
![Page 10: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/10.jpg)
By Michael Schroeder, Biotec 10
Straight Line Distances to Bucharest
Town SLD
Arad 366
Bucharest 0
Craiova 160
Dobreta 242
Eforie 161
Fagaras 178
Giurgiu 77
Hirsova 151
Iasi 226
Lugoj 244
Town SLD
Mehadai 241
Neamt 234
Oradea 380
Pitesti 98
Rimnicu 193
Sibiu 253
Timisoara 329
Urziceni 80
Vaslui 199
Zerind 374
![Page 11: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/11.jpg)
By Michael Schroeder, Biotec 11
Greedy search
Arad
Bucharest
OradeaZerind
Faragas
Hirsova
Eforie
Urziceni
Giurgui
Pitesti
Sibiu
Dobreta
Craiova
Rimnicu
Mehadia
Timisoara
Lugoj
Town SLD
Arad 366
Bucharest 0
Craiova 160
Dobreta 242
Eforie 161
Fagaras 178
Giurgiu 77
Hirsova 151
Iasi 226
Lugoj 244
Town SLD
Mehadai 241
Neamt 234
Oradea 380
Pitesti 98
Rimnicu 193
Sibiu 253
Timisoara 329
Urziceni 80
Vaslui 199
Zerind 374
Go to neighboring city v, which minimizesdistance Fv to goal
![Page 12: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/12.jpg)
By Michael Schroeder, Biotec 12
Greedy search
Arad
Bucharest
OradeaZerind
Faragas
Hirsova
Eforie
Urziceni
Giurgui
Pitesti
Sibiu
Dobreta
Craiova
Rimnicu
Mehadia
Timisoara
Lugoj
Town SLD
Arad 366
Bucharest 0
Craiova 160
Dobreta 242
Eforie 161
Fagaras 178
Giurgiu 77
Hirsova 151
Iasi 226
Lugoj 244
Town SLD
Mehadai 241
Neamt 234
Oradea 380
Pitesti 98
Rimnicu 193
Sibiu 253
Timisoara 329
Urziceni 80
Vaslui 199
Zerind 374
Go to neighboring city v, which minimizesdistance Fv to goal
QUESTION:Any problems?Why is it called“greedy” search?
![Page 13: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/13.jpg)
By Michael Schroeder, Biotec 13
Problems of greedy search Not optimal
Greedy search from Arad to Bucharestvia Fagaras, optimum via Rimnicu
Problem: Greedy algorithm does not include distance already covered
A*: Pursue best node first with scoring function of distance so far plus under estimate to goal (e.g.
shortest line distance) v is a node Sv Best score to go from start to node v Fv Estimate for going from v to goal Tv = Sv + Fv Total score
Organize nodes to be visited sorted by total score(TODO list in next slides)
![Page 14: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/14.jpg)
By Michael Schroeder, Biotec 14
A* search of the Romanian map featured in the previous slide. Note: Nodes are labelled with Tv = Sv + Fv. However,we will be using the abbreviations T, S and F to make the notation simpler
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
Bucharest(2)
BucharestBucharest
![Page 15: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/15.jpg)
By Michael Schroeder, Biotec 15
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
Arad
We begin with the initial state of Arad. The cost of reaching Arad from Arad (or S value) is 0 miles. The straight line distance from Arad to Bucharest (or F value) is 366 miles. This gives us a total value of ( T = S + F ) 366 miles. Expand the initial state of Arad.
DONE = []
TODO = [Arad/366]
T= 0 + 366
T= 366
Bucharest(2)
BucharestBucharest
![Page 16: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/16.jpg)
By Michael Schroeder, Biotec 16
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
Once Arad is expanded we look for the node with the lowest cost. Sibiu has the lowest value for T. (The cost to reach Sibiu from Arad is 140 miles, and the straight line distance from Sibiu to the goal state is 253 miles. This gives a total of 393 miles).
DONE = [Arad]
TODO = [Sibiu/393, Timisoara/447, Zerind/449]
Bucharest(2)
![Page 17: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/17.jpg)
By Michael Schroeder, Biotec 17
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
We now expand Sibiu (that is, we expand the node with the lowest value of T).
DONE = [Arad, Sibiu]
TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671]
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
Bucharest(2)
![Page 18: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/18.jpg)
By Michael Schroeder, Biotec 18
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
We now expand Rimnicu (that is, we expand the node with the lowest value of T ).
DONE = [Arad, Sibiu]
TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671]
Bucharest(2)
![Page 19: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/19.jpg)
By Michael Schroeder, Biotec 19
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
Once Rimnicu is expanded we look for the node with the lowest cost. As you can see, Pitesti has the lowest value for T. (The cost to reach Pitesti from Arad is 317 miles, and the straight line distance from Pitesti to the goal state is 98 miles. This gives a total of 415 miles
DONE = [Arad, Sibiu, Rimnicu]
TODO = [Pitesti/415, Fagaras/417, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
Bucharest(2)
![Page 20: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/20.jpg)
By Michael Schroeder, Biotec 20
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
We now expand Pitesti (that is, we expand the node with the lowest value of T).
DONE = [Arad, Sibiu, Rimnicu, Pitesti]
TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
T= 418 + 0
T= 418
Bucharest(2)
![Page 21: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/21.jpg)
By Michael Schroeder, Biotec 21
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
In actual fact, the algorithm will not really recognise that we have found Bucharest. It just keeps expanding the lowest cost nodes (based on T ) until it finds a goal state AND it has the lowest value of T. So, we must now move to Fagaras and expand it.
DONE = [Arad, Sibiu, Rimnicu, Pitesti]
TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
Bucharest(2)
![Page 22: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/22.jpg)
By Michael Schroeder, Biotec 22
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
We have just expanded a node (Pitesti) that revealed Bucharest, but it has a cost of 418. If there is any other lower cost node (and in this case there is one cheaper node, Fagaras, with a cost of 417) then we need to expand it in case it leads to a better solution to Bucharest than the 418 solution we have already found.
DONE = [Arad, Sibiu, Rimnicu, Pitesti]
TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
T= 418 + 0
T= 418
Bucharest(2)
![Page 23: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/23.jpg)
By Michael Schroeder, Biotec 23
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
We now expand Fagaras (that is, we expand the node with the lowest value of T ).
DONE = [Arad, Sibiu, Rimnicu, Pitesti]
TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]
Bucharest(2)T= 450 + 0
T= 450
![Page 24: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/24.jpg)
By Michael Schroeder, Biotec 24
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
Bucharest(2)T= 450 + 0
T= 450
Once Fagaras is expanded we look for the lowest cost node. As you can see, we now have two Bucharest nodes. One of these nodes ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ) has an T value of 418. The other node (Arad – Sibiu – Fagaras – Bucharest(2) ) has an T value of 450. We therefore move to the first Bucharest node and expand it.
DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras]
TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671]
![Page 25: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/25.jpg)
By Michael Schroeder, Biotec 25
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
Bucharest(2)T= 450 + 0
T= 450
BucharestBucharestBucharest
We have now arrived at Bucharest. As this is the lowest cost node AND the goal state we can terminate the search. If you look back over the slides you will see that the solution returned by the A* search pattern ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ), is in fact the optimal solution.
DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras]
TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671]
![Page 26: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/26.jpg)
By Michael Schroeder, Biotec 26
Additional optimization
Let‘s assume we have an (over)-estimate K for the best solution, i.e. the optimal solution will be better than K
Do not consider any node with total score Tv worse than K
If Tv > K then remove v from TODO list
![Page 27: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/27.jpg)
By Michael Schroeder, Biotec 27
OradeaZerind
Fagaras
Pitesti
Sibiu
Craiova
RimnicuTimisoara
Bucharest
AradT= 0 + 366
T= 366
T= 75 + 374
T= 449
T= 140 + 253
T= 393T= 118 + 329
T= 447
T= 239 + 178
T= 417
T= 291 + 380
T= 671
T= 220 + 193
T= 413
T= 317 + 98
T= 415T= 366 + 160
T= 526
T= 418 + 0
T= 418
Bucharest(2)T= 450 + 0
T= 450
BucharestBucharestBucharest
Additional optimization Assume K = 430, then we can
remove nodes Zerind, Oradea, Timisoara, Craiova
QUESTION:What if K is equal to optimum?What if K is poorely chosen?What if rule is “If Tv >= K then remove v“? Problem?
![Page 28: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/28.jpg)
By Michael Schroeder, Biotec 28
F must be under-estimate
For algorithm to work F must be an under-estimate
Example: Direct distance is always shorter than road
QUESTION:What happens if F is not under-estimate?
![Page 29: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/29.jpg)
By Michael Schroeder, Biotec 29
F must be under-estimate
For algorithm to work F must be an under-estimate
Example: Direct distance is always shorter than road
Then it cannot be guaranteed that optimal solution is found E.g. FRiminicu = 10.000 in example for Riminicu?
Then TRiminicu = 10.220 > K = 450, so Riminicu would be removed, and optimal solution would not be found
QUESTION:What happens if F is not under-estimate?
![Page 30: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/30.jpg)
By Michael Schroeder, Biotec 30
From Romania to Dresden
So, what does that mean for multiple sequence alignment?
QUESTIONS:What does a node (city) correspond to?What does an edge between nodes correspond to?What does the cost between two nodes correspond to?How could we define S?How could we define F?How could we define K?
![Page 31: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/31.jpg)
By Michael Schroeder, Biotec 31
The Sum of Pairs Method
As in the pairwise case, not all MSA’s are equally good. We need a scoring method to determine when one MSA is better than another one
The Sum of Pairs (SP) method: For each column in the alignment, sum up the
score of each pair of residues. M: a MSA of the sequences of (s1, s2, ...sm) s’i is the projection of si , i.e. the sequence si with gaps S(s’i,s’j): the score of the projections The final score is
∑∑+=
−
=
=m
ij
jim
i
ssSMSP1
1
1
)','()(
![Page 32: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/32.jpg)
By Michael Schroeder, Biotec 32
QUESTION:What is the score of the alignment?
An Example of Using the SP Method
Example
s1 = AVP s’1: A-VP-
s2 = AVT s’2: A-V-T
s3 = PSVPT s’3: PSVPT Scores:
Match = 1 Mismatch, insertion, deletion = -1 S(-, -) = 0 to prevent the double counting of gaps.
![Page 33: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/33.jpg)
By Michael Schroeder, Biotec 33
An Example of Using the SP Method
Example
s1 = AVP s’1: A-VP-
s2 = AVT s’2: A-V-T
s3 = PSVPT s’3: PSVPT Scores:
Match = 1 Mismatch, insertion, deletion = -1 S(-, -) = 0 to prevent the double counting of gaps.
Then the SP score is
S(s’1,s’2) + S(s’1,s’3) + S( s’2, s’3)
= 0 + (-1) + (-1)
= -2
![Page 34: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/34.jpg)
By Michael Schroeder, Biotec 34
1 MSA vs. n SA
What is the difference between making one multiple sequence alignment to making many pairwise sequence comparisons?
The score S(s’i,s’j) for the alignment s’i,s’j in a multiple sequence alignment is less than score S(si,sj) for aligning si,sj directly
S(s’i,s’j) <= S(si,sj)
![Page 35: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/35.jpg)
By Michael Schroeder, Biotec 35
Pruning the search space Computing all cells in the dynamic
programming solution is expensive, therefore we want to avoid computing as many cells as possible
Can we rule out any cells? Let us assume that we know already that
there is a known alignment of score K Let v = (i1,i2,….im) be a cell of the DP
matrix for which want to determine whether we need to consider it (and its neighbours) or not
Let Sv be the score of the best path from the start cell to cell v
Let FV be an upper bound for the highest-scoring alignment from v to the end of DP matrix, i.e. we can only find a path from v to the end which is less than FV
Then we know the following: If Sv+ Fv < K, then v cannot lie on
the path of the best alignment
SV
v<=Fv
![Page 36: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/36.jpg)
By Michael Schroeder, Biotec 36
Dynamic Pruning with Forward Recursion
D(v,w) is the score to be added when moving from v to its forward (east, southeast, south) neighbor w.
I.e. the overall score Sv+D(v,w) is sent to w.
The value of Sw is the maximum of all values sent to w from its backward (west, north, northwest) neighbor cells.
SV - gv
si1
From cell v values are sent to all its neighbor cells
SV
- g
SV + R(s
i 1,sj 2)
sj2
![Page 37: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/37.jpg)
By Michael Schroeder, Biotec 37
One more thing: A queue
We need a data structure before we list the algorithm
A queue is a list of elements with two special operators Push: to add an element at the end of the queue Pop: to remove an element from the top of a queue
![Page 38: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/38.jpg)
By Michael Schroeder, Biotec 38
Algorithm: Forward-recursion with pruningproc F(v, hN) a procedure which finds an upper bound
of the score of the alignment from a cell v to the end-cell hN
begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from
its neighbours If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end for end whileend
consth0 the start cell of the DP matrix (H0,0…0)
hN the end cell of the DP matrix (Hn1,n2…
nm)
K a lower bound for the score of the whole alignment
var u, v, w denote cells
S(u) the best score of an alignment from h0 to u
P(u) the score of the best alignment from h0 to u found so far
D(u, v) the score for extending the alignment from cell u to cell v
Q a queue of the cells u for which a value for P(u) is found but u is not visited yet
![Page 39: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/39.jpg)
By Michael Schroeder, Biotec 39
Finding upper limits for scores
For any multiple sequence alignment M of sequences {s1,s2,….sm} we know that the score for the multiple sequence alignment S(M) is less then the
sum of pairwise comparisons of the sequences {s1,s2,….sm}
∑∑+=
−
=
≤m
kl
lkm
k
ssSMS1
1
1
),()(
∑∑+=
++
−
=
=m
kl
lni
kni
m
kllkk
ssSF1
....1...1
1
1
),( (4.6)
The procedure F should find an upper bound for the alignment of the subsequences s1
i1+1…n1 , s2
i2+1…n2 , ….. sm
im+1…nm This can be done as follows:
![Page 40: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/40.jpg)
By Michael Schroeder, Biotec 40
Questions
QUESTION:What is the score of the multiple sequence alignmentwhen the algorithm is done?
QUESTION:How can we get alignment from algorithm?
![Page 41: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/41.jpg)
By Michael Schroeder, Biotec 41
Answers The score for the multiple sequence alignment is S(hN)
How can we get an alignment from the algorithm? We need another variable Dir to store the direction from which we
were coming
Let‘s assume we are at node v and its neighbour w is not pruned If w is new in queue then Dir(w)={v} If w is already in queue and S(v)+D(v,w)>P(w) then
P(w) = S(v)+D(v,w) and Dir(w) = {v} If w is already in queue and S(v)+D(v,w)=P(w) then
Add v to Dir(w)
![Page 42: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/42.jpg)
By Michael Schroeder, Biotec 42
Algorithm: Forward-recursion with pruningproc F(v, hN) a procedure which finds an upper bound of
the score of the alignment from a cell v to the end-cell hN
begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from
its neighbours If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) Dir(w) = {v}
else if S(v)+D(v,w) > P(w) then P(w) = S(v)+D(v,w) Dir(w) = {v} else if S(v)+D(v,w) = P(w) then Add v to Dir(w)
end for end whileend
consth0 the start cell of the DP matrix (H0,0…0)
hN the end cell of the DP matrix (Hn1,n2…
nm)
K a lower bound for the score of the whole alignment
var u, v, w denote cells
S(u) the best score of an alignment from h0 to u
P(u) the score of the best alignment from h0 to u found so far
D(u, v) the score for extending the alignment from cell u to cell v
Q a queue of the cells u for which a value for P(u) is found but u is not visited yet
Dir(w) stores nodes v from which best scores were obtained
![Page 43: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/43.jpg)
By Michael Schroeder, Biotec 43
Printing the alignment: printMSA(hN,0)
printMSA is recursive function, which takes a node v and a position k in the alignment to be generated as input
B is a matrix, which contains the aligment
printMSA(v,k): If v = h0 then print B Else
Let i1,…,im be the indices of v For all u in Dir(v) do
Let i‘1,…,i‘m be the indices of w For j from 0 to m-1 do
If ij = i‘j then Bk,j = „-“ Else Bk,j = sequence j at position ij
printMSA(u,k+1)
![Page 44: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/44.jpg)
By Michael Schroeder, Biotec 44
Questions
QUESTION:Why is Dir a set and not a single node?
QUESTION:Does printMSA print one multiple sequence alignmentor all possible ones?
![Page 45: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/45.jpg)
By Michael Schroeder, Biotec 45
ExampleLet’s align DQLF, DNVQ, QGL
with match = 3 and insertion, deletion, mismatch = -1
<0,0,0>
<3,3,2>
![Page 46: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/46.jpg)
By Michael Schroeder, Biotec 46
Example
We need a lower bound for the overall result.Let’s assume we have got already the following alignment
What is K, the sum of pairs for this alignment?
DQ-LF
DNVQ-
-QGL-
![Page 47: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/47.jpg)
By Michael Schroeder, Biotec 47
Example
We need a lower bound for the overall result.Let’s assume we have got already the following alignment
K = -1 -4 + 3 = -2
DQ-LF
DNVQ-
-QGL-
![Page 48: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/48.jpg)
By Michael Schroeder, Biotec 48
Example
Upper bound for the score from <0,0,0> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1)
F( <0,0,0>, <3,3,2> ) = +2 +3 -2 = +3
D--QLF DQ-LF DNVQ--
DNVQ-- -QGL- ---QGL
+2 +3 -2
![Page 49: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/49.jpg)
By Michael Schroeder, Biotec 49
Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend
Q: <0,0,0>P( <0,0,0> ) = 0S( <0,0,0> ) = 0
![Page 50: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/50.jpg)
By Michael Schroeder, Biotec 50
Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend
S( <0,0,0> ) + F( <0,0,0>, <3,3,2>) = 0+3 >= -2Q: <0,0,1>, <0,1,0>, <0,1,1>, … , <1,1,1>
P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ
![Page 51: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/51.jpg)
By Michael Schroeder, Biotec 51
Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend
v = <0,0,1>, Q: <0,1,0>, <0,1,1>, … , <1,1,1>S( <0,0,1> ) = P( <0,0,1> = -2
P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ
![Page 52: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/52.jpg)
By Michael Schroeder, Biotec 52
Example
Upper bound for the score from <0,0,1> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1)
F( <0,0,1>, <3,3,2> ) = +2 +0 -4 = -2
D--QLF DQLF DNVQ
DNVQ-- -GL- GL--
+2 +0 -4
![Page 53: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/53.jpg)
By Michael Schroeder, Biotec 53
Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);
P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend
v = <0,0,1>S( <0,0,1> ) = -2S( <0,0,1> ) + F( <0,0,1>, <3,3,2>) = -2-2=-4 >= -2
Q: <0,1,0>, <0,1,1>, … , <1,1,1>
P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ
v = <0,0,1> is not further pursued as the pruning rule determines that it cannot be part of the best alignment
![Page 54: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/54.jpg)
By Michael Schroeder, Biotec 54
From MSA to phylogenetic treesAR-LARTLARSIARSLAWTLAWT-
AR-LARTLARSIARSL
AWTLAWT-
AWTLAWT-ARSI
ARSLAR-LARTL AWT- AWTL
ARSI ARSLARTLAR-L
1
23
![Page 55: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/55.jpg)
By Michael Schroeder, Biotec 55
Phylogenetic tree
Introduction Definition Tree construction method
– Clustering (UPGMA)
– Neighbour Joining
![Page 56: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/56.jpg)
By Michael Schroeder, Biotec 56
Darwin: “Origin of the species”
Find the evolutionary history of species existing today and how they are related.
![Page 57: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/57.jpg)
By Michael Schroeder, Biotec 57
Unrooted and Rooted Trees
A B C
A C B
B C A
B
C
A
root
![Page 58: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/58.jpg)
By Michael Schroeder, Biotec 58
Unrooted and Rooted Trees
A
B
C
D
A B
C D
A B
CD
A B C D
A C B D B C A D C A B D D A B c
A D B C A D B C B D A C C B A D D B A C
(a) (b)
All the topologies for four original sequences: (a) unrooted and (b) rooted
A B C D B A C D C D A B D C A B
A C B D
![Page 59: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/59.jpg)
By Michael Schroeder, Biotec 59
How many different trees are there?
)!2(2
)!32()(
2 −−
= − mm
mT mroot
)!3(2
)!52()(
3 −−
= − mm
mT munroot
The number of unrooted topologies for m≥3 original sequences is
The number of rooted topologies for m≥2 original sequences is
(4.7)
(4.8)
Example: For m=10 there are 2.027.025 unrooted trees and 34.459.425 rooted trees
![Page 60: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/60.jpg)
By Michael Schroeder, Biotec 60
Distances between Nodes
Degree of sequence similarity should be reflected in the distances between nodes
Additive tree: The distances between any two nodes is the sum of the distances over the edges connecting the nodes
![Page 61: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/61.jpg)
By Michael Schroeder, Biotec 61
Additive Trees A tree is additive if and only if
the distance between any two nodes is the sum of the distances over the edges connecting the nodes
(a) An additive tree constructed from the sequences with the distances in (b). r shows where a root is placed.
D
A
E
F
BC
8
14
3
2
4
5
34
6
1.5 4.5
r
B C D E F
A 27 24 22 31 30
B 11 21 12 11
C 18 15 14
D 25 24
E 5 (a)
(b)
![Page 62: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/62.jpg)
By Michael Schroeder, Biotec 62
Additive Trees
If the distances between nodes satisfy the equation below, then an additive tree can be constructed
Di,j + Dk,l = Di,k + Dj,l ≥ Di,l + Dj,k
This means that there are often distance matrices for which we cannot compute an additive tree
i
l
j
k
![Page 63: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/63.jpg)
By Michael Schroeder, Biotec 63
Distance-based Approach Single Alignment
Score: 46 matches, 3 mismatches, 1 gap, 3 gap extensions, z.B. Score = 46x1 - 3x1 - 1x2 - 3x1 = 38
Approach: Define distance between two sequences, e.g. percentage of
mismatches in their alignment Construct tree, which groups sequences with minimal
distances iteratively together
atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcgata||||||||||||||| |||| |||||||| |||| |||||||||||||||atgctctggccacggatcttgtggatccca---tgatatgtgcacctgcgata
![Page 64: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/64.jpg)
By Michael Schroeder, Biotec 64
Distance basedAlignment
4
2
3
5
6
7
1
Tree
Distance Matrix
![Page 65: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/65.jpg)
By Michael Schroeder, Biotec 65
Hierarchical Clustering (Single linkage)
(1,2) 3 (4,5)
(1,2) 0 5 8
3 0 4
(4,5) 0
1 2 3 4 5
1 0 2 6 10 9
2 0 5 9 8
3 0 4 5
4 0 3
5 0
(1,2) 3 4 5
(1,2) 0 5 9 8
3 0 4 5
4 0 3
5 0
(1,2) (3,(4,5))
(1,2) 0 5
(3,(4,5)) 0
5
4
3
2
1
0
1 2 3 4 5
![Page 66: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/66.jpg)
By Michael Schroeder, Biotec 66
Hierarchical clusteringconst m number of original sequencesvar U a set of current trees, initially, one tree for each original sequence.D The distance between the trees in Ubegin U = the set of one tree (each of one node) for each original sequence. while |U| >1 do (u,v) = the roots of two trees in U with the least distance in D Make a new tree with root w and with u and v as children Calculate the length of the edges (v, w) and (u, w) for each root x of the trees in U-{u, v} do D(x, w) = calculate the distance between x and the new node (w) end U = (U - {u,v} ) {w} update U endend
![Page 67: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/67.jpg)
By Michael Schroeder, Biotec 67
Hierarchical Clustering
How to define distance between clusters?Distance to the new cluster w = (u,v)
Single linkage: D(x,w) = min { D(x,u), D(x,v) } Example: Distance (A,B) to C is 1
Complete linkage: D(x,w) = max { D(x,u), D(x,v) } Example: Distance (A,B) is C is 2
Average linkage (also called WPGMA (weighted pair group method with arithmetic mean)):
D(x,w) = ( D(x,u) + D(x,v) ) / 2 Example: Distance (A,B) to C is 1.5
More general (also called UPGMA(unweighted pair group method using arithmetic mean):
D(x,w) = ( mu D(x,u) + mv D(x,v) ) / (mu + mv ) mu is the number of nodes in the subtreee u
Question: Are dendrograms always the same independent
of the method?
Question: What’s the difference between
UPGMA and WPGMA?
Note: “weighted” because u and v may have different number of nodes, hences
they are weighted.
![Page 68: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/68.jpg)
By Michael Schroeder, Biotec 68
Hierarchical Clustering
0C
10B
210A
CBA A B C A B CQuestion: Are
dendrograms always the same independent
of the method?
Question: What’s the difference between
WPGMA and UPGMA?
Average linkage: D(x,w) = ( D(x,u) + D(x,v) ) / 2 Example: Distance (A,B) to C is 1.5
More general:D(x,w) = ( mu D(x,u) + mv D(x,v) ) / (mu + mv )mu is the number of nodes in the subtreee u
Consider that subtree D contains 100 nodes (mD =100) and E only 1 (mE =1)
Average linkage D( (D,E), F ) = (2+10)/2 = 6Weighted average D (D,E), F ) = (100*2 + 1*10)/(100+1) = 2.08
0F
100E
210D
FED
Single linkage Complete l.
![Page 69: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/69.jpg)
By Michael Schroeder, Biotec 69
UPGMA-example
B C D E
A 3 7 8 10
B 6 8 7
C 4 5
D 6
C D E
(A,B) 6.5 8 8.5
C 4 5
D 6 ( C,D) E
(A,B) 7.25 8.5
( C,D) 5.5
(( C,D), E)
(A,B) 7.67
(a)
(b)
(c)
(d)
![Page 70: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/70.jpg)
By Michael Schroeder, Biotec 70
Constructing the Edges of the Tree
Let’s assume we want to join the subtrees u and v under the new root w
Then the edge from v to w has to have the following length
Lv,w = 0.5 Du,v – Lv,yv
Example: Joining C and D:
LC, (C,D) = 0.5x4 – 0=2
Joining (C,D) and E: L(C,D),((C,D),E)= 0.5x5.5-2=0.75
Lv,yv
v
yv
w
u
Lv,w
![Page 71: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/71.jpg)
By Michael Schroeder, Biotec 71
UPGMA-Tree
A B C DE
(A,B) (C,D)
((C,D),E)
1.5 1.5
2.33
1.08
2 2
2.75
0.75
B C D E
A 3 7.66 7.66 7.66
B 7.66 7.66 7.66
C 4 5.5
D 5.5 Distances in tree
B C D E
A 3 7 8 10
B 6 8 7
C 4 5
D 6
Original distances
![Page 72: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/72.jpg)
By Michael Schroeder, Biotec 72
Neighbour Joining (NJ)
Does not assume a constant molecular clock Starts with a star tree where all nodes are linked to a central
node:
x
F
A
B
C
D
E
![Page 73: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/73.jpg)
By Michael Schroeder, Biotec 73
Neighbour Joining (NJ)
Each pair of nodes are evaluated for being clustered together
For each pair the sum of all lengths in the resulting tree is calculated
The pair giving the lowest sum is chosen - in the continuation the pair is considered as one node
This is repeated
![Page 74: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/74.jpg)
By Michael Schroeder, Biotec 74
x Y
F
A
B
C
D
E
x
F
A
B
C
D
E
Y
x
F
A
B
C
D
E
Yx
F
A
B
C
D
E
(a) (b)
(c) (d)
Neighbour Joining (NJ)
![Page 75: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/75.jpg)
By Michael Schroeder, Biotec 75
A B
C
FE
D
Rooting an Unrooted Tree
Choose mid-point between all nodes and introduce new root node there
Yx
F
A
B
C
D
E
Mid-point = root
![Page 76: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/76.jpg)
By Michael Schroeder, Biotec 76
Rooting an Unrooted Tree
Alternative: Use an outgroup, which has large distance to all nodes
Example: Let’s assume D is outgroup, then the root is added to the edge from D
A
B
C
D
D = outgroup, so root goes here
BA
C
D
![Page 77: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/77.jpg)
By Michael Schroeder, Biotec 77
NJ vs Hierarchical clustering
In Neighbour Joining the pair of nodes is chosen that gives the lowest sum of branch lengths in the resulting tree.
In Hierarchical clustering the pair of closest nodes are chosen not taking into account the rest of the tree.
Hierarchical clustering does not allow for rate variation among branches.
![Page 78: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/78.jpg)
By Michael Schroeder, Biotec 78
Assessing Quality: Bootstrapping Given a tree obtained from one of the methods above Generate Multiple Alignment For a number of iterations
Generate new sequences by selecting columns (possibly the same column more than once) form the multiple alignment
Generate tree for the new sequences Compare this new tree with the given tree For each cluster in the given tree, which also approach
in the new tree, the bootstrap value is increased Bootstrap-Value = Percentage of trees containing the
same cluster
![Page 79: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/79.jpg)
By Michael Schroeder, Biotec 79
From Phylogenetic Trees to MSA
Use a phylogenetic tree to guide the construction of the multiple sequence alignment
![Page 80: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/80.jpg)
By Michael Schroeder, Biotec 80
5
4
3
2
1
0
1 2 3 4 5
From Phylogenetic Trees to MSA
MSA
12
45
3
![Page 81: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/81.jpg)
By Michael Schroeder, Biotec 81
Progressive AlignmentAlgorithm: Progressive alignment of the sequences {s1, s2, ……sm}var
C current set of alignments.begin C = { };
for i=0 to m do C = C {{ si }} end one alignment of each sequence for i =0 to m-1 do choose two alignments Ap, Aq from C; C = C - { Ap, Aq };
Ar = align ( Ap,Aq ); C = C { Ar } end C now contains the (single) final alignmentend
![Page 82: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/82.jpg)
By Michael Schroeder, Biotec 82
Aligning two subset alignments
Two subset alignments Ap, Aq with the sequences {sp1 ….spm } and {sq1 ….sqm }
Complete alignment method for aligning pairs of subset alignments
The SP score will be
kj
qqkZss
ppj
wwRnm
trSm
kt
jr
n
∑∑∈∈
=}...{
''}...{ 11
1),(
![Page 83: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/83.jpg)
By Michael Schroeder, Biotec 83
Clustering The progressive alignment should be guided by a true
phylogenetic tree Methods
Average linkage Maximum (single) linkage Minimum (complete) linkage
![Page 84: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/84.jpg)
By Michael Schroeder, Biotec 84
Clustering--example Three alignments: A1 ={s1, s2}, A2 ={s3, s4} and A3 ={s5}, with pairwise scores: s2 s3 s4 s5
s1 - 7 5 3 s2 6 4 8 s3 - 7
s4 6
Average linkage S(A1,A2) = (7+5+6+4)/4 = 5.5
S(A1,A3) = 5.5
S(A2,A3) = 6.5 best
Maximum linkage S(A1,A2) = max (7,5,6,4) = 7
S(A1,A3) = 8 best
S(A2,A3) = 7 Minimum linkage S(A1,A2) = min (7,5,6,4) = 4
S(A1,A3) = 3
S(A2,A3) = 6 best
![Page 85: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/85.jpg)
By Michael Schroeder, Biotec 85
Linear clusteringAlgorithm : Basic linear clustering for aligning the sequences {s1, s2, ……sn}var U the set of sequences not alignedA the current alignmentbegin U = {s1, s2, ……sn }; choose two sequences (the most similar) (s, t) from U; A = Align(s, t); U = U – {s, t}; for i=0 to n-2 do choose a sequence s from U; U = U –{s}; A = Align (A, s) endend
![Page 86: Multiple Global Alignment and Phylogenetic tree](https://reader036.vdocuments.mx/reader036/viewer/2022062315/56815821550346895dc58795/html5/thumbnails/86.jpg)
By Michael Schroeder, Biotec 86
The CLUSTALW Algorithm
CLUSTALW: one of the most popular MSA global alignment programs1. Calculate the (static) pairwise similarity scores for the
sequences 2. Construct a guide tree by use of the pairwise scores
(NJ method) 3. Calculate sequence weights, using the guide tree4. Perform a progressive alignment, guided by the tree