we do not have to understand the languaje to identify patterns: “ klaatu barada nikto”
DESCRIPTION
DNA, RNA and protein are an alien language ... We try to cryptographically attack this language ... we want to decipher both its meaning and its history …. Fortunate the genetic code is alphabetic … susceptible to perform string comparisons and pattern recognition. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/1.jpg)
DNA, RNA and protein are an alien language ... We try to cryptographically attack this language ... we
want to decipher both its meaning and its history …
![Page 2: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/2.jpg)
We do not have to understand the languaje to identify patterns:
“klaatu barada nikto”
Fortunate the genetic code is alphabetic … susceptible to perform string comparisons and
pattern recognition
![Page 3: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/3.jpg)
Pairwise Sequence Alignment
![Page 4: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/4.jpg)
• Principles of pairwise sequence comparison• global / local alignments• scoring systems• gap penalties
• Methods of pairwise sequence alignment • window-based methods• dynamic programming approaches
Pairwise Sequence Alignment
![Page 5: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/5.jpg)
A TTCACATA
T A C A T T A C G T A C
Sequence 1
Sequence 2
Pairwise Sequence Alignment: How to?
![Page 6: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/6.jpg)
Dotplot:
A T T C
A C
A T A
T A C A T T A C G T A CSequence 1
Sequence 2
A dotplot gives an overview of all possible alignments
![Page 7: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/7.jpg)
Dotplot:
A T T C
A C
A T A
T A C A T T A C G T A C
T A C A T T A C G T A C
A T A C A C T T A
Sequence 1
Sequence 2
One possible alignment:
In a dotplot each diagonal corresponds to a possible (ungapped) alignment
![Page 8: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/8.jpg)
• Principles of pairwise sequence comparison• global / local alignments• scoring systems• gap penalties
• Methods of pairwise sequence alignment • window-based methods• dynamic programming approaches
Pairwise Sequence Alignment
![Page 9: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/9.jpg)
Window-based Approaches
• Word Size
• Window / Stringency
![Page 10: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/10.jpg)
Word Size Algorithm
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
C T A T G A C A
T A C G G T A T G
Word Size = 3
![Page 11: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/11.jpg)
Window / Stringency
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
C T A T G A CA
T A C G G T A T G
Window = 5 / Stringency = 4
![Page 12: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/12.jpg)
Considerations
• The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted).
• The smaller the window, the larger the weight of statistical (unspecific) matches.
• With large windows the sensitivity for short sequences is reduced.
• Insertions/deletions are not treated explicitly.
![Page 13: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/13.jpg)
Insertions / Deletions in a Dotplot
T
A
C
T
G
T
C
A
T
T A C T G T T C A TSequence 1
Sequence 2
T A C T G - T C A T| | | | | | | | |T A C T G T T C A T
![Page 14: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/14.jpg)
Hemoglobin -chain
Hemoglobin
-chain
Dotplot (Window = 130 / Stringency = 9)
Output of the programs Compare and DotPlot
![Page 15: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/15.jpg)
Dotplot (Window = 18 / Stringency = 10)
Output of the programs Compare and DotPlot
Hemoglobin
-chain
Hemoglobin -chain
![Page 16: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/16.jpg)
• Principles of pairwise sequence comparison• global / local alignments• scoring systems• gap penalties
• Methods of pairwise sequence alignment • window-based approaches• dynamic programming approaches
• Needleman and Wunsch• Smith and Waterman
Pairwise Sequence Alignment
![Page 17: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/17.jpg)
Automatic procedure that finds the best alignment
with an optimal score depending on the chosen parameters.
Dynamic Programming
Recursive solutions. We solve smaller problems first, and
use those solutions to solve larger problems. Intermediate
solutions are stored in a tabular matrix.
![Page 18: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/18.jpg)
Basic principles of dynamic programming
- Initialization of alignment matrix: the scoring model
- Stepwise calculation of score values
(creation of an alignment path matrix)
- Backtracking (evaluation of the optimal path)
![Page 19: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/19.jpg)
Initialization of Matrix (BLOSUM 50): A distance metric
H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3
H 10 0 -2 -2 -2 -3 -2 10 0 0
E 0 6 -1 -3 -1 -3 -3 0 6 6
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
![Page 20: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/20.jpg)
Needleman and Wunsch(global alignment)
Sequence 1: H E A G A W G H E ESequence 2: P A W H E A E
Scoring parameters: BLOSUM50 matrix
Gap penalty: Linear gap penalty of 8
![Page 21: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/21.jpg)
Creation of an alignment path matrix
Idea:Build up an optimal alignment using previous solutions for
optimal alignments of smaller subsequences
• Construct matrix F indexed by i and j (one index for each sequence)
• F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj
• Build F(i,j) recursively beginning with F(0,0) = 0
-A
EE
HHG-WWAA
G-AP
E-H-
Optimal global alignment: EE
![Page 22: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/22.jpg)
H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
Creation of an alignment path matrix
HEAGAWGHE-E--P-AW-HEAE
Optimal global alignment:
![Page 23: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/23.jpg)
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = max F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(i-1, j-1) F(i, j-1)
F(i-1,j) F(i, j)
-d
-d
s(xi ,yj)
Creation of an alignment path matrix
HEAGAWGHE-E--P-AW-HEAE
![Page 24: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/24.jpg)
• If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)
• Three possibilities:
• xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)
• xi is aligned to a gap, F(i,j) = F(i-1,j) - d
• yj is aligned to a gap, F(i,j) = F(i,j-1) - d
• The best score up to (i,j) will be the largest of the three options
Creation of an alignment path matrix
![Page 25: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/25.jpg)
H E A G A W G H E E 0
P
A
W
H
E
A
E
-8 -16 -24 -32 -40 -48 -56 -64 -72 -80
-8
-16
-24
-32
-40
-48
-56
F(j, 0) = -j d
Boundary conditions
F(i, 0) = -i d
Creation of an alignment path matrix
![Page 26: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/26.jpg)
H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8
A -16
W -24
H -32
E -40
A -48
E -56
Stepwise calculation of score values
-2
-10
-9
-3
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = max F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(0,0) + s(xi ,yj) = 0 -2 = -2
F(1,1) = max F(0,1) - d = -8 -8= -16 = -2
F(1,0) - d = -8 -8= -16
F(1,0) + s(xi ,yj) = -8 -1 = -9
F(2,1) = max F(1,1) - d = -2 -8 = -10 = -9
F(2,0) - d = -16 -8= -24
-8 -2 = -10
F(1,2) = max -16 -8 = -24 = -10
-2 -8 = -10
-2 -1 = -3
F(2,2) = max -10 -8 = -18 = -3
-9 -8 = -17
P-H=-2
E-P=-1
H-A=-2
E-A=-1
![Page 27: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/27.jpg)
H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
Backtracking
-5
1
-A
EE
HHG-WWAA
G-AP
E-H-
0
-25
-5
-20
-13
-3
3
-8 -16
-17
Optimal global alignment: EE
![Page 28: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/28.jpg)
Two differences:
1.
2. An alignment can now end anywhere in the matrix
Smith and Waterman(local alignment)
Example:Sequence 1 H E A G A W G H E ESequence 2 P A W H E A E
Scoring parameters: Log-odds ratiosGap penalty: Linear gap penalty of 8
0
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(i, j) = max
![Page 29: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/29.jpg)
H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
Smith Waterman alignment
Optimal local alignment: AA
G-
EE
HH
WW
28
0
5
20 12
22
![Page 30: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/30.jpg)
Extended Smith & Waterman
To get multiple local alignments:• delete regions around best path
• repeat backtracking
![Page 31: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/31.jpg)
H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 0 0 0 0 0
W 0 0 0 0 2 0 0 0
H 0 10 2 0 0 0
E 0 2 16 8 0 0
A 0 0 8 21 13 5 0
E 0 0 6 13 18 12 4 0
0
5
20 12 4
12 18 22 14 6
4 10 18 28 20
4 10 20 27
4 16 26
Extended Smith & Waterman
![Page 32: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/32.jpg)
H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 0 0 0 0 0
W 0 0 0 0 2 0 0 0
H 0 10 2 0 0 0
E 0 2 16 8 0 0
A 0 0 8 21 13 5 0
E 0 0 6 13 18 12 4 0
Second best local alignment:
0
21
10
16
HHEEAA
Extended Smith & Waterman
![Page 33: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/33.jpg)
Further Extensions of Dynamic Programming
• Overlap matches
• Alignment with affine gap scores
![Page 34: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/34.jpg)
• Pairwise sequence comparison• global / local alignments• parameters• scoring systems• insertions / deletions
• Methods of pairwise sequence alignment • dotplot• windows-based methods• dynamic programming• algorithm complexity
Pairwise Sequence Alignment
![Page 35: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/35.jpg)
End.of.pa.irwise..sequence | | | | | align.ment.cours.e
![Page 36: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/36.jpg)
Methods of Pairwise Comparison
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
Programs perform global alignments:
• Needleman & Wunsch: (Pileup, Tree, Clustal)
• Word Size Method: (Clustal)
• X. Huang (MAlign) (modified N-W)
1.
![Page 37: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/37.jpg)
Construction of a Guide Tree
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
1 2 3 4 5
1
2
3
4
5
Sequence
Similarity Matrix:
displays scores ofall sequence pairs.
The similarity matrix is transformed into a distance matrix . . . . .
2.
![Page 38: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/38.jpg)
Construction of a Guide Tree
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
DistanceMatrix
1
23
4
5
Guide Tree
Neighbour-Joining Method or
UPGMA (unweighted pair group method of arithmetic averages)
2.
![Page 39: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/39.jpg)
Multiple Alignment
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
1
23
4
5
Guide Tree
2
3.
1
![Page 40: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/40.jpg)
T T A C T T C C A G G
Columns - once aligned - are never changed
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
T T A C T T C C A G G
3.
G T C C G - - C A G G
T T - C G C - C - G G
G T C C G - C A G G
T T - C G C C - G G
![Page 41: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/41.jpg)
T T A C T T C C A G G
Columns - once aligned - are never changed
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
T T A C T T C C A G G
3.
G T C C G - - C A G G
T T - C G C - C - G G
G T C C G - C A G G
T T - C G C C - G G
. . . . and new gaps are inserted.
![Page 42: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/42.jpg)
T T A C T T C C A G G
Columns - once aligned - are never changed
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step3.
G T C C G - - C A G G
T T - C G C - C - G G
A T C - T - - C A A T
C T G - T C C C T A G
A T C T - - C A A T
C T G T C C C T A G
T T A C T T C C A G G
G T C C G - - C A G G
T T - C G C - C - G G
![Page 43: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/43.jpg)
Sub-sequence alignments
![Page 44: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/44.jpg)
A K-means like clustering problem
![Page 45: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/45.jpg)
Clustering resulting model
![Page 46: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/46.jpg)
Clustering predictions
![Page 47: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/47.jpg)
Assignments
•Describe a pairwise alignment with a different gap penalization.
•Provide an example and perform a multiple global alignment. Describe the recipe.
•Provide an example and and perform a multiple alignment of subsequences. Describe the recipe.
•Algorithms Order (polynomial, exponential, NP)
![Page 48: We do not have to understand the languaje to identify patterns: “ klaatu barada nikto”](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56815181550346895dbfb991/html5/thumbnails/48.jpg)
Algorithmic Complexity
How does an algorithm‘s performance in CPU time and required memory storage scale with the size of the problem?
Needleman & Wunsch
• Storing (n+1)x(m+1) numbers
• Each number costs a constant number of calculations to compute (three sums and a max)
• Algorithm takes O(nm) memory and O(nm) time
• Since n and m are usually comparable: O(n2)