pair-wise sequence alignment what happened to the sequences of similar genes? random mutation...
TRANSCRIPT
![Page 1: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/1.jpg)
Pair-wise Sequence Alignment
•What happened to the sequences of similar genes?random mutationdeletion, insertion
Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI ++ P++ ++DV+SY Seq. 2: 451 EVI---EHKPYNHKADVFSYA
•Homology vs. similarity
•What is pair-wise sequence alignment?
•Why pair-wise alignment?
![Page 2: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/2.jpg)
Some concepts
•Optimal alignment
•Global alignment
•Gaps
•Local alignment
•Gap penalty
•Substitution matrix
![Page 3: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/3.jpg)
Dotplot
•What dotplot shows
•What dotplot does not show
•A simplified representation
![Page 4: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/4.jpg)
Sequence Alignment
•Dynamic programminga method for some optimization problemsdetermine a scoring schemebest solution based on a scoring scheme
•Total number of possible alignments for length n~ 22n / sqrt(2n)
•Needleman-Wunsch - global
![Page 5: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/5.jpg)
•Questions•How does it work?•How to come up with a DP approach to an exponential problem? •How to implement a DP approach?
![Page 6: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/6.jpg)
Dynamic Programming Algorithm
F(i,j) = max
•Break a problem into subproblems•Solve each subproblem separately
F(i-1,j-1) + s(xi, yj)F(i,j-1) + gF(i-1,j) + g
s(xi, yj) : substitution score for aligning xi with yj
g : gap penalty
F(i,j) : The max score for aligning 1st i symbols of sequence 1 with 1st j symbols of sequence 2
![Page 7: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/7.jpg)
Example
•Initialization• matrix filling (scoring)•Trace back
ACTCG ACAGTAG
Match: 1Mismatch: 0Gap: -1
![Page 8: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/8.jpg)
0 -1 -2 -3 -4 -5 -6 -7
-1 1 0 -1 -2 -3 -4 -5
-2 0 2 1 0 -1 -2 -3
-3 -1 1 2 1 1 0 -1
-4 -2 0 1 2 1 1 0
-5 -3 -1 0 2 2 1 2
A C A G T A G
A
C
T
C
G
i=0
i=1
i=2
i=3
i=4
i=5
j =0, 1, 2, 3, 4, 5, 6, 7
![Page 9: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/9.jpg)
Local Alignment: Smith- Waterman•Biological significance
F(i,j) = max F(i-1,j-1) + s(xi, yj)F(i,j-1) + gF(i-1,j) + g
0
•O(n2) time
![Page 10: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/10.jpg)
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 1 1 0 0 0 0 0 2 1
0 0 0 0 0 0 0 0 0 1 0 1
0 1 1 0 0 0 1 0 1 0 0 0
0 0 0 0 0 1 0 2 1 0 0 1
0 1 1 0 0 0 2 0 3 2 1 0
0 0 0 0 0 1 1 3 2 2 1 2
0 1 1 0 0 0 2 2 4 3 2 1
A A C C T A T A G C T
G
C
G
A
T
A
T
A
AACCTATAGCT ||||GCGATATA
Local Alignment
![Page 11: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/11.jpg)
Issues in alignment•Different ways to fill the table
•Multiple optimal alignments
•s(xi, yj) – from substitution matrix
• gap penalty:linear: w(k) = gk
Affine: w(k) =h + gk, k>=1
0, k=0
![Page 12: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/12.jpg)
Gap models•New gap vs. gap extension
•A gap of length k vs. k gaps of length 1
•1 insersion / deletion event vs. k events
• gap penalty:linear: w(k) = gk
Affine: w(k) =h + gk, k>=1
0, k=0
![Page 13: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/13.jpg)
Affine Gap Penalty
M( i, j ) : best score when xi aligned with yjIx (i, j) : best score when xi aligned with a gapIy (i, j) : best score when yj aligned with a gap
•Aligning 1st i symbols of x with 1st j symbols of y
•? Wrong with the F(i,j) formula if AGP is used
•Three matrices
![Page 14: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/14.jpg)
DP for global alignment for AGP
M (i, j) = maxM(i-1, j-1) + s(xi, yj)Ix (i-1, j-1) + s(xi, yj)ly (i-1, j-1) + s(xi, yj)
Ix (i, j) = maxM(i-1, j) + h + gIy(i-1, j) + h + glx (i-1, j) + g
Iy (i, j) = maxM(i, j-1) + h + gIx(i, j-1) + h + gly (i, j-1) + g
![Page 15: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/15.jpg)
DP for global alignment using AGP•Initialization
M(0, 0) =0Ix(i, 0) = h+gily(0, j) = h+gjall other cases: -
•Start at the largest element in the three matricesM(m, n), Ix(m, n), ly(m, n)
•Traceback to (0,0)
![Page 16: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/16.jpg)
DP for local alignment for AGP
M (i, j) = maxM(i-1, j-1) + s(xi, yj)Ix (i-1, j-1) + s(xi, yj)ly (i-1, j-1) + s(xi, yj)0
Ix (i, j) = maxM(i-1, j) + h + gIy(i-1, j) + h + g // ignoredlx (i-1, j) + g
Iy (i, j) = maxM(i, j-1) + h + gIx(i, j-1) + h + g // ignoredly (i, j-1) + g
![Page 17: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/17.jpg)
DP for Local Alignment for AGP•Initialization
M(0, 0) =0Ix(i, 0) = 0ly(0, j) = 0all other cases: -
•Start at the largest M(i, j), Ix(i, j), ly(i, j)
•Traceback till M(i, j) = 0
![Page 18: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/18.jpg)
Database searching methods
•Need more efficient methods
•Dynamic programming - O(n2L), L: size of database
•Why DP is slow?
•Ideas: Regions that are similar likely to share short identical subsequences
•Quick search for the regions, then check carefully locally
![Page 19: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/19.jpg)
FASTA related methods
•Word, word size (2,6), sensitivity vs. speed
•What are the words in the query also in target
•Pre-computed table that stores locations of words – “hashing”
•Heuristic approximation
1. Quick initial “guess” – common subsequences
•An example
![Page 20: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/20.jpg)
FASTA related methods
•Use Smith-Waterman method in a band, 32 aa wide around the best score
2. Find the region with high population of common words•Process diagonals, rescore, join regions, using gaps
3. Local alignment (DP) in the region identified
![Page 21: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/21.jpg)
Limitation of FASTA•Speed vs. sensitivity
•Can miss biologically significant similaritysome proteins do not share identical a.a.initial stepDifferent codons encodes same protein
•Identical words
![Page 22: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/22.jpg)
BLAST •Previous 2 kinds approaches
1. Word list•Incorporate similarity measurement for words
– PAM120e.g. ACDE
•Theoretically sound •search for common subsequences
•Scan for word occurrenceshash tableFinite state machine
(Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)
![Page 23: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/23.jpg)
BLAST
2. Extend words to HSP (locally optimal pairs)•Find additional words within threshold•Merge within distance A
3. Select significant HSPs, use DP in banded region
![Page 24: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/24.jpg)
Mini Presentations
1. Previous BLAST 2. Major concepts in BLAST 3. Statistical issue 4. Gapped local alignment –Gapped 5. Position-specific scoring matrix (PSSM) –
overall idea, architecture, multiple -alignment construction
6. PSSM – target frequency estimation, application to BLAST
(Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)
![Page 25: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/25.jpg)
Multiple Sequence Alignment•Motivation
•What is MSA?
•How do we extend knowledge of pair-wise alignment?
•An example: AGAC, AC, AGAGAC--AC
AGACAG--
ACAG
Some possibilitiesAG-- --AC AGAC
•Fix pair-wise alignment and then add? •Evaluate all the possible alignment of N sequences?
![Page 26: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/26.jpg)
•Sum of pairs (SP) scoring methodsGiven a alignment of N sequences, each of which has length L, in the LxN alignment:
Pair-wise sum for each column, then sum all columns
Scoring MSA
•Example(c(match)=1, c(mismatch)=-1, c(gap)=-2, c(gap,gap) =0
SP4=SP(I,-,I,V) = -2+1-1-2-2-1=-7SP = SP1 +SP2 + … + SP8
AQPILLLVALR-LL—-AK-ILLL-CPPVLILV
•SP tends to overweight a single mutationSP(A,A,A,C) = 0, SP(A,A,A,A) = 6
![Page 27: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/27.jpg)
•DP of N dimensions using SPTime: in the order of (LN)(2N-1)N2 ~ O((2L)NN2)
Extension of DP for N sequences •Extend F(i,j) for N dimensions
![Page 28: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/28.jpg)
STAR method
•DP provide optimal solution but costly
•Heuristic methods – STAR, CLUSTALW, …•Progressive alignment
•STAR- pair-wise - build similarity matrix- find a “star” sequence- use “star” to align other sequence- once gap, all time gap
![Page 29: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/29.jpg)
STAR method
•Example
![Page 30: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/30.jpg)
CLUSTAL family
•Build Similarity tree – “clustering”•Alignment starts at most similar sequences
•What are the disadvantages of STAR method?
1.Pair-wise alignment --> distance matrixFast approximate approach or DP
![Page 31: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649de55503460f94add20f/html5/thumbnails/31.jpg)
CLUSTALW2. Construct similarity tree, “the guide tree”
•Start with most similar sequences•Align group with group using pair-wise alignment•e.g.
3. Progressive alignment
UPGMA (un-weighted pair-group method using arithmetic average)