intro to alignment algorithms: global and...
TRANSCRIPT
Intro to Alignment Algorithms:
Global and Local
Algorithmic Functions of Computational Biology
Professor Istrail
Sequence ComparisonBiomolecular sequences □ DNA sequences (string over 4 letter alphabet {A, C,
G, T}) □ RNA sequences (string over 4 letter alphabet
{ACGU}) □ Protein sequences (string over 20 letter alphabet
{Amino Acids})
Sequence similarity helps in the discovery of genes, and the prediction of structure and function of proteins.
Algorithmic Functions of Computational Biology
Professor Istrail
The Basic Similarity Analysis Algorithm
Global Similarity
• Scoring Schemes
• Edit Graphs
• Alignment = Path in the Edit Graph
• The Principle of Optimality
• The Dynamic Programming Algorithm
• The Traceback
Algorithmic Functions of Computational Biology –
Professor Istrail
The Sequence Alignment ProblemInput. : two sequences over the same alphabet and a scoring scheme Output: an alignment of the two sequences of maximum score
Example: □ GCGCATTTGAGCGA □ TGCGTTAGGGTGACCA A possible alignment: - GCGCATTTGAGCGA - - TGCG - - TTAGGGTGACC
match mismatch
indel
Algorithmic Functions of Computational Biology –
Professor Istrail
CSCI2820 - Class 4
Mismatch, Deletion, Insertion
5
mismatch
deletion (in
template)insertion
(in template)
TCAGGGGGCTATTAGTCCTCCGATAA
TCAGGGGGCTATTAGTCC-CCGATAATCAGGGGG-CTATTAGTCCCCCCGATAA
m
n
yyyYxxxX
...21
...21
=
=Consider two sequences
Over the alphabet
},,, TGCA{=Σ
ji yx , belong to Σ
Algorithmic Functions of Computational Biology –
Professor Istrail
Scoring Schemes
δUnit-score A C G T
ACGT
11
11
-
-
0000
000
0 00
00
000
0
00000
Algorithmic Functions of Computational Biology –
Professor Istrail
Alignment
ACG | | | AGG
δScore = (A,A) (C,G) (G,G)+ +
= 1 + 0 + 1 = 2
Unit-cost
A | A
A is aligned with A
C | G
C is aligned with G
G is aligned with GG | G
Algorithmic Functions of Computational Biology –
Professor Istrail
δ δ
Gaps
ACATGGAAT ACAGGAAAT
ACAT GG - AAT ACA - GG AAAT
OPTIMAL ALIGNMENTS
SCORE 7 8
AAAGGG GGGAAA
SCORE 0 3
- - - AAAGGG GGGAAA - - -
“-” is the gap symbol
Algorithmic Functions of Computational Biology –
Professor Istrail
δ
δδ (x,y) = the score for aligning x with y
(-,y) = the score for aligning - with y
(x,-) = the score for aligning x with -
Algorithmic Functions of Computational Biology –
Professor Istrail
A-CG - G ATCGTG
Alignment
Score
δ δδ δ δδ(A,A) + (G,G) +(C,C) +(-,T) + (-,T ) + (G,G)
THE SUM OF THE SCORES OF THE PAIRWISE ALIGNED SYMBOLS
Algorithmic Functions of Computational Biology –
Professor Istrail
Margaret Dayhoff & PAM Similarity Matrices
ARTEMIS Summer 2008
Professor Istrail
Dr. Margaret Oakley Dayhoff The Mother & Father of Bioinformatics
ARTEMIS Summer 2008
Professor Istrail
Scoring SchemeDayhoff PAM scoring matrices
...δ
PTIPLSRLFDNAMLRAHRLHQ SAIENQRLFNIAVSRVQHLHL
Partial alignment for Monkey and Trout somatotropin proteins
- A R N D C Q E G H I L K M F P S T W Y V
-8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 3 -3 0 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0A
R N D
64
Algorithmic Functions of Computational Biology –
Professor Istrail
Scoring Functions
Scoring function = a sum of a terms each for a pair of aligned residues, and for each gap
The meaning = log of the relative likelihood that the sequences are related, compared to being unrelated
Identities and conservative substitutions are Positive terms
Non-conservative substitutions are Negative terms
Mutations= Substitutions, Insertions, Deletions
Algorithmic Functions of Computational Biology –
Professor Istrail
CSCI2820 - Class 4
Global alignment problem
• Input: Sequences X and Y of length m and n respectively and a similarity matrix
• Output: An optimal global alignment of X and Y – Global alignments require all bases in both
sequences are aligned
16
CSCI2820 - Class 4
Local alignment problem
• Input: Sequences X and Y of length m and n respectively and a similarity matrix
• Output: An optimal local alignment of X and Y – Local alignments do not require using all
bases in either sequence in the alignment
• Applicable when looking for subsequences of similarity
17
The Edit Graph
Suppose that we want to align AGT with AT
We are going to construct a graph where alignments between the two sequences correspond to paths between the begin and and end nodes of the graph.
This is the Edit Graph
Algorithmic Functions of Computational Biology –
Professor Istrail
0 1 2 30
1
2
AGT has length 3 AT has length 2
The Edit graph has (3+1)*(2+1) nodes
The sequence AGT
The sequence AT
Algorithmic Functions of Computational Biology –
Professor Istrail
0 1 2 30
1
2
A G T
A
T
AGT indexes the columns, and AT indexes the rows of this “table”
Begin
End
Algorithmic Functions of Computational Biology –
Professor Istrail
0 1 2 30
1
2
A G T
A
T
Begin
EndThe Graph is directed. The nodes (i,j) will hold values.
Algorithmic Functions of Computational Biology –
Professor Istrail
0 10
1
2
A G
A
T
T0 1 2 30
1
2
A G
A
T
Begin
End
Algorithmic Functions of Computational Biology
Professor Istrail
T
A
T
0 1 2 30
1
2
A GA -
- A
A A
- A
- A-
A
- A
G -
A -
A -
G -
G -
T -
T -
T -
- T
- T
- T
- T
A T
G T
T T
G A
T A
Directed edges get as labels pairs of aligned letters.
Begin
End
Algorithmic Functions of Computational Biology –
Professor Istrail
Alignment = Path in the Edit GraphT
A
T
0 1 2 30
1
2
A GA -
- A
A A
- A
- A-
A
- A
G -
A -
A -
G -
G -
T -
T -
T -
- T
- T
- T
- T
A T
G T
T T
G A
T A
AGT A-T
Begin
End
Every path from Begin to End corresponds to an alignment
Every alignment corresponds to a path between Begin and End
Algorithmic Functions of Computational Biology –
Professor Istrail
The Principle of Optimality
The optimal answer to a problem is expressed in terms of optimal answer for its sub-problems
Algorithmic Functions of Computational Biology –
Professor Istrail
Dynamic Programming
Part 1: Compute first the optimal alignment score
Part 2: Construct optimal alignment
We are looking for the optimal alignment = maximal score path in the Edit Graph from the Begin vertex to the End vertex
Given: Two sequences X and Y Find: An optimal alignment of X with Y
Algorithmic Functions of Computational Biology –
Professor Istrail
The DP Matrix S(i,j)0 1 2 3
0
1
2
A G T
A
TS(2,1)
S(1,0)
Algorithmic Functions of Computational Biology –
Professor Istrail
The DP MatrixMatrix S =[S(i,j)]
S(i,j) = The score of the maximal cost path from the Begin Vertex and the vertex (i,j)
(i,j)
(i,j-1)
(i-1,j)
(i-1,j-1) The optimal path to (i,j) must pass through one of
the vertices
(i-1,j-1)
(i-1,j)
(i,j-1)
Opt
imal
Pat
h to
(i,j
)Algorithmic Functions of Computational Biology –
Professor Istrail
Opt path
(i,j)
(i,j-1)
(i-1,j)
(i-1,j-1)
Optimal path to (i-1,j) + (- , yj)
- xi
yj -
S(i-1,j) + δ
δ
(- , yj)
Algorithmic Functions of Computational Biology –
Professor Istrail
Optimal path
(i,j)
(i-1,j)
(i,j-1)
(i-1,j-1)
Optimal path to (i-1,j-1) + (xi,yj)δ
δS(i-1,j-1) + (xi , yj)
Algorithmic Functions of Computational Biology –
Professor Istrail
Optimal path
(i,j)
(i,j-1)
(I-1,j)
(i-1,j-1)
Optimal path to (i,j-1) + (xi,-)δ
δS(i,j-1) + (xi, -)
Algorithmic Functions of Computational Biology –
Professor Istrail
The Basic ALGORITHM
S(i,j) =
S(i-1, j-1) + (xi, yj)
S(i-1, j) + (xi, -)
S(i, j-1) + (-, yj)
MAX
δ
δ
δ
Algorithmic Functions of Computational Biology –
Professor Istrail
T
A
T
0 1 2 30
1
2
A GA -
- A
A A
- A
- A-
A
- A
G -
A -
A -
G -
G -
T -
T -
T -
- T
- T
- T
- T
A T
G T
T T
G A
T A
0
0
0
0
1
1
0
1
1
0
1
2AGT A - TOptimal Alignment
Optimal Alignment and TracbackAlgorithmic Functions of Computational Biology –
Professor Istrail
S(i,j) =
S(i-1, j-1) + (xi, yj),
S(i-1, j) + (xi, -),
S(i, j-1) + (-, yj)
MAX
δ
δ
δ
0, We add this
The Basic ALGORITHM: Local SimilarityAlgorithmic Functions of Computational Biology –
Professor Istrail
CSCI2820 - Class 4
Protein global alignment
35
X = hlsek Y = nlsak
• X and Y represent a protein subsequence from the BRCA2 (early onset) protein in human and chimpanzee
• Global alignments are used when the two sequences being compared represent a similar biological sequence
CSCI2820 - Class 4
Margaret Dayhoff’s PAM 100 similarity matrix (partial)
36
A N E H L K S *
A 4 -1 0 -3 -3 -3 1 -9
N -1 5 1 2 -4 1 1 -9
E 0 1 5 -1 -5 -1 -1 -9
H -3 2 -1 7 -3 -2 -2 -9
L -3 -4 -5 -3 6 -4 -4 -9
K -3 1 -1 -2 -4 5 -1 -9
S 1 1 -1 -2 -4 -1 4 -9
* -9 -9 -9 -9 -9 -9 -9 1
CSCI2820 - Class 4 37
h l s e k
n
l
s
a
k
XY
-9 2 -7 -16 -25 -34
-9 -18 -27 -36 -45
-27 -16 -1 12 3 -6
-36 -25 -10 3 12 3
-18 -7 8 -1 -10 -19
0
-45 -34 -19 -6 3 17