lecture 1, 31/10/2001 - weizmann institute of science · • the needleman-wunsch algorithm for...
TRANSCRIPT
1
Lecture 1, 31/10/2001:
• Introduction to sequence alignment
• The Needleman-Wunsch algorithm for global sequence alignment: description and properties
2
Computational sequence-analysis
The major goal of computational sequence analysis is to predict the function and structure of genes and proteins from their sequence.
This is made possible sinceorganisms evolve by mutation, duplication and selection oftheir genes.
Thus, sequence similarity often indicates functional andstructural similarity.
3
5’ 3’ 5’ 3’
Sequence alignment
ATCAGAGTC TTCAGTC
ATC ≠ CTA
AG ≠ GA
etc.
4
Sequence alignment
We wish to identify what regions are most similar to eachother in the two sequences . Sequences are shifted one by theother and gaps introduced, to cover all possible alignments.The shifts and gaps provide the steps by which one sequencecan be converted into the other.
ATCAGAGTC TTCA--GTC +++^^+++
5
A T C A G A G T C
T
T
C
A
G
T
C
A T C A G A G T C
T • •
T • •
C • •
A • • •
G • •
T • •
C • •
Sequence alignmentdot-plot
ATTCATCA
GA--GTCGTC
6
ATCAGAGTCTTCA--GTC
Sequence alignmentscoring
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Substitution matrix - the similarity value between each pair of residues
Gap penalty - the cost of introducing gaps Gap penalty -2
A C G TACGT
: 0+2+2+2-2-2+2+2+2 = 8•+++^^+++
7
A T C A G A G T C
T 0 2 0 0 0 0 0 2 0
T 0 2 0 0 0 0 0 2 0
C 0 0 2 0 0 0 0 0 2
A 2 0 0 2 0 2 0 0 0
G 0 0 0 0 2 0 2 0 0
T 0 2 0 0 0 0 0 2 0
C 0 0 2 0 0 0 0 0 2
[T2T1] ATC -TT
[C3T1] ATC- --TT
[T2T2] ATC TT-
Sequence alignmentNeedleman-Wunsch global alignment
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
Initialization
Position 3,2 :
[ab]
[a-]
[-b]
8
A T C A G A G T C
0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4 - 1 6 - 1 8
T - 2 0 2 0 0 0 0 0 2 0
T - 4 0 2 0 0 0 0 0 2 0
C - 6 0 0 2 0 0 0 0 0 2
A - 8 2 0 0 2 0 2 0 0 0
G - 1 0 0 0 0 0 2 0 2 0 0
T - 1 2 0 2 0 0 0 0 0 2 0
C - 1 4 0 0 2 0 0 0 0 0 2
Sequence alignmentNeedleman-Wunsch global alignment
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
[ab]
[a-]
[-b]
Directionality of score calculation
Initialization
9
A T C A G A G T C
0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4 - 1 6 - 1 8
T - 2 0 2 0 0 0 0 0 2 0
T - 4 0 2 0 0 0 0 0 2 0
C - 6 0 0 2 0 0 0 0 0 2
A - 8 2 0 0 2 0 2 0 0 0
G - 1 0 0 0 0 0 2 0 2 0 0
T - 1 2 0 2 0 0 0 0 0 2 0
C - 1 4 0 0 2 0 0 0 0 0 2
Sequence alignmentNeedleman-Wunsch global alignment
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
10
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
Sequence alignmentNeedleman-Wunsch global alignment
A T C A G A G T C
0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4 - 1 6 - 1 8
T - 2 0 0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4
T - 4 - 2 2 0 - 2 - 4 - 6 - 8 - 8 - 1 0
C - 6 - 4 0 4 2 0 - 2 - 4 - 6 - 6
A - 8 - 4 - 2 2 6 4 2 0 - 2 - 4
G - 1 0 - 6 - 4 0 4 8 6 4 2 0
T - 1 2 - 8 - 4 - 2 2 6 8 6 6 4
C - 1 4 - 1 0 - 6 - 2 0 4 6 8 6 8
11
σ[ab] : score of aligning a pair of residues a and b
σ[a-] : score of aligning residue a with a gap (gap penalty: -q)
S : score matrix
S(i,j) : optimal score of aligning residues positions 1 to i on one sequence with residues positions 1 to j on another sequence
Sequence alignmentNeedleman-Wunsch algorithm
12
Sequence alignmentNeedleman-Wunsch algorithm
S(0,0) ⇐ 0for j ⇐ 1 to N do
S(0,j) ⇐ S(0,j-1) + σ[-bj]
for i ⇐ 1 to M do
{ S(i,0) ⇐ S(i-1,0) + σ[ai-]
for j ⇐ 1 to N do
S(i,j) ⇐ max (S(i-1, j-1) + σ[aibj],
S(i-1, j) + σ[ai- ],
S(i, j-1) + σ[-bj ])
} Pearson & MillerMeth Enz 210:575, ‘92
13
Sequence alignmentNeedleman-Wunsch global alignment
Optimal score/s is found - more steps needed to find thecorresponding alignment/s.This is a time-saving property in database searches and otherapplications.
Only a single pass through the alignment matrix is needed.
14
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
Sequence alignmentNeedleman-Wunsch global alignment
A T C A G A G T C
0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4 - 1 6 - 1 8
T - 2 0 0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4
T - 4 - 2 2 0 - 2 - 4 - 6 - 8 - 8 - 1 0
C - 6 - 4 0 4 2 0 - 2 - 4 - 6 - 6
A - 8 - 4 - 2 2 6 4 2 0 - 2 - 4
G - 1 0 - 6 - 4 0 4 8 6 4 2 0
T - 1 2 - 8 - 4 - 2 2 6 8 6 6 4
C - 1 4 - 1 0 - 6 - 2 0 4 6 8 6 8
15
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
Sequence alignmentNeedleman-Wunsch global alignment
the tracebackA T C A G A G T C
0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4 - 1 6 - 1 8
T - 2 0 0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4
T - 4 - 2 2 0 - 2 - 4 - 6 - 8 - 8 - 1 0
C - 6 - 4 0 4 2 0 - 2 - 4 - 6 - 6
A - 8 - 4 - 2 2 6 4 2 0 - 2 - 4
G - 1 0 - 6 - 4 0 4 8 6 4 2 0
T - 1 2 - 8 - 4 - 2 2 6 8 6 6 4
C - 1 4 - 1 0 - 6 - 2 0 4 6 8 6 8
16
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
Sequence alignmentNeedleman-Wunsch global alignment
the tracebackA T C A G A G T C
0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4 - 1 6 - 1 8
T - 2 0 0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4
T - 4 - 2 2 0 - 2 - 4 - 6 - 8 - 8 - 1 0
C - 6 - 4 0 4 2 0 - 2 - 4 - 6 - 6
A - 8 - 4 - 2 2 6 4 2 0 - 2 - 4
G - 1 0 - 6 - 4 0 4 8 6 4 2 0
T - 1 2 - 8 - 4 - 2 2 6 8 6 6 4
C - 1 4 - 1 0 - 6 - 2 0 4 6 8 6 8
ATCAGAGTCTTCAG--TC•++++^^++ : 0+2+2+2+2-2-2+2+2=8
17
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
Sequence alignmentNeedleman-Wunsch global alignment
the tracebackA T C A G A G T C
0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4 - 1 6 - 1 8
T - 2 0 0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4
T - 4 - 2 2 0 - 2 - 4 - 6 - 8 - 8 - 1 0
C - 6 - 4 0 4 2 0 - 2 - 4 - 6 - 6
A - 8 - 4 - 2 2 6 4 2 0 - 2 - 4
G - 1 0 - 6 - 4 0 4 8 6 4 2 0
T - 1 2 - 8 - 4 - 2 2 6 8 6 6 4
C - 1 4 - 1 0 - 6 - 2 0 4 6 8 6 8
ATCAGAGTCTTC--AGTC•++^^++++ : 0+2+2-2-2+2+2+2+2=8
18
A C G TA 2 0 0 0C 0 2 0 0G 0 0 2 0T 0 0 0 2
Gap penalty -2
Sequence alignmentNeedleman-Wunsch global alignment
the tracebackA T C A G A G T C
0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4 - 1 6 - 1 8
T - 2 0 0 - 2 - 4 - 6 - 8 - 1 0 - 1 2 - 1 4
T - 4 - 2 2 0 - 2 - 4 - 6 - 8 - 8 - 1 0
C - 6 - 4 0 4 2 0 - 2 - 4 - 6 - 6
A - 8 - 4 - 2 2 6 4 2 0 - 2 - 4
G - 1 0 - 6 - 4 0 4 8 6 4 2 0
T - 1 2 - 8 - 4 - 2 2 6 8 6 6 4
C - 1 4 - 1 0 - 6 - 2 0 4 6 8 6 8
ATCAGAGTC : 8TTCAG--TC
ATCAGAGTC : 8TTC--AGTC
ATCAGAGTC : 8TTCA--GTC
19
Sequence alignmentNeedleman-Wunsch global alignment
Algorithm calculates score/s of optimal global sequence alignments, penalizes end gaps and penalizes each residue in a gap is equally.
ATCAGAGTC has lower score then CAGAGTC --TTCAGTC TTCAGTC
ATCACAGTC has same score as ATCACAGTC T-C--AGTC T---CAGTC
ATCACAGTC has lower score then ACACAGTC T---CAGTC T--CAGTC
20
Sequence alignmentNeedleman-Wunsch global alignment
In order to score a gap penalty q independent of the gap length, i.e
ACACAGTC ATCACAGTC AGCTTTCACAGTC all have theT--CAGTC T---CAGTC T-------CAGTC same score
the algorithm we presented is modified to extend alignments inmore then the three ways we considered.
21
[ab]
[a-]
[-b]
A T C A G A G T C
T 0 2 0 0 0 0 0 2 0
T 0 2 0 0 0 0 0 2 0
C 0 0 2 0 0 0 0 0 2
A 2 0 0 2 0 2 0 0 0
G 0 0 0 0 2 0 2 0 0
T 0 2 0 0 0 0 0 2 0
C 0 0 2 0 0 0 0 0 2
Sequence alignmentNeedleman-Wunsch global alignment
[ab]
[a-]
[-b]
22
Sequence alignmentNeedleman-Wunsch algorithm
S(0,0) ⇐ 0for j ⇐ 1 to N do
S(0,j) ⇐ -q
for i ⇐ 1 to M do
{ S(i,0) ⇐ -q
for j ⇐ 1 to N do
S(i,j) ⇐ max (S(i-1, j-1) + σ[aibj],
max {S(0, j)...S(i-1, j)} -q,max {S(i, 0)...S(i, j-1)} -q)
} Pearson & MillerMeth Enz 210:575, ‘92
23
Sequence alignmentNeedleman-Wunsch global alignment
caveatsEvery algorithm is limited by the model it is built upon.
For example, the NW dynamic programming algorithm guaranteesus optimal global alignments with the parameters we supply(substitution matrix, gap penalty and gap scoring).
However -• Different parameters can give different alignments,• The correct alignment might not be the optimal one.• The correct alignment might correspond only to part of the global alignments,
24
Source: Pearson WR & Miller W"Dynamic programming algorithms for biological sequence comparison."Methods in Enzymology , 210:575-601 (1992).
Assignment: Calculate NW alignments with constant gap penalty seeingthe effect of different gap penalties and match/mismatch scores. In allcases use substitution matrices that have two types of scores only a valuefor an exact match and a lower value for mismatches. Try the nucleotidesequences used in class and the following amino acid sequences:“ACDGSMF” & “AMDFR”.
More details, sources and thingsto do for next class