pairwise alignment - dtu systems biology - department for systems
TRANSCRIPT
Pairwise Alignment
Anders Gorm PedersenMolecular Evolution Group
Center for Biological Sequence Analysis
Sequences are related
• Darwin: all organisms are related through descent with modification• => Sequences are related through descent with modification• => Similar molecules have similar functions in different organisms
Phylogenetic tree based on ribosomal RNA: three domains of life
Why compare sequences?
• Determination of evolutionary relationships
• Prediction of protein function and structure (database searches).
Protein 1: binds oxygen
Sequence similarity
Protein 2: binds oxygen ?
Dotplots: visual sequence comparison
1. Place two sequences along axes of plot
2. Place dot at grid points where two sequences have identical residues
3. Diagonals correspond to conserved regions
Pairwise alignments
43.2% identity; Global alignment score: 374
10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50
60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :.beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110
120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::.beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140
Pairwise alignment
Percent identity is not a good measure of alignment quality
100.000% identity in 3 aa overlap
SPA ::: SPA
Pairwise alignments: alignment score
43.2% identity; Global alignment score: 374
10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50
60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :.beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110
120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::.beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140
Alignment scores: match vs. mismatch
Simple scoring scheme (too simple in fact…):
Matching amino acids: 5Mismatch: 0
Scoring example:
K A W S A D V : : : : : K D W S A E V5+0+5+5+5+0+5 = 25
Pairwise alignments: conservative substitutions
43.2% identity; Global alignment score: 374
10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50
60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :.beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110
120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::.beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140
Amino acid properties
Serine (S) and Threonine (T) have similar physicochemical properties
Aspartic acid (D) and Glutamic acid (E) have similar properties
Substitution of S/T or E/D occurs relatively often during evolution
=>
Substitution of S/T or E/D should result in scores that are only moderately lower than identities
=>
Protein substitution matrices
A 5R -2 7N -1 -1 7D -2 -2 2 8C -1 -4 -2 -4 13Q -1 1 0 0 -3 7E -1 0 0 2 -3 2 6G 0 -3 0 -1 -3 -2 -3 8H -2 0 1 -1 -3 1 0 -2 10I -1 -4 -3 -4 -2 -3 -4 -4 -4 5L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5 A R N D C Q E G H I L K M F P S T W Y V
BLOSUM50 matrix:
• Positive scores on diagonal (identities)
• Similar residues get higher (positive) scores
• Dissimilar residues get smaller (negative) scores
Pairwise alignments: insertions/deletions
43.2% identity; Global alignment score: 374
10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50
60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :.beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110
120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::.beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140
Alignment scores: insertions/deletions
K L A A S V I L S D A L K L A A - - - - S D A L
-10 + 3 x (-1)=-13
Affine gap penalties:
Multiple insertions/deletions may be one evolutionary event =>
Separate penalties for gap opening and gap elongation
Handout
Compute 4 alignment scores: two different alignments using two different alignment matrices (and the same gap penalty system)
Score 1: Alignment 1 + BLOSUM-50 matrix + gaps
Score 2: Alignment 1 + BLOSUM-Trp matrix + gaps
Score 3: Alignment 2 + BLOSUM-50 matrix + gaps
Score 4: Alignment 2 + BLOSUM-Trp matrix + gaps
Note: fake matrix constructed for pedagogic purposes.
Protein substitution matrices
A 5R -2 7N -1 -1 7D -2 -2 2 8C -1 -4 -2 -4 13Q -1 1 0 0 -3 7E -1 0 0 2 -3 2 6G 0 -3 0 -1 -3 -2 -3 8H -2 0 1 -1 -3 1 0 -2 10I -1 -4 -3 -4 -2 -3 -4 -4 -4 5L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5 A R N D C Q E G H I L K M F P S T W Y V
BLOSUM50 matrix:
• Positive scores on diagonal (identities)
• Similar residues get higher (positive) scores
• Dissimilar residues get smaller (negative) scores
Protein substitution matrices: different types
• Identity matrix(match vs. mismatch)
• Chemical properties matrix(use knowledge of physicochemical properties to design matrix)
• Empirical matrices(based on observed pair-frequencies in hand-made alignments)§ PAM series§ BLOSUM series§ Gonnet
S(a, b) =1
�ln
✓pobs
pexp
◆=
1
�ln
✓pab
pa ⇥ pb
◆
Estimation of the BLOSUM 62 matrix
• BLOSUM matrices are computed based on gap-free alignments in the so-called BLOCKS database. BLOSUM 62 is computed by comparing sequences that are less than 62% identical. BLOSUM 80 is computed from sequences less than 80% identical, etc.
• All pairs of sequences in a block are compared, and the observed pair frequencies (pab) are noted. For instance: pWW = 0.0065, pAL = 0.0044, etc.
• Expected pair frequencies are computed from single amino acid frequencies. For instance: pA x pL= 0.074 x 0.099 = 0.0073
• For each amino acid pair the substitution scores are essentially computed as follows (here λ is a scaling factor, used to obtain integer scores):
ID FIBRONECTIN_2; BLOCKCOG9_CANFA GNSAGEPCVFPFIFLGKQYSTCTREGRGDGHLWCATTCOG9_RABIT GNADGAPCHFPFTFEGRSYTACTTDGRSDGMAWCSTTFA12_HUMAN LTVTGEPCHFPFQYHRQLYHKCTHKGRPGPQPWCATTHGFA_HUMAN LTEDGRPCRFPFRYGGRMLHACTSEGSAHRKWCATTHMANR_HUMAN GNANGATCAFPFKFENKWYADCTSAGRSDGWLWCGTTMPRI_MOUSE ETDDGEPCVFPFIYKGKSYDECVLEGRAKLWCSKTANPB1_PIG AITSDDKCVFPFIYKGNLYFDCTLHDSTYYWCSVTTYSFP1_BOVIN ELPEDEECVFPFVYRNRKHFDCTVHGSLFPWCSLDADSFP3_BOVIN AETKDNKCVFPFIYGNKKYFDCTLHGSLFLWCSLDADSFP4_BOVIN AVFEGPACAFPFTYKGKKYYMCTRKNSVLLWCSLDTESP1_HORSE AATDYAKCAFPFVYRGQTYDRCTTDGSLFRISWCSVTCOG2_CHICK GNSEGAPCVFPFIFLGNKYDSCTSAGRNDGKLWCASTCOG2_HUMAN GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATTCOG2_MOUSE GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATTCOG2_RABIT GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATSCOG2_RAT GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATTCOG9_BOVIN GNADGKPCVFPFTFQGRTYSACTSDGRSDGYRWCATTCOG9_HUMAN GNADGKPCQFPFIFQGQSYSACTTDGRSDGYRWCATTCOG9_MOUSE GNGEGKPCVFPFIFEGRSYSACTTKGRSDGYRWCATTCOG9_RAT GNGDGKPCVFPFIFEGHSYSACTTKGRSDGYRWCATTFINC_BOVIN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTTFINC_HUMAN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTTFINC_RAT GNSNGALCHFPFLYSNRNYSDCTSEGRRDNMKWCGTTMPRI_BOVIN ETEDGEPCVFPFVFNGKSYEECVVESRARLWCATTANMPRI_HUMAN ETDDGVPCVFPFIFNGKSYEECIIESRAKLWCSTTADPA2R_BOVIN GNAHGTPCMFPFQYNQQWHHECTREGREDNLLWCATTPA2R_RABIT GNAHGTPCMFPFQYNHQWHHECTREGRQDDSLWCATT
pA = 0.074, pL = 0.099 ) pA ⇥ pL = 0.0073
pAL = 0.044
SAL =1
�ln
✓pAL
pA ⇥ pL
◆
=1
0.347ln
✓0.0044
0.0073
◆
=�0.51
0.347
= �1.46
⇡ �1
Estimation of the BLOSUM 62 matrix
• Example: A + L:
Pairwise alignment
Optimal alignment:
alignment having the highest possible score given a substitution matrix and a set of gap penalties
So:
best alignment can be found by exhaustively searching all possible alignments, scoring each of them and choosing the one with the highest score?
The problem: How many possible alignments are there?
ACG! AC-G! --ACG! -A-CGACG! ACG-! AC-G-! A-CG-
-ACG! AC-G! --ACG! ...ACG-! A-CG! A-CG-
-ACG! AC-G! --ACGAC-G! -ACG! AC--G
-ACG! ACG-! --ACGA-CG! AC-G! A-C-G
A-CG! ACG-! --ACGACG-! A-CG! A--CG
A-CG! ACG-! -A-CGAC-G! -ACG! ACG--
A-CG! --ACG! -A-CG-ACG! ACG--! AC-G-
How many possible alignments are there?
Length of sequences: n1 = n2 Number of possible alignments
2 13
3 63
4 321
5 1683
10 8,097,453
20 2.61 x 1014
100 2.05 x 1075
300 1.53 x 10228
Pairwise alignment: the problem
The number of possible pairwise alignments increases explosively with the length of the sequences:
Two protein sequences of length 300 amino acids can be aligned in approximately 10228 different ways
Time needed to test all possibilities much larger than the entire lifetime of the universe.
N =n2X
k=0
✓n1 + k
n2
◆✓n2
k
◆
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible previous positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
Dynamic programming: computation of scores
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
score(x,y) = max score(x,y-1) - gap-penalty
T C G C A
T
C
C
A
Dynamic programming: computation of scores
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
score(x,y) = max score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
T C G C A
T
C
C
A
Dynamic programming: computation of scores
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
score(x,y) = max score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
T C G C A
T
C
C
A
Dynamic programming: computation of scores
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from.
Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner.
score(x,y) = max score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
T C G C A
T
C
C
A
Global versus local alignments
Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm).
Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm).
Global alignment
Seq 1
Seq 2
Local alignment
Local alignment overview
• The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative.
• Trace-back is started at the highest value rather than in lower right corner
• Trace-back is stopped as soon as a zero is encountered
score(x,y) = max
score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
0
Substitution matrices and sequence similarity
• Substitution matrices come as series of matrices calculated for different degrees of sequence similarity (different evolutionary distances).
• ”Hard” matrices are designed for similar sequences– Hard matrices a designated by high numbers in the BLOSUM
series (e.g., BLOSUM80)– Hard matrices yield short, highly conserved alignments
• ”Soft” matrices are designed for less similar sequences– Soft matrices have low BLOSUM values (45)– Soft matrices yield longer, less well conserved alignments
Alignments: things to keep in mind
“Optimal alignment” means “having the highest possible score, given substitution matrix and set of gap penalties”.
This is NOT necessarily the biologically most meaningful alignment.
Specifically, the underlying assumptions are often wrong: substitutions are not equally frequent at all positions, affine gap penalties do not model insertion/deletion well, etc.
Pairwise alignment programs always produce an alignment - even when it does not make sense to align sequences.