page 1 march 2003 pairwise sequence alignments volker flegel
TRANSCRIPT
march 2003 Page 1
Pairwise sequence alignments
Volker Flegel
march 2003 Page 2
Goal
Sequence comparison through pairwise alignments• Goal of pairwise comparison is to find conserved regions
(if any) between two sequences
• Extrapolate information about our sequence using the known characteristics of the other sequence
THIO_EMENI GFVVVDCFATWCGPCKAIAPTVEKFAQTY G ++VD +A WCGPCK IAP +++ A Y ??? GAILVDFWAEWCGPCKMIAPILDEIADEY
THIO_EMENI GFVVVDCFATWCGPCKAIAPTVEKFAQTY G ++VD +A WCGPCK IAP +++ A Y ??? GAILVDFWAEWCGPCKMIAPILDEIADEY
THIO_EMENISwissProt
ExtrapolateExtrapolate
???
march 2003 Page 3
Do alignments make sense ?Evolution of sequences
• Sequences evolve through mutation and selectionSelective pressure is different for each residue
position in a protein (i.e. conservation of active site, structure, charge, etc.)
• Modular nature of proteinsNature keeps re-using domains
• Alignments try to tell the evolutionnary story of the proteinsRelationships
Same Sequence
Same 3D Fold
Same Origin Same Function
march 2003 Page 4
Example: An alignment
• Two similar regions of the Drosophila melanogaster Slit and Notch proteins
970 980 990 1000 1010 1020 SLIT_DROME FSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFC ..:.: :. :.: ...:.: .. : :.. : ::.. . :.: ::..:. :. :. :NOTC_DROME YKCECPRGFYDAHCLSDVDECASN-PCVNEGRCEDGINEFICHCPPGYTGKRCELDIDEC 740 750 760 770 780 790
970 980 990 1000 1010 1020 SLIT_DROME FSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFC ..:.: :. :.: ...:.: .. : :.. : ::.. . :.: ::..:. :. :. :NOTC_DROME YKCECPRGFYDAHCLSDVDECASN-PCVNEGRCEDGINEFICHCPPGYTGKRCELDIDEC 740 750 760 770 780 790
march 2003 Page 5
Example: A diagonal plot
• Comparing the tissue-type and urokinase type plasminogen activators
Tissue-Type plasminogen Activator
Uro
kinase
-Typ
e p
lasm
inog
en
Activ
ato
r
URL: www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
march 2003 Page 6
Relationships to other techniques
Sequence analysis tools depending on pairwise comparison
• Multiple alignments
• Profile and HMM making (used to search for protein families and domains)
• 3D protein structure prediction
• Phylogenetic analysis
• Construction of certain substitution matrices
• Similarity searches in a database
march 2003 Page 7
Some definitions
Identity Proportion of pairs of identical residues between two aligned sequences.Generally expressed as a percentage.This value strongly depends on how the two sequences are aligned.
SimilarityProportion of pairs of similar residues between two aligned sequences.If two residues are similar is determined by a substitution matrix.This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used.
Homology Two sequences are homologous if and only if they have a common ancestor.There is no such thing as a level of homology ! (It's either yes or no)
• Homologous sequences do not necessarily serve the same function...
• ... Nor are they always highly similar: structure may be conserved while sequence is not.
march 2003 Page 8
More definitions
Consider a set S (say, globins) and a test t that tries to detect members of S
(for example, through a pairwise comparison with another globin).
True positive A protein is a true positive if it belongs to S and is detected by t.
True negative A protein is a true negative if it does not belong to S and is not detected by t.
False positive A protein is a false positive if it does not belong to S and is (incorrectly) detected by t.
False negative A protein is a false negative if it belongs to S and is not detected by t (but should be).
march 2003 Page 9
Definition example
The set of all globins and a test to identify them
Globins
Matches
True positives
True negatives
False positives
False negatives
march 2003 Page 10
Even more definitions
Sensitivity Ability of a method to detect positives, irrespective of how many false positives are reported.
Selectivity Ability of a method to reject negatives, irrespective of how many false negatives are rejected.
True positives
True negatives
False positives
False negatives
Greater sensitivity
Less selectivity
Less sensitivity
Greater selectivity
march 2003 Page 11
Pairwise sequence alignments
Concept of sequence alignment• Pairwise Alignment:
Explicit mapping between the residues of 2 sequences
– Tolerant to errors (mismatches, insertion / deletions or indels)
– Evaluation of the alignment in a biological concept (significance)
Seq A GARFIELDTHELASTFA-TCAT||||||||||| || ||||
Seq B GARFIELDTHEVERYFASTCAT
Seq A GARFIELDTHELASTFA-TCAT||||||||||| || ||||
Seq B GARFIELDTHEVERYFASTCAT
errors / mismatches insertion
deletion
march 2003 Page 12
Alignements
Number of alignments• There are many ways to align two sequences• Consider the sequence fragments below: a simple
alignment shows some conserved portions
but also:
CGATGCAGACGTCA ||||||||CGATGCAAGACGTCA
CGATGCAGACGTCA ||||||||CGATGCAAGACGTCA
CGATGCAGACGTCA||||||||CGATGCAAGACGTCA
CGATGCAGACGTCA||||||||CGATGCAAGACGTCA
• Number of possible alignments for 2 sequences of length 1000 residues: more than 10600 gapped alignments
(Avogadro 1024, estimated number of atoms in the universe 1080)
march 2003 Page 13
Alignement evaluationWhat is a good alignment ?
• We need a way to evaluate the biological meaning of a given alignment
• Intuitively we "know" that the following alignment:
is better than:
CGAGGCACAACGTCA||| ||| ||||||CGATGCAAGACGTCA
CGAGGCACAACGTCA||| ||| ||||||CGATGCAAGACGTCA
ATTGGACAGCAATCAGG| || | |ACGATGCAAGACGTCAG
ATTGGACAGCAATCAGG| || | |ACGATGCAAGACGTCAG
• We can express this notion more rigorously, by using a scoring system.
march 2003 Page 14
Scoring system
Simple alignment scores• A simple way (but not the best) to score an alignment is to
count 1 for each match and 0 for each mismatch.
Score: 12
CGAGGCACAACGTCA||| ||| ||||||CGATGCAAGACGTCA
CGAGGCACAACGTCA||| ||| ||||||CGATGCAAGACGTCA
ATTGGACAGCAATCAGG| || | |ACGATGCAAGACGTCAG
ATTGGACAGCAATCAGG| || | |ACGATGCAAGACGTCAG
Score: 5
march 2003 Page 15
Introducing biological informationImportance of the scoring system
discrimination of significant biological alignments
• Based on physico-chemical properties of amino-acidsHydrophobicity, acid / base, sterical properties, ...Scoring system scales are arbitrary
• Based on biological sequence informationSubstitutions observed in structural or evolutionary
alignments of well studied protein familiesScoring systems have a probabilistic foundation
Substitution matrices• In proteins some mismatches are more acceptable than
others• Substitution matrices give a score for each substitution of
one amino-acid by another
march 2003 Page 16
Substitution matrices (log-odds matrices)
Example matrix
PAM250From: A. D. Baxevanis, "Bioinformatics"
(Leu, Ile): 2
(Leu, Cys): -6...
• Positive score: the amino acids are similar, mutations from one into the other occur more often then expected by chance during evolution
• Negative score: the amino acids are dissimilar, the mutation from one into the other occurs less often then expected by chance during evolution
chancebyexpected
observedlog
chancebyexpected
observedlog
• For a set of well known proteins:• Align the sequences• Count the mutations at each position• For each substitution set the score to
the log-odd ratio
march 2003 Page 17
Matrix choice
Different kind of matrices• PAM series (M. Dayhoff, 1968, 1972, 1978)
Based on 1572 protein sequences from 71 familiesOld standard matrix:PAM250
• BLOSUM seriesBased on alignments in the BLOCKS databaseStandard matrix: BLOSUM62
Limitations• Substitution matrices do not take into account long range
interactions between residues.• They assume that identical residues are equal (a residue at
the active site has other evolutionary constraints than the same residue outside of the active site)
• They assume evolution rate to be constant.
march 2003 Page 18
Alignment score
Amino acid substitution matrices • Example: PAM250• Most used: Blosum62
Raw score of an alignment
TPEA¦| |APGA
TPEA¦| |APGA
Score = 1 = 9
• It is possible that a good long alignment gets a better raw score than a very good short alignment.
We need a normalised score to compare alignments ! (p-value, e-value)
+ 6 + 0 + 2
march 2003 Page 19
Gaps
Insertions or deletions• Proteins often contain regions where residues have been
inserted or deleted during evolution• There are constraints on where these insertions and
deletions can happen (between structural or functional elements like: alpha helices, active site, etc.)
Gaps in alignments
GCATGCATGCAACTGCAT|||||||||GCATGCATGGGCAACTGCAT
GCATGCATGCAACTGCAT|||||||||GCATGCATGGGCAACTGCAT
can be improved by inserting a gap
GCATGCATG--CAACTGCAT||||||||| |||||||||GCATGCATGGGCAACTGCAT
GCATGCATG--CAACTGCAT||||||||| |||||||||GCATGCATGGGCAACTGCAT
march 2003 Page 20
Gap opening and extension penaltiesCosts of gaps in alignments
• We want to simulate as closely as possible the evolutionary mechanisms involved in gap occurence.
Example• Two alignments with identical number of gaps but very
different gap distribution. We may prefer one large gap to several small ones (e.g. poorly conserved loops between well-conserved helices)
CGATGCAGCAGCAGCATCG|||||| |||||||CGATGC------AGCATCG
CGATGCAGCAGCAGCATCG|||||| |||||||CGATGC------AGCATCG
CGATGCAGCAGCAGCATCG|| || |||| || || |CG-TG-AGCA-CA--AT-G
CGATGCAGCAGCAGCATCG|| || |||| || || |CG-TG-AGCA-CA--AT-G
gap opening
Gap opening penalty• Counted each time a gap is opened in an alignment (some
programs include the first extension into this penalty)
gap extension
Gap extension penalty• Counted for each extension of a gap in an alignment
march 2003 Page 21
Gap opening and extension penaltiesExample
• With a match score of 1 and a mismatch score of 0• With an opening penalty of 10 and extension penalty of 1,
we have the following score:
CGATGCAGCAGCAGCATCG|||||| |||||||CGATGC------AGCATCG
CGATGCAGCAGCAGCATCG|||||| |||||||CGATGC------AGCATCG
CGATGCAGCAGCAGCATCG|| || |||| || || |CG-TG-AGCA-CA--AT-G
CGATGCAGCAGCAGCATCG|| || |||| || || |CG-TG-AGCA-CA--AT-G
gap opening
13 x 1 - 10 - 6 x 1 = -3
gap extension
13 x 1 - 5 x 10 - 6 x 1 = -43
march 2003 Page 22
Statistical evaluation of results
Alignments are evaluated according to their score• Raw score
It's the sum of the amino acid substitution scores and gap penalties (gap opening and gap extension)
Depends on the scoring system (substitution matrix, etc.)
Different alignments should not be compared based only on the raw score
• Normalised score Is independent of the scoring systemEnables us to compare different alignmentsUnits: expressed in bits
march 2003 Page 23
Statistical evaluation of results
Statistics derived from the scores• p-value
Probability that an alignment with this score occurs by chance in a database of this size
The closer the p-value is towards 0, the better the alignment
• e-valueNumber of matches with this score one can expect to
find by chance in a database of this sizeThe closer the e-value is towards 0, the better the
alignment
march 2003 Page 24
Diagonal plots or Dotplot
Concept of a Dotplot• Produces a graphical representation of similarity regions.• The horizontal and vertical dimensions correspond to the
compared sequences.• A region of of similarity stands out as a diagonal.
Tissue-Type plasminogen Activator
Uro
kinase
-Typ
e p
lasm
inog
en
Activ
ato
r
march 2003 Page 25
Dotplot constructionSimple example
• A dot is placed at each position where two residues match.The colour of the dot can be chosen according to the
substitution value in the substitution matrixT H E F A T C A T
T
H
E
F
A
S
T
C
A
T
THEFA-TCAT||||| ||||THEFASTCAT
THEFA-TCAT||||| ||||THEFASTCAT
Note• This method produces dotplots with too much noise to be
usefulThe noise can be reduced by calculating a score using a
window of residuesThe score is compared to a threshold or stringency
march 2003 Page 26
Dotplot constructionWindow example
• Each window of the first sequence is aligned (without gaps) to each window of the 2nd sequence
• A colour is set into a rectangular array according to the score of the aligned windows
T H E F A T C A T
T
H
E
F
A
S
T
C
A
T
THE|||THE
THE|||THE
Score: 23
THE
HEF
THE
HEF
Score: -5
CAT
THE
CAT
THE
Score: -4
HEF
THE
HEF
THE
Score: -5
march 2003 Page 27
Dotplot limitations
It's a visual aid. The human eye can rapidly identify similar regions in sequences.
It's a good way to explore sequence organisation. It does not provide an alignment.
Tissue-Type plasminogen Activator
Uro
kinase
-Typ
e p
lasm
inog
en
Activ
ato
r
march 2003 Page 28
Relationship between alignment and dotplot• An alignment can be seen as a path through the dotplot
diagramm.
Creating an alignment
Seq B A-CA-CA| || |
Seq A ACCAAC-
Seq B A-CA-CA| || |
Seq A ACCAAC-
Seq B ACA--CA|
Seq A A-CCAAC
Seq B ACA--CA|
Seq A A-CCAAC
march 2003 Page 29
Finding an alignmentAlignment algorithms
• An alignment program tries to find the best alignment between two sequences given the scoring system.
• This can be seen as trying to find a path through the dotplot diagram including all (or the most visible) diagonals.
Alignement types• Global Alignment between the complete sequence A and the
complete sequence B• Local Alignment between a sub-sequence of A an a sub-
sequence of B
Computer implementation (Algorithms)• Dynamic programing• Global Needleman-Wunsch• Local Smith-Waterman
march 2003 Page 30
Global alignment (Needleman-Wunsch)
Example Global alignments are very sensitive to gap penaltiesGlobal alignments do not take into account the modular
nature of proteinsTissue-Type plasminogen Activator
Uro
kinase
-Typ
e p
lasm
inog
en
Activ
ato
r
Global alignment:
march 2003 Page 31
Local alignment (Smith-Waterman)
Example Local alignments are more sensitive to the modular nature
of proteinsThey can be used to search databasesTissue-Type plasminogen Activator
Uro
kinase
-Typ
e p
lasm
inog
en
Activ
ato
r
Local alignments:
march 2003 Page 32
Optimal alignment extensionHow to extend optimaly an optimal alignment
• An optimal alignment up to positions i and j can be extended in 3 ways.
• Keeping the best of the 3 guarantees an extended optimal alignment.
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
• We have the optimal alignment extended from i and j by one residue.
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
ai+1
bj+1
ai+1
bj+1
Score = Scoreij + Substij
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
ai+1
-
ai+1
-Score = Scoreij - gap
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
Seq A a1 a2 a3 ... ai-1 ai
Seq B b1 b2 b3 ... bj-1 bj
-
bj+1
-
bj+1
Score = Scoreij - gap
march 2003 Page 33
Exact algorithms
Simple example (Needleman-Wunsch)
• Scoring system: Match score: 2 Mismatch score: -1 Gap penalty: -2
Note• We have to keep track of the origin of the score for each
element in the matrix. This allows to build the alignment by traceback when the matrix
has been completely filled out.• Computation time is proportional to the size of sequences (n
x m).
G A T T A
0 -2 -4 -6 -8 -10
G -2
A -4
A -6
T -8
T -10
C -12
G A T T A
0 -2 -4 -6 -8 -10
G -2 2 0 -2 -4 -6
A -4 0 4
A -6
T -8
T -10
C -12
0 - 2
0 - 2
2 + 2
G A T T A
0 -2 -4 -6 -8 -10
G -2 2 0 -2 -4 -6
A -4 0 4 2 0 -2
A -6 -2 2 3 1 2
T -8 -4 0 4 5 3
T -10 -6 -2 2 6 4
C -12 -8 -4 0 4 5
F(i-
1,j)
F(i,j)
s(xi,yj)
F(i-1,j-
1) -d
F(i,j-
1)
-d
F(i,j):score at position i, js(xi,yj): match or mismatch score (or substitution matrix value) for residues xi and yj
d:gap penalty (positive value)
GA-TTA|| ||GAATTC
GA-TTA|| ||GAATTC
march 2003 Page 34
Algorithms for pairwise alignments
Web resources• LALIGN - pairwise sequence alignment:
www.ch.embnet.org/software/LALIGN_form.html
• PRSS - alignment score evaluation: www.ch.embnet.org/software/
PRSS_form.html
Concluding remarks • Substitution matrices and gap penalties introduce
biological information into the alignment algorithms.• It is not because two sequences can be aligned that
they share a common biological history. The relevance of the alignment must be assessed with a statistical score.
• There are many ways to align two sequences.Do not blindly trust your alignment to be the only truth. Especially gapped regions may be quite variable.
• Sequences sharing less than 20% similarity are difficult to align:
You enter the Twilight Zone (Doolittle, 1986) Alignments may appear plausible to the eye but are no
longer statistically significant. Other methods are needed to explore these sequences
(i.e: profiles)