bioinformática - universidade da beira interiorpandre/bioinformatica/iacb_t2.pdf ·...
TRANSCRIPT
1
Sara C. Madeira
Universidade da Beira Interior
(Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática
Sara C. Madeira 1
Sequence Alignment Pairwise Sequence Alignment
16/03/2009 & 23/03/2009 27/04/2009
Bioinformática Bioinformática
Sara C. Madeira 2
Outline
• Introduction
• What is pairwise sequence alignment?
• Scoring Sequence Alignments
• Pairwise Sequence Alignment Algorithms
• Needleman-Wunsch Algorithm (global alignment)
• Smith-Waterman Algorithm (local alignment)
• Extension to repeated matches (multiple local alignments)
• Heuristic Pairwise Sequence Alignment Algorithms
• BLAST
• FASTA
2
Bioinformática Bioinformática
Sara C. Madeira 3
Introduction
• Advances in molecular biology allow increasingly rapid sequencing of
genomes --> Exponential growth in Genbank.
• Francois Jacob (1977) [Evolution and tinkering, science 196:1161-1166]
“Nature is a tinkerer and not an inventor”
• Eric Wieschaus (1995) [Associated Press, 9 October, 1995]
“We didn´t know it at the time, but we found out everything in life is
so similar, that the same genes that work in flies are the ones that
work in humans.”
Bioinformática Bioinformática
Sara C. Madeira 4
Introduction
• New sequences are adapted from pre-existing sequences rather
than invented de novo.
• Sequence similarity is an indicator of homology.
• Other (several) uses for sequence similarity
• Database queries
• Comparative genomics
• ...
3
Bioinformática Bioinformática
Sara C. Madeira 5
Outline
• Introduction
• What is Pairwise Sequence Alignment?
• Scoring Sequence Alignments
• Pairwise Sequence Alignment Algorithms
• Needleman-Wunsch Algorithm (global alignment)
• Smith-Waterman Algorithm (local alignment)
• Extension to repeated matches (multiple local alignments)
• Heuristic Pairwise Sequence Alignment Algorithms
• BLAST
• FASTA
Bioinformática Bioinformática
Sara C. Madeira 6
What is Pairwise Sequence Alignment?
• The problem of deciding if a pair of sequences are evolutionarily related or not.
• Two biological sequences are similar ⇔ Two strings are similar
• Sequences accumulate
• Insertions
• Deletions
and
• Substitutions
4
Bioinformática Bioinformática
Sara C. Madeira 7
What is Pairwise Sequence Alignment?
Distance Between DNA Sequences
• Hamming distance is not typically used to compare DNA or protein
sequences.
• Levenshtein distance allows one to compare strings of different lengths.
• Edit distance
Definition: The edit distance between two strings is defined as the
minimum number of edit operations – insertions, deletions and substitutions
– needed to transform the first string into the second. Matches are not
counted.
Bioinformática Bioinformática
Sara C. Madeira 8
What is Pairwise Sequence Alignment?
String Alignment
• The concept of an alignment is crucial.
• Global Alignment
Definition: A (global) alignment of two strings S1 e S2 is obtained by first inserting chosen spaces (or dashes), either into or at the ends of S1 and S2, and then placing the two resulting strings one
above the other so that every character or space (dash) in either string is opposite to a unique character (dash) or unique space (dash) in the other string.
5
Bioinformática Bioinformática
Sara C. Madeira 9
What is Pairwise Sequence Alignment?
Gaps
• Gaps help create alignments that better conform to underlying
biological models.
• Mechanisms that make long insertions or deletions in DNA include:
• unequal crossing-over in meiosis; DNA slippage during replication;
insertion of transposable elements into DNA string; insertions of DNA by
retro-viruses; etc...
Definition: A gap is any maximal, consecutive run of spaces (or
dashes) in a single string of a given alignment.
Bioinformática Bioinformática
Sara C. Madeira 10
What is Pairwise Sequence Alignment?
Example
• More than one possible alignment!
• Which one is better?
• Is it a true or a spurious alignment?
S1 = WEAGAWGHEE S2 = PAWHEAE
WEAGAWGHE-E
P-A--W-HEAE
mismatch match
WEAGAWGHE-E
--P-AW-HEAE
gap
6
Bioinformática Bioinformática
Sara C. Madeira 11
Outline
• Introduction
• What is Pairwise Sequence Alignment?
• Scoring Sequence Alignments
• Pairwise Sequence Alignment Algorithms
• Needleman-Wunsch Algorithm (global alignment)
• Smith-Waterman Algorithm (local alignment)
• Extension to repeated matches (multiple local alignments)
• Heuristic Pairwise Sequence Alignment Algorithms
• BLAST
• FASTA
Bioinformática Bioinformática
Sara C. Madeira 12
How to Score an Alignment?
• Find the best alignment between two strings under some scoring scheme.
• Use a scoring model that quantifies evolutionary preferences.
• Substitution matrices
• Matches and mismatches
• Gap penalty
• Initiating a gap
• Gap extension penalty
• Extending a gap
Set of values for quantifying the
likelihood of one residue being
substituted by another in an
alignment.
7
Bioinformática Bioinformática
Sara C. Madeira 13
The Scoring Model
• The total score will be a sum of terms for each aligned pair of residues, plus terms for each gap.
• Identities and conservative substitutions will be more likely in alignments than expected by chance.
• contribute with positive score terms.
• Non-conservative changes are expected to be observed less frequently in real alignments than expected by chance
• contribute with negative score terms.
Bioinformática Bioinformática
Sara C. Madeira 14
The Scoring Model
• The score assigned to an alignment is computed using this function:
where
s(s1(i),s2(i)) is the score for each aligned pair of residues
and
G(g) are the gap penalties
• Scores s(.,.) and gap penalties G(g) can be computed using different models (scoring matrices, probabilistics models, ...)!
)())(,)(( 21 gGisissSi
+=!
Given by a Scoring Matrix !
Given apriori !
8
Bioinformática Bioinformática
Sara C. Madeira 15
Example
• Gap penalty: -8
• Gap extension penalty: -8
WEAGAWGHE-E
--P-AW-HEAE
(-8) + (-8) + (-1) + (-8) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 1
W
P
H
E
A
W H G E A
15 -3 -3 -3 -3
-4 -2 -2 -1 -1
-3 10 -2 0 -2
-3 0 -3 6 -1
-3 -2 0 -1 5
Alig
nm
ent
Sco
res
Bioinformática Bioinformática
Sara C. Madeira 16
Example
WEAGAWGHE-E
P-A--W-HEAE
Exercise: What is the score of the following alignment ?
• Gap penalty: -8
• Gap extension penalty: -8
W
P
H
E
A
W H G E A
15 -3 -3 -3 -3
-4 -2 -2 -1 -1
-3 10 -2 0 -2
-3 0 -3 6 -1
-3 -2 0 -1 5
Alig
nm
ent
Sco
res
9
Bioinformática Bioinformática
Sara C. Madeira 17
Example
(-4) + (-8) + 5 + (-8) + (-8) + 15 + (-8) + 10 + 6 + (-8) + 6 = -2
WEAGAWGHE-E
P-A--W-HEAE
Exercise: What is the score of the following alignment ?
• Gap penalty: -8
• Gap extension penalty: -8
W
P
H
E
A
W H G E A
15 -3 -3 -3 -3
-4 -2 -2 -1 -1
-3 10 -2 0 -2
-3 0 -3 6 -1
-3 -2 0 -1 5
Alig
nm
ent
Sco
res
Bioinformática Bioinformática
Sara C. Madeira 18
Scoring Matrices
• Family of matrices listing the likelihood of change from one sequence to another during evolution.
• Amino acid substitution matrices
• PAM (Point Accepted Mutation)
• BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)
• DNA substitution matrices
• DNA: less conserved than protein sequences.
• Less effective to compare coding regions at nucleotide level.
10
Bioinformática Bioinformática
Sara C. Madeira 19
DNA Substitution Matrices
• Scoring matrices for nucleotide sequences are relatively simple.
• A positive value or a high score is given for a match and a negative value/low positive score is given for a mismatch.
• This assignment is based on the assumption that the frequencies of mutation are equal for all bases.
• However, this assumption may not be realistic !
• Observations show that transitions (substitutions between purines and purines, A<->C) occur more frequently than transversions (substitutions between pyrimidines and pyrimidines, T<->G)
• Therefore, a more sophisticated statistical model with different
probability values to reflect two types of mutations is needed!
• Several nucleotide substitution models (Example: Kimura model)
Bioinformática Bioinformática
Sara C. Madeira 20
Amino acid substitution matrices
PAM Matrices (Dayhoff, 1978)
• Encode and summarize expected evolutionary change at the
amino acid level.
• Each matrix is designed to be used to compare pairs of
sequences that are a specific number of PAM units diverged.
• 1 PAM unit indicates the probability of 1 point mutation per
100 residues.
11
Bioinformática Bioinformática
Sara C. Madeira 21
Amino acid substitution matrices
• After 100 PAMs of evolution, not every residue will have changed
• Some residues may have mutated several times.
• Some residues may have returned to their original state.
• Some residues may not changed at all.
• PAM matrices started by constructing hypothetical phylogenetic trees relating the sequences in 71 families, where each pair of sequences differed by no more than 15% of their residues.
• For each amino acid pair, Ai and Aj, count the number of times that Ai aligns opposite Aj, and divide that number by the total number of pairs in all the aligned data.
Bioinformática Bioinformática
Sara C. Madeira 22
PAM Matrices
• Let F(i,j) denote the resulting frequency.
• Let Fi and Fj be the frequencies that amino acids Ai and Aj
appear in the sequences.
• The (i,j) entry for the ideal PAMn matrix is:
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.)
)()(),(
(logjFiFjiF
12
Bioinformática Bioinformática
Sara C. Madeira 23
80 250
70 159
60 120
50 80
40 56
30 38
20 23
10 11
1 1
Observed difference %
Evolutionary distance
(PAM)
Amino acid substitution matrices
Most widely Used PAM
Matrix PAM250
24 Sara C. Madeira
13
Bioinformática Bioinformática
Sara C. Madeira 25
Amino acid substitution matrices
BLOSUM Matrices (Henikoff, 1992)
• Substitution matrices derived using probabilistic models.
• Matrices derived from a much larger dataset: the protein families BLOCKS database.
• Sequences are clustered whenever their percentage of identical residues exceed some level L%.
• BLOSUM50 and BLOSUM62 are widely used.
• BLOSUM observes significantly more replacements than PAM, even for infrequent pairs.
Bioinformática Bioinformática
Sara C. Madeira 26
BLOSUM50 A R N D C Q E G H I L K M F P S T W Y V!A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0!R !-2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3!N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3!D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4!C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1!Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3!E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3!G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4!H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4!I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4!L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1!K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3!M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1!F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1!P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3!S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2!T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0!W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3!Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1!V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5
14
Bioinformática Bioinformática
Sara C. Madeira 27
Amino acid substitution matrices
PAM Matrices vs BLOSUM Matrices
• PAM model is designed to track evolutionary origin of proteins.
• BLOSUM model is designed to find conserved domains of proteins.
Thumb Rules
• Lower PAMs and higher BLOSUMs find short local alignment of highly
similar sequences.
• Higher PAMs and lower BLOSUMs find longer weaker local alignments.