cecs 694-02 introduction to bioinformatics university of louisville spring 2003 dr. eric rouchka...
Post on 20-Dec-2015
217 views
TRANSCRIPT
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Lecture 2:
Pairwise Sequence Alignment
Eric C. Rouchka, D.Sc.
http://kbrin.kwing.louisville.edu/~rouchka/CECS694/
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Sequence Alignment
• Question: Are two sequences related?
• Compare the two sequences, see if they are similar
• Example: pear and tear
• Similar words, different meanings
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Biological Sequences
• Similar biological sequences tend to be related
• Information:– Functional– Structural– Evolutionary
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Biological sequences
• Common mistake: – sequence similarity is not homology!
• Homologous sequences: derived from a common ancestor
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Relation of sequences
• Homologs: similar sequences in 2 different organisms derived from a common ancestor sequence.
• Orthologs: Similar sequences in 2 different organisms that have arisen due to a speciation event. Functionality Retained.
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Relation of sequences
• Paralogs: Similar sequences within a single organism that have arisen due to a gene duplication event.
• Xenologs: similar sequences that have arisen out of horizontal transfer events (symbiosis, viruses, etc)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Relation of sequences
Image Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Edit Distance
• Sequence similarity: function of edit distance between two sequences
P E A R
| | |
T E A R
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Hamming Distance
• Minimum number of letters by which two words differ
• Calculated by summing number of mismatches
• Hamming Distance between PEAR and TEAR is 1
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Gapped Alignments
• Biological sequences– Different lengths– Regions of insertions and deletions
• Notion of gaps (denoted by ‘-’)
A L I G N M E N T | | | | | | |- L I G A M E N T
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Possible Residue Alignments
• Match
• Mismatch (substitution or mutation)
• Insertion/Deletion (INDELS – gaps)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Alignments
• Which alignment is best?
A – C – G G – A C T| | | | |A T C G G A T _ C T
A T C G G A T C T| | | | | |A – C G G – A C T
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Alignment Scoring Scheme
• Possible scoring scheme:
match: +2
mismatch: -1
indel –2
• Alignment 1: 5 * 2 – 1(1) – 4(2) = 10 – 1 – 8 = 1
• Alignment 2: 6 * 2 – 1(1) – 2 (2) = 12 – 1 – 4 = 7
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Alignment Methods
• Visual
• Brute Force
• Dynamic Programming
• Word-Based (k tuple)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Visual Alignments (Dot Plots)
• Matrix– Rows: Characters in one sequence– Columns: Characters in second sequence
• Filling– Loop through each row; if character in row, col
match, fill in the cell– Continue until all cells have been examined
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Example Dot Plot
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Noise in Dot Plots
• Nucleic Acids (DNA, RNA)– 1 out of 4 bases matches at random
• Stringency– Window size is considered– Percentage of bases matching in the
window is set as threshold
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Reduction of Dot Plot Noise
Self alignment of ACCTGAGCTCACCTGAGTTA
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Information Inside Dot Plots
• Regions of similarity: diagonals
• Insertions/deletions– Can determine intron/exon structure
• Repeats and Inverted Repeats– Inverted repeats = reverse complement– Used to determine folding of RNA molecules
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Insertions/Deletions
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Repeats/Inverted Repeats
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Comparing Genome Assemblies
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Chromosome Y self comparison
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Available Dot Plot Programs
• Vector NTI software package (under AlignX)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Available Dot Plot Programs
• Vector NTI software package (under AlignX)
GCG software package:• Compare http://www.hku.hk/bruhk/gcgdoc/compare.html
• DotPlot+ http://www.hku.hk/bruhk/gcgdoc/dotplot.html • http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html• http://bioweb.pasteur.fr/cgi-bin/seqanal/dottup.pl • Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Available Dot Plot Programs
Dotlet (Java Applet) http://www.isrec.isb-sib.
ch/java/dotlet/Dotlet.html
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Available Dot Plot ProgramsDotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Available Dot Plot ProgramsEMBOSS DotMatcher, DotPath,DotUp
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Available Dot Plot Programs
GCG software package:• Compare http://www.hku.hk/bruhk/gcgdoc
/compare.html
• DotPlot+ http://www.hku.hk/bruhk/gcgdoc/dotplot.html
• DNA strider• PipMaker
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Dot Plot References
Gibbs, A. J. & McIntyre, G. A. (1970). The diagram method for comparing sequences. its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1-11.
Staden, R. (1982). An interactive graphics program for comparing and aligning nucleic-acid and amino-acid sequences. Nucl. Acid. Res. 10 (9), 2951-2961.
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Determining Optimal Alignment
• Two sequences: X and Y– |X| = m; |Y| = n– Allowing gaps, |X| = |Y| = m+n
• Brute Force
• Dynamic Programming
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Brute Force
• Determine all possible subsequences for X and Y– 2m+n subsequences for X, 2m+n for Y!
• Alignment comparisons– 2m+n * 2m+n = 2(2(m+n)) = 4m+n comparisons
• Quickly becomes impractical
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Dynamic Programming
• Used in Computer Science
• Solve optimization problems by dividing the problem into independent subproblems
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Dynamic Programming
• Sequence alignment has optimal substructure property– Subproblem: alignment of prefixes of two
sequences
– Each subproblem is computed once and stored in a matrix
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Dynamic Programming
• Optimal score: built upon optimal alignment computed to that point
• Aligns two sequences beginning at ends, attempting to align all possible pairs of characters
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Dynamic Programming
• Scoring scheme for matches, mismatches, gaps
• Highest set of scores defines optimal alignment between sequences
• Match score: DNA – exact match; Amino Acids – mutation probabilities
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Dynamic Programming
• Guaranteed to provide optimal alignment given:– Two sequences– Scoring scheme
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Steps in Dynamic Programming
Initialization Matrix Fill (scoring) Traceback (alignment)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
DP Example
Sequence #1: GAATTCAGTTA; M = 11
Sequence #2: GGATCGA; N = 7
s(aibj) = +5 if ai = bj (match score)
s(aibj) = -3 if aibj (mismatch score)
w = -4 (gap penalty)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
View of the DP Matrix
• M+1 rows, N+1 columns
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Global Alignment(Needleman-Wunsch)
• Attempts to align all residues of two sequences
• INITIALIZATION: First row and first column set
• Si,0 = w * i
• S0,j = w * j
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Initialized Matrix (Needleman-Wunsch)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Matrix Fill(Global Alignment)
Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the
diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2)
]
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Matrix Fill (Global Alignment)
• S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 - 4] = MAX[5, -8, -8]
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Matrix Fill (Global Alignment)
• S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 - 4] = MAX[-4 - 3, 5 – 4, -8 – 4] =
MAX[-7, 1, -12] = 1
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Matrix Fill (Global Alignment)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Filled Matrix (Global Alignment)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Trace Back (Global Alignment)
• maximum global alignment score = 11 (value in the lower right hand cell).
• • Traceback begins in position SM,N; i.e. the
position where both sequences are globally aligned.
• • At each cell, we look to see where we move
next according to the pointers.
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Trace Back (Global Alignment)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Global Trace Back
G A A T T C A G T T A
| | | | | |
G G A – T C – G - — A
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Checking Alignment Score
G A A T T C A G T T A| | | | | |G G A – T C – G - — A
+ - + - + + - + - - +5 3 5 4 5 5 4 5 4 4 5
5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Local Alignment
• Smith-Waterman: obtain highest scoring local match between two sequences
• Requires 2 modifications:– Negative scores for mismatches– When a value in the score matrix becomes
negative, reset it to zero (begin of new alignment)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Local Alignment Initialization
• Values in row 0 and column 0 set to 0.
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Matrix Fill(Local Alignment)
Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2),
0
]
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Matrix Fill(Local Alignment)
• S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 – 4,0] = MAX[5, -4, -4, 0] = 5
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Matrix Fill (Local Alignment)
• S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 – 4, 0] = MAX[0 - 3, 5 – 4, 0 – 4,
0] = MAX[-3, 1, -4, 0] = 1
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Matrix Fill (Local Alignment)S1,3 = MAX[S0,2 -3, S1,2 - 4, S0,3 – 4, 0] = MAX[0 - 3, 1 – 4, 0 – 4, 0] =
MAX[-3, -3, -4, 0] = 0
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Filled Matrix(Local Alignment)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Trace Back (Local Alignment)
• maximum local alignment score for the two sequences is 14
• found by locating the highest values in the score matrix
• 14 is found in two separate cells, indicating multiple alignments producing the maximal alignment score
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Trace Back (Local Alignment)
• Traceback begins in the position with the highest value.
• At each cell, we look to see where we move next according to the pointers
• When a cell is reached where there is not a pointer to a previous cell, we have reached the beginning of the alignment
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Trace Back (Local Alignment)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Trace Back (Local Alignment)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Trace Back (Local Alignment)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Maximum Local Alignment
G A A T T C - A
| | | | |
G G A T – C G A
+ - + + - + - +
5 3 5 5 4 5 4 5
G A A T T C - A
| | | | |
G G A – T C G A
+ - + - + + - +
5 3 5 4 5 5 4 5
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Scoring Matrices
• match/mismatch score – Not bad for similar sequences– Does not show distantly related sequences
• Likelihood matrix– Scores residues dependent upon likelihood
substitution is found in nature– More applicable for amino acid sequences
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Percent Accepted Mutation (PAM or Dayhoff) Matrices
• Studied by Margaret Dayhoff
• Amino acid substitutions – Alignment of common protein sequences– 1572 amino acid substitutions– 71 groups of protein, 85% similar
• “Accepted” mutations – do not negatively affect a protein’s fitness
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Percent Accepted Mutation (PAM or Dayhoff) Matrices
• Similar sequences organized into phylogenetic trees
• Number of amino acid changes counted
• Relative mutabilities evaluated
• 20 x 20 amino acid substitution matrix calculated
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Percent Accepted Mutation (PAM or Dayhoff) Matrices
• PAM 1: 1 accepted mutation event per 100 amino acids; PAM 250: 250 mutation events per 100 …
• PAM 1 matrix can be multiplied by itself N times to give transition matrices for sequences that have undergone N mutations
• PAM 250: 20% similar; PAM 120: 40%; PAM 80: 50%; PAM 60: 60%
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
PAM1 matrixnormalized probabilities multiplied by 10000
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Log Odds Matrices
• PAM matrices converted to log-odds matrix– Calculate odds ratio for each substitution
• Taking scores in previous matrix• Divide by frequency of amino acid
– Convert ratio to log10 and multiply by 10– Take average of log odds ratio for converting A to B and
converting B to A– Result: Symmetric matrix– EXAMPLE: Mount pp. 80-81
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
PAM250 Log odds matrix
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Blocks Amino Acid Substitution Matrices (BLOSUM)
• Larger set of sequences considered
• Sequences organized into signature blocks
• Consensus sequence formed– 60% identical: BLOSUM 60– 80% identical: BLOSUM 80
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Nucleic Acid Scoring Matrices
• Two mutation models:– Uniform mutation rates (Jukes-Cantor)– Two separate mutation rates (Kimura)
• Transitions• Transversions
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
DNA Mutations
A
C
G
T
PURINES: A, GPYRIMIDINES C, T
Transitions: AG; CTTransversions: AC, AT, CG, GT
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
PAM1 DNA odds matrices
A. Model of uniform mutation rates among nucleotides. A G T C
A 0.99 G 0.00333 0.99 T 0.00333 0.00333 0.99 C 0.00333 0.00333 0.00333 0.99
B. Model of 3-fold higher transitions than transversions. A G T C
A 0.99 G 0.006 0.99 T 0.002 0.002 0.99 C 0.002 0.002 0.006 0.99
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
PAM1 DNA log-odds matrices
A. Model of uniform mutation rates among nucleotides.A G T CA 2 G -6 2 T -6 -6 2 C -6 -6 -6 2
B. Model of 3-fold higher transitions than transversions.A G T C
A 2 G -5 2 T -7 -7 2 C -7 -7 -5 2
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Linear vs. Affine Gaps
• Gaps have been modeled as linear
• More likely contiguous block of residues inserted or deleted– 1 gap of length k rather than k gaps of length 1
• Scoring scheme should penalize new gaps more
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Affine Gap Penalty
• wx = g + r(x-1)
• wx : total gap penalty; g: gap open penalty; r:
gap extend penalty;x: gap length• • gap penalty chosen relative to score matrix
– Gaps not excluded– Gaps not over included– Typical Values: g=-12; r = -4
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Affine Gap Penalty and Dynamic Programming
Mi, j = max {
Di - 1, j - 1 + subst(Ai, Bj)
Mi - 1, j - 1 + subst(Ai, Bj)
Ii - 1, j - 1 + subst(Ai, Bj)
Di, j = max {
Di , j - 1 - extend
Mi , j - 1 - open
Ii, j = max {
Mi-1 , j - open
Ii-1 , j - extend
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Drawbacks to DP Approaches
• Compute intensive
• Memory Intensive
• O(n2) space, between O(n2) and O(n3) time
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Alternative DP approaches
• Linear space algorithms Myers-Miller
• Bounded Dynamic Programming
• Ewan Birney’s Dynamite Package– Automatic generation of DP code
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Significance of Alignment
• Determine probability of alignment occurring at random– Sequence 1: length m– Sequence 2: length n
• Random sequences:– Alignment follows Gumbel Extreme Value
Distribution
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Gumbel Extreme Value Distribution
• http://roso.epfl.ch/mbi/papers/discretechoice/node11.html
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Probability of Alignment Score
• Expected # of alignments with score at least S (E-value):
E = Kmn e-λS
– m,n: Lengths of sequences– K ,λ: statistical parameters dependent
upon scoring system and background residue frequencies
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Converting to Bit Scores
A raw score can be normalized to a bit score using the formula:
• The E-value corresponding to a given bit score can then be calculated as:
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
P-Value
• P-Value: probability of obtaining a given score at random
P = 1 – e-E
Which is approximately e-E
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Significance of Ungapped Alignments
• PAM matrices are 10 * log10x
• Converting to log2x gives bits of information
• Converting to logex gives nats of information
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Quick Calculation
• If bit scoring system is used, significance cutoff is:
log2(mn)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Example (p110)
• 2 Sequences, each 250 amino acids long
• Significance: – log2(250 * 250) = 16 bits
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Example (p110)
• Using PAM250, the following alignment is found:
• F W L E V E G N S M T A P T G• F W L D V Q G D S M T A P A G
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Example (p110)
• Using PAM250 (p82), the score is calculated:
• F W L E V E G N S M T A P T G• F W L D V Q G D S M T A P A G
• S = 9 + 17 + 6 + 3 + 4 + 2 + 5 + 2 + 2 + 6 + 3 + 2 + 6 + 1 + 5 = 73
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Significance Example
• S is in 10 * log10x -- convert to a bit score:• • S = 10 log10x
• S/10 = log10x
• S/10 = log10x * (log210/log210)
• S/10 * log210 = log10x / log210
• S/10 * log210 = log2x
• 1/3 S ~ log2x
• • S’ ~ 1/3S
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Significance Example
• S’ = 1/3S = 1/3 * 73 = 24.333 bits
• Significance cutoff = 16 bits
• Therefore, this alignment is significant
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Estimation of E and P
• For PAM250, K = 0.09; = 0.229
• Using equations 30 and 31:
• S’ = 0.229 * 73 – ln 0.09 * 250 * 250
• S’ = 16.72 – 8.63 = 8.09 bits
• P(S’ >= 8.09) = 1 – e(-e-8.09) = 3.1* 10-4
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Significance of Gapped Alignments
• Gapped alignments use same statistics
and K cannot be easily estimated
• Empirical estimations and gap scores determined by looking at random alignments
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Bayesian Statistics
• Built upon conditional probabilities
• Used to derive joint probability of multiple events or conditions
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Bayesian Statistics
• P(B|A): Probability of condition B given condition A is true
• P(B): Probability of condition B, regardless of condition A
• P(A, B): Joint probability of A and B occurring simultaneously
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Bayesian Statistics
• A has two substates: A1, A2
• B has two substates: B1, B2
• P(B1) = 0.3 is known
• P(B2) = 1.0 – 0.3 = 0.7
• These are marginal probabilities
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Joint Probabilities
• Bayes Theorem:
– P(A1,B1) = P(B1)P(A1|B1)– P(A1,B1) = P(A1)P(B1|A1)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Bayesian Example
• Given:– P(A1|B1) = 0.8; P(A2|B2) = 0.7
• Then– P(A2|B1) = 1.0 – 0.8 = 0.2; P(A1|B2) = 1.0 – 0.7 = 0.3
• AND– P(A1,B1) = P(B1)P(A1|B1) = 0.3 * 0.8 = 0.24– P(A2,B2) = P(B2)P(A2|B2) = 0.7 * 0.7 = 0.49– …
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Posterior Probabilities
• Calculation of joint probabilities results in posterior probabilities
– Not known initially– Calculated using
• Prior probabilities• Initial information
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Applications of Bayesian Statistics
• Evolutionary distance between two sequences (Mount, pp 122-124)
• Sequence Alignment (Mount, 124-134)
• Significance of Alignments (Durbin, 36-38)
• Gibbs Sampling (Covered later)
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Pairwise Sequence Alignment Programs
• Blast 2 Sequences– NCBI– word based
sequence alignment
• LALIGN– FASTA package– Mult. Local
alignments
• needle– Global
Needleman/Wunsch alignment
• water– Local
Smith/Waterman alignment
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka
Various Sequence Alignments
Wise2 -- Genomic to protein
Sim4 -- Aligns expressed DNA to genomic sequence
spidey -- aligns mRNAs to genomic sequence
est2genome -- aligns ESTs to genomic sequence