amino acid scoring matrices jason davis. overview protein synthesis/evolution protein...
TRANSCRIPT
Amino Acid Amino Acid Scoring MatricesScoring Matrices
Jason DavisJason Davis
OverviewOverview
Protein synthesis/evolutionProtein synthesis/evolution Computational sequence alignmentComputational sequence alignment
Smith-Waterman AlgorithmSmith-Waterman Algorithm BLASTBLAST
Amino Acid Scoring MatricesAmino Acid Scoring Matrices PAM – Point Accepted MutationsPAM – Point Accepted Mutations BLOSUM – BLOck SUbstitution MatrixBLOSUM – BLOck SUbstitution Matrix mPAMmPAM
Metric ConversionsMetric Conversions
ProteinsProteins 3-dimensional stuctures3-dimensional stuctures Composed of amino acids Composed of amino acids
chained togetherchained together Can be represented as a 2-Can be represented as a 2-
dimensional sequencedimensional sequence 20 different amino acids exist20 different amino acids exist Usually 100-1500 amino acids Usually 100-1500 amino acids
longlong Have many different shapes Have many different shapes
and functionsand functions Function depends on both 3d Function depends on both 3d
shape and aa sequence shape and aa sequence
Protein SynthesisProtein Synthesis DNA: strand composed of 4 different base pairs DNA: strand composed of 4 different base pairs
A, T, C, GA, T, C, G 20 amino acids: 3 base pairs needed to encode 20 amino acids: 3 base pairs needed to encode
each amino acideach amino acid Degenerate codingDegenerate coding
Signalling Transcription/Translation Protein
Protein EvolutionProtein Evolution Protein ‘families’Protein ‘families’
Set of homologous proteinsSet of homologous proteins Same function, different Same function, different
compositioncomposition Similar structureSimilar structure
Identifying familiesIdentifying families Pairwise sequence Pairwise sequence
alignmentalignment Multiple sequence Multiple sequence
alignmentalignment NP-hardNP-hard
Other approachesOther approaches Structural, experimentalStructural, experimental
Pairwise Sequence Pairwise Sequence AlignmentAlignment
InputInput 2 sequences p, q of lengths m,n2 sequences p, q of lengths m,n 20x20 Amino Acid Substitution Matrix20x20 Amino Acid Substitution Matrix Insertion (gap) costInsertion (gap) cost
Global AlignmentGlobal Alignment Find optimal set of insertions such that the Find optimal set of insertions such that the
resulting alignment (length < m+n) is optimal resulting alignment (length < m+n) is optimal w.r.t. amino acid substitution matrixw.r.t. amino acid substitution matrix
Difficult, less usefulDifficult, less useful Local AlignmentLocal Alignment
Find significant ‘hotspot’ in the alignment Find significant ‘hotspot’ in the alignment
Sequence Alignment Sequence Alignment AlgorithmsAlgorithms
Dynamic Programming ApproachesDynamic Programming Approaches Global and Local variationsGlobal and Local variations Provably Optimal Provably Optimal O(nm) space and timeO(nm) space and time
‘‘banded’ heuristics can reduce the state spacebanded’ heuristics can reduce the state space FSA extensions allow varying penalties for gap FSA extensions allow varying penalties for gap
openings and gap extensionsopenings and gap extensions Heuristics ApproachesHeuristics Approaches
Blast, FastaBlast, Fasta Sublinear time – look for statistical Sublinear time – look for statistical
significance in small local alignments between significance in small local alignments between sequences sequences
Substitution Matrices - Substitution Matrices - PAMPAM
Dayhoff, Schwartz, Orcutt (1978)Dayhoff, Schwartz, Orcutt (1978) Step 1: extrapolate mutation probabilites Step 1: extrapolate mutation probabilites
from 1 step in evolutionary timefrom 1 step in evolutionary time Pick a set of protein families (71)Pick a set of protein families (71) Restrict proteins in each family to sequences Restrict proteins in each family to sequences
with similarity above a certain threshold with similarity above a certain threshold (>85%)(>85%)
Build a phylogenetic tree for each familyBuild a phylogenetic tree for each family Extrapolate frequencies AExtrapolate frequencies Aabab that amino acids a, that amino acids a,
b evolved from same amino acid b evolved from same amino acid AAabab and A and Ababa assumed to be the same assumed to be the same
Convert frequencies to probabilitiesConvert frequencies to probabilities p(a|b) = Bp(a|b) = Babab = A = Aabab/∑/∑ccAAacac
Substitution Matrices – Substitution Matrices – PAM (2)PAM (2)
Step 2 – Infer greater evolutionary Step 2 – Infer greater evolutionary timestimes Dayhoff defined a PAM1 matrix to have 1% Dayhoff defined a PAM1 matrix to have 1%
expected substitutions expected substitutions For each row, scale off-diagonalsFor each row, scale off-diagonals and adjust and adjust
diagonals to keep the matrix row stochasticdiagonals to keep the matrix row stochastic To infer larger evolutionary times, we can To infer larger evolutionary times, we can
view formed matrix C as a 20-state Markov view formed matrix C as a 20-state Markov ChainChain
CCnn is the result of performing n-steps in the is the result of performing n-steps in the Markov ProcessMarkov Process
Substitution Matrices – Substitution Matrices – PAM (3)PAM (3)
Create odds ratio ofCreate odds ratio of 1) the event that 2 amino acids i,j, evolved from the same 1) the event that 2 amino acids i,j, evolved from the same
ancestor, xancestor, x ffii = observed frequency of amino acid i = observed frequency of amino acid i p(i,j have same ancestor) = ∑p(i,j have same ancestor) = ∑xxffxx Pr{x→i} Pr{x→j} Pr{x→i} Pr{x→j}
= ∑ = ∑xxffxx (C (CNN))ix ix (C(CNN))jxjx
= ∑= ∑x x (C(CNN))ix ix ffx x (C(CNN))jxjx
= ∑= ∑x x (C(CNN))ix ix ffj j (C(CNN))xjxj
= f= fj j (C(C2N2N))ijij
2) the event that the 2 amino acids align at random2) the event that the 2 amino acids align at random p(independent alignment of i,j) = fp(independent alignment of i,j) = fii * f * fjj
Final log odds ratio: Final log odds ratio: DDijij = average[log((C = average[log((CNN))ij ij / f/ fii), log(C), log(CNN))ji ji / f/ fjj)))) The log allows for an additive modelThe log allows for an additive model Final numbers are rounded to nearest integerFinal numbers are rounded to nearest integer
PAM250PAM250 Different values on Different values on
the diagonal the diagonal correspond do correspond do mutability mutability potentialpotential
BLOSUMBLOSUM Henikoff & Henikoff, 1992Henikoff & Henikoff, 1992 Uses aligned, ungapped blocks within protein Uses aligned, ungapped blocks within protein
families that have similarity greater than some families that have similarity greater than some level L%level L%
qqaa = ∑ = ∑bbAAabab / ∑ / ∑c,d c,d AAcdcd ppab ab = A = Aabab / ∑ / ∑c,d c,d AAcdcd S(a,b) = log(pS(a,b) = log(pabab / q / qaaqqbb)) Final entries are roundedFinal entries are rounded Blosum62 (L=62), Blosum50 (L=50)Blosum62 (L=62), Blosum50 (L=50) More direct approach, usually yields better More direct approach, usually yields better
resultsresults
Log-Odds Similarity Matrix Log-Odds Similarity Matrix PropertiesProperties
Negative numbers needed for Smith-Waterman Negative numbers needed for Smith-Waterman local alignment algorithmlocal alignment algorithm
Nice probabilistic interpretationNice probabilistic interpretation Amino acid substitutions assumed independentAmino acid substitutions assumed independent
Attempts to metricize these matricesAttempts to metricize these matrices Taylor, Jones 93: used various algebraic manipulations Taylor, Jones 93: used various algebraic manipulations
to arrive at a metric matrix with minimal disortionto arrive at a metric matrix with minimal disortion DDijij = a – S = a – Sijij
Larger values of a yielded better metrics at the cost of Larger values of a yielded better metrics at the cost of high dimensionalityhigh dimensionality
Constant Shift EmbeddingConstant Shift Embedding Linial, et. al. constructed a near metric over Linial, et. al. constructed a near metric over
aligned segments of length 50aligned segments of length 50 D(u,v) = S(u,u) + S(v,v) – 2*S(u,v)D(u,v) = S(u,u) + S(v,v) – 2*S(u,v) 1010-7-7
error rateerror rate
mPAMmPAM Metric substitution modelMetric substitution model Measures the expected time per 250 Measures the expected time per 250
mutations among 100 amino acids mutations among 100 amino acids Same rate as PAM250Same rate as PAM250 Exponential distribution assumed: f(t) = 1 – eExponential distribution assumed: f(t) = 1 – e--λλtt
Given pairwise substitution rates p(a,b)Given pairwise substitution rates p(a,b) Solve for Solve for λλ:: f(1) = 1-e f(1) = 1-e- - λλ = p(a,b) = p(a,b) Expected time t of an event occuring in an exponential distribution Expected time t of an event occuring in an exponential distribution
is 1/ is 1/ λλ mPAM(a,b) = round(1/ mPAM(a,b) = round(1/ λλ))
Two values needed to be adjusted to form a Two values needed to be adjusted to form a metricmetric
Rounding error?Rounding error?
mPAM (2)mPAM (2)
Seller’s Theorem:Seller’s Theorem: If a pairwise alignment is found using a metric, If a pairwise alignment is found using a metric,
resulting alignment scores are also metricsresulting alignment scores are also metrics Optimized for BLAST-like lookupOptimized for BLAST-like lookup
Smaller alignmentsSmaller alignments Difficult to compare with other similarity Difficult to compare with other similarity
matricesmatrices Dynamic programming algorithms rely on Dynamic programming algorithms rely on
negative values in the similarity matrixnegative values in the similarity matrix Probabilistic interpretation: larger positive alignments Probabilistic interpretation: larger positive alignments
are statistically significantare statistically significant
mPAM DisadvantagesmPAM Disadvantages
d(x,x) = 0d(x,x) = 0 This does not capture the relative mutability This does not capture the relative mutability
among different amino acids among different amino acids PAM/BLOSUM capture this with different positive PAM/BLOSUM capture this with different positive
values along the diagonalvalues along the diagonal
Do amino acids substitute according to an Do amino acids substitute according to an exponential distribution?exponential distribution?
Amino Acid Substitution may be Amino Acid Substitution may be inherently non-metricinherently non-metric
Comparison to BLOSUM?Comparison to BLOSUM?