amino acid scoring matrices jason davis. overview protein synthesis/evolution protein...

Amino Acid Amino Acid Scoring MatricesScoring Matrices

Jason DavisJason Davis

OverviewOverview

Protein synthesis/evolutionProtein synthesis/evolution Computational sequence alignmentComputational sequence alignment

Smith-Waterman AlgorithmSmith-Waterman Algorithm BLASTBLAST

Amino Acid Scoring MatricesAmino Acid Scoring Matrices PAM – Point Accepted MutationsPAM – Point Accepted Mutations BLOSUM – BLOck SUbstitution MatrixBLOSUM – BLOck SUbstitution Matrix mPAMmPAM

Metric ConversionsMetric Conversions

ProteinsProteins 3-dimensional stuctures3-dimensional stuctures Composed of amino acids Composed of amino acids

chained togetherchained together Can be represented as a 2-Can be represented as a 2-

dimensional sequencedimensional sequence 20 different amino acids exist20 different amino acids exist Usually 100-1500 amino acids Usually 100-1500 amino acids

longlong Have many different shapes Have many different shapes

and functionsand functions Function depends on both 3d Function depends on both 3d

shape and aa sequence shape and aa sequence

Protein SynthesisProtein Synthesis DNA: strand composed of 4 different base pairs DNA: strand composed of 4 different base pairs

A, T, C, GA, T, C, G 20 amino acids: 3 base pairs needed to encode 20 amino acids: 3 base pairs needed to encode

each amino acideach amino acid Degenerate codingDegenerate coding

Signalling Transcription/Translation Protein

Protein EvolutionProtein Evolution Protein ‘families’Protein ‘families’

Set of homologous proteinsSet of homologous proteins Same function, different Same function, different

compositioncomposition Similar structureSimilar structure

Identifying familiesIdentifying families Pairwise sequence Pairwise sequence

alignmentalignment Multiple sequence Multiple sequence

alignmentalignment NP-hardNP-hard

Other approachesOther approaches Structural, experimentalStructural, experimental

Pairwise Sequence Pairwise Sequence AlignmentAlignment

InputInput 2 sequences p, q of lengths m,n2 sequences p, q of lengths m,n 20x20 Amino Acid Substitution Matrix20x20 Amino Acid Substitution Matrix Insertion (gap) costInsertion (gap) cost

Global AlignmentGlobal Alignment Find optimal set of insertions such that the Find optimal set of insertions such that the

resulting alignment (length < m+n) is optimal resulting alignment (length < m+n) is optimal w.r.t. amino acid substitution matrixw.r.t. amino acid substitution matrix

Difficult, less usefulDifficult, less useful Local AlignmentLocal Alignment

Find significant ‘hotspot’ in the alignment Find significant ‘hotspot’ in the alignment

Sequence Alignment Sequence Alignment AlgorithmsAlgorithms

Dynamic Programming ApproachesDynamic Programming Approaches Global and Local variationsGlobal and Local variations Provably Optimal Provably Optimal O(nm) space and timeO(nm) space and time

‘‘banded’ heuristics can reduce the state spacebanded’ heuristics can reduce the state space FSA extensions allow varying penalties for gap FSA extensions allow varying penalties for gap

openings and gap extensionsopenings and gap extensions Heuristics ApproachesHeuristics Approaches

Blast, FastaBlast, Fasta Sublinear time – look for statistical Sublinear time – look for statistical

significance in small local alignments between significance in small local alignments between sequences sequences

Substitution Matrices - Substitution Matrices - PAMPAM

Dayhoff, Schwartz, Orcutt (1978)Dayhoff, Schwartz, Orcutt (1978) Step 1: extrapolate mutation probabilites Step 1: extrapolate mutation probabilites

from 1 step in evolutionary timefrom 1 step in evolutionary time Pick a set of protein families (71)Pick a set of protein families (71) Restrict proteins in each family to sequences Restrict proteins in each family to sequences

with similarity above a certain threshold with similarity above a certain threshold (>85%)(>85%)

Build a phylogenetic tree for each familyBuild a phylogenetic tree for each family Extrapolate frequencies AExtrapolate frequencies Aabab that amino acids a, that amino acids a,

b evolved from same amino acid b evolved from same amino acid AAabab and A and Ababa assumed to be the same assumed to be the same

Convert frequencies to probabilitiesConvert frequencies to probabilities p(a|b) = Bp(a|b) = Babab = A = Aabab/∑/∑ccAAacac

Substitution Matrices – Substitution Matrices – PAM (2)PAM (2)

Step 2 – Infer greater evolutionary Step 2 – Infer greater evolutionary timestimes Dayhoff defined a PAM1 matrix to have 1% Dayhoff defined a PAM1 matrix to have 1%

expected substitutions expected substitutions For each row, scale off-diagonalsFor each row, scale off-diagonals and adjust and adjust

diagonals to keep the matrix row stochasticdiagonals to keep the matrix row stochastic To infer larger evolutionary times, we can To infer larger evolutionary times, we can

view formed matrix C as a 20-state Markov view formed matrix C as a 20-state Markov ChainChain

CCnn is the result of performing n-steps in the is the result of performing n-steps in the Markov ProcessMarkov Process

Substitution Matrices – Substitution Matrices – PAM (3)PAM (3)

Create odds ratio ofCreate odds ratio of 1) the event that 2 amino acids i,j, evolved from the same 1) the event that 2 amino acids i,j, evolved from the same

ancestor, xancestor, x ffii = observed frequency of amino acid i = observed frequency of amino acid i p(i,j have same ancestor) = ∑p(i,j have same ancestor) = ∑xxffxx Pr{x→i} Pr{x→j} Pr{x→i} Pr{x→j}

= ∑ = ∑xxffxx (C (CNN))ix ix (C(CNN))jxjx

= ∑= ∑x x (C(CNN))ix ix ffx x (C(CNN))jxjx

= ∑= ∑x x (C(CNN))ix ix ffj j (C(CNN))xjxj

= f= fj j (C(C2N2N))ijij

2) the event that the 2 amino acids align at random2) the event that the 2 amino acids align at random p(independent alignment of i,j) = fp(independent alignment of i,j) = fii * f * fjj

Final log odds ratio: Final log odds ratio: DDijij = average[log((C = average[log((CNN))ij ij / f/ fii), log(C), log(CNN))ji ji / f/ fjj)))) The log allows for an additive modelThe log allows for an additive model Final numbers are rounded to nearest integerFinal numbers are rounded to nearest integer

PAM250PAM250 Different values on Different values on

the diagonal the diagonal correspond do correspond do mutability mutability potentialpotential

BLOSUMBLOSUM Henikoff & Henikoff, 1992Henikoff & Henikoff, 1992 Uses aligned, ungapped blocks within protein Uses aligned, ungapped blocks within protein

families that have similarity greater than some families that have similarity greater than some level L%level L%

qqaa = ∑ = ∑bbAAabab / ∑ / ∑c,d c,d AAcdcd ppab ab = A = Aabab / ∑ / ∑c,d c,d AAcdcd S(a,b) = log(pS(a,b) = log(pabab / q / qaaqqbb)) Final entries are roundedFinal entries are rounded Blosum62 (L=62), Blosum50 (L=50)Blosum62 (L=62), Blosum50 (L=50) More direct approach, usually yields better More direct approach, usually yields better

resultsresults

Log-Odds Similarity Matrix Log-Odds Similarity Matrix PropertiesProperties

Negative numbers needed for Smith-Waterman Negative numbers needed for Smith-Waterman local alignment algorithmlocal alignment algorithm

Nice probabilistic interpretationNice probabilistic interpretation Amino acid substitutions assumed independentAmino acid substitutions assumed independent

Attempts to metricize these matricesAttempts to metricize these matrices Taylor, Jones 93: used various algebraic manipulations Taylor, Jones 93: used various algebraic manipulations

to arrive at a metric matrix with minimal disortionto arrive at a metric matrix with minimal disortion DDijij = a – S = a – Sijij

Larger values of a yielded better metrics at the cost of Larger values of a yielded better metrics at the cost of high dimensionalityhigh dimensionality

Constant Shift EmbeddingConstant Shift Embedding Linial, et. al. constructed a near metric over Linial, et. al. constructed a near metric over

aligned segments of length 50aligned segments of length 50 D(u,v) = S(u,u) + S(v,v) – 2*S(u,v)D(u,v) = S(u,u) + S(v,v) – 2*S(u,v) 1010-7-7

error rateerror rate

mPAMmPAM Metric substitution modelMetric substitution model Measures the expected time per 250 Measures the expected time per 250

mutations among 100 amino acids mutations among 100 amino acids Same rate as PAM250Same rate as PAM250 Exponential distribution assumed: f(t) = 1 – eExponential distribution assumed: f(t) = 1 – e--λλtt

Given pairwise substitution rates p(a,b)Given pairwise substitution rates p(a,b) Solve for Solve for λλ:: f(1) = 1-e f(1) = 1-e- - λλ = p(a,b) = p(a,b) Expected time t of an event occuring in an exponential distribution Expected time t of an event occuring in an exponential distribution

is 1/ is 1/ λλ mPAM(a,b) = round(1/ mPAM(a,b) = round(1/ λλ))

Two values needed to be adjusted to form a Two values needed to be adjusted to form a metricmetric

Rounding error?Rounding error?

mPAM (2)mPAM (2)

Seller’s Theorem:Seller’s Theorem: If a pairwise alignment is found using a metric, If a pairwise alignment is found using a metric,

resulting alignment scores are also metricsresulting alignment scores are also metrics Optimized for BLAST-like lookupOptimized for BLAST-like lookup

Smaller alignmentsSmaller alignments Difficult to compare with other similarity Difficult to compare with other similarity

matricesmatrices Dynamic programming algorithms rely on Dynamic programming algorithms rely on

negative values in the similarity matrixnegative values in the similarity matrix Probabilistic interpretation: larger positive alignments Probabilistic interpretation: larger positive alignments

are statistically significantare statistically significant

mPAM DisadvantagesmPAM Disadvantages

d(x,x) = 0d(x,x) = 0 This does not capture the relative mutability This does not capture the relative mutability

among different amino acids among different amino acids PAM/BLOSUM capture this with different positive PAM/BLOSUM capture this with different positive

values along the diagonalvalues along the diagonal

Do amino acids substitute according to an Do amino acids substitute according to an exponential distribution?exponential distribution?

Amino Acid Substitution may be Amino Acid Substitution may be inherently non-metricinherently non-metric

Comparison to BLOSUM?Comparison to BLOSUM?

amino acid scoring matrices jason davis. overview protein synthesis/evolution protein...

Documents