chapter 3 computational molecular biology michael smith [email protected]
TRANSCRIPT
Chapter 3Chapter 3
Computational Molecular BiologyComputational Molecular Biology
Michael SmithMichael [email protected]@cs.fit.edu
Sequence ComparisonSequence Comparison
Sequence comparison is the most Sequence comparison is the most important operation in important operation in computational biologycomputational biology
Consists of finding which parts of Consists of finding which parts of the sequences are alike and which the sequences are alike and which parts differparts differ
Similarity and AlignmentSimilarity and Alignment
SimilaritySimilarity Gives a measure of how similar Gives a measure of how similar
sequences aresequences are
AlignmentAlignment A way of placing sequences one above A way of placing sequences one above
the other in order to make clear the the other in order to make clear the correspondence between similar correspondence between similar characters or substringscharacters or substrings
Sequence ComparisonSequence Comparison Want best alignment between two or more Want best alignment between two or more
sequencessequences Global ComparisonGlobal Comparison
Alignment involving entire sequencesAlignment involving entire sequences Local ComparisonLocal Comparison
Alignment involving substringsAlignment involving substrings Semi-Global ComparisonSemi-Global Comparison
Aligning prefixes and suffixes of the sequencesAligning prefixes and suffixes of the sequences All can be solved by Dynamic All can be solved by Dynamic
ProgrammingProgramming
Global ComparisonGlobal Comparison
Consider the following DNA sequencesConsider the following DNA sequencesGACGGATTAGGACGGATTAG
GATCGGAATAGGATCGGAATAG
Are they similar?Are they similar?
After alignment, similarities are more After alignment, similarities are more obviousobvious
GA-CGGATTAGGA-CGGATTAG
GATCGGAATAGGATCGGAATAG
Alignment and ScoreAlignment and Score
Alignment, more precise definitionAlignment, more precise definition Insertion of spaces in arbitrary locations Insertion of spaces in arbitrary locations
along the sequences so that they end up with along the sequences so that they end up with the same sizethe same size
No column can be entirely composed of No column can be entirely composed of spacesspaces
ScoreScore Measure of similarityMeasure of similarity Each column receive +1, for a match, -1 for a Each column receive +1, for a match, -1 for a
mismatch or -2 for a spacemismatch or -2 for a space Sum values to get scoreSum values to get score
Dynamic ProgrammingDynamic Programming
Solving an instance of a problem by Solving an instance of a problem by taking advantage of already taking advantage of already computed solutions for smaller computed solutions for smaller instances of the probleminstances of the problem
Main algorithmic approach used in Main algorithmic approach used in sequence alignmentsequence alignment
Figure 3.1, 3.2Figure 3.1, 3.2
Optimal AlignmentsOptimal Alignments
From Figure 3.1, start at (m,n) and follow From Figure 3.1, start at (m,n) and follow arrows to (0,0)arrows to (0,0)
Each arrow gives one column of the Each arrow gives one column of the alignmentalignment
If arrow is horizontal, it corresponds to a If arrow is horizontal, it corresponds to a column with a space in s matched with t[j]column with a space in s matched with t[j]
If arrow is vertical, it corresponds to s[i] If arrow is vertical, it corresponds to s[i] matched with a space in tmatched with a space in t
If arrow is diagonal, s[i] is matched with If arrow is diagonal, s[i] is matched with t[j]t[j]
Optimal AlignmentsOptimal Alignments
Many alignments are possible, Many alignments are possible, depending on which arrow is given depending on which arrow is given prioritypriority
Local ComparisonLocal Comparison
A local alignment between s and t is an A local alignment between s and t is an alignment between a substring of s and alignment between a substring of s and a substring of ta substring of t
Goal : find the highest scoring local Goal : find the highest scoring local alignment between two sequencesalignment between two sequences
Variation of basic algorithm (Figure 3.2)Variation of basic algorithm (Figure 3.2) Each entry holds highest score of an Each entry holds highest score of an
alignment between suffixes of s and t alignment between suffixes of s and t (page 55)(page 55)
SemiGlobal ComparisonSemiGlobal Comparison
Score alignments ignoring some of the end Score alignments ignoring some of the end spaces in the sequencesspaces in the sequences
End spaces are those that appear before the End spaces are those that appear before the first or after the last character in a sequencefirst or after the last character in a sequence
For example,For example,CAGCA-CTTGGATTCTCGGCAGCA-CTTGGATTCTCGG
---CAGCGTGG-----------CAGCGTGG--------
If we aligned the sequences in the usual way, If we aligned the sequences in the usual way, thenthen
CAGCACTTGGATTCTCGGCAGCACTTGGATTCTCGG
CAGC-----G-T----GGCAGC-----G-T----GG
Extensions to Basic Extensions to Basic AlgorithmAlgorithm
Basic algorithm has O(mn) complexity and Basic algorithm has O(mn) complexity and uses space on the order of O(mn)uses space on the order of O(mn)
Possible to improve complexity from Possible to improve complexity from quadratic to linear at the expense of quadratic to linear at the expense of doubling processing timedoubling processing time
Can be accomplished by using a Divide Can be accomplished by using a Divide and Conquer strategyand Conquer strategy Divide the problem into small subproblems Divide the problem into small subproblems
and later combine the solutions to obtain a and later combine the solutions to obtain a solution for the whole problemsolution for the whole problem
Gap Penalty FunctionsGap Penalty Functions
A gap is a consecutive number of A gap is a consecutive number of spacesspaces
When mutations occur, it is more When mutations occur, it is more likely to have a block of gaps verses likely to have a block of gaps verses a series of isolated gapsa series of isolated gaps
Previous discussed scoring method Previous discussed scoring method is not appropriate in this caseis not appropriate in this case
Gap Penalty FunctionsGap Penalty Functions
For example,For example,A------ATTCCTTCCTTCCA------ATTCCTTCCTTCC
AAAGAGAATTCCTTCCTTCCAAAGAGAATTCCTTCCTTCC
Scoring is done at a block level, not Scoring is done at a block level, not a column levela column level
A ------ ATTCCTTCCTTCCA ------ ATTCCTTCCTTCC
A AAGAGA ATTCCTTCCTTCCA AAGAGA ATTCCTTCCTTCC
Multiple SequencesMultiple Sequences
Multiple sequence alignment is a Multiple sequence alignment is a generation of the two sequence casegeneration of the two sequence case
Multiple alignment of sMultiple alignment of s11,s,s22…..s…..skk is is obtained by inserting spaces in the obtained by inserting spaces in the sequences in such a way to make sequences in such a way to make them all the same sizethem all the same size
No column is made entirely of spacesNo column is made entirely of spaces Figure 3.10Figure 3.10
Scoring Multiple Scoring Multiple SequencesSequences
Need a function that inputs amino acid Need a function that inputs amino acid sequences and returns a scoresequences and returns a score
The function must have two propertiesThe function must have two properties Order of arguments must be independent. Order of arguments must be independent.
For example if a column has I,V,- the same For example if a column has I,V,- the same score should be produced if the order is score should be produced if the order is -,V,I-,V,I
Should reward the presence of many equal Should reward the presence of many equal resides and penalize unequal residues and resides and penalize unequal residues and spacesspaces
Sum-of-Pairs (SP)Sum-of-Pairs (SP)
Sum-of-Pairs (SP) satisfies the Sum-of-Pairs (SP) satisfies the propertiesproperties
Sum of pairwise scores of all pairs of Sum of pairwise scores of all pairs of symbols in a columnsymbols in a column
SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) + SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) +
p(-,I) + p(-,V) + p(I,V)p(-,I) + p(-,V) + p(I,V)
where p(a,b) is pairwise score of a where p(a,b) is pairwise score of a and band b
Algorithm ParadigmAlgorithm Paradigm
Dynamic programming is used againDynamic programming is used again Basic algorithm can be used, but Basic algorithm can be used, but
there will be problemsthere will be problems In two sequence case, complexity is In two sequence case, complexity is
O(nO(n22)) For k sequence case, complexity is For k sequence case, complexity is
O(nO(nkk)) Can take a really long time if k is Can take a really long time if k is
largelarge
Algorithm ParadigmAlgorithm Paradigm
Must reduce the amount or number Must reduce the amount or number of cells to computeof cells to compute
Apply a heuristic to reduce the Apply a heuristic to reduce the number of computed cellsnumber of computed cells
Star AlignmentsStar Alignments
Building a multiple alignment based Building a multiple alignment based on pairwise alignments between a on pairwise alignments between a fixed sequence and all othersfixed sequence and all others
Fixed sequence is the center of the Fixed sequence is the center of the starstar
Star AlignmentsStar Alignments
ExampleExamplea = ATTGCCATTa = ATTGCCATT
b = ATGGCCATTb = ATGGCCATT
c = ATCCAATTTTc = ATCCAATTTT
d = ATCTTCTTd = ATCTTCTT
e = ACTGACCe = ACTGACC
SelectSelect a a as the center of the staras the center of the star
Star AlignmentsStar Alignments
AlignAligna with ba with b
a with c a with c
a with da with d
a with ea with e
Star AlignmentsStar Alignments ATTGCCATTATTGCCATT ATGGCCATTATGGCCATT
ATTGCCATT--ATTGCCATT-- ATC-CAATTTTATC-CAATTTT
ATTGCCATTATTGCCATT ATCTTC-TTATCTTC-TT
ATTGCCATTATTGCCATT ACTGACC--ACTGACC--
Star AlignmentsStar Alignments
Combine resultsCombine results
ATTGCCATT--ATTGCCATT-- ATGGCCATT--ATGGCCATT-- ATC-CAATTTTATC-CAATTTT ATCTTC-TT--ATCTTC-TT-- ACTGACC----ACTGACC----
Database SearchDatabase Search
Database exist for searching and Database exist for searching and comparing protein and DNA comparing protein and DNA sequencessequences
Methods described work, but may Methods described work, but may take to long and be impractical for take to long and be impractical for searching large databasessearching large databases
Novel and faster methods have been Novel and faster methods have been developeddeveloped
PAM MatrixPAM Matrix
When scoring protein sequences, the When scoring protein sequences, the +1,-1,-2 may not be sufficient+1,-1,-2 may not be sufficient
Amino acids have properties that Amino acids have properties that influence the likelihood that they will influence the likelihood that they will be substituted in an evolutionary be substituted in an evolutionary scenarioscenario
PAM MatrixPAM Matrix
Point Accepted MutationsPoint Accepted Mutations A 1-PAM matrix is suitable for A 1-PAM matrix is suitable for
comparing sequences that are 1 unit comparing sequences that are 1 unit of evolution apartof evolution apart
A 250-PAM matrix is suitable for A 250-PAM matrix is suitable for comparing sequences that are 250 comparing sequences that are 250 units of evolution apartunits of evolution apart
PAM MatrixPAM Matrix
Markovian in natureMarkovian in nature Need the probability of for each Need the probability of for each
amino acidamino acid Probability transition matrixProbability transition matrix Score matrixScore matrix
BLASTBLAST
Most frequently programs used to Most frequently programs used to search sequence databasessearch sequence databases
Acronym for Basic Alignment Search Acronym for Basic Alignment Search ToolTool
Returns a list of high scoring segment Returns a list of high scoring segment pairs between the query sequence pairs between the query sequence and sequences in the databaseand sequences in the database
http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov
FASTFAST
Another family of programs for Another family of programs for sequence database searchsequence database search
http://www.rcsb.org/pdb/index.htmlhttp://www.rcsb.org/pdb/index.html BLAST and FAST use PAM matricesBLAST and FAST use PAM matrices