chapter 3 computational molecular biology michael smith [email protected]

30
Chapter 3 Chapter 3 Computational Molecular Computational Molecular Biology Biology Michael Smith Michael Smith [email protected] [email protected]

Upload: milo-cook

Post on 04-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Chapter 3Chapter 3

Computational Molecular BiologyComputational Molecular Biology

Michael SmithMichael [email protected]@cs.fit.edu

Page 2: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Sequence ComparisonSequence Comparison

Sequence comparison is the most Sequence comparison is the most important operation in important operation in computational biologycomputational biology

Consists of finding which parts of Consists of finding which parts of the sequences are alike and which the sequences are alike and which parts differparts differ

Page 3: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Similarity and AlignmentSimilarity and Alignment

SimilaritySimilarity Gives a measure of how similar Gives a measure of how similar

sequences aresequences are

AlignmentAlignment A way of placing sequences one above A way of placing sequences one above

the other in order to make clear the the other in order to make clear the correspondence between similar correspondence between similar characters or substringscharacters or substrings

Page 4: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Sequence ComparisonSequence Comparison Want best alignment between two or more Want best alignment between two or more

sequencessequences Global ComparisonGlobal Comparison

Alignment involving entire sequencesAlignment involving entire sequences Local ComparisonLocal Comparison

Alignment involving substringsAlignment involving substrings Semi-Global ComparisonSemi-Global Comparison

Aligning prefixes and suffixes of the sequencesAligning prefixes and suffixes of the sequences All can be solved by Dynamic All can be solved by Dynamic

ProgrammingProgramming

Page 5: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Global ComparisonGlobal Comparison

Consider the following DNA sequencesConsider the following DNA sequencesGACGGATTAGGACGGATTAG

GATCGGAATAGGATCGGAATAG

Are they similar?Are they similar?

After alignment, similarities are more After alignment, similarities are more obviousobvious

GA-CGGATTAGGA-CGGATTAG

GATCGGAATAGGATCGGAATAG

Page 6: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Alignment and ScoreAlignment and Score

Alignment, more precise definitionAlignment, more precise definition Insertion of spaces in arbitrary locations Insertion of spaces in arbitrary locations

along the sequences so that they end up with along the sequences so that they end up with the same sizethe same size

No column can be entirely composed of No column can be entirely composed of spacesspaces

ScoreScore Measure of similarityMeasure of similarity Each column receive +1, for a match, -1 for a Each column receive +1, for a match, -1 for a

mismatch or -2 for a spacemismatch or -2 for a space Sum values to get scoreSum values to get score

Page 7: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Dynamic ProgrammingDynamic Programming

Solving an instance of a problem by Solving an instance of a problem by taking advantage of already taking advantage of already computed solutions for smaller computed solutions for smaller instances of the probleminstances of the problem

Main algorithmic approach used in Main algorithmic approach used in sequence alignmentsequence alignment

Figure 3.1, 3.2Figure 3.1, 3.2

Page 8: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Optimal AlignmentsOptimal Alignments

From Figure 3.1, start at (m,n) and follow From Figure 3.1, start at (m,n) and follow arrows to (0,0)arrows to (0,0)

Each arrow gives one column of the Each arrow gives one column of the alignmentalignment

If arrow is horizontal, it corresponds to a If arrow is horizontal, it corresponds to a column with a space in s matched with t[j]column with a space in s matched with t[j]

If arrow is vertical, it corresponds to s[i] If arrow is vertical, it corresponds to s[i] matched with a space in tmatched with a space in t

If arrow is diagonal, s[i] is matched with If arrow is diagonal, s[i] is matched with t[j]t[j]

Page 9: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Optimal AlignmentsOptimal Alignments

Many alignments are possible, Many alignments are possible, depending on which arrow is given depending on which arrow is given prioritypriority

Page 10: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Local ComparisonLocal Comparison

A local alignment between s and t is an A local alignment between s and t is an alignment between a substring of s and alignment between a substring of s and a substring of ta substring of t

Goal : find the highest scoring local Goal : find the highest scoring local alignment between two sequencesalignment between two sequences

Variation of basic algorithm (Figure 3.2)Variation of basic algorithm (Figure 3.2) Each entry holds highest score of an Each entry holds highest score of an

alignment between suffixes of s and t alignment between suffixes of s and t (page 55)(page 55)

Page 11: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

SemiGlobal ComparisonSemiGlobal Comparison

Score alignments ignoring some of the end Score alignments ignoring some of the end spaces in the sequencesspaces in the sequences

End spaces are those that appear before the End spaces are those that appear before the first or after the last character in a sequencefirst or after the last character in a sequence

For example,For example,CAGCA-CTTGGATTCTCGGCAGCA-CTTGGATTCTCGG

---CAGCGTGG-----------CAGCGTGG--------

If we aligned the sequences in the usual way, If we aligned the sequences in the usual way, thenthen

CAGCACTTGGATTCTCGGCAGCACTTGGATTCTCGG

CAGC-----G-T----GGCAGC-----G-T----GG

Page 12: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Extensions to Basic Extensions to Basic AlgorithmAlgorithm

Basic algorithm has O(mn) complexity and Basic algorithm has O(mn) complexity and uses space on the order of O(mn)uses space on the order of O(mn)

Possible to improve complexity from Possible to improve complexity from quadratic to linear at the expense of quadratic to linear at the expense of doubling processing timedoubling processing time

Can be accomplished by using a Divide Can be accomplished by using a Divide and Conquer strategyand Conquer strategy Divide the problem into small subproblems Divide the problem into small subproblems

and later combine the solutions to obtain a and later combine the solutions to obtain a solution for the whole problemsolution for the whole problem

Page 13: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Gap Penalty FunctionsGap Penalty Functions

A gap is a consecutive number of A gap is a consecutive number of spacesspaces

When mutations occur, it is more When mutations occur, it is more likely to have a block of gaps verses likely to have a block of gaps verses a series of isolated gapsa series of isolated gaps

Previous discussed scoring method Previous discussed scoring method is not appropriate in this caseis not appropriate in this case

Page 14: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Gap Penalty FunctionsGap Penalty Functions

For example,For example,A------ATTCCTTCCTTCCA------ATTCCTTCCTTCC

AAAGAGAATTCCTTCCTTCCAAAGAGAATTCCTTCCTTCC

Scoring is done at a block level, not Scoring is done at a block level, not a column levela column level

A ------ ATTCCTTCCTTCCA ------ ATTCCTTCCTTCC

A AAGAGA ATTCCTTCCTTCCA AAGAGA ATTCCTTCCTTCC

Page 15: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Multiple SequencesMultiple Sequences

Multiple sequence alignment is a Multiple sequence alignment is a generation of the two sequence casegeneration of the two sequence case

Multiple alignment of sMultiple alignment of s11,s,s22…..s…..skk is is obtained by inserting spaces in the obtained by inserting spaces in the sequences in such a way to make sequences in such a way to make them all the same sizethem all the same size

No column is made entirely of spacesNo column is made entirely of spaces Figure 3.10Figure 3.10

Page 16: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Scoring Multiple Scoring Multiple SequencesSequences

Need a function that inputs amino acid Need a function that inputs amino acid sequences and returns a scoresequences and returns a score

The function must have two propertiesThe function must have two properties Order of arguments must be independent. Order of arguments must be independent.

For example if a column has I,V,- the same For example if a column has I,V,- the same score should be produced if the order is score should be produced if the order is -,V,I-,V,I

Should reward the presence of many equal Should reward the presence of many equal resides and penalize unequal residues and resides and penalize unequal residues and spacesspaces

Page 17: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Sum-of-Pairs (SP)Sum-of-Pairs (SP)

Sum-of-Pairs (SP) satisfies the Sum-of-Pairs (SP) satisfies the propertiesproperties

Sum of pairwise scores of all pairs of Sum of pairwise scores of all pairs of symbols in a columnsymbols in a column

SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) + SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) +

p(-,I) + p(-,V) + p(I,V)p(-,I) + p(-,V) + p(I,V)

where p(a,b) is pairwise score of a where p(a,b) is pairwise score of a and band b

Page 18: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Algorithm ParadigmAlgorithm Paradigm

Dynamic programming is used againDynamic programming is used again Basic algorithm can be used, but Basic algorithm can be used, but

there will be problemsthere will be problems In two sequence case, complexity is In two sequence case, complexity is

O(nO(n22)) For k sequence case, complexity is For k sequence case, complexity is

O(nO(nkk)) Can take a really long time if k is Can take a really long time if k is

largelarge

Page 19: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Algorithm ParadigmAlgorithm Paradigm

Must reduce the amount or number Must reduce the amount or number of cells to computeof cells to compute

Apply a heuristic to reduce the Apply a heuristic to reduce the number of computed cellsnumber of computed cells

Page 20: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Star AlignmentsStar Alignments

Building a multiple alignment based Building a multiple alignment based on pairwise alignments between a on pairwise alignments between a fixed sequence and all othersfixed sequence and all others

Fixed sequence is the center of the Fixed sequence is the center of the starstar

Page 21: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Star AlignmentsStar Alignments

ExampleExamplea = ATTGCCATTa = ATTGCCATT

b = ATGGCCATTb = ATGGCCATT

c = ATCCAATTTTc = ATCCAATTTT

d = ATCTTCTTd = ATCTTCTT

e = ACTGACCe = ACTGACC

SelectSelect a a as the center of the staras the center of the star

Page 22: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Star AlignmentsStar Alignments

AlignAligna with ba with b

a with c a with c

a with da with d

a with ea with e

Page 23: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Star AlignmentsStar Alignments ATTGCCATTATTGCCATT ATGGCCATTATGGCCATT

ATTGCCATT--ATTGCCATT-- ATC-CAATTTTATC-CAATTTT

ATTGCCATTATTGCCATT ATCTTC-TTATCTTC-TT

ATTGCCATTATTGCCATT ACTGACC--ACTGACC--

Page 24: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Star AlignmentsStar Alignments

Combine resultsCombine results

ATTGCCATT--ATTGCCATT-- ATGGCCATT--ATGGCCATT-- ATC-CAATTTTATC-CAATTTT ATCTTC-TT--ATCTTC-TT-- ACTGACC----ACTGACC----

Page 25: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

Database SearchDatabase Search

Database exist for searching and Database exist for searching and comparing protein and DNA comparing protein and DNA sequencessequences

Methods described work, but may Methods described work, but may take to long and be impractical for take to long and be impractical for searching large databasessearching large databases

Novel and faster methods have been Novel and faster methods have been developeddeveloped

Page 26: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

PAM MatrixPAM Matrix

When scoring protein sequences, the When scoring protein sequences, the +1,-1,-2 may not be sufficient+1,-1,-2 may not be sufficient

Amino acids have properties that Amino acids have properties that influence the likelihood that they will influence the likelihood that they will be substituted in an evolutionary be substituted in an evolutionary scenarioscenario

Page 27: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

PAM MatrixPAM Matrix

Point Accepted MutationsPoint Accepted Mutations A 1-PAM matrix is suitable for A 1-PAM matrix is suitable for

comparing sequences that are 1 unit comparing sequences that are 1 unit of evolution apartof evolution apart

A 250-PAM matrix is suitable for A 250-PAM matrix is suitable for comparing sequences that are 250 comparing sequences that are 250 units of evolution apartunits of evolution apart

Page 28: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

PAM MatrixPAM Matrix

Markovian in natureMarkovian in nature Need the probability of for each Need the probability of for each

amino acidamino acid Probability transition matrixProbability transition matrix Score matrixScore matrix

Page 29: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

BLASTBLAST

Most frequently programs used to Most frequently programs used to search sequence databasessearch sequence databases

Acronym for Basic Alignment Search Acronym for Basic Alignment Search ToolTool

Returns a list of high scoring segment Returns a list of high scoring segment pairs between the query sequence pairs between the query sequence and sequences in the databaseand sequences in the database

http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov

Page 30: Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

FASTFAST

Another family of programs for Another family of programs for sequence database searchsequence database search

http://www.rcsb.org/pdb/index.htmlhttp://www.rcsb.org/pdb/index.html BLAST and FAST use PAM matricesBLAST and FAST use PAM matrices