sequence alignment by dynamic programming
DESCRIPTION
Bios 560R, Lecture 2. Genome sequencing (very brief). O verview of HGP Shotgun sequencing Next generation sequencing. Sequence Alignment by Dynamic Programming. Overview The difficulty The scoring system Dynamic programming for pairwise alignment. - PowerPoint PPT PresentationTRANSCRIPT
Sequence Alignment by Dynamic Programming
Bios 560R, Lecture 2
OverviewThe difficultyThe scoring systemDynamic programming for pairwise alignment
Genome sequencing (very brief)Overview of HGPShotgun sequencingNext generation sequencing
1977. Allan Maxam and Walter Gilbert at Harvard University and Frederick Sanger at the U.K. Medical Research Council (MRC) independently develop methods for sequencing DNA.1988. NIH establishes the Office of Human Genome Research.1995. Patrick Brown of Stanford and colleagues publish first paper using a printed glass microarray of complementary DNA (cDNA) probes
Researchers at Whitehead and Généthon (led by Lander and Thomas Hudson at Whitehead) publish a physical map of the human genome containing 15,000 markers.1996. NIH funds six groups to attempt large-scale sequencing of the human genome.
Affymetrix makes DNA chips commercially available. An international consortium publicly releases the complete genome
sequence of the yeast1998. a new company named Celera.
NIH announces a new project to find SNPs2001. The HGP consortium publishes its working draft in Nature (15
February), Celera publishes its draft in Science (16 February).2006. Sequence of all chromosomes finalized.
Human genome project (HGP) – some milestones
Science 9 February 2001: Vol. 291. no. 5507, p. 1195
Shotgun sequencing.
Shotgun: fires a number of small spherical pellets.
The genome DNA (~ 3 billion base pairs, in 22 autosomes and 2 sex chromosomes) is too long for sequencing directly.
The DNA is randomly broken is to small pieces. The pieces are sequenced.The short sequences are assembled into long sequences based on the
overlapping of fragments.
Genome sequencing strategies
http://www.shotgun-insight.com/expressFibreVsPlastic/proFibre5_25yds80percent.jpg
Genome
Reads
Contigs(islands)
The Lander-Waterman model. Genomics. 1988 Apr;2(3):231-9.
Important result about coverage (direct quotes from the paper):
Genome sequencing strategies
“a greater risk of long-range misassembly”
“additional preliminary work”
Proc Natl Acad Sci U S A. 2002 March 19; 99(6): 3712–3716. HGP Celera
Genome sequencing strategies
Whole genome using next generation sequencing ?
Next generation sequencing generates much shorter reads at higher throughput.
nature biotechnology 26(10):1135. Sanger method Next Generation
Whole genome using next generation sequencing ?
Nature Reviews Drug Discovery 12, 358–369 (2013)
Application of next-generation sequencing (NGS) in various ways:
Between protein and gene sequences- The gene sequence (DNA/RNA) are made of 4 characters
(nucleotides)- The protein sequence is made of 20 characters (amino acids)- The DNA codons:
- When discussingalignment, we considerboth DNA sequenceand protein sequence.
Sequence alignment• Alignment answers the question of whether two or more
sequences are (evolutionarily) related.
• Example: A new gene is found in human. We wish to study its properties. To get a hint, we try to find its corresponding part in mouse. Among the tens of thousands of genes in mouse, which is the one that’s most related to this human gene? (BLAST.)
• Sequence segments more important to biological functions are more conserved. Align proteins sharing a function to find peptide sequences more important to the function.Align DNA sequences upstream to (functionally or evolutionarily) related genes to find segments that bind transcription factors.
http://pevsnerlab.kennedykrieger.org/wiley/bestfit2.gif
Sequence alignment
• Alignment serves as the basis for determining evolutionary relationships.
• Example:
http://www.computational-genomics.net/case_studies/sabertooth_demo_37.png
Sequence alignment
In the design and/or data interpretation of sequence-based high-throughput technologies, sequence comparison is necessary:
- To find microarray probes that do not have similar sequence with other genes.
- To match sequences in high-throughput sequencing data to genome.
- To find motifs based on ChIP-chip/ChIP-seq data.……
Sequence alignment
Difficulty in alignmentThe huge search space (the number of possible alignments) !
Sequence 1: GSAQVKSequence 2: GNPKVKGSAQVK----- GSAQVK---------GNPKVK ----GNPKVK
GSAQ--VK -----GSAQVK-G-NPKVK GNPKVK-----
………………… How to pick the best??
The scoring system• A quasi-statistical model log-likelihood ratio:
ii
ii
ii
ii
PyPx
yPx
PyPx
yPx
RyxP
MyxPloglog
)|,(
)|,(log
x: sequence segment 1, y: sequence segment 2i: aligned positionPxiyi: Prob(xi and yi are in aligned position)Pxi: Prob(xi occurs)Pyi: prob(yi occurs)
M: this aligned modelR: x, y are independent sequences
The scoring systemNote: this kind of “model” is for illustration purposes only!
There are many possible models: M1, M2, ……, Mn
According to the Bayes Rule,
i
jj
iii MyxP
MPMyxP
MPMyxPyxMP |,
)(|,
)(|,,|
This is MLE also since we are using an uninformative prior.
The scoring system• Given the above, the best alignment model is the one that maximizes
ii
ii
PyPx
yPxlog
Notice that there are a limited number of possible xi, yi pairs. For amino acids, there are 20 choose 2 pairs, plus any amino acid against Gap (all equal values : the “gap penalty”).
How do we get these values ????
The scoring system• These values are obtained from some high-confidence alignments:
GAPVKFC GAPKKFC
ii
ii
PyPx
yPxlog
The log likelihood ratio of the pair occurring as an aligned pair as opposed to an unaligned pair.
It can be obtained in multiple ways. Generally, such a matrix describes the rate that one character (amino acid or nucleotide) is replaced by another one during evolution.
The scoring systemPAM (Point Accepted Mutation) matrix is calculated from closely related sequences. Finding aligned positions is simple as there are only a few positions where the two sequences are different. The number following “PAM” denotes evolutionary distance. PAM1 matrix considers only direct mutations.
Taking this matrix to N-th power with proper normalization yields PAM-N matrix.
The scoring system
http://www.brc.dcs.gla.ac.uk/~drg/courses/bioinformatics_mscIT/slides/slides3/img033.gif
BLOSUM (BLOcks of Amino Acid SUbstitution Matrix).
It comes from blocks of aligned sequences with less similarity than those used by PAM.
The log-odds ratio are calculated directly from the observed alignment. There is no extrapolation as PAM.
The scoring system
The scoring system
http://en.wikipedia.org/wiki/Image:BLOSUM62.gif
Dynamic programming• Now, for every pair of amino acids, we have a score that can be considered the log likelihood ratio. So what? Do we still need to perform endless computations to compare all models?
• The answer is of course NO.• Different models share sub-models. We should not re-calculate the score
• For non-optimal models, we do not need to know their exact score.
GSAQVK G-SA-QVK GSA-QVKGNPKVK GN-PK-VK G-NPKVK
“Two sledgehammers of the algorithms craft, dynamic programming and linear programming”
Dynamic programming:
Breaking the overall optimization problem into overlapping smaller problems;
Solve each sub-problem once, and reuse the results, thus reducing the computing cost (dramatically);
Often working backward.
Dynamic programming
A simple example:
Find the shortest path from S to E in the directed acyclic graph below.
Take node D as an example. The way to get to D is through B or C. So,
Algorithms, Dasgupta et al.
Dynamic programming
Compare:(1) Exhaustive approachSABE: 1+6+2SCABE: 2+4+6+2SCDE: 2+3+1SCABDE: 2+4+6+1+1
(2) DP approachDist(A)=min(1, 2+4)=1Dist(C)=2Dist(B)=min(dist(A)+6)=1+6=7Dist(D)=min(dist(B)+1, dist(C)+3)=min(7+1, 2+3)=5Dist(E)=min(dist(B)+2, dist(D)+1)=min(7+2, 5+1)=6
11 additions.Complexity grows exponentially with the size of graph
6 additions.Complexity grows linearly with the size of graph
Dynamic programming
Another example: A game of picking up matches.There are 30 matches on the table. You start by picking up 1~3 matches, then your opponent picks up 1~3 matches. This goes on until the last match is picked up. The person picking up the last is the loser.
Last step: you want to leave the last match to your opponent
One step back: this is what you want to leave to your opponent
The first step: this is what you want to leave to your opponent 29x
1, 5, 9, 13, 17, 21, 25, 29
Two steps back: this is what you want to leave to your opponent
Operations Research: mathematical programming, Winston WL
Dynamic programming
Dynamic programmingTwo critical components of dynamic programming sequence alignment:
(1)The assumption that the score of an aligned pair of amino acids is independent of other parts of the alignment.
(2)Due to the additive nature of the likelihood, any sub-alignment of the optimal overall alignment must also be optimal. Thus the search for the global optimum becomes a sequential search for local optima.
Dynamic programmingAny sub-model of the best model must be optimal.Example. if this is the best model,
GS-AQVKG-NPKVK
Then any part of it, for example the middle part
-AQNPK
Must be better than any other possible sub-model. Otherwise, say if it is not as good as
A-QNPK
Then, GSA-QVKG-NPKVK
Is better than the “best” model.
Dynamic programming is used when the overall solution can be achieved by recursively solving sub-problems.
It saves computing time by re-using results of submodels.
GSAQVK G-SA-QVK GSA-QVKGNPKVK GN-PK-VK G-NPKVK
In contrast, the “brute-force” method that simply lists all possible alignments and re-calculate the scores for every model solves the same sub-problem over and over again.
Dynamic programming
Needleman-Wunsch algorithm for global alignment
Mi,j = MAXIMUM[
Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)
]
Create a matrix of dimension (m+1)x(n+1), where m and n are the lengths of the two sequences. Let M(0,0)=0Iteratively calculate other values in the matrix, while recording which way (align, gap-x, gap-y) led to the maximum for each cell.
Needleman-Wunsch algorithm
http://www.matfys.kvl.dk/databehandling/f2004/eksempler/word/needleman-wunsch.png
Smith-Waterman algorithm
The previous algorithm looks for the best OVERALL alignment between two sequences. A more common task is to find best SUB-SEQUENCE alignments between two sequences.
Mi,j = MAXIMUM[
0,Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)
]
CommentsIn biological sequences, a gap is often longer than 1. For example, the insertion of a new functioning domain. A long gap may be penalized too much by the previous methods.
- assign two different penalties:gap initiation penalty.gap extension penalty.
Now at every position, 3 values, instead of 1, are kept.
Significance of an alignment?
- Bayesian approach model comparison.- significance derived from extreme value
distribution.
Difficulty in higher-dimensionsWhen we have two sequences to align, we can use dynamic programming with o(n^2) calculations. Can we use the same idea when we have more sequences?It is hard. For example, when we have 3 sequences, instead of operating in a square, we will be operating in a cube. The number of calculations explodes with more sequences.
http://www.bioinf2.leeds.ac.uk/b/MA_sect.gif