design of optimal multiple spaced seeds for homology search
DESCRIPTION
Design of Optimal Multiple Spaced Seeds for Homology Search. Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li and B. Ma. Overview. Seed-based homology search Optimal multiple spaced seeds LP based randomized algorithm Experimental results - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/1.jpg)
Design of Optimal Multiple Spaced Seeds for Homology Search
Jinbo XuSchool of Computer Science, University
of WaterlooJoint work with D. Brown, M. Li and B.
Ma
![Page 2: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/2.jpg)
Overview
Seed-based homology search Optimal multiple spaced seeds LP based randomized algorithm Experimental results Future work
![Page 3: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/3.jpg)
Homology Search
Exhaustive search algorithm e.g. Smith-Waterman algorithm 100% sensitivity infeasible if the database is large
Suffix tree Seed-based algorithm, e.g. BLAST,
PatternHunter
Given: database of DNA sequences, query sequence QTask: extract all homologous sequences of Q from the database.
![Page 4: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/4.jpg)
Region and Seed
S1: AGCTTGCCGTAAACCGS2: ACGTAGCACTGAGCTGRegion model: 1001011001010101
seed: 10010010111: a required match0: “don’t care”seed length M: length of the stringseed weight W: the number of 1 in the seed
![Page 5: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/5.jpg)
Seed-based Hit
ACGCGTGGGAAACC
CAATGTGGGCAATT11011011
00001111101100
seed
region
Given a seed, a query sequence hits another sequence ifand only if the seed hits a region model of both sequences.
Query:
A seed S hits a region R at position i if and only ifR[i+j]=1 for every position j where s[j]=1
![Page 6: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/6.jpg)
Single Seed Based Algorithm
Query: GGAAGCTTGCCGTATGCCATAGS1: CCAGGCTAGCCATAGGCCTTCT
Seed:101110111011011101
Length=18, weight=13
Query: GGAAGCTTGCCGTATGCCATAGS2: CCAGGCATGCAGTAGGCCTTCT
S1 hit, but S2 missed.
![Page 7: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/7.jpg)
Multiple Seeds Based Algorithm
Query: GGAAGCTTGCCGTATGCCATAGS1: CCAGGCTAGCCATAGGCCTTCT
seed1:101110111011011101
Length=18, weight=13
Query: GGAAGCTTGCCGTATGCCATAGS2: CCAGGCATGCAGTAGGCCTTCT
seed2:101101110111011101
Both S1 and S2 are hit
![Page 8: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/8.jpg)
Optimal Multiple Seeds (OMS) Problem
Given: random region R under certain distribution, two integers M and W, and an integer k.Find: set of k seeds with weight W and length no more than M to maximize the hit probability of R.
![Page 9: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/9.jpg)
Mandala (J. Buhler et al.) Hill Climbing, good for small k, no result
reported for k>4 Greedy + Monte Carlo sampling
Greedy Algorithm (M. Li and B. Ma et al.) Given i seeds (i=1,2,…,k-1), search for the
next seed by maximizing the incremental sensitivity
Vector Seeds (B. Brejova et al.)
Related Work
![Page 10: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/10.jpg)
Seed Specific OMS problem: Given a random region R, a set of m seeds , and an integer k, find a set of k seeds out of , to maximize the hit probability of R.
Seed-Region Specific OMS problem: Given a set of m seeds , an integer k and a set of
regions , find a set of k seeds, to maximize the hits of .
Variants of OMS
msss ,...,, 21
msss ,...,, 21
msss ,...,, 21
NRRR ,...,, 21
NRRR ,...,, 21
![Page 11: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/11.jpg)
Main Results:1. Approximation ratio by a greedy algorithm
(D.S. Hochbaum)2. Same approximation ratio by linear programming based
randomized algorithm 3. is tight unless P=NP (U. Feige)
Given a ground set H and its subsets and an integer k, Find k sets out of to cover H as much as possible.
Maximum Coverage (MC) problem
632.01 1 e
mHHH ,...,, 21
11 e
mHHH ,...,, 21
![Page 12: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/12.jpg)
OMS vs. MC ProblemOMS
Region Sampling
Seed Specific OMS
Seed-Region Specific OMS=MC Problem
Seed Enumeration
iS seedby hit regions ofset the:iH
![Page 13: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/13.jpg)
Region Model
3-bits
000 001 010 011 100 101 110 111
p .1426
.0573
.1236
.0660
.0710
.0364
.2335
.2696
1. PH: length 64 and each bit independently set to 1 with probability 0.7 (B. Ma et al.)
2. M3: length 64 and each bit independently set to 1 with probability 0.8 if i%3=1 or 2, 0.5 otherwise (B. Brejova et al.)
3. M8: length 63 and each codon satisfy a certain distribution (B. Brejova et al.)
4. HMM: average length 90, two adjacent codons are not independent (B. Brejova et al.)
![Page 14: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/14.jpg)
Observations
1. PH model: the hit probability of any seed with weight 11 and length 18 is at least 0.30
2. M3 model: the hit probability of any seed with weight 11 and length 18 is at least 0.27
3. HMM model: the hit probability of any seed with weight 11 and length 18 is at least 0.70
![Page 15: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/15.jpg)
Variant of MC Problem
possible. asmuch as cover to
,...,, ofout sets find elements, ||least at contains
whichofeach ,,...,, subsets its and set ground aGiven
21
21
H
HHHkH
HHHH
m
m
Can we have a better approximation ratio?
![Page 16: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/16.jpg)
If the sensitivity of each seed is at least and the optimal linear solution is , then the LP based randomized algorithm guarantees to generate a solution with approximation ratio at least
for the seed-region specific OMS problem.
Better Approximation Ratio
)1)(1()1( 1
)1()1( *
*
*
*
ee
lklkk
lklk
*l
![Page 17: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/17.jpg)
Theoretical Results
![Page 18: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/18.jpg)
Practical Approximation Ratio
)11( 02146.0
)(
)(
)(**
l
Ap
AP
Ap
the optimal seed set for the random region R the best seed set found by the LP based algorithm
*AA
with probability 0.99
If 5000 regions are sampled, then we have
![Page 19: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/19.jpg)
Practical Approximation Ratio (W=10)
![Page 20: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/20.jpg)
Practical Approximation Ratio (W=11)
![Page 21: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/21.jpg)
Test Data
All-against-all comparison between mouse EST sequences and human EST sequences by Smith-Waterman algorithm
3346700 pairs found with local alignment score no less than 16
score
16-20
20-30
30-40
40-50
50-60
60-70
70-80
80-90
>90
ratio 93 6.3 0.26 0.06 0.06 0.02 0.01 0.02 0.28
label
1 2 3 4 5 6 7 8 9
![Page 22: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/22.jpg)
Performance of PH Seeds
![Page 23: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/23.jpg)
Performance of HMM Seeds
![Page 24: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/24.jpg)
4 HMM Seeds vs. 1 HMM Seed
![Page 25: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/25.jpg)
Greedy vs. LP
![Page 26: Design of Optimal Multiple Spaced Seeds for Homology Search](https://reader036.vdocuments.mx/reader036/viewer/2022062422/56813e3a550346895da81eb3/html5/thumbnails/26.jpg)
LP-based algorithm gives a mathematical foundation
LP-based algorithm is also good in practice Time complexity is exponential to . Is there
an approximation algorithm without enumerating seeds?
Better approximation ratio by Greedy algorithm?
Summary and Future Work
WM 2