novel computational techniques for mapping next ...hxin/proposal.pdfnovel computational techniques...

Novel Computational Techniques for MappingNext-Generation Sequencing Reads

Thesis ProposalMay 26, 2017

Hongyi Xin

Computer Science DepartmentSchool of Computer ScienceCarnegie Mellon University

Thesis CommitteeProf. Carl Kingsford (Chair)

Prof. Jian MaProf. Phil Gibbons

Prof. Iman HajirasoulihaDr. Bill Bolosky

1

Contents

1 Problem and Thesis Statement 3

2 Related Work 42.1 BWA-MEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Bowtie2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 SNAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Hobbes2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 GEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Subproblem 1: Intelligent Seed Extraction 63.1 Cheap K-mer Selection (CKS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Optimal Seed Solver (OSS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2.1 The core dynamic programming algorithm of OSS . . . . . . . . . . . . . . . . . . . . . . 73.2.2 Optimizations of OSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 The Leaping Toad Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3.1 The full algorithm of OSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Subproblem 2: Efficient Filtration of Incorrect Seed Locations 134.1 Adjacency Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Shifted Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 Speculative Removal of Short-Matches (SRS) . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.2 The full algorithm of SHD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Subproblem 3: High Performance Extension 165.1 The Leaping Toad Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2 LEAP: The general solution of the Leaping Toad Problem . . . . . . . . . . . . . . . . . . . . . . . 20

5.2.1 The full algorithm of LEAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6 Subproblem 4: Accurate and Efficient Read Could Mapping 236.1 Modeling 10X Genomics Read Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.2 Mapping 10X Read Clouds with Magneto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.3 mapping split reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Goal and Timeline For Graduation 25

2

1 Problem and Thesis StatementThe emergence of massively parallel sequencing technologies, commonly known as high-throughput

sequencing platforms or next-generation sequencing (NGS) platforms, during the past decade triggered arevolution in the field of genomics. These platforms generate billions of short DNA fragments, called reads,of a target donor genome. The first step in analyzing these reads, is to map reads to a reference genome.This process is commonly known as read mapping.

Formally, read mapping is essentially a fuzzy text search problem: given a short read and a long reference,find locations in the reference genome where the read and reference are highly similar.

Read mapping is computationally expensive. There are four main challenges: 1) The mapper mapsbillions of short reads for each target donor genome in a typical clinical setup. 2) The reference genome isusually long and contain a sizable portion of repetitive patterns. 3) Reads are short, hence if some are fromrepetitive regions, it is challenging to find the exact mapping. 4) Individuals from the same species containslight genetic variations in their genomes (single nucleotide variations frequency is estimated to be below1 per every 1000 nucleotides). Furthermore, sequencers can also induce errors at low frequencies (0.5%chance of altering a base-pair). As a result, reads are often not 100% identical to the reference genome andmapper has to find mappings despite minor differences.

The key strategy used in modern mappers is the seed-and-extend heuristic. Seed-and-extend heuristicstates that since variations between the donor genome and the reference are infrequent, most reads shouldcontain large fractions that are free of genetic variations and sequencer errors. Therefore, instead ofscanning through the entire reference and compute similarity at every location, to search a good mapping,seed-and-extend based mappers assume part of the read is error free, called the seeds, and only checks forpossible mappings at reference locations where seeds appear.

Seed-and-extend mappers usually draw multiple seeds from a read. They map reads in three steps:seeding, filtering and extension. The seeding step determines how many seeds should be sampled andwhere to sample them in a read. The filtering step efficiently filters out obviously incorrect seed locationswhich contain too many errors. The extension step computes an alignment as well as a similarity (or adissimilar score) score between the read and the reference at each seed location. The more seeds a mapperuses and the more seed locations a mapper examines, the more extension the mapper has to compute, hencegreater the runtime will be. On the flip side, the more seeds a mapper extracts, the more tolerant it is toerrors (including genetic variations and sequencer errors), hence the higher the mapper sensitivity will be(the ability to find the correct mapping).

Mappers often cannot simultaneously obtain both high speed and sensitivity. For high speed, a mapperwould have to reduce the number of sampled seeds, which reduces sensitivity and increases error pronenessand vice versa.

3

We explore and evaluates novel computational techniques that can improve the speed, sensitivityand accuracy of NGS mappers.

Specifically, we subdivide the mapping problem into four subproblems and we aim at developingcomputational techniques to solve each each subproblem: intelligent seed extraction, efficient filtration ofincorrect seed locations, high performance extension and accurate and efficient read cloud mapping.

Below we briefly describe our goal in solving each subproblem:• Intelligent seed extraction focuses on finding low frequency seeds (seeds appear less frequently in the

reference genome) while maintaining high seed count for high sensitivity.• Efficient filtration of incorrect seed locations targets rejecting obviously incorrect mappings before

reaching to the costly extension stage. We propose two simple and efficient mechanisms that rejecterror-abundant sequence pairs quickly.

• High performance extension aims at speeding up the extension process while accurately measures thegenetic similarity between the read and the reference.

• Accurate and efficient read cloud mapping addresses computational challenges in mapping NGS readclouds, which are groups of NGS reads that are sequenced in batches in close by vicinities in the donorgenome.

2 Related WorkMany mappers have been proposed to solve the NGS mapping problem. This section introduces the most

commonly used mappers in the field and their approaches to the proposed subproblems. If the reader isfamiliar with modern popular read mappers, they can skip the section and refer back to it when necessary.

2.1 BWA-MEM

BWA-MEM [7] uses non-overlapping maximal exact matches (MEM) in a read as seeds. MEM are exactmatches between two strings that cannot be extended in either direction towards the beginning or end oftwo strings without allowing for a mismatch. The key idea is that the longer a seed is, the less frequent it isand the less likely that it contains an error. Unless the read itself is from a genomic repeat, the probabilityof a MEM containing an error equals to 8∗L∗|Genome|

4L , where L is the length of the read. Therefore, whenBWA-MEM finds a long MEM, highly likely it has found a high quality seed. As a failsafe mechanism,in case the correct mapping does not contain any MEMs, BWA-MEM also uses a re-seeding mechanism.When the length of MEM is longer than a threshold (28 base pair by default), BWA-MEM discards theoriginal seed and finds a new MEM that covers the center of the original seed with higher seed frequency.BWA uses Burrows-Wheeler Transformation and FM-indexing [6] to index the reference genome as wellas search for MEMs.

4

To filter incorrect mappings, BWA-MEM uses an idea called chaining. The basic idea is to greedily chainseeds that have close by locations together while dynamically discards short chains for longer chains.

BWA-MEM also uses a greedy extension mechanism. It prioritizes the extension of seed locations fromthe longest seed chain, as it has higher likelihood to be the best mapping. Given the best mapping scoreof a read, BWA-MEM sets a similarity threshold by cutting a slack to the currently known best mappingscore. If a potential mapping exceeds the threshold in the middle of an extension, the extension process isterminated as the potential mapping is guaranteed to have a low similarity score.

2.2 Bowtie2

Bowtie2 [9] samples fixed-length long seeds at fixed locations in the read: the beginning, the middle andthe end of the read. Then it checks the frequency of each seed and prioritizes seed locations for the lessfrequent seed, by randomly selecting seed locations with the probability of selecting locations from eachseed bing inversely proportional to the seed frequency. The idea is that ultra-frequent seed is more likely tocontain incorrect mappings hence their locations are deprioritized.

Bowtie2 doesn’t filter seed locations. Instead, it self-terminates the mapping of a read when the mappingscore of the read stops improving with more seed locations being extended. The key idea is that sinceless frequent seeds have high probability of containing the correct mapping locations, it is highly likelythat Bowtie2 gets to the best mapping of the read fast. After obtaining the best mapping, Bowtie2 stopsquickly without wasting too much computation. In rare cases when the best mapping is in the more frequentseed, then based on the observation that incorrect seeds generally gives very low similarity scores, the bestmapping score of the read will keep getting improved which prevents Bowtie2 terminates the mapping. In anutshell, Bowtie2 tries to find the best mapping fast and terminates after obtaining the best mapping; whileit keeps running when it cannot get to the best mapping quickly.

Bowtie2 uses a standard Needleman-Wunsch alignment algorithm [14].

2.3 SNAP

SNAP [13] samples overlapping fixed-length long seeds from a read. It discards seeds with highfrequencies to avoid iterating through genomic repeats. SNAP ranks all seed locations based on how manyseeds have corresponding locations in their own list. It prioritizes extension of locations that appears inmost seeds, as such locations have the higher probability to be the best mapping. It then uses the bestmapping score to filter the rest seed locations.

Similar to BWA-MEM, SNAP utilizes a slack to set a similarity threshold and terminates the extensionwhen the similarity score drops below the threshold. However, SNAP also filters reads using a customizedq-gram filter. When the number of seeds supporting a location drops below the q-gram threshold (theq-gram threshold is computed from the similarity threshold), SNAP rejects the location without extension.

SNAP uses a SIMD implementation of the Landau-Vishkin algorithm [8] for extension.

5

2.4 Hobbes2Hobbes2 [2] uses fixed length non-overlapping seeds. Given an error tolerance threshold of e, it uses a

dynamic programming method, called Optimal Prefix Selection (OPS) to find the least frequent set of e+1fixed-length seeds.

Hobbes2 uses a simple letter count filter. It states that for a good mapping, the differences between thecount of A, C, T, G between the read and the reference should not be greater than e,.

Hobbes uses standard Smith-Waterman alignment [15] for extension.

2.5 GEMGEM [12] uses non-overlapping variable length seeds with a frequency cap. GEM keeps extending the

length of a seed until the seed frequency drops below a user defined threshold Tf req. If the seeds are toolong and there are not enough non-overlapping seeds, GEM falls back to use fixed length seeds.

GEM claims to use a dynamic programming method for extension.

3 Subproblem 1: Intelligent Seed ExtractionTo ensure mapping sensitivity and error tolerance of up to e errors (including both single nucleotide

variations and sequencer errors), according to the pigeonhole principle, the mapper has to sample at leaste+1 non-overlapping seeds. As such, at least one of the seed will be error-free, assuming the read has onlya maximum of e errors.

Using fixed-length, fixed-position seeds limits the performance of a mapper (measured in execution time)due to frequently selecting seeds with high frequencies.

Seeds have two important properties: (i) the frequency of a seed is monotonically non-increasing withlarger seed lengths and (ii) frequencies of different seeds typically differ (sometimes significantly). Althoughultra-frequent seeds (seeds that appear more frequently than 104 times) are few, they are ubiquitous in thegenome. As a result, for a randomly selected read, there is a high chance that the read contains one or moreof such frequent seeds. We call this phenomenon the frequent seed phenomenon.

Therefore, it is important to develop a better seed selection mechanism which minimizes the total seedswhile maintaining high seed counts. In this section, we propose two intelligent seed selection mechanisms:Cheap K-mer Selection (CKS) and Optimal Seed Solver (OSS).

3.1 Cheap K-mer Selection (CKS)Cheap K-mer Selection (CKS) is developed for mappers using fixed-length k-mers as seeds. CKS divides

the read into as many non-overlapping k-mers as possible; sort all k-mers based on frequency and select theleast frequent e+1 k-mers as seeds. The key observation behind CKS is that most reads only contain one ortwo k-mers that have high frequencies. A simple sort avoids selecting the high-frequency k-mers as seeds.

Through our experiment, we observe that CKS on average reduces the total seed frequency by 95.4%compared to naïve random seed selection mechanisms.

6

3.2 Optimal Seed Solver (OSS)

Instead of using k-mers, optimal seed solver (OSS) finds the optimal set of variable-length seeds from aread, such that the overall frequency of seeds are minimized. In other words, OSS studies how to partition aread into e+1 fragments as seeds, with minimum seed frequencies.

A major challenge in deriving the optimal set of seeds is the large search space. If we allow a seed tobe selected from an arbitrary location in the read with an arbitrary length, then from a read of length L,there can be L×(L+1)

2 possibilities to extract a single seed. When there are multiple seeds, the search spacegrows exponentially since the position and length of each newly selected seed depend on the positionsand lengths of all previously selected seeds. For x seeds, there can be as many as O(L2x

x! ) seed selectionschemes.

Our dynamic programming algorithm, OSS, finds the optimal set of x seeds of a read in O(x× L)

operations on average and in O(x×L2) operations in the worst case.

3.2.1 The core dynamic programming algorithm of OSS

The core recurrence relationship of OSS states that, with the optimal x−1-seed solutions of all prefixes

of the read, we can find the optimal (x)-seed solution of any prefix the read in O(Lp) time, where Lp is thelength of the prefix.

Assume that we have already know the optimal m-seed solutions for all prefix of the read. Then for anyprefix, to compute the optimal (m+1)-seeds solution, we simply divide the prefix into two segments, wherewe will optimally draw m seeds in the first segment (also a prefix of the read), and one seed in the secondsegment. Since the optimal m seed solution for the first segment is already known, we only need to findthe optimal solution of the second segment, which is done through a simple lookup (assuming a lookup isin O(1) time). We iterate through all possible divisions and the final optimal (m+1)-seed solution for theprefix would simply be division that generates the minimum total seed frequency. Because we can onlymove the divider at most Lp(q) times, we can find the optimal solution in O(Lp(q)) time.

To conclude, to compute the optimal set of x seeds from a read, R, OSS computes and stores optimalsolutions of prefixes with fewer seeds through x iterations. In each iteration, OSS computes optimalsolutions of prefixes with regard to a specific number of seeds. In the mth iteration (m≤ x), OSS computesthe optimal m-seed solutions of all prefixes of R, by re-using optimal solutions computed from the previous(m−1)th iteration. For each prefix, OSS performs a series of divisions and finds the division that providesthe minimum total frequency of m seeds. For each division, OSS computes the optimal m-seed frequencyby summing up the optimal (m−1)-seed frequency of the first part and the 1-seed frequency of the secondpart. Both frequencies can be obtained from previous iterations. Overall, OSS starts from one seed anditerates to x seeds. Finally, OSS computes the optimal (x+1)-seed solution of R by finding the optimaldivision of R and reuses results from the xth iteration.

7

3.2.2 Optimizations of OSS

With the proposed dynamic programming algorithm, OSS can find the optimal (x+1) seeds of a L-bpread in O(x×L2) operations: In each iteration, OSS examines O(L) prefixes and for each prefix OSSinspects O(Lp) divisions . In total, there are O(L2) divisions to be verified in an iteration.

To speed up OSS and reduce the average complexity of processing each iteration, we propose fouroptimizations: optimal divider cascading, early divider termination, divider sprinting and optimal solutionforwarding. With all four optimizations, we empirically reduce the average complexity of processing aniteration to O(L).

Optimal divider cascading is built upon the fact that between two prefixes of different lengths in the sameiteration (under the same number of seeds): the first optimal divider (the optimal divider that is the closesttowards the beginning of the read, if there exist multiple optimal divisions with the same total frequency) ofthe shorter prefix must be at the same or a closer position towards the beginning of the read, comparedto the first optimal divider of the longer prefix. An example of optimal divider cascading is depicted inFigure 1.

��✁✁✁✂✄✁✂✁✂✄✂✁✄✁✂�✂✄✁✁�✄✄�✁��✄�✁✄�✁✁✂��✄✂✁✂��✁✄�✄✂✄✁�✄✁

��✁✁✁✂✄✁✂✁✂✄✂✁✄✁✂�✂✄✁✁�

��✁✁✁✂✄✁✂✁✂✄✂✁✄✁✂�✂✄✁✁�✄

��✁✁✁✂✄✁✂✁✂✄✂✁✄✁✂�✂✄✁✁�✄✄

��✁✁✁✂✄✁✂✁✂✄✂✁✄✁✂�✂✄✁✁�✄✄�

��✁✁✁✂✄✁✂✁✂✄✂✁✄✁✂�✂✄✁✁�✄✄�✁

☎☎☎

✆✝✞✟✠ ✡☛✠✝☞✌✍ ✎✝✏✝✎✑✞✒

��✁✁✁✂✄✁✂✁✂✄✂✁✄✁✂�✂✄✁✁�✄✄�✁��✄�✁✄�✁✁✂��✄✂✁✂��✁✄�✄✂✄✁�✄✁

Figure 1: An example of optimal divider cascading.

3.3 The Leaping Toad Problem

In each iteration, we start with the longest prefix of the read, which is the read itself. We examine alldivisions of the read and find the first optimal divider of it. Then, we move to the next prefix of the length|L−1|. In this prefix, we only need to check dividers that are at the same or a prior position than the firstoptimal divider of the length |L| prefix. After processing the length |L−1| prefix, we move to the length|L−2| prefix, whose search space is further reduced to positions that are at the same or a prior positionthan the first optimal divider of the length |L−1| prefix. This procedure is repeated until the shortest prefixis processed. Then we move on to the next total seed number.

Similarly, early divider termination is also to reduce the number of dividers we examine for each prefix.The key idea of early divider termination is that, when the frequency reduction of the first segment is greaterthan the total frequency of the second segment, between two consecutive dividers, then we are certain thatall the rest of the dividers are further backwards are guaranteed to be suboptimal and can be excluded fromthe search space.

8

Divider sprinting reveals that when finding the optimal divider of a prefix, we do not need to check forearly termination every time. Between two consecutive divider positions of a move, when the frequency ofthe first segment remains unchanged, then we do not need to check the frequency of the second segment, norneed we check for early termination. The latter divider will aways provide smaller overall seed frequency.

Finally, optimal solution forwarding states that optimal solutions between prefixes can also be inherited.Between two consecutive prefixes in the same iteration, by adopting the optimal divider of the longer prefix,as suggested by optimal divider cascading, if right away the shorter prefix obtains the same optimal totalseed frequency as of the longer prefix, then optimal solution forwarding guarantees that the optimal dividerfor the longer prefix is also the optimal divider of the shorter prefix.

Together with all four optimizations, we observe that the average number of division verificationsper prefix reduces further to 0.95 from 5.4 (this data is obtained from mapping ERR240726 to humangenome v37, under the error threshold of 5), providing a 5.68x theoretical speedup over OSS without anyoptimizations.

3.3.1 The full algorithm of OSS

Algorithm 1 provides the pseudo-code of optimalSeedSolver, which contains the core algorithm ofOSS and the optimal divider cascading optimization; and Algorithm 2 provides the pseudo-code off irstOptDivider, redwhich contains the early divider termination, the divider sprinting and the optimalsolution forwarding optimizations.

3.4 Results

We benchmarked both CKS and OSS against three other related studies, Adaptive Seed Selector(ASS) [12], Optimal Prefix Selection (OPS) [2] and spaced seeds [11]. We compared the effectiveness ofabove seeding mechanisms by measuring the average seed frequency. Smaller average seed frequency isbetter. Figure 3 presents the results. We also provide a qualitative study of average-case complexity andnumber of seed lookups required by each seeding mechanism. The result is shown in Table 2

OSS ASS CKS OPS Spaced seeds naïve

Complexity O(x×L) O(x) O(x× logLk ) O(x×L) O(x) O(x)

Seed lookups O(L2) O(x) O(Lk ) O(L) O(x) O(x)

Figure 2: An average-case complexity and seed frequency lookup comparison.

9

Algorithm 1: optimalSeedSolverInput: the read, ROutput: the optimal x-seed frequency of R, opt_freq and the first x-seed optimal divider of R, opt_divGlobal data structure: the 2-D data array opt_data[ ][ ]Functions:firstOptDivider: computes the first optimal divider of the prefixoptimalFreq: retrieves the optimal 1-seed frequency of a substringPseudocode:// The first iteration is special,// it calculates the 1-seed solutionsfor l = L to Smin do

prefix = R[1...l];opt_data[1][l].freq = optimalFreq(prefix);

// From iteration 2 to x-1// (From 2 to x-1 seeds)for iter = 2 to x−1 do

// Initialize the previous optimal// divider with maximum valueprev_div = L−Smin +1;for l = L to iter×Smin do

prefix = R[1...l];// Find the optimal dividerdiv = firstOptDivider(prefix, iter,prev_div);// Get frequencies of the 2 parts1st_part = R[1...div−1];2nd_part = R[div...L];1st_freq = opt_data[iter−1][div−1].freq;2nd_freq = optimalFreq(2nd_part);// Update data in the elementopt_data[iter][l].div = div;opt_data[iter][l]. f req = 1st_freq+2nd_freq;// Optimal seed cascading,// Update the previous dividerprev_div = div;

// Find the optimal x-seeds frequencyprev_div = L−Smin +1;// Find the optimal divider of the readopt_div = firstOptDivider(R,L−Smin +1);// Get frequencies of the 2 parts1st_part = R[1...opt_div−1];2nd_part = R[opt_div...L];1st_freq = opt_data[x−1][opt_div−1].freq;2nd_freq = optimalFreq(2nd_part);// The final x-seeds frequencyopt_freq = 1st_freq+2nd_freq;return opt_freq, opt_div; 10

Algorithm 2: firstOptDividerInput: the prefix; the iteration count, iter; the previous prefix divider, prev_divOutput: the first optimal divider of the prefix, opt_divGlobal data structure: the 2-D data array opt_data[ ][ ], the optimal 2nd-part frequency of theprevious prefix, opt_2nd_freq

Functions:optimalFreq: retrieves the optimal 1-seed frequency of a substringPseudocode:// optimal solution forwarding2nd_part = prefix[prev_div...end];2nd_freq = optimalFreq(2nd_part);// If true, forward and returnif opt_2nd_freq = 2nd_freq then

return prev_div;

// Initialize datafirst_div = prev_div;min_freq = MAX_INT;prev_1st_freq = MAX_INT;prev_2nd_freq = MAX_INT;// Move divider backward until terminationfor div = prev_div to (iter−1)×Smin do

// Get frequencies of the 2 parts1st_part = prefix[1...div−1];2nd_part = prefix[div...end];1st_freq = opt_data[iter][div−1].freq;// The 1st-part-freq of the next movenext_1st_freq = opt_data[iter][div−2].freq;// divider sprinting,// skip if no change to 1st-part-freqif next_1st_freq = 1st_freq then

continue;

2nd_freq = optimalFreq(2nd_part);// early divider termination,// terminates when frequency difference// of the 1st part is too largeif (1st_freq−prev_1st_freq)> prev_2nd_freq then

break;

freq = 1st_freq+2nd_freq;// update the optimal divider// for new minimumif (freq≤min_freq then

min_freq = freq;first_div = div;opt_2nd_freq = 2nd_freq;

prev_1st_freq = 1st_freq;prev_2nd_freq = 2nd_freq;

return first_div;

11

Figure 3: CKS and OSS results.

From both results, we can draw a number of conclusions. First, CKS has low overhead. In total, CKSneeds only bL

k c lookups for seed frequencies followed by a sorting of bLk c seed frequencies. k is the length

of each seed (k-mer) Although fast, CKS can provide only limited seed frequency reduction as it has a verylimited pool to select seeds from. This limitation is especially evident under large seed numbers, whereCKS selects evidently more frequent seeds than OPS and ASS. Second, greedy seed selectors includingCKS, ASS and OPS all provide drastic improvements over the na’́ive random or consecutive seed selector.Last but not the least, there is still potential of improvements for greedy selection algorithm, since the

12

best performing greedy seeding algorithm, OPS, still produces seeds that is at least 2x more frequent thanoptimal.

4 Subproblem 2: Efficient Filtration of Incorrect Seed LocationsBesides reducing the seed frequency, efficiently filtering out incorrect mappings also increases the

performance of a mapper.As shown in the previous section, even with the best performing greedy seeding mechanism, its average

seed frequency is still at least 2x greater than the optimal. This means that at least over 50% of the seedlocations returned by a greedy seeding mechanism will turn out to be incorrect mappings. In fact, even withoptimal seeding, only a small fraction of the seed locations will pass extension and are reported as correctmappings.

However, whether a seed location will turn out to be a good mapping is not known at the seedingstage. Therefore, incorrect mappings are only rejected after the costly extension process, which is acomputationally costly process measuring the similarity between the read and the reference at the seedlocation.

Computation spent on extending incorrect mappings are wasteful, since the mapping is invalid and neverreaches to the final report. Therefore, it is crucial to avoid wasting computational resources on extendingincorrect mappings and instead, reject them using cheaper heuristics.

We propose two efficient filtration mechanisms, Adjacency Filtering and Shifted Hamming Distance,which filters out incorrect mappings without the costly extension process. We show that both mechanismsare effective in finding and rejecting seed locations leading to incorrect mappings. We also prove that bothfilters maintains 100% mapper sensitivity, as they never reject any correct mappings.

4.1 Adjacency Filtering

Adjacency Filtering (AF) is a simple idea that extends the pigeonhole principle. It states that given a setof non-overlapping seeds in a read, as well as a potential mapping location loc, if there exist m seeds whodo not have corresponding locations around loc, then there are at least m errors in the mapping at loc. If m

is greater than the maximum allowed error threshold, then loc can be rejected without further investigations.In a sense, AF is a weaker form of the q-gram filtering, which only use non-overlapping seeds. However,

we observe that AF is a good estimation of the q-gram similarity and is often sufficient to detect incorrectmappings. We observe that the ratio of error-inflicted overlapping q-grams is often close to the ratio oferror-inflicted non-overlapping k-mers. This is because a single error destroys all surrounding q-grams,hence in the worst case, it destroys q

L q-grams. In AF, each error destroys a single k-mer and there are atotal of L

k k-mers. So it also destroys kL k-mers.

As with q-gram filters, AF only returns the minimum number of errors in a potential mapping, thereforeit never incorrectly rejects a correct mapping; hence has zero false negative rate.

13

The implementation of AF is quite simple. Since seed locations are sorted, checking for existence ofadjacent locations in a seed is implemented as a simple binary search. Overall, filtering a mapping takes atotal of L

k binary searches.

4.2 Shifted Hamming Distance

Shifted Hamming Distance (SHD), on the other hand, filters incorrect mappings through a pseudo-alignment method, using bit-parallel and SIMD instructions. Unlike AF, which filters incorrect mappingswithout any knowledge of the reference, SHD, on the other hand, retrieves the reference fragments andestimates the number of errors of the mapping by performing an approximate alignment process.

SHD is built upon two key observations:1. If two strings differ by e errors, then all non-erroneous characters of the strings can be aligned in at most

e shifts.

2. If two strings differ by e errors, then they share at most e+1 identical segments (Pigeonhole Principle).Based on above observations, SHD aligns all identical segments by slightly shifts the read or the reference

string up to e base pairs, as Figure 4 shows. The aligned identical segments can be identified and revealedthrough XORing the shifted strings. Each identical segment is represented as a streak of 0s in the hammingmasks.

TCCATTGACATTCGTGAGCTGCTCCTTCTCTCCCACCCCTTTGCCC

TCCATTGACATTCGTGACTCTCCTTCTCTCCCACCCCTTTGCCCCCRead >> 1:

Reference:

0101101111101111110011111111111001100010011000


TCCATTGACATTCGTGACTCTCCTTCTCTCCCACCCCTTTGCCCCC

Reference:

0011111110111110101110000000000000000000000000

Read>>2:

Hamming mask 3:

Hamming mask 4:

0000000000000000000010000000000000000000000000

Final bit-vector:

TCCATTGACATTCGTGACTCTCCTTCTCTCCCACCCCTTTGCCCCCRead:

Reference:

0000000000000000011110111100001010100110111000

TCCATTGACATTCGTGACTCTCCTTCTCTCCCACCCCTTTGCCCCCRead << 1:

Reference:

1011011111011111110010010011101100101111111000

TCCATTGACATTCGTGACTCTCCTTCTCTCCCACCCCTTTGCCCCC

Reference:

1111111011111011011111001100101000111111111000

Read << 2:

Hamming mask 0:

Hamming mask 1:

Hamming mask 2:




CAT: Identical section 0: Bp matchC: Deletion 1: Bp mismatch

Figure 4: Identifying identical segments in SHD.

14

Identical segments can be joint together through an global AND operation, which ANDs all hammingmusks together. In the final bit vector, all streaks of 0s are preserved, since any 0 in any hamming mask willpropagate through the AND operation. Therefore, the number of bit 1s in the final bit vector is guaranteedto be smaller than or equal to the total number of errors in the alignment.

To summarize, given two strings and a small error threshold e, SHD filters incorrect alignments throughthe following steps: 1) SHD right shifts the read and then the reference, from 0 to e base-pairs. 2) Aftereach shift, SHD computes a hamming mask, by XORing the shifted read/reference against the originalreference/read. 3) SHD merges all hamming masks into a final bit vector through a global AND. 4) SHDcounts the number of bit 1s in the final bit vector. If there are more than e 1s, then SHD rejects the mapping.

4.2.1 Speculative Removal of Short-Matches (SRS)

A potential problem of the vanilla SHD algorithm, is that any bit 0 in the hamming masks gets propagatedto the final bit vector, even the ones that are not from identical segments but are simply single base pairmatches due to random chance. Since the DNA alphabet size is only four, the chance of random single-base-pair match is 25%. As a result, often unrelated string pairs that contain much more errors than e, getpass the vanilla SHD. Figure 5 shows an example of two drastically different strings passing vanilla SHDas highly similar strings.

TTCCCAGCACAAGACACATTCTGTTTTCTGTGCCGACCCAGGACAT

TTCCCAGCACAGACGCATAGCCTGGTCTTTGTCGTCCATTGACATTRead:

Reference:

00000000000111111111 1111 11 111 111 111 11110

TTCCCAGCACAAGACACATTGTGTTTTCTGTGCCGACCCAGGACAT

TTCCCAGCACAGACGCATAGCCTGGTCTTTGTCGTCCATTGACATTRead << 1:

Reference:

01 1111111 11 1 111 1 1 100011111111111 110


TTCCCAGCACAGACGCATAGCCTGGTCTTTGTCGTCCATTGACATT

Reference:

11 1111 1 1 111111111 1 1111 1 1 111 111100

Read << 2:

Hamming mask 0:

Hamming mask 1:

Hamming mask 2:


TTCCCAGCACAGACGCATAGCCTGGTCTTTGTCGTCCATTGACATTRead >> 1:

Reference:

001 111111000010001 11 110001 1 1 11100000


TTCCCAGCACAGACGCATAGCCTGGTCTTTGTCGTCCATTGACATT

Reference:

0011 1111 11111111 1111 11111 111111000111111

Read>>2:

Hamming mask 3:

Hamming mask 4:

0000000000000000000000000000000000000000000000

Final bit-vector:

: Spurious ✄0�

Figure 5: A example of incorrect mapping passing na’́ive SHD.

15

To solve this problem, we propose Speculative Removal of Short-Matches (SRS), which speculativelyremoves 0s that do not belong to identical segments. We call 0s that are not part of an identical segment asspurious 0s.

We observe that spurious 0s often form short streaks. This is because the probability of having anidentical segment due to random chance decreases exponentially as the length of the identical segmentincreases. In fact, the random probability of having identical segment is 0.25L, where L denotes the lengthof the identical segment. As a result, most spurious 0s form only short streaks.

SRS assumes all streaks of 0s in the hamming masks that are short than a threshold, Tstreak are spurious0s and removes them by flipping them into 1s. The final bit vector, therefore, only contain streaks of 0sthat are longer than Tstreak. In order to maintain correctness and not filter out correct mappings by mistake,when counting errors in the final bit vector, SRS counts each streak of 1s of length L1 as 1+ L1−1

Tstreakerrors.

In practice, we found that Tstreak = 3 is sufficient for up to e = 5.SRS can be implemented using Intel’s SIMD instruction, pshu f .

4.2.2 The full algorithm of SHD

The pseudo code of SHD is shown in Algorithm 3

4.3 Result

We benchmark both the runtime as well as the filtering effectiveness of AF and SHD. Figure 6a showsthe false positive rate, which is measured by the fraction of incorrect mappings passing the filter, of AFand SHD, respectively, under different error thresholds. Figure 6b shows the runtime comparison ofAF and SHD, against a SIMD implementation of the Smith-Waterman algorithm (Swps) and a SIMDimplementation of Gene Myers’ bit-vector algorithm (SeqAn).

From both figures, we can make the following conclusions: Both AF and SHD are effective on filteringout incorrect mappings. However, the effectiveness of both mechanisms decreases as the error thresholdincreases. Among the two, AF is faster but also generates higher false positive rates. SHD, on the other hand,while slower than AF but faster than the rest SIMD implementations, provides higher filtering accuracy.

5 Subproblem 3: High Performance ExtensionThe core component of the extension step, which compares the read and the reference DNA strings, is

the approximate string matching problem (ASM). In the context of DNA alignment, we try to find the set ofoperations (also called edits) that converts the read string into the reference string. Each operation has anassociated cost and the overall goal is to find the set of operations with minimum cost. If the cost is below athreshold, Te, we claim that we have found a match between the read and the reference, and the convertedbase pairs are errors in the read.

Depending the context where ASM is used, there are many different penalty schemes, which assigndifferent costs to different operations. In the realm of DNA alignment, two commonly used penalty schemes

16

Algorithm 3: SHDInputs: Read[0]...Read[s−1],Ref[0]...Ref[s−1] (bit-vectors of the Read and Reference), e (errorthreshold)

Outputs: Pass (returns True if the string pair passes the SHD)Functions: see Supplementary MaterialsComputeHammingMask: computes the Hamming maskSRS_amend: amends short streaks of ‘0’s into ‘1’sSRS_count: counts the number of errors in the final bit-vectorPseudocode: HMask = ComputeHammingMask(Read,Ref);Final_BV = SRS_amend(HMask);for i = 1 to e do

// Left shift Readfor j = 0 to s−1 do

ShiftedRead[j] = A[j]<< i;HMask = ComputeHammingMask(ShiftedRead,Ref);SRS_HMask = SRS_amend(HMask);Final_BV = Final_BV&SRS_HMask;// Right shift Readfor j = 0 to s−1 do

ShiftedRead[j] = A[j]>> i;HMask = ComputeHammingMask(ShiftedRead,Ref);SRS_HMask = SRS_amend(HMask);Final_BV = Final_BV&SRS_HMask;

errorNum = SRS_count(Final_BV);if errorNum≤ e then

Pass = True;

elsePass = False;

return Pass;

are: edit-distance and affine gap penalty. Edit-distance assign unit penalties to insertion, deletion as wellas substitutions of individual base pairs. Affine gap penalty, on the other hand, penalizes insertions anddeletions differently depending on the length. For consecutive insertion/deletion of L base pairs, affinegap penalty assigns a total penalty of A+B×L, where A is called the gap opening penalty while B iscalled the gap extension penalty. Affine gap penalty is often considered to be a better model for geneticvariation. During DNA replication, often an entire DNA fragment is inserted or deleted all together, insteadof selectively insert or delete interspersed base pairs of the fragment. Affine gap penalty reflects thisobservation by assigning smaller penalties for consecutive indels (insertion/deletion) over interspersedindels.

17

✼✺✵✵

✶✵✵✵✵❡ ❂ ✵

❙❍❉

❙�✁❆♥

❆❋

❙✇✂s

✵

✷✺✵✵

✷✷✺✵✵

✷✼✺✵✵❡ ❂ ✶

❙❍❉

❙�✁❆♥

❆❋

❙✇✂s

✵

✺✵✵✵

✺✵✵✵✵

✻✵✵✵✵❡ ❂ ✷

❙❍❉

❙�✁❆♥

❆❋

❙✇✂s

✵

✶✵✵✵✵

✶✵✵✵✵✵

✶✷✺✵✵✵❡ ❂ ✸

❙❍❉

❙�✁❆♥

❆❋

❙✇✂s

✵

✷✺✵✵✵

✷✵✵✵✵✵

✷✺✵✵✵✵❡ ❂ ✹

❙❍❉

❙�✁❆♥

❆❋

❙✇✂s

✵

✺✵✵✵✵

✹✵✵✵✵✵

✺✵✵✵✵✵❡ ❂ ✺

❙❍❉

❙�✁❆♥

❆❋

❙✇✂s

✵

✶✵✵✵✵✵

❊✄✄☎✆✝✞☎✟✠

✡

❊✄✄☎✆✝✞☎✟✠

☎

❊✄✄☎✆✝✞☎✞✠

✡

❊✄✄☎✆✝✞☎✞✠

☎

❊✄✄☎✆✝✞☎☛✠

✡

❊✄✄☎✆✝✞☎☛✠

☎

❊✄✄☎✆✝✞☎☞✠

✡

❊✄✄☎✆✝✞☎☞✠

☎

❊✄✄☎✆✝✞✌✝✠

✡

❊✄✄☎✆✝✞✌✝✠

☎

❇✍✎❝❤♠❛✏❦✑

✒①✓✔✉t✐♦✕❚✐✖✓✭✗✓✔♦✕❞✗✮

(a) Runtime result of AF and SHD.

0

25

50

e = 0

SHDAF

0

25

50

e = 1

SHDAF

0

25

50

e = 2

SHDAF

0

25

50

e = 3

SHDAF

0

25

50

e = 4

SHDAF

0

25

50 e = 5

SHDAF

ERR240726_1

ERR240726_2

ERR240727_1

ERR240727_2

ERR240728_1

ERR240728_2

ERR240729_1

ERR240729_2

ERR240730_1

ERR240730_2

BenchmarksBenchmarks

False

PositiveRate

(%)

(b) False positive rates of AF and SHD.

Among a wide variety of algorithms that are developed to solve the approximate string matching problemand its variations ([3], [16]). The Landau-Vishkin algorithm (LV) shows great potential. Instead ofcalculating the canonical L×L dynamic-programming table, LV iteratively finds the longest matchingsubstrings with an increasing number of edits. Compared to other canonical algorithms, LV computes fewerentries in the dynamic-programming table and conducts less compute for each entry in the table as well.

LV improves upon the banded-edit distance algorithm. It uses the fact that edit-distance is conservedalong the diagonal in the dynamic programming table for a sequence of matches, so it can simply traversealong the diagonal to the position of the next error. The length of the traversal is LCE(i, j) (longest commonextension), which is the length of the longest prefix which si..m and r j..n share.

The recurrence function of LV is shown below:

LVd,e = max

LVd,e−1 +1+LCE(LVd,e−1 +2,LVd,e−1 +d +2)︸︷︷︸substitution

LVd−1,e−1 +LCE(LVd−1,e−1 +1,LVd−1,e−1 +d +1)︸︷︷︸insertion

LVd+1,e−1 +1+LCE(LVd+1,e−1 +2,LVd+1,e−1 +d +2)︸︷︷︸deletion

18

However, LV is proposed for edit-distance only and it has not been shown that LV supports affine gappenalty.

Here we propose LEAP, an extension to the Landau-Vishkin algorithm. We show that the same principlesof LV can be applied to solve ASM with affine gap penalty. We further provide a bitvectorized de Bruijnsequence based optimization over LEAP. Overall, LEAP is up to 7.4x faster than the state-of-the-art bit-vector Levenshtein distance implementation and up to 32x faster than the state-of-the-art affine-gap-penaltyparallel Needleman Wunsch implementation.

5.1 The Leaping Toad Problem

To prove that the chief principles of LV can be applied to solve ASM with affine gap penalty, we firstconvert the ASM problem into a graph traversal problem of a special directional acyclic graph, which wecall it the Leaping Toad Problem (LTP). Inside the graph, there are vertices and directional edges. Edgescarry weights. The goal is to find a path from a source vertex to a destination vertex where the sum of alledge weight is minimum.

We formally define the directed acyclic graph, which we call it the swimming pool, of LTP is describedas follows:• There is a convex swimming pool that encircles vertices taken from a 2-dimensional vertex grid, where

vertices are aligned in rows and columns. Vertices in the swimming pool are then organized into disjointlanes which are rows in the vertex grid. Inside a lane, each vertex is connected to the next vertex on itsright by a directional edge, with itself being the source and the vertex on the right being the destination.We call these edges as forward edges. Forward edges only exist among vertices inside the swimmingpool and do not exist for vertices outside of the swimming pool.

• A vertex may also have edges pointing to vertices in other lanes. We call these edges leap edges. In LTP,for a vertex and a separate lane, there can be at most one leap edge pointing to at most one vertex in thatlane. In other words, there can not be multiple edges pointing to the same lane from the same vertex.We also require all the vertices in the same lane share the same types of leap edges: the same directionsand lengths. When visualized, leap edges between two lanes are an array of parallel arrows. Notice thatsome leap edges might have their source and/or destination vertices staying out of the swimming pooland we call these edges out edges. Outside of the swimming pool enclosure, out edges continue to exist,connecting vertices between different lanes.

• Over any edge, there is a non-negative integer weight. Leap edges sharing the same origin anddestination lanes have the same, positive weight. Forward edges have zero or positive weights. Wecall forward edges with positive weights as hurdles. Traveling across a hurdle is called hurdle crossing.Hurdles may have different costs.In the swimming pool, we appoint a number of lanes as origin lanes and a number of lanes as destination

lanes. The set of origin lanes and destination lanes may overlap. The general goal of the LTP is to find a

19

path in the directed graph, with minimum sum of edge weights, that starts at the first vertex (the leftmostvertex in the swimming pool of the lane) of an origin lane and either travels to the last vertex of adestination lane or travels out of the swimming pool while exiting onto a destination lane.

For simplicity, we call edge weights as energy costs; we call traveling along a forward edge without ahurdle as swim; we call traveling along a forward edge without a hurdle as stride; and we call travelingalong the leaping edge as leaps.

Figure 6 shows a typical swimming pool setup, and highlights an optimal path of the swimming pool.

Figure 6: An example swimming pool setup for LTP.

The dynamic programming method of solving the ASM problem can be converted into the LTP problemwith indels converted to leap edges and mismatches converted to hurdles. Figure 7 shows an exampleconversion.

Figure 7: An example of converting the ASM problem into LTP.

5.2 LEAP: The general solution of the Leaping Toad Problem

To show that the core recurrence function of LV can also be applied to solve LTP, we prove the followingtheorems:

Theorem 1 Among all optimal paths of the Leaping Toad problem, there must exist one path in which the

toad either never leaps or only leaps right before hurdles.

With Theorem 1, we now transform the general Leaping Toad problem of finding an optimal path with

minimum cost to a sub-problem that finds an optimal path which only leaps before hurdles.LEAP solves the above sub-problem through an optimized dynamic-programming method that can be

viewed as an extension of the Landau-Vishkin algorithm. LEAP can be summarized into four steps:

20

1. LEAP iterates through all intermediate energy costs from 0 to E and for each energy cost, LEAP iteratesthrough all lanes.

2. For an intermediate energy cost e and a lane l, LEAP finds the furthest vertex v in l that is reachable atprecisely the energy cost e from either a leap or a hurdle-crossing.

3. LEAP extends the segment at v (if permitted) until the segment hits a hurdle.4. LEAP repeats step 2) and 3) until either a lane has reached to the destination vertex or all intermediate

energy levels have been exhausted. The path that leads to the destination vertex is reported as the result.To summarize, LEAP uses a core recurrence function shown below:

start[l][e] = min∀l′∈lanes

end[l′][e−P(l′, l)]+F(l′, l)

end[l][e] = start[l][e]+VtH(l,start[l][e])

where P(l′, l) returns the penalty of leaping from lane l′ to l, F(l′, l) returns the number of columns thetoad moves forward when it leaps from lane l′ to l, and VtH(l,start_column) (abbreviated for Vertices toHurdle) returns the number of vertices until next hurdle from start_column in lane l. When l = l′, P(l, l′)

is simply the energy cost of the next hurdle and F(l, l′) = 1.

5.2.1 The full algorithm of LEAP

We further provide we provide a de Bruijn sequence based bit-vector optimization over LEAP. Wenoticed that step 3 in leap searches for the next hurdle in the lane, which is equivalent to searches for thenext bit 1 in the shifted hamming mask (from SHD) after a particular bit 0. The problem can be furtherconverted to finding the position of the most significant bit 1 after left shifting the bit-vector to move thecurrent bit 0 up front. Finally, finding the position of the most significant bit 1 can be implemented usingthe de Bruijn sequence technique in bit-parallel fashion [10].

The pseudo code of LEAP is shown in Algorithm 4.

5.3 ResultWe implemented LEAP for both banded Levenshtein distance and banded affine gap penalties. For each

scoring scheme, we compare LEAP against three state-of-the-art approximate string matching implementa-tions, including: an in-house vanilla Landau-Vishkin implementation (LV); an implementation of GeneMyer’s bit-vector algorithm from SeqAn (SeqAn) ([5]) and finally a SIMD implementation of bandedglobal Needleman-Wunsch algorithm (NW-SIMD) ([4]). Additionally, in order to benchmark the benefit ofthe de Brujin sequence based bit-vector optimization, we implemented two versions of LEAP: one with(LEAP-BV) and one without (LEAP), the bit-vector optimization.

The results are shown in Table 1. We benchmarked all read and reference pairs from bowtie2 on all 6implementations, using six read files from the 1000 Genomes Project ([1]), ERR240726_1, ERR240726_2,ERR240727_1, ERR240727_2, ERR240728_1, ERR240728_1. From the results, we can observe that

21

Algorithm 4: LEAPInput: E, destination_lanes, origin_lanesOutput: pass; final_lane; final_energy;Initialization: end[l][e] = start[l][e] = −MAX_INT, ∀(l,e);final_energy = MAX_INTFunctions:VtH(l, pos): computes the number of vertices until the next hurdle from column from pos in lane l.Pseudocode:// Initializationfor l in [−k...+ k] do

if l in origin_lanes thenstart[l][0] = origin_lanes[l];length = VtH(l,start[l][0]);end[l][0] = start[l][0] + length;

// Iterate through all energy levelsfor e = 1 to E do

// Finds the furthest starting position// after a leap or a hurdle-crossingfor l in [−k...+ k] do

for l′ in [−k...+ k] doe′ = e−P(l′, l);if e′ ≥ 0 ∧ end[l′][e′] then

candidate_start = end[l′][e′]+F(l′, l);if candidate_start > start[l][e] then

start[l][e] = candidate_start;

// Find how long the toad can travel// without running into a hurdlelength = VtH(l,start[l][e]);end[l][e] = start[l][e] + length;if end[l][e]≥ destination_lanes[l] then

if e < final_energy thenfinal_lane = l;final_energy = e;

pass = final_energy < E;

LEAP-BV is the fastest in both Levenshtein distance setup and affine gap setup. For Levenshtein distance,compared to SeqAn, LEAP-BV achieves up to 7.4x speedup under E = 1 and 1.6x speedup under E = 5.For affine gap, compared to NW-SIMD, LEAP-BV achieves even greater performance, with up to 32xspeedup under E = 1 and 2.3x speedup under E = 5. Notice that even though both vanilla LV and SeqAn

22

e LEAP-BV LEAP LV NW-SIMD SeqAnLevenshtein 1 0.84 1.28 1.25 38.81 6.21

2 1.48 2.07 2.03 38.91 6.663 2.39 3.14 3.11 38.61 6.284 3.35 4.50 4.45 38.38 6.595 4.60 6.01 6.04 38.46 6.88

Affine Gap 1 1.18 1.72 N/A 38.75 N/A2 3.10 4.01 N/A 38.15 N/A3 6.39 7.77 N/A 38.64 N/A4 10.91 13.31 N/A 38.66 N/A5 16.91 20.74 N/A 38.51 N/A

Table 1: This table shows runtime for a suite of Approximate String Matching algorithms normalized to seconds per 10million read/reference pairs. String pairs in this benchmark are generated by bowtie2 with default parameters. WhileLEAP uses a simple for loop to find the next hurdle, LEAP-BV (LEAP-Bit-Vector) uses the de Bruijn sequence basedbit-vector algorithm to locate the next hurdle.

are reasonably fast under Levenshtein distance settings, neither support affine gap penalties due to theirtight coupling with Levenshtein distance scores.

6 Subproblem 4: Accurate and Efficient Read Could MappingRecently, advances in DNA library (material prior to sequencing) preparation technologies offer new

opportunities in NGS read mapping. Technologies including Moleculo and 10X Genomics introduced anew concept called read cloud. The core methodology of read cloud sequencing is read barcoding, wheregroups of short reads that are sequenced in close by vicinities in the donor genome are tagged with a uniquebarcode. Therefore reads sharing the same barcode, must be mapped to local clusters

Read cloud technology can potentially improve the mapping accuracy of NGS reads. Human genome ishighly repetitive. It contains large quantity of genomic repeats such as minisatellites and DNA transposons.In fact, it is estimated that around 21.1% of human genome is made up of Long Interspersed NuclearElements (LINEs) and another 11% of human genome is made up of Short Interspersed Nuclear Elements(SINEs). Short reads generated from LINE or SINE may have many mappings with high similarity scoresand pinpointing the single correct mapping with high confidence is theoretically impossible. We call readsthat can map to many locations with high similarity scores as multi-mapping reads.

Read cloud provides new opportunities in finding the exact mapping for multi-mapping reads. Sincewe know that average length of LINE is 7000 base pairs and average length of SINE is 500 base pairs, bygenerating much larger read clouds (e.g., 50k base pairs with 10X Genomics), we can create read cloudthat surpass the boundaries of a LINE. Therefore, among multi-mapping reads from a LINE, a read cloudshould also contain non-repetitive reads that are outside of the LINE. These non-repetitive reads can serveas anchors guiding the mapping of multi-mapping reads.

However, new opportunities also bring new computational challenges. Specifically, the knowledge ofthe number of distinct read clouds in each barcode, as well as which reads belong to which read cloud, is

23

hidden. Therefore, we must create new algorithms that not only map reads to the reference genome, butalso assigns reads to hypothetical long molecules and assesses the quality of both the read mapping and theread cloud assignment. To achieve this, we need to build new probability models to model the read cloudsequencing process.

In this proposal, we focus on mapping read clouds generated by 10X Genomics sequencing technology.

6.1 Modeling 10X Genomics Read CloudsThere are two key strategies in mapping 10X Genomics (or simply 10X) read clouds: 1) reads share the

same barcode should be mapped into compact clusters; and 2) reads in a compact cluster should be assignedto a single read cloud. These two key mapping strategies resemble the read cloud sequencing process.

NGS read clouds generation and barcoding process can be summarized as follows: First, donor genomeis divided into tens of millions of long fragments (50k base pair average length for 10X). A small number(10 on average for 10X) of DNA long fragments are randomly selected and are assigned to a unique barcode.Finally, NGS reads are randomly synthesized from long fragments following a binomial distribution.

From the sequencing process, we can draw a few key conclusions: 1) the probability of reads in asingle cluster synthesized from two separate DNA fragments is drastically lower than the probability of thesame reads synthesized from a single DNA fragment. As fragments are randomly assigned to barcodes,probability two fragments assigned to the same barcode is drastically smaller than a single fragment beingassigned to a barcode. 2) the probability of having a compact cluster is much higher than the probabilityof having a long and sparse cluster. Due to the fact that NGS reads are synthesized from long fragmentsthrough a Bernoulli process, together with the fragment length follows a normal distribution, the probabilityof having a long and sparse cluster is very low.

Therefore, we prioritize the mapping of multi-mapping read to close by regions of non-repetitivereads, as otherwise the multi-mapping read would have to create a separate sparse read cloud just for themulti-mapping read itself.

6.2 Mapping 10X Read Clouds with MagnetoWe propose a new mapping strategy, magento, which efficiently maps multi-mapping reads. Magneto first

maps unambiguously mapped reads, which are called anchors. Then it attempts mapping of multi-mappingreads, which are call them magnets, around anchors. We call the process of mapping magnets aroundanchors as attaching magnets to anchors.

To avoid wasting computation in finding all possible mappings of magnets, magneto prioritizes extensionof seed locations of magnets that are close to anchors. Magneto only evaluates the rest of seed locations ifno high quality mappings are found that are close to anchors.

Magneto also tracks magnet only clusters, in case a fragment only contains genomic repeats. However,from experiments, we found that magnet only clusters are extremely rare due to the fact that most DNAfragments exceed the length of a LINE.

24

6.3 mapping split reads

The same idea of Magneto can also be applied to mapping split reads. Split reads are reads that coversstructural variations such as chromosome fusion, translocation, inversion as well as insertion and deletions.Due to these structural variations, split reads are split into two halves and are often mapped to distantlocations in the reference genome. However, due to genomic repeats, as well as the short length of theremainder after the split, NGS split reads suffer from high multi-mappings uncertainty.

We hypothesize that when there is a structural variation event, reads synthesized from a single fragmentwill be artificially split into two distant clusters with a split read connecting the boundaries of the twoclusters. Therefore, to map split read, we only need to focus on finding potential mappings aroundboundaries of existing clusters, which drastically reduces the search space.

7 Goal and Timeline For GraduationMy goal is to graduate in the fall semester of 2017. To this end, Table 2 lists my tentative timeline for

graduation.

Duration Description

Jun-Jul 2017 Finish the implementation of mapping split-read with SNAP-X. Finish the implementationof peripheral functions of SNAP-X.

Sep-Oct 2017 Benchmark SNAP-X and finish the write up of the manuscript.

Nov-Dec 2017 Wind up on work and start writing thesis.

Dec 2017 Defend.

Table 2: Target goals for the proposal.

References[1] 1000 Genomes Project Consortium, “A map of human genome variation from population-scale sequencing.”

Nature, vol. 467, pp. 1061–1073, 2010.[2] A. Ahmadi, A. Behm, N. Honnalli, C. Li, L. Weng, and X. Xie, “Hobbes: optimized gram-based methods for

efficient read alignment,” Nucleic Acids Research, vol. 40, p. e41, 2011.[3] R. Cole and R. Hariharan, “Approximate string matching: A simpler faster algorithm,” SIAM Journal on

Computing, vol. 31, no. 6, pp. 1761–1782, 2002.[4] J. Daily, “Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments,” BMC

Bioinformatics, vol. 17, p. 81, 2016. Available: http://dx.doi.org/10.1186/s12859-016-0930-z[5] A. Döring, D. Weese, T. Rausch, and K. Reinert, “Seqan an efficient, generic c++ library for sequence analysis.”

BMC Bioinformatics, vol. 9, p. 11, 2008. Available: http://dx.doi.org/10.1186/1471-2105-9-11[6] P. Ferragina and G. Manzini, “Opportunistic data structures with applications,” in Proceedings of the 41st

Annual Symposium on Foundations of Computer Science, ser. FOCS ’00. Washington, DC, USA: IEEEComputer Society, 2000, pp. 390–. Available: http://dl.acm.org/citation.cfm?id=795666.796543

[7] L. H., “Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM,” arXiv eprint, 2013.[8] G. M. Landau and U. Vishkin, “Fast parallel and serial approximate string matching,” Journal of algorithms,

vol. 10, no. 2, pp. 157–169, 1989.

25

http://dx.doi.org/10.1186/s12859-016-0930-z

http://dx.doi.org/10.1186/1471-2105-9-11

http://dl.acm.org/citation.cfm?id=795666.796543

[9] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with bowtie 2,” Nature Method, vol. 9, pp.357–359, 2012.

[10] C. E. Leiserson, H. Prokop, and K. H. Randall, “Using de bruijn sequences to index a 1 in a computer word,”Available on the Internet from http://supertech. csail. mit. edu/papers. html, 1998.

[11] B. Ma, J. Tromp, and M. Li, “Patternhunter: faster and more sensitive homology search,” Bioinformatics,vol. 18, pp. 440–445, 2002.

[12] S. Marco-Sola, M. Sammeth, R. Guigc, and P. Ribeca, “The gem mapper: fast, accurate andversatile alignment by filtration.” Nat Methods, vol. 9, no. 12, pp. 1185–1188, 2012. Available:http://dx.doi.org/10.1038/nmeth.2221

[13] Z. Matei, B. W. J., C. Kristal, F. Armando, P. David, S. Scott, S. Ion, K. R. M., and S. Taylor, “Faster and moreaccurate sequence alignment with snap,” eprint arXiv, 2011.

[14] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the aminoacid sequence of two proteins,” Journal of Molecular Biology, 1970.

[15] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of MolecularBiology, vol. 147, pp. 195–195, 1981.

[16] E. Ukkonen, “Algorithms for approximate string matching,” Information and control, vol. 64, no. 1-3, pp.100–118, 1985.

26

http://dx.doi.org/10.1038/nmeth.2221

novel computational techniques for mapping next ...hxin/proposal.pdfnovel computational techniques...

Documents