1 fastetch: a fast sketch-based assembler for genomesananth/papers/ghosh_tcbb17_preprint.pdfindex...

16
1 FastEtch: A Fast Sketch-based Assembler for Genomes Priyanka Ghosh, Ananth Kalyanaraman Abstract—De novo genome assembly describes the process of reconstructing an unknown genome from a large collection of short (or long) reads sequenced from the genome. A single run of a Next-Generation Sequencing (NGS) technology can produce billions of short reads, making genome assembly computationally demanding (both in terms of memory and time). One of the major computational steps in modern day short read assemblers involves the construction and use of a string data structure called the de Bruijn graph. In fact, a majority of short read assemblers build the complete de Bruijn graph for the set of input reads, and subsequently traverse and prune low-quality edges, in order to generate genomic “contigs”—the output of assembly. These steps of graph construction and traversal, contribute to well over 90% of the runtime and memory. In this paper, we present a fast algorithm, FastEtch, that uses sketching to build an approximate version of the de Bruijn graph for the purpose of generating an assembly. The algorithm uses Count-Min sketch, which is a probabilistic data structure for streaming data sets. The result is an approximate de Bruijn graph that stores information pertaining only to a selected subset of nodes that are most likely to contribute to the contig generation step. In addition, edges are not stored; instead that fraction which contribute to our contig generation are detected on-the-fly. This approximate approach is intended to significantly improve performance (both execution time and memory footprint) whilst possibly compromising on the output assembly quality. We present two main versions of the assembler—one that generates an assembly, where each contig represents a contiguous genomic region from one strand of the DNA, and another that generates an assembly, where the contigs can straddle either of the two strands of the DNA. For further scalability, we have implemented a multi-threaded parallel code. Experimental results using our algorithm conducted on E. coli, Yeast, C. elegans and Human (Chr2 and Chr2+3) genomes show that our method yields one of the best time-memory-quality tradeoffs, when compared against many state-of-the-art genome assemblers. Index Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods. 1 I NTRODUCTION D E novo genome assembly is the problem of assembling an unknown genome from a set of DNA “reads” sequenced from it. Despite advances in assembly algorithms over the last two decades, development of de novo genome assembers continues to be an active research topic. Over the years, multiple Next Generation Sequencing (NGS) tech- nologies have emerged (e.g. Illumina, 454), which have in- creased in throughput, improved in accuracy and decreased in cost, leading to a wide adoption of these technologies among practitioners. Most of these technologies generate “short reads” that are of length in the hundreds of bases, with errors less than 1%, but at rates that can quickly overwhelm the computational capacity to assemble them. There are numerous short read assemblers that have been developed over the last decade (e.g., [1], [2], [3], [4], [5], [6]). Nearly all of them use a string data structure called the de Bruijn graph to compute their assembly, and follow a 3-step approach: First, the de Bruijn graph is constructed using all the substrings of length k (called k-mers) in the input reads as “vertices”, and all the k +1-mers that exist in the input reads as “edges” (see Fig. 1 for an example). P. Ghosh is with the School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99164. E-mail: [email protected]. A. Kalyanaraman is with the School of Electrical Engineering and Com- puter Science, Washington State University, Pullman, WA, 99164. Email: [email protected]. Manuscript received February 10, 20017; revised June 8, 2017. The next step involves error correction, which locates and prunes paths induced by erroneous k-mers that are likely to be manifestations of sequencing errors. The last step is to generate a set of assembled contigs by traversing paths in the pruned and corrected graph. Of the three steps, the first step of de Bruijn graph construction is typically most memory- and time-intensive, potentially taking hours to even days for assembling moderate sized eukaryotic genomes, and requiring tens to hundreds of gigabytes of memory. To mitigate the memory requirements, a handful of recently developed assemblers [7], [8] use a probabilistic data structure (Bloom filter [9]) during their de Bruijn graph construction. The filter enables a succinct representation for finding and locating all k-mers that exist, at the potential ex- pense of reporting false positives. However, the Bloom filter is suited for membership queries. In the case of genome assembly, in addition to membership, the frequency of k- mers also becomes important. Therefore, the Bloom filter- based methods follow an indirect approach of first indexing the k-mers using the filter and later pruning those k-mers (vertices) that occur less frequently from the de Bruijn graph. Furthermore, it is essential to estimate the size of the filter a priori in order for it to perform optimally [10]. Contributions: In this paper, we present a new ap- proach to compute genome assembly using a probabilistic data structure specifically designed for counting—viz. the Count-Min sketch (CM sketch) [11]. The CM sketch data structure was originally designed for streaming applica- tions and operates in sub-linear space. We propose a new genome assembly algorithm that uses the CM sketch to

Upload: others

Post on 30-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

1

FastEtch: A Fast Sketch-based Assembler forGenomes

Priyanka Ghosh, Ananth Kalyanaraman

Abstract—De novo genome assembly describes the process of reconstructing an unknown genome from a large collection of short (orlong) reads sequenced from the genome. A single run of a Next-Generation Sequencing (NGS) technology can produce billions ofshort reads, making genome assembly computationally demanding (both in terms of memory and time). One of the majorcomputational steps in modern day short read assemblers involves the construction and use of a string data structure called the deBruijn graph. In fact, a majority of short read assemblers build the complete de Bruijn graph for the set of input reads, and subsequentlytraverse and prune low-quality edges, in order to generate genomic “contigs”—the output of assembly. These steps of graphconstruction and traversal, contribute to well over 90% of the runtime and memory. In this paper, we present a fast algorithm, FastEtch,that uses sketching to build an approximate version of the de Bruijn graph for the purpose of generating an assembly. The algorithmuses Count-Min sketch, which is a probabilistic data structure for streaming data sets. The result is an approximate de Bruijn graph thatstores information pertaining only to a selected subset of nodes that are most likely to contribute to the contig generation step. Inaddition, edges are not stored; instead that fraction which contribute to our contig generation are detected on-the-fly. This approximateapproach is intended to significantly improve performance (both execution time and memory footprint) whilst possibly compromising onthe output assembly quality. We present two main versions of the assembler—one that generates an assembly, where each contigrepresents a contiguous genomic region from one strand of the DNA, and another that generates an assembly, where the contigs canstraddle either of the two strands of the DNA. For further scalability, we have implemented a multi-threaded parallel code. Experimentalresults using our algorithm conducted on E. coli, Yeast, C. elegans and Human (Chr2 and Chr2+3) genomes show that our method yieldsone of the best time-memory-quality tradeoffs, when compared against many state-of-the-art genome assemblers.

Index Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods.

F

1 INTRODUCTION

D E novo genome assembly is the problem of assemblingan unknown genome from a set of DNA “reads”

sequenced from it. Despite advances in assembly algorithmsover the last two decades, development of de novo genomeassembers continues to be an active research topic. Overthe years, multiple Next Generation Sequencing (NGS) tech-nologies have emerged (e.g. Illumina, 454), which have in-creased in throughput, improved in accuracy and decreasedin cost, leading to a wide adoption of these technologiesamong practitioners. Most of these technologies generate“short reads” that are of length in the hundreds of bases,with errors less than 1%, but at rates that can quicklyoverwhelm the computational capacity to assemble them.

There are numerous short read assemblers that havebeen developed over the last decade (e.g., [1], [2], [3], [4],[5], [6]). Nearly all of them use a string data structure calledthe de Bruijn graph to compute their assembly, and followa 3-step approach: First, the de Bruijn graph is constructedusing all the substrings of length k (called k-mers) in theinput reads as “vertices”, and all the k + 1-mers that existin the input reads as “edges” (see Fig. 1 for an example).

• P. Ghosh is with the School of Electrical Engineering and ComputerScience, Washington State University, Pullman, WA, 99164.E-mail: [email protected].

• A. Kalyanaraman is with the School of Electrical Engineering and Com-puter Science, Washington State University, Pullman, WA, 99164.Email: [email protected].

Manuscript received February 10, 20017; revised June 8, 2017.

The next step involves error correction, which locates andprunes paths induced by erroneous k-mers that are likelyto be manifestations of sequencing errors. The last step is togenerate a set of assembled contigs by traversing paths in thepruned and corrected graph. Of the three steps, the first stepof de Bruijn graph construction is typically most memory-and time-intensive, potentially taking hours to even daysfor assembling moderate sized eukaryotic genomes, andrequiring tens to hundreds of gigabytes of memory.

To mitigate the memory requirements, a handful ofrecently developed assemblers [7], [8] use a probabilisticdata structure (Bloom filter [9]) during their de Bruijn graphconstruction. The filter enables a succinct representation forfinding and locating all k-mers that exist, at the potential ex-pense of reporting false positives. However, the Bloom filteris suited for membership queries. In the case of genomeassembly, in addition to membership, the frequency of k-mers also becomes important. Therefore, the Bloom filter-based methods follow an indirect approach of first indexingthe k-mers using the filter and later pruning those k-mers(vertices) that occur less frequently from the de Bruijngraph. Furthermore, it is essential to estimate the size ofthe filter a priori in order for it to perform optimally [10].

Contributions: In this paper, we present a new ap-proach to compute genome assembly using a probabilisticdata structure specifically designed for counting—viz. theCount-Min sketch (CM sketch) [11]. The CM sketch datastructure was originally designed for streaming applica-tions and operates in sub-linear space. We propose a newgenome assembly algorithm that uses the CM sketch to

Page 2: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

2

agc

tgc

gca cac act

cta

ctg

Fig. 1: Example of a de Bruijn graph for an input containingthe reads {agcactg, tgcacta}, and for k = 3.

build a low memory, approximate version of the de Bruijngraph, and uses that graph to generate an assembly. Morespecifically, our approximate de Bruijn graph stores, witha high probability, only those vertices corresponding tofrequent k-mers. Edges are not stored—instead they aredetected on-the-fly as the output contigs are generated. Byavoiding the construction of a full blown de Bruijn graphand by selectively enumerating edges (and hence, paths)on-the-fly, our approach trades off assembly accuracy infavor of performance (both memory and run-time). Towardfurther scalability, our approach also parallelizes efficientlyon multi-core computers.

A fast and space-efficient assembly could not only makeassembly of very large genomes feasible and in shortturnaround times, but it could also pave the way potentiallyfor the assembly process happening in real-time (i.e., wherethe time for assembly is significantly less than the time togenerate the input reads).

Based on the above idea, we present two main variantsof our assembly method—one that generates an assembly inwhich each contig represents a contiguous genomic regionfrom one of the two strands of the DNA; and another thatgenerates an assembly in which a contig is allowed to switchbetween the two strands of the DNA.

Experimental results testing all variants of our algo-rithm, using E. coli, yeast, C. elegans and and Human (Chr2and Chr2+3) genomes, show that our method is able toproduce an assembly with quality comparable or betterthan most other state-of-the-art assemblers, while runningin significantly reduced memory and time footprint.

Henceforth, we call our algorithm FastEtch, referring toits fast ability to “etch” out a genome using sketching. Tothe best of our knowledge, FastEtch is the first assembler touse sketching for genome assembly. A preliminary versionof this paper appeared in [12].

2 RELATED WORK

De novo genome assembly is a widely researched topic witha number of assembly tools and strategies developed overthe last two decades. Short read assemblers correspondto that subset which target reads generated from NGStechnologies (e.g. Illumina, 454 pyrosequencing, SOLiD).Broadly speaking, modern day short read assemblers canbe grouped under three categories [13]: i) Overlap-Layout-Consensus (OLC) methods rely on constructing a pairwisealignment based overlap graph; ii) de Bruijn Graph-basedmethods (henceforth, abbreviated as “DBG” assemblers)build a de Bruijn graph out of the k-mers occurring in reads,and then perform error correction and Euler path traversalsto generate an assembly1; and iii) String Graph approaches

1. For this reason, this class of assemblers is also referred to asEulerian assemblers.

represent a variation of the OLC approach, by focusing onsuffix-prefix overlaps between reads.

Of these approaches, the DBG approach has emergedto be one of the most widely used paradigms. Originallyintroduced by Pevzner et al. [14], it has been widely adoptedin a number of short read assemblers including (but not lim-ited to): Velvet [5], ALLPATHS [1], SOAPdenovo [3], ABySS[4] and SPAdes [6]. These methods vary in the manner inwhich they process the de Bruijn graph, and identify andprune potentially erroneous paths during contig assembly.However all of these assemblers follow the same algorith-mic template, which can be summarized by the followingsteps:

1) (Graph Construction) Given a set of input reads and aconstant k > 0, construct the de Bruijn graph;

2) (Error Correction) Identify and prune paths in the deBruijn graph that are likely to be an artifact of sequenc-ing errors and inconsistencies; and

3) (Contig Generation) Enumerate paths in the prunedgraph in order to generate assembled sequences—i.e.,the contigs.

Implementing an assembler using the DBG approachhas its advantages. Constructing a de Bruijn graph can beachieved in linear time complexity (in comparison, build-ing an overlap based graph in the OLC assemblers is ofquadratic time complexity since reads need to be comparedagainst one another). In addition, the traversal step treatsthe problem of contig generation as an Euler tour (which ispolynomial solvable) as opposed to the intractable Hamilto-nian path problem as OLC approaches do. Despite this per-formance advantage, assembling large, complex genomesstill remains a compute- and memory-intensive task, requir-ing hours to days of computation, and tens to hundreds ofgigabytes of memory.

In an attempt to reduce the memory footprint of assem-bly, Chapman et al. proposed the Meraculous algorithm [7].The algorithm uses the Bloom filter [9], which is a proba-bilistic data structure for answering membership queries, togenerate its version of the de Bruijn graph. Once the graphis generated, simple chain paths (between branching nodes)are enumerated and the first batch of unique contigs aregenerated from these paths; consequently, these contigs arecalled “UU-contigs”. Subsequently, the read set is loadediteratively to incrementally map the reads against the UU-contigs, and in the process, the contigs are elongated oneither end. Note that the latter step of mapping reads to thepartially formed contigs requires that the read set be loadedpotentially multiple times until no longer contig extensionsare possible.

Minia [15] is another assembler that uses the Bloom filterto construct the de Bruijn graph, by storing all the distinctk-mers (vertices) and subsequently discarding the ones con-sidered potentially erroneous (appearing fewer times thana threshold). The tool also stores an additional structureto remove what the authors call “critical false positives”.Moreover, in order to further reduce memory footprint,the Minia algorithm uses secondary storage. All k-mersgenerated from the read set are partitioned and stored onthe disk. Low-abundance k-mers are filtered, by separatelyloading each partition (one at a time) into memory and intoa temporary hash table. In the absence of adequate disk

Page 3: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

3

space, we observed Minia to encounter segmentation faults.There have been other efforts to generate memory-

efficient, compressed representation of the string structureused for assembly, viz. bitmaps [16], and FM-index (basedon the Burrows-Wheeler transform) [17], [18], [19]. Anotherclass of assemblers employ techniques such as Digital Nor-malization [20] prior to generating the assembly.

There also exists a class of assemblers that utilize alikelihood-based approach toward assembly assessment—these include CGAL [21] and GAML [22]. SLICEMBLER, onthe other hand, proposes a new technique to perform de novoassembly using utra-deep sequencing data in [23].

In an effort to improve scalability, there have been someefforts that have developed parallel genome assemblers.Notable examples include YAGA [24], Ray [25], PASHA[26], Spaler [27], SWAP [28] and HipMer [29], which areoriginally written for assembling on distributed memorymachines (i.e., commodity clusters). While constructing andtraversing the de Bruijn graph in a distributed memory set-ting introduces multiple challenges related to interprocesscommunication and data locality, the aggregate memory insuch a machine could be used to scale up the operation tovery large inputs.

The past few years have seen the growth of severaladditional short read de Bruijn graph based assemblersadopting a succinct representation. Notable mention in-cludes the metagenomics assembler MEGAHIT [30] whichforgoes any pre-processing stages and utilizes succinct deBruijn graphs (SdBG [31]) along with ideas from IDBA-UD[32] in the assembly process. [33] implements the de Bruijngraph using the FM-index and frequency-based minimizers.Another more recent approach utilizes a colored de Bruijngraph in order to minimize the space consumption as seenin algorithms such as Rainbowfish [34] and VARI [35]. A re-lated effort SplitMEM [36] exploits the relationship betweencompressed colored de Bruijn graphs and suffix trees.

The algorithm presented in this paper, FastEtch, is fun-damentally different from all the above previous efforts: i)It is the first approach, to the best of our knowledge, to usethe Count-Min Sketch (a sublinear space data structure) inthe graph construction step; ii) It constructs an approximatede Bruijn graph, storing only a subset of vertices (k-mers),and detecting edges on-the-fly as contigs are generated—thereby generating savings on both vertex and edge spacerequirements, and consequently, the time to solution; and iii)Through a choice of parameters, it offers a way to control thequality-performance (time, memory) trade-off in the output.

3 METHODS

3.1 Notation and TerminologyLet r denote a read (string) of length `, over the DNAalphabet {a, c, g, t}. We index the characters in each readfrom 1, and denote the ith character by r[i]. Let r(i, l)denote the substring of length l starting at index i in r,such that i + l − 1 ≤ `. Then, every r(i, k) is a k-mer inr, for a fixed k > 0. We denote the input set of n reads asR = {r1, r2, . . . rn}, and their total length as N (=

∑i |ri|).

Let K denote the set of all k-mers in R.Given R and k, the corresponding de Bruijn graph is a

directed graph G(V,E), such that the vertex set V = K,

Fig. 2: CM sketch depicting the Update function, whichupdates the count for every k-mer to one cell in each row.

and there exists a directed edge (u, v) ∈ E for every pair ofk-mers u and v that are consecutive in any read r ∈ R (asshown in Fig. 1). We note here that the above representsa minimalist definition of a de Bruijn graph. In practice,typically a number of other attributes are stored along eachedge—e.g., the set of reads that cover each edge (or vertex),or the number of occurrences of every k-mer.

Since DNA is a double stranded molecule, we use thefollowing notation of reverse complementation. The reversecomplement of an arbitrary DNA sequence α, denoted by α,may be obtained by reversing r and complementing eachbase—i.e., α[i] = complement(α[`− i+ 1]). The complemen-tary base pairing rule is as follows: {a↔ t, c↔ g}.

3.2 Sketching for Genome AssemblyIntroduced in 2003, the Count-Min sketch (CM sketch) [11]is a sublinear space data structure used for summarizingmassive data streams in applications. It finds use in imple-menting point, range and inner product queries, quantilecomputations, and identification of heavy hitters [11].

As shown in Fig. 2, the CM sketch data-structure is a 2-Dmatrix of depth d and width w, that stores d×w counts. Thesketch uses d hash functions:

h1 . . . hd : {1 . . . n} → {1 . . . w}

that are from a pairwise independent family. Two parame-ters, ε and δ, are used to determine the sizes of w and d,where ε denotes the error in answering a query within theprobability of 1 - δ: w = de/εe and d = dln(1/δ)e.

A CM sketch supports two basic functions: Update countand Estimate count. In the streaming model, every incomingupdate is in the form of the tuple < i, c >, where irepresents an item and c represents the count of i for thatupdate. Given an update < i, c >, the Update count functionis given by:

count[j, hj(i)]← count[j, hj(i)] + c, ∀j ∈ [1, d] (1)

The Estimate count function retrieves an estimated countfor a particular item i from the CM sketch, and is given by:

CM(i)← minj

count[j, hj(i)] (2)

Intuitively, if ai denotes the actual count for item i, theneach entry count[j, hj(i)] represents an overestimate for ai(due to potential collisions introduced by hj). Therefore,returning the minimum over all the d counts represents thelowest overestimate one can derive from the CM sketch.

By setting w = de/εe and d = dln(1/δ)e, the CM sketchprovides the following approximation guarantee [11]:

CM(i) ≤ ai + ε∑i′

ai′ , with probability at least 1− δ (3)

Page 4: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

4

Intuitively, ε is the tolerance for error in the count, and 1− δrepresents the probability of incurring an error within thattolerance.

As can be observed from this approximation bound,the quality of the estimate is expected to be better for themore frequent items in the set—i.e. those items whose airepresents a significant fraction of

∑i′ai′ . This makes the CM

sketch naturally suited for keeping track of frequent itemsin data streams.

3.2.1 Frequency of k-mers

In this paper, we describe a way to use the CM sketchfor detecting frequent k-mers in reads and using them forconstructing an approximate version of the de Bruijn graph,and subsequently, for generating the assembly. Our ideato detect and keep track of only frequently occurring k-mers is directly motivated by sequencing coverage C—whichcorresponds to the number of clonal copies of the targetgenome used in sequencing. Typically, de novo sequencingexperiments use a high coverage to sequence the underlyinggenome (between 50× and 100×). Given the stochasticity ofthe random shotgun process, this implies that on an average,each base (and hence, each k-mer originating at that base)in the genome is covered by roughly C reads. However,errors during sequencing, even if as low as 1%, alter thisexpectation, as erroneous positions in reads generate lowfrequency (“poor quality”) k-mers. In fact, the fraction ofsuch low quality k-mers is only expected to grow as k is in-creased. The non-erroneous fraction ( “high quality” k-mers)on the other hand tend to have a frequency proportional tothe coverage.

Consequently, if we are to plot the frequency distributionof k-mers, we expect to see a bimodal distribution, withroughly two peaks, one corresponding to the low qualityk-mers and the other high quality k-mers. We conducted apreliminary study on the frequency distribution of k-mersgenerated from two different coverage experiments on theE. coli genome. Fig. 3 confirms the expectation.

In fact, this chart also provides an interesting insight intothe relative concentrations of the low vs. high quality k-mers. Notably, well over 75% of all the distinct k-mers areof low quality, while only up to 25% are high quality k-mers.This simple observation lays the foundation for our use ofCM sketch for genome assembly—if the CM sketch datastructure can be used to effectively identify only those highquality k-mers, then significant savings in space and timecan be achieved.

Fig. 3 also suggests that there exists a clear separationbetween low and high quality k-mers in their frequencyranges—i.e., as per the bimodal expectation. We exploitthese observations in the design of our algorithm (Sec-tion 3.3).

3.2.2 Precision of CM Sketch Counts

In this section, we evaluate the precision of CM sketchcounts when applied to k-mers from sequenced reads. Asdescribed earlier, the size parameters w and d of the CMsketch play a significant role in controlling the quality of thecount overestimate provided by the sketch. Also note that

0

10

20

30

40

50

60

70

80

90

(20,2

1] (2

1,2

2] (2

2,2

3] (2

3,2

4] (2

4,2

5] (2

5,2

6] (2

6,2

7]

Pe

rce

nta

ge

(%

) o

f k-m

ers

Frequency of k-mers

Coverage=50x

Coverage=80x

Fig. 3: Histogram showing k-mer frequency distribution(expressed as a % of the total distinct k-mers), for twocoverage experiments of the E. coli genome, with k=32. Ascan be observed, the distribution is bimodal, with a clearseparation between the less frequent (“poor quality”) andmore frequent (“high quality”) k-mers. Furthermore, thereis a lateral shift in the peaks when the sequencing coverageis increased (from 50× to 80×).

the summation (∑i′ai′ ) in the approximate bound (3) will be

Θ(N), if CM sketch is used for k-mer counting.Given the above, we first note that even for small values

of d it is possible to achieve a very high probability 1− δ inbound (3). For example, d = 5 is sufficient to guarantee 0.99probability (for 1− δ).

The term w controls the quality of the estimate; morespecifically, it controls ε in the summation (

∑i′ai′ ). This

implies that a high precision can be achieved by targetinga low value of ε. But how small should ε be so as to makethe contribution from the summation practically negligible,when applied to k-mer counting? For instance, a value of was high as 106 will bring ε down to 10−6. AssumingN is 109

(1 Gbp input), this still implies a non-negligible contributionfrom 103 for each estimate. Given that the sequencing cover-age is typically only under 100×, this suggests that the valueof w has to be increased proportional to N if one were toaccurately recover the counts for all k-mers. However, sucha high value of w is clearly undesirable from a memoryconsumption perspective, because as w tends to N , thesublinear space advantage of the CM sketch diminishes.

This is where our strategy of targeting only frequent k-mers (high quality) has a direct impact on the practical valueof the CM sketch toward k-mer counting. More specifically,since our goal is not to recover all k-mers but only those thatare frequent, a smaller value of w to justify the sublinearargument is sufficient in practice. Intuitively, this is becauseit is those high frequent k-mers that are likely to contributeto form major sections of the output contigs. On the otherhand, the low frequent k-mers are more likely to be erro-neous k-mers, or even if legitimate, tend to appear at theends of otherwise highly covered contigs.

To evaluate the potential of using a CM sketch-basedapproach, we performed an experiment with a 10× read setof the E. coli genome. For each distinct k-mer, we computed

Page 5: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

5

0

50

100

150

200

250

300

350

400

450

28

29

210

211

212

213

214

215

216

217

218

219

Sta

ndard

devia

tion o

f |C

M c

ount -

Actu

al count|

width ’w’

d=8d=16d=32d=64

d=128

(a) CM sketch: precision

0

500000

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

3.5e+06

4e+06

4.5e+06

5e+06

10 20 30 40 50 60 70 80 90 100

Nu

mb

er

of

dis

tin

ct

k-m

ers

Percentage contribution of k-mer’s to a bucket

w=64kw=128kw=256kw=512k

w=1024k

(b) CM sketch: bucket contributionFig. 4: The effect of w and d on: a) the standard deviation of the difference between CM count and actual count for allk-mers. b) the percentage contribution of each k-mer to their respective hash bucket in the CM sketch.

the difference between the estimated count (CM count)generated by the CM sketch and the k-mer’s actual count.Fig. 4a shows the results of this comparison for varying val-ues of w, and also for a smaller range of d. As shown, w hasa more pronounced effect on the precision of the CM sketchcompared to d. More importantly, the observations confirmthat a small value of w (e.g. 128K) is sufficient to diminishthe standard deviation in the overall difference betweenactual and estimated (for all k-mers). Note that comparedto this value of ε, the value of N is at least two orders ofmagnitude larger (≈ 50M for this set). We also noticed thatan increase in the value of d, does not significantly impactthe output and therefore we have restricted the value of d to8 in all our experiments. Put together, these observationsimpy that a relatively small (compared to N ), fixed sizeCM sketch is sufficient for implementing an effective countestimation procedure in practice.

We also performed an experiment to study the effect ofvarying w on the nature of collision within each hashedentry (“bucket”) of the CM sketch. Intuitively, the moreskewed a k-mer’s contribution to its overall bucket sum(CM count), the easier it may become to separate thathigh quality k-mer from the rest. On the other hand, auniform contribution from multiple k-mers may reduce theefficacy in selecting the high quality k-mers. A third possiblescenario, where a highly frequent k-mer is accompanied bya long list of infrequent k-mers in the same bucket may alsonegatively impact the efficacy of selection.

To study this effect, we conducted experiments for vary-ing values of w, and calculating the percentage contributionof all distinct k-mers within their corresponding bucket2.This is obtained by dividing the actual count of each k-merby its corresponding CM count. Fig. 4b shows the resultsof this analysis, as a histogram of k-mers distributed bytheir contributions to their respective buckets. As can beobserved, as the value of w is increased, a larger fraction ofk-mers contribute dominantly to their respective buckets—as indicated by the larger area under the curve for >60%contribution. This is because of lower collision rates. Whilethe efficacy of selection can be expected to increase for evenlarger values of w, increasing w implies increasing the space

2. Of the d buckets that a k-mer is present, the bucket with theminimum count for that k-mer is selected.

Algorithm 1: Approximate de Bruijn Graph Construc-tion Algorithm — Baseline

Input: Input set of reads: R, Width: w, Depth: d, Threshold: τOutput: Approximate de Bruijn graph (G)for each r ∈ R in parallel do

for each k-mer i ∈ r dofor each j ← 1 to d do

Update count(j, hj(i))← count(j, hj(i)) + 1

Estimate CM(i)← minj count[j, hj(i)]if CM(i) > τ then

if i /∈ G thenInsert i in G[h(i)].listInitialize G[h(i)].list.partialCount(i)← 0

G[h(i)].list.partialCount(i)++

return G

consumed by the CM sketch—i.e., a tradeoff between spaceand precision of CM count (filtering efficacy).

3.3 FastEtch: The Assembly AlgorithmBuilding on the foundations laid out in Section 3.2, wepresent our FastEtch algorithm for genome assembly (seeFig. 5). Our algorithm has two main steps:

i) (Graph Construction) Construct an “approximate deBruijn graph” using the input reads and CM sketch;

ii) (Contig Generation) Generate contigs by performingedge detection and path enumeration of the approxi-mate de Bruijn graph.

3.3.1 Approximate de Bruijn Graph ConstructionOur graph construction algorithm uses two data structures:a d× w CM sketch, and an auxiliary data container to holdthe output “approximate de Bruijn graph” (denoted by G),which will contain only a selected subset of k-mers that havebeen identified as “high quality” by our method. Ideally,these identified k-mers should correspond to all and onlythose frequent k-mers. For the purpose of the auxiliary k-mer data container, we use a simple hash table in our currentimplementation3. We use multi-threading for parallelism.Let p denote the number of threads.

3. Given that G is a container for k-mers, we treat G also as a setcontaining the k-mers for the purpose of exposition.

Page 6: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

6

k-mer, +1count

x

k-mer

"acc"

cca

?

ccc

?

ccg

?

cct

?

edge extensions

(successor)

Input :

Hashtable of frequent k-mers

Output: Contigs

Fig. 5: FastEtch: A schematic illustration of our sketch-based algorithm for de novo genome assembly.

Algorithm 1 presents our baseline algorithm to constructthe approximate de Bruijn graph. Given the input set of nreadsR, the reads are partitioned in a load balanced manner(using dynamic scheduling) among the p threads. Eachthread then enumerates all k-mers from its batch of reads,by simply sliding a window of length k within each read.As and when a k-mer i is generated through this slidingwindow approach, the following operations are performed:1) First, the k-mer’s count is incremented in all the corre-

sponding d buckets of the CM sketch, using the Updatecount function.

2) Next, we check to see if the k-mer’s estimated count hasexceeded a certain threshold τ . If it has, then that k-meris (tentatively) deemed as “high quality” and is insertedinto G (if it already does not exist). The first time a k-meris inserted into G, we initialize a partial count for this k-mer and set it to 0. If the k-mer already exists in G, thenwe simply increment the partial count.The rationale for maintaining a partial count for every

distinct k-mer pushed into G is as follows: Ideally, theset of k-mers inserted into G should correspond to alland only those k-mers occurring above a certain frequency(determined as a function of the sequencing coverage C).However, due to collision of k-mers within each bucket ofthe CM sketch, it is possible that the estimated CM countfor a k-mer, which is an overestimate, is significantly largerthan its actual count. This could happen in particular withinthe CM sketch buckets where there is a skewed mixture ofa handful of high frequent k-mers alongside numerous lowfrequency k-mers. If it so happens that multiple instances ofeven one of those high frequency k-mers are inserted earlyon, then by virtue of collision, any low frequency k-merthat follows into the bucket is also likely to get selectedfor insertion into G. It is precisely for this reason that wedeem any k-mer inserted into G as being tentatively highquality, and maintain a partial count for such k-mers in G.By initializing this partial count to 0 (and not τ or the CMcount value), we are essentially making this partial countstored at G an underestimate of the actual count. By sheerprobability of occurrence, the low frequency k-mers insertedinto G are likely to remain close to this initialized value,whereas the high frequency k-mers (the real high qualityones) are likely to grow in their partial count values, as more

instances of that k-mer appear. This strategy of moving froman overestimate (CM count) to an underestimate (partial count) iscritical to ensure that the k-mers that eventually get selectedfor contig generation (elaborated in Section 3.3.2) are indeedof high quality.

Improving the Efficacy of Filtering: Note that in theabove baseline approach, once a bucket’s CM count exceedsτ , all subsequent k-mers for which that bucket is the min-imum count bucket, will be inserted into the hash table G.But there is no guarantee that such k-mers are all indeedof high quality. In fact, too large a τ can negatively impactsensitivity (i.e. potentially missing out on a good fractionof high quality k-mer insertions). On the other hand, a lowvalue of τ affects precision by letting in too many low qual-ity k-mers into G. Therefore, instead of solely relying on τ ,we devise a “utility”-based iterative scheme to insert k-mersinto G, with the aim of keeping the number of insertions intoG as low as possible without losing sensitivity. We define ameasure called the utility for each CM sketch bucket, whichwill be measured dynamically with every CM sketch updateof that bucket. Intuitively, a CM sketch bucket b is said to beof high utility if it has a significant potential to contribute toa high quality k-mer into G at any stage.

To measure utility, within each CM sketch bucket, wekeep track of an integer for the number of insertions intoG that has resulted from this bucket so far. We refer tothis count as the bucket’s “insert count”, and it is initializedto 0. This is kept in addition to the CM count of thatbucket. Given the above, we provide two different ways todetermine the utility of a bucket:

Definition 1. A bucket is said to be of high utility if its CMcount has exceeded τ and its insert count is less than a certainthreshold γ;

Definition 2. A bucket is said to be of high utility if the ratio ofits CM count to its insert count is above a certain threshold β;

The revised algorithm to insert a k-mer into G is givenin Algorithm 2, using one of the two definitions of choiceto compute a bucket’s utility. Consequently, we generatetwo variants of our FastEtch algorithm—viz. FastEtch-γ andFastEtch-β, that use the γ and β filters respectively.

Intuitively, as per Definition 1, the idea is to prevent abucket from performing too many inserts into G as that

Page 7: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

7

Algorithm 2: A Utility-based Scheme to Insert k-mersinto G

Input: CM sketch bucket: b, k-mer: iLet i← k-mer being inserted into bucket b, such that b is the

count-min bucket for iIncrement b.CMcountif b.CMcount > τ then

if b is of “high utility” thenif i /∈ G then

Insert i into G and initialize partial countIncrement b.InsertCount

elseIncrement partial count for i in G

if reset criterion is met thenReset b.CMcount = b.InsertCount = 0

Algorithm 3: Contig Generation Algorithm

Input: Approximate de Bruijn Graph: GOutput: Contigsfor each k-mer i ∈ G in parallel do

if i is a begin k-mer thenInitialize contig c← first (k − 1) characters of irepeat

Concatenate the end character of k-mer i to contig ci← succ(i)

until i does not exist;Output contig c

is indicative of low quality k-mers. This is achieved byplacing a bound on the number of inserts (γ). However,if we permanently shut off a bucket’s contribution afterγ inserts are reached, then any new high quality k-merthat gets inserted after that point will also be inadvertentlyskipped. To reduce such potential loss in sensitivity, we resetthe entire bucket by resetting both its CM count and insertcount to 0, when a bucket is no longer of high utility. Thisprocess, conducted iteratively for a bucket, lasts until all k-mers in the input are exhausted. The reset strategy giveseach bucket a revived chance to contribute k-mers into Geventually, each time after it is deemed low utility.

Definition 2 is similar in spirit except that the criterionfor resetting a bucket’s counts is based on the the ratio of itsCM count to insert count. Intuitively, if a bucket holds toomany low quality k-mers, this ratio will be close to 1. How-ever, a higher value of the ratio presents a better evidencefor high quality k-mers originating from that bucket.

3.3.2 Contig Generation

We use the approximate de Bruijn graph (G) constructed asabove to generate contigs. Algorithm 3 presents the stepsinvolved in generating the contigs. The main steps are asfollows: Given a k-mer i ∈ G, we define two functions:succ(i) and pred(i), as follows. Let S(i) denote the set ofall possible k-mers that can be constructed from i through asingle character extension in the forward direction.

S(i) = {j | j(1, k − 1) = i(2, k − 1)}

A valid successor of i is a k-mer j ∈ S(i) that also exists inG with the maximum partial count that is greater than χ.

More specifically, let p(j) denote the partial count for k-merj ∈ G. Then,

succ(i) = arg maxj∈S(i)∩G

{p(j) | p(j) > χ} (4)

We used χ = 1 in all our experiments.Given that there are four such possible extensions of i,

there could be at most four valid successors of i in G. Thepartial count check is essential to ensure that low quality k-mers which occur sparsely but which made their way intoG by virtue of being associated alongside a a high qualityk-mer in the CM sketch, eventually get discarded during thecontig generation step.

The notion of a valid predecessor and the pred(i) functionare defined similarly except that a predecessor needs tomatch its last k− 1 characters with the first k− 1 charactersof i. These two functions provide the basis for contig gener-ation. More specifically, we begin with any k-mer in G thatdoes not have a predecessor. We refer to such k-mers as begink-mers. From a begin k-mer we identify successors, one ata time, and elongate the contig until no further extension ispossible. Consequently, this scheme constitutes our strategyto detect edges on-the-fly—not all but only those edges thatcontribute to contig assembly.

We keep a flag to mark k-mers that have already beenused for generating a contig, so that it does not get selectedas part of another contig. Furthermore, if during successorand predecessor operations, a given k-mer extension withthe maximum partial count has already been claimed, thenthe next best choice is selected and returned in our heuristic.In our parallel implementation, we use atomic updates (asopposed to locks) to prevent multiple threads from updatingthe same k-mer entries concurrently.

3.3.3 Complexity Analysis

Graph construction: The time to generate k-mers from thereads and update the CM sketch is Θ(N). The insertion ofthe selected k-mers into the hash table for G depends onthe collision rate at that hash table bucket. Our current im-plementation uses chaining, and if the k-mers are uniformlydistributed across the hash table entries, then we can expectnear constant time per hash table update as well. The graphconstruction step uses a combination of atomic updates andlocks, to update the CM sketch and G, and we expect thespeedup of this step to be sublinear.

Contig generation: The time for this step is bounded bythe total number of k-mers that were originally insertedinto G, which in turn can be no greater than the numberof distinct k-mers in the input (O(N)). In practice, though,we expect the size of G to be significantly smaller thanN (asdiscussed in Section 3.2.1). Assuming perfect speedup frommulti-threading using p threads, this implies an expectedruntime complexity of Θ(Np ). Since the contig generationstep is lock-free we expect linear scaling in that step.

As for space complexity, the CM sketch is a sublineardata structure. The memory cost of our algorithm is domi-nated by the space to store G. Note that G stores only thosedistinct k-mers identified by the sketch as tentatively highquality (as described in Section 3.3.1). Also, our algorithmdoes not need to store the input set of reads. Consequently,

Page 8: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

8

the expected space complexity is sublinear (o(N)), althoughin practice, constants associated with G cannot be ignored.

3.4 Bi-FastEtch: Bidirected de Bruijn graph extensionAs DNA molecules are double-stranded, reads can be se-quenced from either of these two strands. This leads to thepossibility of overlapping regions between reads sequencedin opposite directions from the two strands. Therefore, dur-ing the contig generation process, there is a need to switchbetween the two strands so that contigs maximally cover thesequenced genomic regions.

However the FastEtch algorithm that was describedin Section 3.3 does not identify strand directionality andtherefore, the contigs generated from this baseline versionwould cover contiguous regions only along a single strand.Furthermore, an input data may contain reads sequencedfrom both strands but covering the same genomic region. Insuch cases, the baseline algorithm could end up generatingduplicate contigs (i.e., one corresponding to each strand).This leads to potential redundancy in the assembly output.Therefore we introduce a new variant of our FastEtch algo-rithm, which takes into account the double-stranded natureof the DNA. We call this new variant Bi-FastEtch.

The main idea of Bi-FastEtch is to construct a bidirectedversion of the de Bruijn graph, in order to capture the doublestranded nature of the DNA, and thereby facilitate switch-ing between the DNA strands during contig generation.

Recall from Section 3.1 that the reverse complement of asequence α is denoted by α.

Bidirected de Bruijn Graph Construction: We buildour bidirected de Bruijn graph, such that every node inthe graph G, represents two distinct k-mers: i and i. Notethat, since K is the set of all k-mers in the read set R,|G| ≥ |R|/2. In other words, the bidirected de Bruijn graphcan be expected to have a smaller size than the regular(unidirected) version of the de Bruijn graph.

Storing a single node for both k-mer variants wouldallow the contig generation step to select k-mers from eitherstrand. However, the selection criterion during contig exten-sion relies on k-mer partial counts. Therefore, it is necessaryfor the counts populated in G to include the counts fromboth variants of a k-mer.

To achieve this, we also reflect the bidirected propertyinto the CM sketch counts. More specifically, given a k-mer i, we keep a single CM count that would include theindividual counts for both i and i. Without loss of generality,we use the lexicographically smaller of the two k-mers forhashing its location in the CM sketch—i.e., given i, the CMcount for i is updated as follows:

count[j, hj(min(i, i)]← count[j, hj(min(i, i))] + 1,

∀j ∈ [1, d]

We use the same minimum logic when k-mers are in-serted into the bidirected de Bruijn graph G.

Algorithm 4 illustrates the revised graph constructionprocedure for inserting k-mers in G. Note that, a k-mercould be entered into the CM sketch in either or both ofits forms. As a result, given a read set R and its k-mer setK, the CM count for any k-mer in K could be potentiallyinflated by at most a factor of 2 relative to the CM count

t c a c t g a c

a g t g a c t g

5' 3'

3' 5'

(a) Example of a double-strandedDNA sequence of length l=8, showingboth the forward and reverse strands.

gtca

ttgaacg

t

gtcactga

atga

CASE: S -> P :

complement and append

BEGIN character of 'ctga'

Contig: g t c a g

Begin k-mer : gtca

(b) Contig elongation direction: S → P

ctga

cagt a

c

g

t

caggcagc

caga

CASE: P -> P :

complement and append

BEGIN character of 'actg'

Contig: g t c a g t

(c) Contig elongation direction: P → P

actg

agtt a

c

g

t

agtgagtc

agta

CASE: P -> S :

append END character

of 'agtg'

Contig: g t c a g t g

(d) Contig elongation direction: P → S

agtg

tcacacg

t

gcacccac

acac

CASE: S -> S :

append END character

of 'gtga'

Final Contig: g t c a g t g a

(e) Contig elongation direction: S → S

tcac

gtga

cact

agtg

actg

cagt

ctga

tcag

tgac

gtca

CASE: S -> PCASE: P -> P

CASE: P -> SCASE: S-> S

5'

5'

3'

3'

(f) Example of contig enumeration for sequence in Fig:6aFig. 6: Example illustrating the process of contig elongationobserved by Bi-FastEtch.

in our baseline implementation. In order to account of thispossibility, we also need to adjust the value of τ .

Furthermore, there is also an alternative scenario where ak-mer is present in both its direct and reverse complementedforms, along the same strand but at different locations ofthe genome. This scenario would imply that our revisedbidirected k-mer counting method could falsely combinethe k-mer counts originating at different genomic locations.This could in turn lead to a false extension of a contigduring the contig generation phase. In Section 4, we analyzethe impact of this trade-off between extending a contig inthe correct manner through capturing the bidirectionalityproperty of the DNA, and introducing false extensions dueto the approximate nature of our counting methodology.

Contig Generation: Algorithm 5 shows the steps forenumerating contigs from the bidirected de Bruijn graph.Any k-mer i ∈ G could have up to eight possible succes-sors in G— i.e., four k-mers that extend i in the forwarddirection, and four k-mers that extend i in the reverse

Page 9: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

9

Algorithm 4: Approximate Bidirected de Bruijn GraphConstruction Algorithm — Baseline

Input: Input set of reads: R, Width: w, Depth: d, Threshold: τOutput: Approximate bidirected de Bruijn graph (G)for each r ∈ R in parallel do

for each k-mer i ∈ r doi = ReverseComplement(i)imin = min {i, i}for each j ← 1 to d do

Update count(j, hj(imin))

Estimate CM(imin)if CM(imin) > τ then

if imin /∈ G thenInsert imin in G[h(imin)].listInitializeG[h(imin)].list.partialCount(imin)← 0

G[h(imin)].list.partialCount(imin)++

return G

(complemented) direction. Given the above, the set of allpossible extensions for i can be defined as follows:

Sf (i) = {j | j(1, k − 1) = i(2, k − 1)}

Sr(i) ={j | j(2, k − 1) = i(2, k − 1)

}S(i) = Sf ∪ Sr

We subsequently use a k-mer that has the largest partialcount as the successor for extending the contig, as shown inEqn. 4. If the successor selected is from Sf , then we appendj[k] to the running contig; otherwise, we append j[1] to therunning contig.

It is to be noted that the partial counts for any k-mer inthe bidirected G, will be greater or equal to its correspond-ing value in the baseline version of G. As a result, the valueof χ should also be adapted to reflect the increased counts.

Fig. 6 presents an example of the contig enumerationprocess implemented in Bi-FastEtch. Since a valid traversalthrough the graph must respect both directions, we denotethe forward direction as S (or successor) and the reverse direc-tion as P (or predecessor). The two instances of a switch be-tween the strands is denoted by: {S → P, P → S}, and thetwo extensions along the same strand by: {S → S, P → P}.We illustrate the process of obtaining the contig shownin Fig. 6a, by showcasing the step by step enumerationprocedure from Fig. 6b to 6e. Lastly in Fig. 6f we observe thepath followed by the Bi-FastEtch algorithm on the bidirectedde Bruijn graph. Here, each node represents the forms ofa single k-mer, with the label corresponding to the lexico-graphically smaller of the two forms.

With this bidirected approach, we expect the total num-ber of entries in G to reduce, since k-mers originating fromopposite strands now occupy a single node. This reductioncan be expected to result in performance benefits (both inmemory and time).

Impact on quality: While it is clear that introducingthe bidirected variant into FastEtch is expected to generateperformance savings (as explained above), its impact onquality can be both ways. More specifically, as an intendedconsequence, by exposing the flexibility to switch between

Algorithm 5: Contig Generation Algorithm for Bi-FastEtch

Input: Approximate bidirected Bruijn Graph: GOutput: Contigsfor each k-mer i ∈ G in parallel do

if i is a begin k-mer thenInitialize contig c← first k − 1 characters of irepeat

Compute the set of all valid successors S(i)Determine succ(i) that is a k-mer in S(i) with

maximum partial count (if exists)if succ(i) from forward direction then

Concatenate the end character of succ(i) tocontig c

elseComplement and Concatenate the begin

character of succ(i) to contig ci← succ(i)

until i does not exist;Output contig c

TABLE 1: Input data sets used in our experiments

Organism GenomeSize(bp)

Coverage # ofReads

Avglength

Datasize(GB)

E. coli 4,703,541 80x 3,713,280 100 0.4Yeast 12,157,105 100x 12,156,200 100 1.4C. elegans 100,286,401 50x 50,143,050 100 5.3Chr2 243,199,373 100x 238,202,088 100 32Chr2 + Chr3 441,221,803 100x 432,998,328 100 51

strands, we expect to provide the contigs better chance togrow.

However, since a single node (with an aggregate count)is used to record the counts of a k-mer in both its forms,it is also possible that the choice of a successor from a k-mer is different from the corresponding choice made in thebaseline version. And a wrong choice could potentially trun-cate the growth of a contig prematurely; this effect becomesmore pronounced in the bidirected approach because of its8-way freedom in selection. Consider as an example, a k-mer i, whose forward extension could have taken a longerextension in the baseline method. It may so happen that thereverse complemented extension of i could be abundant inthe genome and therefore, the bidirected method makes agreedy choice to extend i along this k-mer only to discoverlater that subsequent extensions are not possible.

In Section 4, we analyze the impact of the bidirectedimplementation on quality of assembly.

4 RESULTS4.1 Experimental Setup

We tested our assembler using read set inputs from fivedifferent organisms namely E. coli (K12 MG1655), Yeast, C.elegans, Human-2 chromosome (Chr2) and Human-2 and 3chromosome (Chr23) (shown in Table 1). All read data setswere generated using the ART Illumina read simulator [37]for different coverage settings. Experiments were conductedon a single 128GB DDR4 memory node of the NERSC Corisupercomputer (Cray XC40), with each node equipped withtwo sockets, each socket populated with a 16-core, 2.3 GHzIntel Haswell processor.

We evaluated multiple variants of our FastEtch method:i) FastEtch refers to the baseline implementation of our

Page 10: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

10

original algorithm (Section 3.3); ii) FastEtch-β and FastEtch-γ refer to the variants that use the β and γ filters toperform insertion in G (as described in Section 3.3.1); andiii) Bi-FastEtch refers to the baseline implementation of ourbidirected version (Section 3.4).

For our comparative evaluation, we used other state-of-the-art de Bruijn graph based short-read assemblers, viz.Velvet [5], SOAPdenovo [3], ABySS [4], Minia [15], IDBA-UD [32] and SPAdes [6]. We used the QUAST [38] toolfor quality evaluation, which reports the following metrics:N50 contig length (similar to a median contig length), % ofgenome covered, unaligned contig length (length of thosecontigs which have no alignment with the reference), andlength of the largest region aligning with the genome.

4.2 Performance and Quality Evaluation

Table 2 presents the performance (runtime and memory)and quality statistics for FastEtch, for three inputs. We alsoreport on internal de Bruijn graph statistics, viz. the numberof k-mers inserted into G, and the number of k-mers thateventually contributed to contigs. For these experiments, kwas fixed at 32, and the code was executed on 32 cores.

First we note that the number of k-mers that are insertedinto G is one order of magnitude smaller than the inputsize (N ). This shows the efficacy of CM sketch as a firstlevel of filter. Subsequently, we note that there is a furtherone order of magnitude reduction in the number of k-mers that eventually get selected for contributing into thecontigs. This shows the high selectiveness of our secondfilter (χ = 1) in filtering out poor quality k-mers from G.Such poor quality k-mers represent false positives as theywere identified for insertion into G despite being infrequent.Section 4.3 presents results for further reducing these falsepositives.

Put together, these statistics show that FastEtch con-tributes to a 100-fold reduction in data complexity (fromreads to contigs), for all three inputs.

Table 3 shows the corresponding set of results obtainedfor Bi-FastEtch. While a similar 100-fold reduction in datacomplexity is also in place for Bi-FastEtch, compared toFastEtch we can notice a reduction in the number of k-mersinserted into G. This can be attributed to the fact that Bi-FastEtch uses a single node to store a k-mer in either of itstwo forms (as opposed to two distinct entries in the caseof FastEtch). The factor of reduction observed is not as highas 2× because the savings occur only when two reads aresequenced from the same region of the genome but from theopposite strands.

As for quality, the results show that despite the ap-proximation of de Bruijn graph, the output quality of thecontigs is maintained relatively high (for both FastEtch andBi-FastEtch). The N50 contig lengths are in tens of thousandsfor both E. coli and Yeast, while for C. elegans the N50 lengthis in the thousands. Note that for C. elegans we used a lowersequencing coverage (50×). The above trend is also reflectedin the overall genome covered by the contigs. (> 90% forE. coli and Yeast, and > 80% for C. elegans). The largestaligning regions in all three assemblies were substantiallylarger than the N50 contig length for both the FastEtch andBi-FastEtch, indicating the ability of the assembler to capture

TABLE 2: Performance evaluation of FastEtch across threeorganisms, with k=32, using 32 cores

FastEtch E.colicov=80x(τ=400)

Yeastcov=100x(τ=800)

C.eleganscov=50x(τ=1K)

Input size N (bp) 3.71× 108 1.21× 109 5.04× 109

No. k-mers in G (bp) 5.6× 107 1.8× 108 8.7× 108

No. k-mers in contigs (bp) 4.6× 106 1.1× 107 8.5× 107

N50 (bp) 14,112 17,496 5,221Coverage % 98.01 94.40 85.09Largest alignment (bp) 70,574 106,428 91,497Unaligned contig len. (bp) 2,917 20,775 617,616Time(secs) 48.50 171.90 600.28Memory(GB) 2.45 7.42 30.90

TABLE 3: Performance evaluation of Bi-FastEtch across threeorganisms, with k=32, using 32 cores

Bi-FastEtch E.colicov=80x(τ=400)

Yeastcov=100x(τ=1600)

C.eleganscov=50x(τ=6K)

Input size N (bp) 3.71× 108 1.21× 109 5.04× 109

No. k-mers in G (bp) 5.0× 107 1.5× 108 7.1× 108

No. k-mers in contigs (bp) 4.4× 106 1.1× 107 8.1× 107

N50 (bp) 12,337 17,536 4,235Coverage % 96.98 93.10 81.10Largest alignment (bp) 62,061 99,574 107,037Unaligned contig len. (bp) 2,000 4,702 396,476Time (secs) 45.10 155.40 677.32Memory (GB) 2.04 5.88 25.60

long contiguous regions along the genome (via one or morecontigs).

The unaligned contig lengths are for those contigs whichhave no (or partial) alignment against the reference, indi-cating false assemblies generated probably owing to falseextensions during contig generation. These unaligned frac-tions represent about 0.07%, 0.1% and 0.6% of the E. coli,Yeast and C. elegans genomes, respectively for FastEtch; andapproximately 0.04%, 0.03% and 0.3% respectively for Bi-FastEtch. The decrease in the unaligned fraction for Bi-FastEtch can be attributed to the fact that, a contig exten-sion at a read’s end is more likely to select the correctk-mer successor because of the option to switch betweenstrands; in contrast, without such an option in the FastEtchimplementation, the corresponding extension runs the riskof selecting a lower quality k-mer successor.

We observed that the N50 length of the assembly re-duced from FastEtch to Bi-FastEtch. This can be attributedto the alternative extension scenario (described at the endof Section 3.4), whereby the reverse complemented versionof a k-mer is also present along the same strand but in adifferent genomic region.

TABLE 4: Breakdown of execution time for different stagesof assembly as seen for both baseline implementations ofFastEtch and Bi-FastEtch for all the 3 inputs with varyingcoverages of 80x, 100x and 50x for E. coli, Yeast and C. elegansrespectively.

Input ReadingInput

GraphConstruc-tion

ContigGeneration

TotalTime

FastEtchE. coli 4.38 33.48 10.71 48.57Yeast 12.14 113.8 46.05 171.99C.elegans 42.29 428.93 129.06 600.28

Bi-FastEtchE. coli 4.74 26.92 13.91 45.57Yeast 18.23 87.96 49.44 155.63C.elegans 40.46 379.25 257.67 677.38

Page 11: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

11

0

2

4

6

8

10

12

14

16

18

20

22

1 2 4 8 16 32

Speedup

Number of threads ’p’

Graph Construction

Contig Generation

Total Runtime

(a) Speedup of FastEtch

0

2

4

6

8

10

12

14

16

18

20

22

1 2 4 8 16 32

Speedup

Number of threads ’p’

Graph Construction

Contig Generation

Total Runtime

(b) Speedup of Bi-FastEtchFig. 7: The Speedup of a)FastEtch and b)Bi-FastEtch on C. elegans (50x, k=32).

Tables 2 and 3 also show the raw performance (runtimeand memory) numbers for both FastEtch implementations.We observe that among the two variants, although FastEtchis faster (12%), Bi-FastEtch is more memory-efficient (24.2%)than FastEtch (as noted for C. elegans). From these two tables,we can conclude that both FastEtch and Bi-FastEtch havetheir respective quality and performance advantages andlimitations.

Table 4 shows the runtime breakdown by the major stepsof the algorithm. As expected, the runtime is dominated bygraph construction. In our experiments, we fixed the lengthof the hash table that stores G to roughly 18M buckets(for E. coli and Yeast) and 52M buckets for C. elegans —significantly less than N in anticipation of a large fractionof k-mers getting filtered out. However, based on the actualnumber of k-mers inserted into G, this implies an averagecollision rate of 3.11, 10, and 16.7 per bucket for the E.coli, Yeast and C. elegans datasets respectively (for FastEtch).The length was fixed to reduce memory footprint but atthe expense of runtime performance (since, due to multi-threading, a bucket needs to be locked prior to inserting ak-mer). Consequently, the length of the hash table representsa trade-off between runtime and memory usage.

Compared to FastEtch, Bi-FastEtch runs faster in thegraph construction phase (see Table 4). This is owing tolesser contention encountered at the time of updating Gdue to fewer k-mer insertions as compared to FastEtch.However in the contig generation phase, FastEtch providesa clear advantage in terms of runtime, whereas the Bi-FastEtch algorithm proves to be more expensive; this canbe attributed to the added cost in searching for a successor(one in possibly 8 options for Bi-FastEtch vs. one in 4 forFastEtch ).

Fig. 7 shows the speedup for our parallel implemen-tation for varying number of cores (threads). A positivespeedup trend is visible for the graph construction step,even though it is affected by the collision rate—i.e., locksused to grant access to concurrently accessed buckets. Asexpected, the graph construction phase of Bi-FastEtch (Fig.7a) scales better with respect to FastEtch owing to the re-duction in contention overhead when inserting new entriesto G. FastEtch however displays a higher scalability rate(Fig. 7b) for the contig generation step, and achieves nearlinear scaling because it is lock-free—providing a speedup

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1 2 4 8 16 32

Tim

e in

se

co

nd

s

Number of Threads

FastEtch (Yeast)Bi-FastEtch (Yeast)

FastEtch (C.elegans)Bi-FastEtch (C.elegans)

Fig. 8: Performance scaling of FastEtch vs. Bi-FastEtch onYeast(100x) and C. elegans(50x) for k=32.

of 21× on 32 cores. The cost of the contig generation stepfor FastEtch will be significantly lower in comparison to Bi-FastEtch for larger datasets. The time (as well as memory)taken to update the CM sketch is negligible, and is thereforenot reported. Fig. 8 compares the runtime performance ofFastEtch alongside Bi-FastEtch for the Yeast and C. elegansdatasets across the 32 threads. As can be observed, the run-time improvement achieved by FastEtch is more significantfor the larger input (C. elegans).TABLE 5: Comparing the duplication factor of FastEtch vs.Bi-FastEtch for all the datasets (where FE denotes FastEtch).

Numberof con-tigs

N50 Duplicationfactor

Number ofk-mers inG

E.coli FE (τ=400) 1,473 14,122 1.94 56,029,869Bi-FE (τ=400) 934 12,337 1 50,208,772

Yeast FE (τ=800) 3,565 17,805 1.92 182,127,245Bi-FE (τ=1600) 2,133 17,536 1 154,341,561

C.elegans FE (τ=1k) 58,304 5,221 1.69 886,744,228Bi-FE (τ=6k) 38,493 4,280 1.01 712,570,949

Chr2 FE (τ=6k) 142,994 2,678 1.48 1,086,339,454Bi-FE (τ=24k) 103,457 2,210 1.01 810,708,459

Chr2+3 FE (τ=10k) 258,676 2,620 1.48 1,965,038,032Bi-FE (τ=20k) 190,154 2,146 1.02 1,527,148,071

Page 12: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

12TABLE 6: Parametric study of FastEtch for assemblies of E. coli (Cov=80x) across varying factors.

k-mer length k=32τ Hash Table

length# k-mers in G Memory

(GB)Time(secs)

N50 (bp) %Genomecovered

LargestAlignment(bp)

Unalignedlength (bp)

% missassem-bled contigs

200 18,874,368 58,978,326 2.432 51.80 14,890 98.12 83,923 1,978 6.16%300 18,874,368 57,507,466 2.399 48.20 14,740 98.13 83,923 1,998 6.63%400 18,874,368 56,029,869 2.351 46.60 14,693 98.14 83,923 3,077 6.11%600 18,874,368 53,078,477 2.211 44.20 14,665 98.06 71,068 2,004 6.19%800 18,874,368 50,127,274 2.195 43.00 14,111 97.90 66,238 2,452 5.53%1000 18,874,368 47,155,735 2.099 42.80 13,160 97.50 69,209 2,828 5.22%1200 18,874,368 44,179,222 1.984 41.10 10,748 97.40 62,171 3,812 5.48%

k-mer length k=32τ Hash Table

length# k-mers in G Memory

(GB)Time(secs)

N50 (bp) %Genomecovered

LargestAlignment(bp)

Unalignedlength (bp)

% missassem-bled contigs

200 33,554,432 58,978,765 2.739 47.00 14,015 98.10 88,246 2,928 5.91%300 33,554,432 57,507,307 2.688 45.60 13,757 98.10 88,246 4,908 6.07%400 33,554,432 56,030,371 2.593 40.90 14,546 98.10 77,525 7,619 6.14%600 33,554,432 53,078,433 2.516 39.20 14,325 97.90 70,038 7,121 5.62%800 33,554,432 50,130,188 2.487 38.19 12,909 97.90 68,972 4,184 6.07%1000 33,554,432 47,156,623 2.379 37.60 12,366 97.80 68,972 4,015 5.39%1200 33,554,432 44,175,445 2.240 37.40 10,874 97.80 66,944 7,194 5.32%

k-mer length k=44τ Hash Table

length# k-mers in G Memory

(GB)Time(secs)

N50 (bp) %Genomecovered

LargestAlignment(bp)

Unalignedlength (bp)

% missassem-bled contigs

200 18,874,368 63,092,411 2.650 42.18 23,574 98.35 115,446 389 5.89%300 18,874,368 61,137,725 2.582 41.50 22,202 98.28 115,446 593 5.41%400 18,874,368 59,191,723 2.496 40.40 18,249 98.10 93,430 599 5.27%600 18,874,368 55,292,056 2.352 39.90 10,942 98.01 92,690 869 2.99%800 18,874,368 51,380,781 2.248 39.50 5,855 97.57 29,958 770 2.13%1000 18,874,368 47,453,173 2.187 37.80 3,047 95.71 21,459 4,105 1.07%1200 18,874,368 43,519,699 2.061 36.30 1,671 90.61 13,655 4,633 0.80%

TABLE 7: Parametric study of FastEtch vs.Bi-FastEtch for assemblies of Yeast (Cov=100x) across varying values of τ . Givenk=32 and results are obtained on 32 threads.

FastEtch

τ # k-mers in G N50 (bp) % Genome covered Largest Alignment (bp) Unaligned length (bp) Time (secs) Memory (GB)400 187,682,890 17,521 94.4 109,829 20,725 176.3 7.552600 184,901,135 17,709 94.3 106,428 18,310 165.7 7.420800 182,127,245 17,805 94.4 106,428 19,524 174.7 7.4201000 179,361,700 17,377 94.5 109,829 20,099 169.2 7.2961200 176,584,446 17,369 94.4 106,423 18,685 170.1 7.0541400 173,800,389 16,937 94.1 93,211 20,601 162.0 6.912

Bi-FastEtch

τ # k-mers in G N50 (bp) % Genome covered Largest Alignment (bp) Unaligned length (bp) Time (secs) Memory (GB)800 165,133,955 16,627 93.2 101,186 10,161 158.7 6.2721000 162,452,803 16,669 93.5 101,186 6,562 154.7 6.1441300 158,405,028 16,692 93.5 95,315 5,667 151.5 6.0161600 154,341,561 17,536 93.4 99,913 4,702 151.8 5.8881800 151,633,774 15,834 93.2 88,269 7,902 146.8 5.7602000 148,921,015 15,834 93.1 89,539 7,667 143.4 5.632

The overall memory consumption is dominated by thespace to store G, which is implemented as a hash table inboth our current implementations. Note that we neither storethe input reads nor the edges of the approximate de Bruijngraph. We utilize a compact bit representation to representeach k-mer in G. Even though the number of k-mers storedin G is an order of magnitude smaller than N , the actualmemory consumed in bytes by G still is 5-6× larger thanN . This is mainly due to auxiliary information needed pernode, in our current implementation.

Table 5 shows the contig quality statistics of our im-plementation for all the input datasets. As observed theduplication factor for Bi-FastEtch is always one therebysuggesting that a particular contig region is covered onlyonce. This is evident from the lesser number of contigs andfar fewer k-mers in G reported, as compared to FastEtch.As a result Bi-FastEtch produces a smaller unaligned contigfraction and at times a smaller N50, mainly because k-mers sequenced from either directions in different genomicregions are reported only once.

4.3 Evaluation of Trade-offs

Parametric study: Table 6 depicts a detailed parametricanalysis conducted using the E. coli dataset, by varying threefactors: 1) k, 2) the CM sketch threshold (τ ), and 3) thehash table length for G. We experimented with values of32 and 44 for k, 18M and 33M for the hash table length,and varied τ from 200 to 1200. As expected, the number ofk-mers, across all the three sets of experiments, continuesto diminish with the increase in τ , accompanied with anoticeable decrease in memory consumption as well. As aresult we also observe a reduction in the execution time.However, an excessive increase in τ contributes to loweringthe output quality (more particularly, N50) mostly because ahigher value of τ renders the selection process of k-mers inG more conservative. The effect of τ is more pronounced inexperiments with higher k-mer lengths, as seen in the caseof k=44. This is because a larger value of k produces moredistinct k-mers but with reduced frequency of occurrence.Therefore, lowering τ with increase in k is recommended.Results for k=32, however do not show a distinguishable

Page 13: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

13TABLE 8: Table comparing FastEtch-γ vs. β versions on E.coli dataset (Cov=80x) with k=32 executed on 32 threads.Recall that N for this dataset = 3.71× 108bp.

FastEtch-γ, τ=1300γ #k-mers in G N50 %Genome

coveredLargestalignment

Unalignedlength

500 41,971,361 10,436 98.03 68,996 2,017600 42,677,649 13,727 98.10 80,299 2,482700 42,681,792 13,837 98.10 64,317 2,339800 42,680,585 13,677 98.07 64,317 2,258900 42,680,721 14,100 98.04 90,536 2,1811000 42,682,062 14,303 97.90 80,238 2,187

FastEtch-β, τ=1200β #k-mers in G N50 %Genome

coveredLargestalignment

Unalignedlength

2 44,176,423 13,955 98.11 75,871 4,9723 44,173,912 14,341 98.12 77,991 4,9904 44,176,575 14,007 98.11 75,871 1,3255 44,169,275 14,078 98.02 77,991 4,7946 42,829,705 2,308 97.22 21,031 11,941

difference in overall quality, with changes in either thresholdor hash table length, maintaining a steady coverage ofover 98%. The length of the hash table however impactsperformance, with a larger length leading to a reduction inruntime. Memory consumption, on the other hand, is stilldominated by the number of k-mers inserted into G (hashtable) and therefore, is affected by τ .

Table 7 presents a parametric study comparing theassemblies generated by FastEtch and Bi-FastEtch for theYeast dataset at k=32, by varying the value of τ . For ourexperiments, we used higher values of τ in our Bi-FastEtchruns compared to that of FastEtch. This is required since thecounts for both forms of each k-mer are stored in the sameCM sketch entry. On an average we notice that G in Bi-FastEtch holds a little over 20M fewer k-mers as compared toFastEtch. This is reflected in the memory consumption andruntime performance wherein we observe an improvementof 18% and 11% respectively.

Comparing β vs. γ Heuristics: We compared the effectof varying the two filtering parameters, β and γ for FastEtchimplementations. Table 8 shows the results of this paramet-ric study for FastEtch-γ and FastEtch-β. These experimentswere tested for a higher value of τ , anticipating to exposetrade-offs between better filtering efficacy by the CM sketchvs. output quality.

From Table 8, we see that the performance of the γand β variants are mostly comparable, although a smallercutoff for β (e.g., 2) is sufficient to match the quality thatis achieved for larger values of γ (e.g., 1000). Recall that βis a reflection on the rate at which insertions occur fromwithin a CM sketch bucket into G. From a user perspective,this makes the choice of β easier as less experimentation isneeded to identify an optimal parameter setting.

Table 8 also shows the trade-off between reducing thenumber of k-mers stored in G and preserving the assemblyquality (i.e. space vs. quality). In fact, we see that the numberof k-mers in G can be reduced (relative to the baselineversion) without altering the quality until certain values of βand γ. For instance, the output quality for a FastEtch settingof τ=400 on the E. coli dataset, is comparable to the outputquality obtained with any of the γ settings shown in Table 8;while the number of k-mers reduced from 56 million to 42million (corresponding to 25% reduction).

Although not shown in the table, we also experimentedwith the β version of Bi-FastEtch (referred to as Bi-FastEtch-β). Our experiments show that Bi-FastEtch-β is capable ofachieving a further level of reduction in the number ofk-mers in G—e.g., on the E. coli dataset, it stored only36 million k-mers in G, corresponding to an overall 35%reduction compared to FastEtch.

4.4 Comparative analysis

We compared FastEtch with six other state-of-the-artassemblers—viz. SOAPdenovo, Velvet, ABySS, Minia,IDBA-UD and SPAdes. All experiments were executed ona node with 32 cores. Memory consumption corresponds topeak memory usage. However, note that Minia also usessecondary storage. For SOAPdenovo and Velvet, the graphconstruction and contig generation processes needed to berun separately. For comparison purpose, we use the E. coligenome, which was the smallest input for which all theprograms ran successfully under at least one comparableparameter setting.

Table 9 shows the results of our comparison. We ob-served that all five variants of FastEtch performs consistentlythe best in terms of runtime, with Bi-FastEtch-β running thefastest across all assemblers tested and using least memoryamong the FastEtch variants. More specifically, Bi-FastEtch-βperformed approximately 38% faster than the second fastestalgorithm (Minia), across all the experiments for k=32. Withrespect to memory usage, FastEtch was third, with Miniaand IDBA-UD running in less memory. With respect to theother assemblers, FastEtch consumes less memory—savingsrange from 50% (over Velvet) to 78% (over SOAPdenovo).

In terms of assembly quality, FastEtch and its variantsprovide the best genome coverage, while the other assem-blers yield higher N50 values for most inputs, albeit takinglonger to complete. Note that ABySS, Velvet and SPAdesare full-blown assemblers with all conventional steps suchas (exact) graph construction, error correction and contiggeneration. All assemblers baring Minia are multi-threadedparallel implementations. Also, ABySS is a parallel code thatuses Message Passing Interface (MPI) for parallelization.Minia performs comparable in quality to both ABySS andIDBA-UD despite using Bloom filters. The SPAdes assem-bler reports the largest N50 length of all the assemblerstested, although it is costly with respect to the amount oftime and memory it consumes. Our FastEtch assembler usesa more aggressive approximation strategy through its use ofCM sketch and a successor selection heuristic that obviatesthe need to generate or store the entire de Bruijn graph.

Table 10 shows the results of our analysis on largerinputs. It is to be noted that the largest input in this table(Chr2+3) is almost ten times the size of the smallest datasetin the table (C. elegans). We observe that the overall N50lengths among almost all assemblers are comparable, exceptfor SPAdes which produces the highest N50 length for C.elegans. However, we were unable to execute SPAdes for thelarger inputs (Chr2 and Chr2+3), since it exceeded the 128GB memory limit on the node.

In terms of memory consumption for FastEtch, we ob-served that it dropped from 6× (for C. elegans) to 2× largerthan N (for both the human chromosomal inputs). This

Page 14: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

14TABLE 9: Comparative evaluation of assemblies generated by FastEtch and four other state-of-the-art assemblers. “NA”indicates that the assembler is unable to produce an output. We configured Bi-FastEtch to run with the β heuristic for k=32.

E. coli (cov=80x) Memory (GB) Time (secs) N50 (bp) %Genomecovered

LargestAlignment(bp)

Unalignedlength (bp)

k=32

FastEtch (τ=400) 2.45 48.50 14,112 98.01 70,574 2,917FastEtch-γ (τ=1300,γ=900) 2.13 46.63 13,557 98.59 72,322 1,495FastEtch-β (τ=1300,β=2) 2.08 45.71 13,996 98.60 81,479 6,727

Bi-FastEtch (τ=400) 2.04 45.11 12,337 96.92 62,061 2,000Bi-FastEtch-β (τ=1300,β=2) 1.92 38.60 12,832 96.88 70,938 300

SOAPdenovo 8.70 93.20 457 87.20 1,720 275,726ABySS 3.69 188.70 22,904 97.70 127,978 0Velvet 3.58 105.90 22,173 97.70 120,922 0Minia 0.69 61.70 22,175 97.70 127,978 0

IDBA-UD 0.67 97.86 23,404 97.34 128,055 0SPAdes 4.35 140.84 67,289 98.04 203,041 0

k=44

FastEtch (τ=200) 2.69 42.90 23,574 98.30 115,446 389FastEtch-γ (τ=500,γ=1000) 2.53 44.40 24,453 98.19 97,149 3,361

FastEtch-β (τ=500,β=2) 2.51 43.50 25,499 98.20 99,689 1,130SOAPdenovo 8.45 76.40 1,269 95.50 7,346 26,449

ABySS 4.10 237.60 46,085 98.10 166,040 0Velvet NA NA NA NA NA NAMinia 0.77 69.10 46,092 98.06 166,043 100

IDBA-UD 0.69 91.53 48,757 97.68 166,040 0SPAdes 4.86 145.25 109,989 98.16 327,368 0

Yeast (cov=100x) Memory (GB) Time (secs) N50 (bp) %Genomecovered

LargestAlignment(bp)

Unalignedlength (bp)

k=32

FastEtch (τ=800) 7.42 171.90 17,496 94.40 106,428 20,775FastEtch-γ (τ=1600,γ=1000) 6.53 173.61 15,967 94.05 130,414 16,430

FastEtch-β (τ=2400,β=2) 6.66 175.40 17,279 94.30 91,499 14,320bi-FastEtch (τ=1600) 5.88 155.46 17,536 93.14 99,174 4,702

bi-FastEtch-β (τ=3200,β=2) 5.32 143.76 16,866 93.15 139,795 7,790SOAPdenovo 13.95 282.30 246 82.94 2,112 2,075,457

ABySS 10.65 707.40 22,371 94.20 107,976 0Velvet 10.88 363.80 20,802 94.20 107,974 348Minia 3.97 230.40 22,272 93.10 107,976 0

IDBA-UD 1.88 290.33 22,892 93.35 107,976 0SPAdes 9.06 941.85 46,723 94.84 134,218 1004

k=44

FastEtch (τ=400) 7.94 160.70 23,252 95.00 129,195 5,445FastEtch-γ (τ=800,γ=1000) 7.30 165.00 21,529 94.90 140,023 4,577FastEtch-β (τ=1800,β=2) 7.17 157.60 22,471 94.80 144,332 5,953

SOAPdenovo 12.54 216.50 1,902 94.30 10,952 63,413ABySS 11.47 655.10 34,159 94.40 122,865 0Velvet NA NA NA NA NA NAMinia 3.46 242.20 34,328 93.70 122,865 0

IDBA-UD 1.78 266.67 34,672 94.08 122,865 0SPAdes 8.99 945.81 61,654 95.26 149,817 438

reduction can be attributed to the following reason: as theinput size of the genome grows, we encounter a crossoverpoint where the number of possible distinct k-mers cannotexceed a certain limit.

With respect to time, both FastEtch and Bi-FastEtch per-form significantly faster than all the other assemblers. Itis to be noted that unlike other assemblers like Minia andIDBA-UD, the coverage for FastEtch can drastically improvewith a higher sequencing depth. We were unable to executeABySS on the larger genomes due to an unknown machineconfiguration issue.

Fig. 9 summarizes the quality-time-memory compar-isons for all the methods tested, using a 3-D visual repre-sentation, for the C. elegans dataset (k=32). In this visualrepresentation, the more left and the more high an as-sembler appears, the better quality-time-memory tradeoff itexhibits. The tools IDBA-UD and Minia offer the best trade-offs among all the assemblers tested. As can be observed,despite using an approximation approach, FastEtch is ableto produce comparable quality to the best of assemblers,while running the fastest and also taking less memory ingeneral. Optimizing the code to generate more memorysavings could improve the performance-quality trade-off

achieved by FastEtch even further. For instance, our currentimplementation uses a hash table using chaining to storeall the k-mers, which can be further optimized for spaceconsumption.

5 CONCLUSIONS AND FUTURE WORK

In this paper we introduced FastEtch, a novel approach toassemble genomes using sketching. Our approach computesthe assembly based on an approximate, space-efficient ver-sion of the de Bruijn graph. To the best of our knowledge,FastEtch is the first short-read assembler to use the sublinearstreaming data structure, CM sketch. Results have demon-strated that, despite approximation, FastEtch can producehighly accurate fast assemblies in reduced memory foot-print, compared to most other assemblers. Our algorithmalso exposes tradeoffs between time, memory and quality,that can be exploited by users.

Future research directions include (but not limited to): i)exploring alternative, more compact strategies to implementthe approximate de Bruijn graph; ii) use of error correctionheuristics and incorporation of paired-end information tofurther improve quality; iii) theoretical and analytical stud-ies to further improve filtering efficacy of CM sketch; iv)

Page 15: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

15TABLE 10: Comparative evaluation of larger assemblies generated by FastEtch and five other assemblers for k=32.

C.elegans (cov=50x) Time (mins) memory (GB) N50 (bp) % Genome covered Largest Alignment (bp) Unaligned length (bp)FastEtch (τ=1k) 10.01 30.9 5,221 85.05 89,867 716,672

Bi-FastEtch (τ=6k) 12.21 26.2 4,280 81.2 91,182 374,336SOAPdenovo 25.17 42.01 622 88.9 5,638 9,719,664

Velvet 24.56 47.1 5,008 90.6 75,227 3,563Minia 37.23 7.4 5,925 86.08 108,539 264

IDBA-UD 23.48 8.9 6,200 86.3 108,539 0SPAdes 365.48 45.5 13,553 91.2 158,548 2,987

Chr2 (cov=100x) Time (mins) memory (GB) N50 (bp) % Genome covered Largest Alignment (bp) Unaligned length (bp)FastEtch (τ=6k) 39.63 62.2 2,678 72.12 30,088 4,305,938

Bi-FastEtch (τ=24k) 42.11 57.4 2,197 68.12 26,199 1,972,550SOAPdenovo 56.23 73.6 2,083 75.87 18,021 543

Velvet 67.27 95.4 2,781 76.92 35,073 0Minia 78.43 10.7 2,829 77.54 35,075 835

IDBA-UD 58.34 25.2 2,909 78.15 35,478 0SPAdes out of memory

Chr2+3 (cov=100x) Time (mins) memory (GB) N50 (bp) % Genome covered Largest Alignment (bp) Unaligned length (bp)FastEtch (τ=10k) 70.83 115.9 2,620 72.11 30,088 5,296,708

Bi-FastEtch (τ=30k) 74.55 99.6 2,146 68.11 25,035 2,681,386SOAPdenovo 88.53 84.74 2,046 73.22 17,858 543

Velvet 130.46 121.33 2,697 76.92 31,155 0Minia 186.22 17.25 2,751 76.57 31,158 835

IDBA-UD 116.25 43.98 2,837 77.28 31,158 0SPAdes out of memory

50 100 150 200 250 300 350 5

10 15

20 25

30 35

40 45

50 3000

6000

9000

12000

N50 (bp)

FastEtch

Bi-FastEtch

SOAPdenovo

IDBA-UD

Velvet

Minia

SPAdes

Time (mins)

Memory (GB)

N50 (bp)

50 100 150 200 250 300 350 5

10 15

20 25

30 35

40 45

50

50

60

70

80

90

100

Coverage (%)

FastEtch

Bi-FastEtch

SOAPdenovo

IDBA-UD

Velvet

Minia

SPAdes

Time (mins)

Memory (GB)

Coverage (%)

Fig. 9: 3-D plot depicting the performance (time, memory, quality) of all assemblers tested, on the C. elegans dataset(Cov=50x) with k=32. The figure represents the data for the baseline variants for both FastEtch and Bi-FastEtch.

parallelization on distributed memory machines for scalingto much larger data sets; and iv) extension to other assemblyframeworks such as transcriptome assembly.

ACKNOWLEDGMENTS

This research was supported in part by U.S. DOE grant DE-SC-0006516. This research used resources of the NationalEnergy Research Scientific Computing Center, a DOE Officeof Science User Facility supported by the Office of Scienceof the U.S. DOE under Contract No. DE-AC02-05CH11231.

REFERENCES

[1] J. Butler, I. MacCallum, M. Kleber et al., “ALLPATHS: de novoassembly of whole-genome shotgun microreads,” Genome Research,vol. 18, no. 5, pp. 810–820, 2008.

[2] M. J. Chaisson and P. A. Pevzner, “Short read fragment assemblyof bacterial genomes,” Genome Research, vol. 18, no. 2, pp. 324–330,2008.

[3] R. Li, H. Zhu, J. Ruan et al., “De novo assembly of human genomeswith massively parallel short read sequencing,” Genome Research,vol. 20, no. 2, pp. 265–272, 2010.

[4] J. T. Simpson, K. Wong, S. D. Jackman et al., “ABySS: a parallelassembler for short read sequence data,” Genome Research, vol. 19,no. 6, pp. 1117–1123, 2009.

[5] D. R. Zerbino and E. Birney, “Velvet: algorithms for de novo shortread assembly using de Bruijn graphs,” Genome Research, vol. 18,no. 5, pp. 821–829, 2008.

[6] A. Bankevich, S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin,A. S. Kulikov, V. M. Lesin, S. I. Nikolenko, S. Pham, A. D.Prjibelski et al., “SPAdes: a new genome assembly algorithm andits applications to single-cell sequencing,” Journal of ComputationalBiology, vol. 19, no. 5, pp. 455–477, 2012.

[7] J. A. Chapman, I. Ho, S. Sunkara et al., “Meraculous: de novogenome assembly with short paired-end reads,” PloS ONE, vol. 6,no. 8, p. e23501, 2011.

[8] K. Salikhov, G. Sacomoto, and G. Kucherov, “Using cascadingbloom filters to improve the memory usage for de brujin graphs,”Algorithms for Molecular Biology, vol. 9, no. 1, p. 1, 2014.

[9] D. E. Knuth, The art of computer programming: sorting and searching.Pearson Education, 1998, vol. 3.

[10] E. Georganas, A. Buluc, J. Chapman et al., “Parallel de Bruijngraph construction and traversal for de novo genome assembly,”in High Performance Computing, Networking, Storage and Analysis,SC14, 2014, pp. 437–448.

[11] G. Cormode and S. Muthukrishnan, “An improved data streamsummary: the count-min sketch and its applications,” Journal ofAlgorithms, vol. 55, no. 1, pp. 58–75, 2005.

[12] P. Ghosh and A. Kalyanaraman, “A fast sketch-based assemblerfor genomes,” in Proceedings of the 7th ACM International Confer-ence on Bioinformatics, Computational Biology, and Health Informatics.ACM, 2016, pp. 241–250.

[13] N. Nagarajan and M. Pop, “Sequence assembly demystified,”Nature Reviews Genetics, vol. 14, no. 3, pp. 157–167, 2013.

Page 16: 1 FastEtch: A Fast Sketch-based Assembler for Genomesananth/papers/Ghosh_TCBB17_preprint.pdfIndex Terms—Genome assembly, de Bruijn Graph, Count-Min sketch, Approximation methods

16

[14] P. A. Pevzner, H. Tang, and M. S. Waterman, “An Eulerian pathapproach to DNA fragment assembly,” Proceedings of the NationalAcademy of Sciences, vol. 98, no. 17, pp. 9748–9753, 2001.

[15] R. Chikhi and G. Rizk, “Space-efficient and exact de Bruijn graphrepresentation based on a Bloom filter,” Algorithms for MolecularBiology, vol. 8, no. 1, p. 1, 2013.

[16] T. C. Conway and A. J. Bromage, “Succinct data structures forassembling large genomes,” Bioinformatics, vol. 27, no. 4, pp. 479–486, 2011.

[17] J. T. Simpson and R. Durbin, “Efficient construction of an assemblystring graph using the FM-index,” Bioinformatics, vol. 26, no. 12,pp. i367–i373, 2010.

[18] J. T. Simpson and R. Durbin, “Efficient de novo assembly of largegenomes using compressed data structures,” Genome Research,vol. 22, no. 3, pp. 549–556, 2012.

[19] J. M. Holt, J. R. Wang, C. D. Jones, and L. McMillan, “Improvedlong read correction for de novo assembly using an FM-index,”bioRxiv, p. 067272, 2016.

[20] C. T. Brown, A. Howe, Q. Zhang, A. B. Pyrkosz, and T. H. Brom,“A reference-free algorithm for computational normalization ofshotgun sequencing data,” arXiv preprint arXiv:1203.4802, 2012.

[21] A. Rahman and L. Pachter, “CGAL: computing genome assemblylikelihoods,” Genome Biology, vol. 14, no. 1, p. R8, 2013.

[22] V. Boza, B. Brejova, and T. Vinar, “GAML: genome assembly bymaximum likelihood,” Algorithms for Molecular Biology, vol. 10,no. 1, p. 18, 2015.

[23] H. Mirebrahim, T. J. Close, and S. Lonardi, “De novo meta-assembly of ultra-deep sequencing data,” Bioinformatics, vol. 31,no. 12, pp. i9–i16, 2015.

[24] B. G. Jackson, M. Regennitter, X. Yang, P. S. Schnable, andS. Aluru, “Parallel de novo assembly of large genomes fromhigh-throughput short reads,” in Parallel & Distributed Processing(IPDPS), 2010 IEEE International Symposium on. IEEE, 2010, pp.1–10.

[25] S. Boisvert, F. Laviolette, and J. Corbeil, “Ray: simultaneousassembly of reads from a mix of high-throughput sequencingtechnologies,” Journal of Computational Biology, vol. 17, no. 11, pp.1519–1533, 2010.

[26] Y. Liu, B. Schmidt, and D. L. Maskell, “Parallelized short readassembly of large genomes using de Bruijn graphs,” BMC bioinfor-matics, vol. 12, no. 1, p. 1, 2011.

[27] A. Abu-Doleh and U. V. Catalyurek, “Spaler: Spark and GraphXbased de novo genome assembler,” in Big Data (Big Data), 2015IEEE International Conference on. IEEE, 2015, pp. 1013–1018.

[28] J. Meng, B. Wang, Y. Wei, S. Feng, and P. Balaji, “SWAP-Assembler:scalable and efficient genome assembly towards thousands ofcores,” in BMC bioinformatics, vol. 15, no. Suppl 9. BioMed CentralLtd, 2014, p. S2.

[29] E. Georganas, A. Buluc, J. Chapman, S. Hofmeyr, C. Aluru,R. Egan, L. Oliker, D. Rokhsar, and K. Yelick, “HipMer: an extreme-scale de novo genome assembler,” in Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis. ACM, 2015, p. 14.

[30] D. Li, C.-M. Liu, R. Luo, K. Sadakane, and T.-W. Lam, “MEGAHIT:an ultra-fast single-node solution for large and complex metage-nomics assembly via succinct de Bruijn graph,” Bioinformatics, p.btv033, 2015.

[31] A. Bowe, T. Onodera, K. Sadakane, and T. Shibuya, “Succinct deBruijn graphs,” in International Workshop on Algorithms in Bioinfor-matics. Springer, 2012, pp. 225–235.

[32] Y. Peng, H. Leung, S.-M. Yiu, and F. Y. Chin, “IDBA-UD: a de novoassembler for single-cell and metagenomic sequencing data withhighly uneven depth,” Bioinformatics, vol. 28, no. 11, pp. 1420–1428, 2012.

[33] R. Chikhi, A. Limasset, S. Jackman, J. T. Simpson, andP. Medvedev, “On the representation of de Bruijn graphs,” in Inter-national Conference on Research in Computational Molecular Biology.Springer, 2014, pp. 35–55.

[34] F. Almodaresi, P. Pandey, and R. Patro, “Rainbowfish: A SuccinctColored de Bruijn Graph Representation,” bioRxiv, p. 138016, 2017.

[35] K. Belk, C. Boucher, A. Bowe, T. Gagie, P. Morley, M. D. Muggli,N. R. Noyes, S. J. Puglisi, and R. Raymond, “Succinct colored deBruijn graphs,” bioRxiv, p. 040071, 2016.

[36] S. Marcus, H. Lee, and M. C. Schatz, “SplitMEM: a graphical algo-rithm for pan-genome analysis with suffix skips,” Bioinformatics,vol. 30, no. 24, pp. 3476–3483, 2014.

[37] W. Huang, L. Li, J. R. Myers, and G. T. Marth, “ART: a next-generation sequencing read simulator,” Bioinformatics, vol. 28,no. 4, pp. 593–594, 2012.

[38] A. Gurevich, V. Saveliev, N. Vyahhi, and G. Tesler, “QUAST:quality assessment tool for genome assemblies,” Bioinformatics, p.btt086, 2013.

Priyanka Ghosh received her BS degree inInformation Technology from the West BengalUniversity of Technology, Kolkata, India in 2007.She obtained her MS degree in Computer Sci-ence from University of Houston, Houston, USAin 2012. She is currently working to obtain herPhD degree in Computer Science at WashingtonState University, Pullman, USA. Her research in-terests lie in the field of Parallel Algorithms, HighPerformance Computing and Computational Bi-ology.

Ananth Kalyanaraman received the bachelorsdegree from Visvesvaraya National Institute ofTechnology, Nagpur, India, in 1998, and the MSand PhD degrees from Iowa State University,Ames, USA in 2002 and 2006, respectively. Cur-rently, he is an associate professor and Boe-ing Centennial Chair in Computer Science, atthe School of Electrical Engineering and Com-puter Science in Washington State University,Pullman, USA. His research focuses on devel-oping parallel algorithms and software for data-

intensive problems originating in the areas of computational biology andgraph-theoretic applications. He is a recipient of a DOE Early CareerAward, an Early Career Impact Award and two best paper awards.He serves on editorial boards of IEEE Transactions on Parallel andDistributed Systems and Journal of Parallel and Distributed Computing.Ananth is a member of ACM, IEEE, ISCB and SIAM.