whole genome assembly from 454 sequencing output via modified dna graph concept

7
Computational Biology and Chemistry 33 (2009) 224–230 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem Brief Communication Whole genome assembly from 454 sequencing output via modified DNA graph concept Jacek Blazewicz a,c , Marcin Bryja a , Marek Figlerowicz c , Piotr Gawron a , Marta Kasprzak a,c , Edward Kirton b , Darren Platt b , Jakub Przybytek a , Aleksandra Swiercz a,c,, Lukasz Szajkowski b a Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland b Lawrence Livermore National Laboratory, Joint Genome Institute, 7000 East Avenue, Livermore, CA 94550, USA c Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland article info Article history: Received 25 November 2008 Received in revised form 29 March 2009 Accepted 23 April 2009 Keywords: DNA assembly 454 sequencing DNA graph abstract Recently, 454 Life Sciences Corporation proposed a new biochemical approach to DNA sequencing (the 454 sequencing). It is based on the pyrosequencing protocol. The 454 sequencing aims to give reliable output at a low cost and in a short time. The produced sequences are shorter than reads produced by classical methods. Our paper proposes a new DNA assembly algorithm which deals well with such data and outperforms other assembly algorithms used in practice. The constructed SR-ASM algorithm is a heuristic method based on a graph model, the graph being a modified DNA graph proposed for DNA sequencing by hybridization procedure. Other new features of the assembly algorithm are, among others, temporary compression of input sequences, and a new and fast multiple alignment heuristics taking advantage of the way the output data for the 454 sequencing are presented and coded. The usefulness of the algorithm has been proved in tests on raw data gen- erated during sequencing of the whole 1.84Mbp genome of Prochlorococcus marinus bacteria and also on a part of chromosome 15 of Homo sapiens. The source code of SR-ASM can be downloaded from http://bio.cs.put.poznan.pl/ in the section ‘Current research’ ‘DNA Assembly’. Among publicly available assemblers our algorithm appeared to generate the best results, especially in the number of produced contigs and in the lengths of the contigs with high similarity to the genome sequence. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction The DNA sequence assembly process is one of the most impor- tant problems in computational biology, and is known for its high complexity. The huge amount of data makes the problem very diffi- cult to solve. The fact that data are erroneous and incomplete makes the problem even more difficult. Exact algorithms cannot be used in practice because of unacceptable computation time. Many teams worldwide try hard to provide heuristics producing satisfying sub- optimal outcomes (see for example Kececioglu and Myers, 1995; Idury and Waterman, 1995; Jiang and Li, 1996; Pevzner et al., 2001; Chaisson et al., 2004; Sundquist et al., 2007). The errors present in the input sequences to assembly prob- lem are generated during a wet experiment and their number and character depend on the applied sequencing method. Several bio- chemical approaches have been elaborated to determine nucleotide sequences of naturally occurring DNA molecules. For the long time Corresponding author. E-mail address: [email protected] (A. Swiercz). Sanger method (Maxam and Gilbert, 1977; Sanger et al., 1977) was commonly used. It is based on a standard primer extension reaction involving fluorescently labeled terminators of DNA synthe- sis. The DNA sequence is established during consecutive capillary electrophoresis of the reaction products. An additional method developed during the last decades was sequencing by hybridiza- tion (SBH) (Southern, 1988; Bains and Smith, 1988; Lysov et al., 1988; Drmanac et al., 1989; Guénoche, 1992; Blazewicz et al., 1999a, 2002; Zhang et al., 2003; Blazewicz et al., 2004, 2006). However, SBH never reached a highly advanced level allowing its wider appli- cation. Recently, DNA sequencing has been revolutionized by an introduction of novel massively parallel sequencing systems: 454 (Life Sciences) (Margulies et al., 2005), Solexa (Illumina) (Bennett, 2004) and SOLID (Applied Biosystems) (Fu et al., 2008). They are capable of generating a huge amount of sequencing data (from 100 million bases per run (454) to billion bases per run (Solexa and SOLID)) at less than 1% of the cost of capillary-based meth- ods. Each system employed different strategy to generate high quality data. 454 relies on pyrosequencing of clonally amplified DNA fragments fixed to beads placed in water-in-oil emulsion. It produces 200–300 nucleotide sequence reads. Solexa and SOLID 1476-9271/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2009.04.005

Upload: jacek-blazewicz

Post on 26-Jun-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

B

WD

JEa

b

c

a

ARRA

KD4D

1

tcctiwoIC

lccs

1d

Computational Biology and Chemistry 33 (2009) 224–230

Contents lists available at ScienceDirect

Computational Biology and Chemistry

journa l homepage: www.e lsev ier .com/ locate /compbio lchem

rief Communication

hole genome assembly from 454 sequencing output via modifiedNA graph concept

acek Blazewicza,c, Marcin Bryjaa, Marek Figlerowiczc, Piotr Gawrona, Marta Kasprzak a,c,dward Kirtonb, Darren Plattb, Jakub Przybyteka, Aleksandra Swiercza,c,∗, Lukasz Szajkowskib

Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, PolandLawrence Livermore National Laboratory, Joint Genome Institute, 7000 East Avenue, Livermore, CA 94550, USAInstitute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland

r t i c l e i n f o

rticle history:eceived 25 November 2008eceived in revised form 29 March 2009ccepted 23 April 2009

eywords:NA assembly54 sequencingNA graph

a b s t r a c t

Recently, 454 Life Sciences Corporation proposed a new biochemical approach to DNA sequencing (the454 sequencing). It is based on the pyrosequencing protocol. The 454 sequencing aims to give reliableoutput at a low cost and in a short time. The produced sequences are shorter than reads produced byclassical methods. Our paper proposes a new DNA assembly algorithm which deals well with such dataand outperforms other assembly algorithms used in practice.

The constructed SR-ASM algorithm is a heuristic method based on a graph model, the graph being amodified DNA graph proposed for DNA sequencing by hybridization procedure. Other new features ofthe assembly algorithm are, among others, temporary compression of input sequences, and a new andfast multiple alignment heuristics taking advantage of the way the output data for the 454 sequencing

are presented and coded. The usefulness of the algorithm has been proved in tests on raw data gen-erated during sequencing of the whole 1.84 Mbp genome of Prochlorococcus marinus bacteria and alsoon a part of chromosome 15 of Homo sapiens. The source code of SR-ASM can be downloaded fromhttp://bio.cs.put.poznan.pl/ in the section ‘Current research’ → ‘DNA Assembly’.

Among publicly available assemblers our algorithm appeared to generate the best results, especiallyin the number of produced contigs and in the lengths of the contigs with high similarity to the genome

sequence.

. Introduction

The DNA sequence assembly process is one of the most impor-ant problems in computational biology, and is known for its highomplexity. The huge amount of data makes the problem very diffi-ult to solve. The fact that data are erroneous and incomplete makeshe problem even more difficult. Exact algorithms cannot be usedn practice because of unacceptable computation time. Many teams

orldwide try hard to provide heuristics producing satisfying sub-ptimal outcomes (see for example Kececioglu and Myers, 1995;dury and Waterman, 1995; Jiang and Li, 1996; Pevzner et al., 2001;haisson et al., 2004; Sundquist et al., 2007).

The errors present in the input sequences to assembly prob-

em are generated during a wet experiment and their number andharacter depend on the applied sequencing method. Several bio-hemical approaches have been elaborated to determine nucleotideequences of naturally occurring DNA molecules. For the long time

∗ Corresponding author.E-mail address: [email protected] (A. Swiercz).

476-9271/$ – see front matter © 2009 Elsevier Ltd. All rights reserved.oi:10.1016/j.compbiolchem.2009.04.005

© 2009 Elsevier Ltd. All rights reserved.

Sanger method (Maxam and Gilbert, 1977; Sanger et al., 1977)was commonly used. It is based on a standard primer extensionreaction involving fluorescently labeled terminators of DNA synthe-sis. The DNA sequence is established during consecutive capillaryelectrophoresis of the reaction products. An additional methoddeveloped during the last decades was sequencing by hybridiza-tion (SBH) (Southern, 1988; Bains and Smith, 1988; Lysov et al.,1988; Drmanac et al., 1989; Guénoche, 1992; Blazewicz et al., 1999a,2002; Zhang et al., 2003; Blazewicz et al., 2004, 2006). However,SBH never reached a highly advanced level allowing its wider appli-cation. Recently, DNA sequencing has been revolutionized by anintroduction of novel massively parallel sequencing systems: 454(Life Sciences) (Margulies et al., 2005), Solexa (Illumina) (Bennett,2004) and SOLID (Applied Biosystems) (Fu et al., 2008). They arecapable of generating a huge amount of sequencing data (from100 million bases per run (454) to billion bases per run (Solexa

and SOLID)) at less than 1% of the cost of capillary-based meth-ods. Each system employed different strategy to generate highquality data. 454 relies on pyrosequencing of clonally amplifiedDNA fragments fixed to beads placed in water-in-oil emulsion. Itproduces 200–300 nucleotide sequence reads. Solexa and SOLID

iology

sssalachBaohv4

bsratbfttel(psotipaw24r1

tSaabT

2

sotnIBbsoectu

La

J. Blazewicz et al. / Computational B

equencing approaches are built around a very large number ofhort sequence reads (ca. 30–50 nt). In both systems individualhort DNA molecules are attached to a solid surface and clonallymplified. During the next stage Solexa uses especially designedabeled nucleotides which can reversible terminate DNA synthesisnd SOLID is based on sequential ligation with dye-labeled oligonu-leotides. In all three systems the necessity for DNA cloning andigh G-C content are not as much of a problem as in Sanger method.ecause new sequencing approaches are massively parallel, theyllow for detecting mutations at a low sensitivity level. On thether hand, the length of sequence reads causes the problems whenighly repetitive regions are analyzed. In addition, new systems areery sensitive to contaminations. In this paper, we focus only on the54 sequencing approach.

The specificity of the 454 sequencing data influences the assem-ly algorithm that is used at the computational stage of theequencing process. In this paper, we propose a new assembly algo-ithm which deals well with these data and outperforms otherssembly algorithms known from the literature and used in prac-ice. The algorithm is a heuristics based on a graph model, the grapheing built from the set of input sequences. The algorithm success-ully utilizes some ideas coming from the procedures developed forhe computational phase of the SBH (Blazewicz et al., 1999a). In par-icular, we use here a modified concept of the DNA graph (Blazewiczt al., 1999b), as well as Prize Collecting Traveling Salesman Prob-em (PCTSP), the latter being used for finding a consensus sequenceBlazewicz et al., 1999a). Since the assembly process is more com-licated than SBH, we also use temporary compression of the inputequences, a new heuristic procedure for selecting promising pairs,perations repairing the lack of some arcs or excess of some ver-ices in the graph, and finally, the system of voting of sequencesn creating the consensus sequence. The computational tests wereerformed on the data coming from real biochemical experimentst the Joint Genome Institute (JGI), which sought to sequence thehole genome of Prochlorococcus marinus bacteria (Chen et al.,

006)(accession number in GenBank is NC 007335) with the new54 sequencing approach. Moreover, the performance of the algo-ithm was checked on a mammalian data—a part of chromosome5 of Homo sapiens (accession number in GenBank is EU606050).

The organization of the paper is as follows. Section 2 containshe description of the new assembly algorithm proposed here—theR-ASM algorithm. In Section 3 the computational results of thelgorithm are compared with the outcomes of five other assemblylgorithms used in practice and also with outcomes obtained at JGI,y a combination of two assembly programs and experts’ finishing.he paper ends with Section 4.

. Methods

In the classical approach to DNA assembly, algorithms mergeequences of lengths equal to a few hundreds of nucleotides inrder to obtain an output sequence (or a series of disjoint con-igs in case of lack of coverage in some places) up to a few millionucleotides long (see for example Kececioglu and Myers, 1995;

dury and Waterman, 1995; Jiang and Li, 1996; Pevzner et al., 2001;lazewicz and Kasprzak, 2008). The input sequences come fromoth strands of the DNA. They are usually found by gel electrophore-is, sequencing by hybridization, or by capillary electrophoresis inne of its variations. The input sequences contain errors of sev-ral types: insertions, deletions, and substitutions of nucleotides,himeras and contaminations. The average depth of coverage (i.e.

he average number of reads covering a position) of the solution issually 6–10.

The sequencing process in the novel approach invented by 454ife Sciences Corporation, named the 454 sequencing, is performedutomatically by a sequencer, which gives short DNA reads (usu-

and Chemistry 33 (2009) 224–230 225

ally 100–200 nucleotides long) with high average depth of coverage(even 30–40) (Margulies et al., 2005). The 454 sequencer protocolis based on measuring the intensities of light generated during thesequencing process. The sequences (reads) are presented as chainsof nucleotides and in addition, the sequencer produces rates of con-fidence for every nucleotide. Similar to the previous sequencingmethods, the produced sequences come from both DNA strands andinclude random insertions, deletions, substitutions of nucleotides,and contaminations. This time, however, the nature of the bio-chemical process excludes chimeras. Moreover, compared to othermethods, the sequences have fewer insertion, deletion and sub-stitution errors. The speed of generating the data gives a greatopportunity to speed up the overall genome assembly process.However, the short reads need to be treated by a special assemblyalgorithm. In the last few years several new methods dedicated toshort reads were implemented (e.g. Chaisson et al., 2004; Sundquistet al., 2007; Zerbino and Birney, 2008). In this paper we use twoof them, i.e. VELVET from EMBL-EBI (Zerbino and Birney, 2008)and NEWBLER assembler (Margulies et al., 2005), for comparisonwith our method. VELVET is one of the newest assembly packages,designed for both Solexa and 454 technologies. The authors decidednot to take into account the confidence rates of nucleotides, treat-ing this information as rather unimportant. It will be interesting tocompare this method with NEWBLER and our new algorithm, themore so as the latter methods exploit this information.

The motivation of our work was to invent a new assembly algo-rithm that performs well with data from the 454 sequencer togetherwith the confidence rates. To check its usefulness we had raw datacoming from real experiments using the 454 sequencer, undertakenat Lawrence Livermore National Laboratory (LLNL) operating jointlywith the Joint Genome Institute (JGI). The data covered the wholegenome of Prochlorococcus marinus bacteria, of length 1.84 Mbp. Wealso had results of the assembly process, obtained for this data byJGI, with the use of the programs Forge and Jazz together withexperts’ finishing (see Section 3). Moreover, additional tests on apart of chromosome 15 of Homo sapiens were done.

2.1. SR-ASM Algorithm

In the proposed SR-ASM algorithm (Short Reads ASseMbly) threephases can be distinguished. The first phase computes feasibleoverlaps for all input sequences (reads). The input parameters,among others, are the value of the minimum overlap between twosequences and the value of the error bound, the latter being the per-centage of the mismatches allowed in the overlap of two sequences.These values are set by the user of the algorithm. Feasible overlapmeans here the overlap satisfying both parameters. The feasibleoverlaps are computed also for the sequences being reverse com-plementary to the input ones, due to the assumption that the readscome from both strands of a DNA helix. In the second phase, agraph is constructed with the reads and their reverse complemen-tary counterparts as the vertices. Each read and its complementarycounterpart create a pair of bound vertices. Here, we use a modi-fied concept of the DNA graph invented in the context of the SBH(Blazewicz et al., 1999b). (The original DNA graphs (Blazewicz etal., 1999b) were defined as induced subgraphs of de Bruijn graphs(de Bruijn, 1946)) Two vertices being labeled by the correspondingreads are connected by a directed arc if there is a feasible over-lap between these reads. We look for a path which passes throughone of the vertices from every pair: either through the straight-forward input sequence or its reverse complementary counterpart.

In this procedure we use the idea of the Prize Collecting TravelingSalesman Problem (PCTSP), developed again in the context of theSBH approach (Blazewicz et al., 1999a). Usually it is not possibleto find a single path in the graph and several paths are returnedas solutions. At the end, in the third phase, a consensus sequence

2 iology

(h

oncwWlcfcwcbcitip

hosrsa

sreoIt(tbp“c

c4paltoqott

Fgfm

26 J. Blazewicz et al. / Computational B

sequences) is determined with a new and fast multiple alignmenteuristics.

In order to find overlaps for all n sequences (the first phasef the algorithm), we should compare all pairs of sequences. Theumber of such pairs is O(n2). The alignment of two sequences (aalculation of a possible overlap of two sequences) is performedith the Smith–Waterman algorithm (Smith and Waterman, 1981;aterman, 1995) and takes time O(k2), where k is the length of the

onger sequence. The complexity of this problem makes the exactalculation of the complete overlap matrix impossible in practiceor big instances (big n). For example, around 3 × 1011 sequenceomparisons are necessary to reconstruct a 2 Mbp bacterial genomeith reads of 100 bp and average depth of coverage equal to 15. The

omputation time is therefore extremely long, but can be shortenedy applying some strategies. For instance, the number of necessaryomparisons of every sequence with every other sequence can benitially decreased. Only the selected pairs are then compared by theime-consuming Smith–Waterman algorithm. On the other hand,t is possible to speed up the alignment procedure; for example, therocedure can work in subquadratic time (Crochemore et al., 2003).

In our approach we applied both strategies. We introduced aeuristic method which estimates the quality of the alignmentf two sequences, and selects the so-called promising pairs ofequences. We also achieve the acceleration of the alignment algo-ithm, by imposing thresholds of minimum overlap between twoequences and maximum error rate of the alignment as definedbove.

The new heuristic algorithm for selecting the promising pairs ofequences works as follows. The sequences contain relatively fewandom insertions, deletions and substitutions. The most frequentrrors in the 454 output data are the lack or the excess of nucleotidesf one type in a subsequence of the same consecutive nucleotides.f during the comparison of two sequences we omit the informa-ion about the number of consecutive nucleotides of the same typeremembering only one nucleotide of each type instead of a few)hen we can compare the sequences which contain less errors. Thus,efore the proper alignment, the sequences are temporarily com-ressed by deleting all but one repeated nucleotides (e.g. keepingA” instead of “AAA”). Next, we compare all pairs of sequences byhecking for common substrings.

Let us define a window as a fragment of a sequence of a given,onstant length (another parameter of the algorithm, usually set to–7 nucleotides). For every pair of compressed sequences we com-are their respective sets of windows. The leading sequence in thelignment (with the Smith–Waterman algorithm) is defined as theeftmost one. From the leading sequence we determine windows inhe initial part of the sequence (the initial part is another parameter

f the algorithm, and may vary between 20 and 40 nucleotides). Theuality of a window is evaluated according to the confidence ratesf the first nucleotides in the series of consecutive nucleotides ofhe same type present in a window. If any of the confidence rates ofhese nucleotides is below a given threshold, then such a window is

ig. 1. An example of reducing some unnecessary vertices. (a) Sequences s2, s3, s4, s5 araph which passes through vertices s1 → s2 → s3 → s4. Sequence s4 could not be connact s5 is a subsequence of s1). (b) All the subsequences (s2, s3, s4 and s5) are included in

oved to sequence s1 (additional connection between s1 and s6).

and Chemistry 33 (2009) 224–230

rejected. We denote the set of windows from the leading sequence(A) as WA. Next, we look for the windows from the next sequence(B) in the alignment, which are present in set WA. These windowsconstitute set WB. At the end, we check whether or not the order ofthe windows’ appearance is the same in both sequences. In orderto do this we define WAB as:

WAB = {s ∈ WB : posA(predB(s)) < posA(s)} (1)

where posA(s) is the position where window s starts in sequence A,and predB(s) is the feasible window preceding s in sequence B. Then,measure p, which reflects a fitness of two sequences A and B to beconsidered as overlapping ones in the algorithm, can be defined bythe equation

p = |WAB| + 1|WB| (2)

If p exceeds a certain threshold (given as a parameter for the algo-rithm) then the pair of sequences A and B are marked as promising.A shift of an overlap of two sequences is defined as the number ofnucleotides of the leading sequence not overlapped by the secondone. For all the promising pairs of sequences the optimal semiglobalalignment, represented by the shift between two sequences inthe alignment and by the score (similarity), is calculated by theSmith–Waterman algorithm (+1 for every match, −2 for every mis-match, and −1 for every gap).

Having the alignments, we can now construct the graph whichallows to find the path (or paths) (the second phase of the algo-rithm). The vertices of the graph correspond to the sequences andits directed arcs connect the vertices for which the alignment hasbeen calculated and for which the overlap is feasible. Now, someadditional operations have to be done within the graph to help findthe longest path in the graph. The procedure of selecting promis-ing pairs of sequences can result in omitting some important arcs.These operations allow to restore some arcs (if the neighborhoodsuggests a good connection which should be present there) or elim-inate some vertices (if one sequence is included in the anothersequence) (see Fig. 1).

Now we search for a path in the graph. Here, the algorithm usessome ideas derived from PCTSP, developed for the SBH approach(Blazewicz et al., 1999a). Starting from any vertex, we construct apath by joining the vertices with the smallest shift of the overlap ofthe corresponding sequences. If there are two candidate arcs withthe same shift values then the one with the highest similarity ofa candidate sequence and the previously added one is chosen. Thepath is constructed as long as any vertices can be added in bothdirections: either predecessors or successors of the path. Such apath is constructed for every vertex as the starting point. The longest

path is chosen, and all the vertices from the path (and their reversecomplementary counterparts) are deleted from the graph. If thereare still some vertices left in the graph, other paths are constructed,as long as the length of the new path is greater than the minimumpath length (another parameter of the algorithm).

re subsequences of sequence s1. Consider a path constructed in the modified DNAected with s5 due to weak overlap, but s1 could be possibly connected with s5 (into the longer sequence (s1), and all the connections between other sequences are

J. Blazewicz et al. / Computational Biology and Chemistry 33 (2009) 224–230 227

Table 1Results of the computational experiment for smaller instances (5 × 102–105 input sequences) of the genome of Prochlorococcus marinus for the PHRAP, TIGR-AS and CAP3assemblers.

PHRAP TIGR-AS CAP3

Input seq. No. of contigs Qual [%] Cov [%] Time [s] No. of contigs Qual [%] Cov [%] Time [s] No. of contigs Qual [%] Cov [%] Time [s]

98.70 22.34 99.50 41.42500 1 97.78 99.27 1.6 31 97.87 16.02 2.6 7 99.33 26.44 8.0

98.52 14.01 88.08 24.9096.93 90.43 98.56 12.78 91.55 27.26

1000 4 99.32 9.66 3.5 52 99.26 11.91 6.6 11 99.93 21.97 17.388.85 3.95 98.74 11.05 99.50 20.4899.52 52.40 99.06 12.95 99.91 23.86

2000 6 95.45 47.73 7.4 90 99.26 8.44 13.1 16 99.80 15.04 34.588.85 1.98 99.21 7.42 99.93 11.0197.78 99.27 99.00 5.68 99.91 9.67

5000 7 88.85 0.80 19.2 198 99.06 5.25 44.4 33 99.83 8.80 88.698.10 0.63 99.11 4.02 99.34 8.5999.60 47.98 99.00 2.78 99.67 6.55

10000 10 99.51 34.35 37.2 394 99.09 2.57 140.3 76 99.70 5.32 179.799.52 10.38 99.56 2.34 99.91 4.7399.60 24.20 99.00 1.40 99.76 3.39

20000 22 96.48 18.63 84.5 817 99.06 1.29 604.1 184 99.65 2.96 408.799.48 17.33 64.11 1.23 99.68 2.8699.59 11.62 99.73 1.21 99.68 1.36

50000 63 96.49 7.40 218.6 570 98.86 0.64 4767.2 493 99.75 1.34 1204.297.95 6.94 98.81 0.59 99.69 1.31

98.1 99.

99.

afpt(atoc

sotafiosraattsafs

pwd

3

twcs

73.17 6.7400000 135 98.69 6.48 487.7 622

99.60 5.82

After the completion of the above stage, we have a set of pathsnd a set of single sequences. The algorithm performs now theollowing steps as long as any changes can be made in the set ofaths, in order to connect the disjoint parts of the current solu-ion (both paths and single sequences). First, alignments for endslast vertices) of the currently existing paths and single sequencesre made both for their straightforward and reverse complemen-ary encoding. Then, the ends with the best fit are overlapped. Ifne of the joined elements appears to be a single read, its reverseomplementary counterpart is removed from further analysis.

The last phase of the algorithm involves composing the consensusequence (or a series of contigs) from the obtained path (or setf paths, respectively). The multiple alignment algorithm is thenrying to align optimally overlapping sequences (in order to createcontig). A leading sequence is chosen (at the beginning it is therst sequence in the path) together with these sequences whichverlap it. Next, for each sequence the alignment with the leadingequence is constructed. All the alignments are put together, withespect to the leading sequence, and if there are any differences atposition, the nucleotides at that position in the sequences vote

s follows. If a nucleotide of one type occurs at a given position inhe alignment in at least 50% of sequences, then it is accepted inhe consensus sequence, otherwise it is rejected. When the leadingequence finishes, it is replaced with the next one from the path,nd the above procedure repeats until all the sequences (vertices)rom the path are used. If there are more than one path the aboveteps are repeated for each path.

Paths composed of a fewer number of reads than the minimumath length are not considered. Our tests showed that due to thise lost only contigs of length not greater than 98 nucleotides whicho not affect the 100% coverage of the genome.

. Results and Discussion

The major part of our computational experiment consisted inests with the use of raw data produced at JGI in cooperationith LLNL. The data cover the whole genome of Prochloro-

occus marinus bacteria of length 1.84 Mbp. The output of theequencer contains over 300,000 reads, each approximately 100

31 0.37 99.57 1.0134 0.30 24024.5 887 99.78 0.87 2822.932 0.28 99.65 0.76

nucleotides long. The sequencer also provides rates of confidence ofnucleotides.

In the computational experiment, apart from the complete setof data as described above, also smaller instances were prepared inorder to do comprehensive comparison of various assemblers. Forthe complete set some methods were not able to perform correctlyas either they did not finish the computations in a given time or theirresults were poor. The smaller instances are subsets of the completeset and cover continuous fragments of the genome. Thus, they the-oretically should result in a small number of disjoint contigs. Themodel sequence of Prochlorococcus marinus is known and published(Chen et al., 2006)(accession number in GenBank is NC 007335), weused it only to evaluate the quality of produced contigs.

The tests were carried out on SUN Fire 6800 in PoznanSupercomputing and Networking Center. The assembly programswhich were used for the comparison are widely known andpublicly available: PHRAP (http://www.phrap.org/), CAP3 (Huangand Madan, 1999)(http://genome.cs.mtu.edu/), TIGR assembler(Sutton et al., 1995)(TIGR-AS, http://www.tigr.org/). From amongthe assemblers dedicated to short reads, which are publicly avail-able, we chose VELVET as one of the newest ones (Zerbino andBirney, 2008)(http://www.ebi.ac.uk/∼zerbino/velvet). The resultsobtained at JGI for the complete data set, with the use of programsForge (http://combiol.org/talk/ForgeG/) and Jazz (the assembler ofJGI) together with experts’ finishing, are also presented here. For thesame complete data we have had the results of NEWBLER assem-bler, being a 454 product attached to the sequencer, using flowsignals of the sequencer instead of nucleotide sequences. To com-plete the comparison, we added our previous assembly programASM (Blazewicz et al., 2004).

The criteria which have been used to compare the outcomes ofthe methods are: the number of contigs produced by the methods,the quality of the largest contigs, the coverage, N50 length, and thetime of computations. The quality, i.e. the similarity of a contig to

a fragment of the genome, was calculated by the Smith–Watermanalgorithm with the score +1 for a match, −2 for every mismatch,and −1 for a gap. The coverage is the percentage of the total lengthof the reconstructed sequence which is covered by the contig. N50length is the greatest possible value x such that 50% of the whole

228 J. Blazewicz et al. / Computational Biology and Chemistry 33 (2009) 224–230

Table 2Results of the computational experiment for smaller instances (5 × 102–105 input sequences) of the genome of Prochlorococcus marinus for the ASM, VELVET and SR-ASMassemblers.

ASM VELVET SR-ASM

Input seq. No. of contigs Qual [%] Cov [%] Time [s] No. of contigs Qual [%] Cov [%] Time [s] No. of contigs Qual [%] Cov [%] Time [s]

94.43 76.50 48.54 29.50500 8 84.26 25.73 7.7 11 49.53 15.76 0.1 1 99.86 100.00 2.0

97.78 5.34 49.41 14.9996.23 96.46 47.59 17.53

1000 6 80.38 5.05 31.8 30 47.95 15.39 2.4 1 99.78 100.00 5.097.92 2.11 49.42 9.8797.20 48.86 47.76 10.70

2000 18 90.15 26.86 128.0 65 47.80 7.69 3.9 1 99.75 100.00 10.097.53 23.94 52.34 5.4296.82 19.74 47.39 4.18

5000 28 97.20 19.79 794.6 179 47.59 3.64 6.8 1 99.78 100.00 31.097.08 17.92 48.72 3.5596.91 17.68 47.90 2.57

10000 74 96.82 9.66 3380.9 395 47.84 2.02 11.7 1 99.78 100.00 113.097.20 9.68 47.39 2.0197.42 10.72 99.79 1.71 99.69 43.65

20000 147 90.83 9.23 11783.3 752 52.39 1.12 22.5 10 99.78 20.67 746.097.04 8.31 47.03 1.05 99.76 14.8097.23 4.84 99.90 0.57 99.82 10.17

50000 360 97.10 4.78 81075.2 1959 99.95 0.53 53.7 31 99.63 10.10 5926.097.40 3.69 47.90 0.52 99.79 9.1397.36 4.91 99.96 0.39 99.82 10.42

100000 700 97.52 2.96 351418.7 3873 99.95 0.29 108.0 76 99.72 9.60 22987.097.30 2.79 99.90 0.29 99.66 6.94

F NEWw t set c

rn

cTcqt“c

attfib

gaa

ig. 2. Comparison of the quality of three largest contigs obtained by the SR-ASM,hole genome of Prochlorococcus marinus bacteria of length 1.86 Mbp. The input tes

econstructed sequence is covered by produced contigs of lengthot less than x.

In Tables 1 and 2 results for the smaller instances of Prochloro-occus marinus are presented. Table 1 contains the results of PHRAP,IGR-AS, and CAP3, Table 2 of ASM, VELVET, and SR-ASM. For theases where the algorithms resulted in more than one contig, theuality and coverage values are given for three longest contigs. Inhe tables, “input seq.” means the number of reads in instances,qual” stands for the quality, “cov” for the coverage, and “no. ofontigs” for the number of contigs produced by the assemblers.

Results of the tests performed for the whole bacteria’s genomere shown in Table 3. Not all compared algorithms are presentedhere—ASM and TIGR-AS had unacceptable computation time. Onhe other hand, results from both sources from JGI (JGI after experts’nishing and NEWBLER) are added. The results from Table 3 are

etter visualized in Figs. 2–5 .

As we can see from the tables, the computation time of ASMrows rapidly with the instance size. The algorithm could generatesolution for 100,000 reads at most. However, it surpasses CAP3

nd TIGR-AS in the length of produced contigs, with slightly worse

BLER, PHRAP and VELVET assemblers, and obtained at JGI. The results concern theontains over 3 × 105 DNA reads.

similarity to the model sequence. ASM, similarly as CAP3, TIGR-AS,and PHRAP, does not use the information about confidence ratesprovided by the 454 sequencer. Taking into account all the “older”algorithms not dedicated to the 454 data, PHRAP seems to be thebest one. PHRAP beats them in all the measures except the quality.

The small number of contigs obtained at JGI by the use of Jazzand Forge is due mainly to the finishing phase, where inappropri-ate contigs were rejected by experts. On the other hand, NEWBLERwas not supported by the experts and generated very good solu-tions concerning all the criteria. However, SR-ASM reached almostso good results in the number of contigs, surpassing NEWBLER inboth the quality and the coverage of the contigs. Solutions of PHRAPare quite acceptable except the number of contigs, but the behav-ior of VELVET was surprising. VELVET from EMBL-EBI is designedfor both Solexa and 454 technologies, executing the program we

set its parameters appropriately for the kind of input data we had(according to the personal communication with an author).

The second part of the computational experiment was done withthe use of “mammalian” data, being of different characteristic thanthe bacteria ones. The data are stored in NCBI Short Reads Archive

J. Blazewicz et al. / Computational Biology and Chemistry 33 (2009) 224–230 229

Fig. 3. Comparison of the coverage of three largest contigs obtained by the SR-ASM, NEWBLER, PHRAP and VELVET assemblers, and obtained at JGI. The results concern thewhole genome of Prochlorococcus marinus bacteria of length 1.86 Mbp. The input test set contains over 3 × 105 DNA reads.

Table 3Results of the computational experiment for the whole genome of Prochloroccocusmarinus bacteria (over 3 × 105 input sequences) for the PHRAP, NEWBLER, CAP3,VELVET and SR-ASM assemblers, and the results obtained at JGI.

No. ofcontigs

Quality[%]

Coverage[%]

N50 [bp] Time [s]

94.56 4.74PHRAP 5460 81.66 1.89 5241 2067

98.56 1.88JGI after experts’ finishing 96.85 4.60

150 96.94 3.97 15148 –97.00 3.9599.29 4.53

NEWBLER 149 99.38 3.91 22745 –99.45 3.8999.85 0.42

CAP3 7937 99.69 0.38 1517 1237699.91 0.3299.90 0.17

VELVET 13769 99.96 0.15 518 32899.90 0.14

S

(bmnwao

FVP3

Fig. 5. Comparison of the number of contigs obtained by the SR-ASM, NEWBLER,

99.71 5.66R-ASM 171 99.76 4.38 10157 289513

99.68 3.19

at http://www.ncbi.nlm.nih.gov/Traces/home/) and come from aiochemical experiment which was to cover a gap closure in chro-osome 15 of Homo sapiens. The final sequence of length 11717

ucleotides has accession number EU606050 in GenBank. Thereas around 14,000 input reads, each one of length between 58

nd 669 nucleotides. Because of significantly contaminated 3′ endsf the reads, we had to cut them from this side. Finally, we took

ig. 4. Comparison of the N50 value obtained for SR-ASM, NEWBLER, PHRAP andELVET assemblers, and obtained at JGI. The results concern the whole genome ofrochlorococcus marinus bacteria of length 1.86 Mbp. The input test set contains over× 105 DNA reads.

PHRAP and VELVET assemblers, and obtained at JGI. The results concern the wholegenome of Prochlorococcus marinus bacteria of length 1.86 Mbp. The input test setcontains over 3 × 105 DNA reads.

200 leftmost nucleotides from the reads; reads shorter than 100nucleotides were rejected.

The results of the second series of tests are presented in Table 4.This time we do not have NEWBLER’s results since the data arestored in the form of nucleotide sequences (NEWBLER works onlyon flow signals from 454 sequencer, we had no access to such exper-

iments). Also PHRAP is not taken into account, execution of it endedwith an error (‘fatal error: requested memory unavailable’; in fact,not more than 4 GB of memory can be allocated to the computa-tional process on the 32 bit processor). From among the remaining

Table 4Results of the computational experiment for the gap closure in chromosome 15of Homo sapiens for the CAP3, TIGR-AS, VELVET and SR-ASM assemblers. Dataare obtained from NCBI Short Reads Archive (at http://www.ncbi.nlm.nih.gov/Traces/home/). There are around 14,000 input reads.

No. of contigs Quality [%] Coverage [%] N50 [bp] Time [s]

98.72 89.04CAP3 330 95.52 13.89 10515 9582

46.70 4.4097.98 15.21

TIGR-AS 3086 98.59 9.99 910 9720297.54 9.7148.10 2.69

VELVET 554 50.28 1.54 132 2850.75 1.7198.56 100.00

SR-ASM 48 79.09 5.27 11863 281489.53 3.99

2 iology

faetr

4

bwesbtomo

oTccitt

cgoaw

wftis

A

t

o

R

B

BB

30 J. Blazewicz et al. / Computational B

our approaches, SR-ASM appeared to outperform the others. Ourlgorithm generated the whole resulting sequence of 100% of cov-rage and 98.56% of similarity to the model sequence. Concerningwo other criteria, the number of produced contigs and N50, theesults of SR-ASM are also the best.

. Conclusions

In this paper, a new SR-ASM algorithm, dedicated to DNA assem-ly based on short reads, has been proposed. Computational testsith the use of raw data generated by 454 sequencer at JGI, cov-

ring the whole 1.84 Mbp genome of Prochlorococcus marinus, havehown its performance to be very good. Our algorithm producedetter results, with respect to the coverage and to the similarityo the model sequence, than the NEWBLER’s results or the resultsbtained at JGI with the support of experts’ finishing. The perfor-ance has been further confirmed by additional tests on a fragment

f chromosome 15 of human genome.Summing up the results presented in Tables 1–4 and in Fig. 2–5,

ur algorithm outperformed all other approaches used in the tests.he strength of SR-ASM comes from the combination of the ideasoming from our SBH procedures and new propositions: temporaryompression of input sequences, a new method of selecting promis-ng pairs, operations adding some arcs or deleting some vertices inhe graph, and finally, the system of voting of sequences in creatinghe consensus sequence.

The computation time of SR-ASM is rather long, but time is notrucial for assembly algorithms. SR-ASM solved the whole bacteriaenome in 80 h, which is quite acceptable with respect to the timef obtaining these data in biochemical experiments. Currently were working on a parallel implementation of this algorithm, whichill significantly reduce the computational time.

SR-ASM is a new proposition, which can be used in practice in thehole genome assembly based on the 454 sequencing. It outper-

orms other assembly algorithms both dedicated and not dedicatedo short reads. Moreover, combined with experts’ finishing staget has a chance to produce better results than the approaches forhort-reads assembly currently used in laboratories.

cknowledgments

We are very grateful to Krystyna Ciesielska and Richard Law forheir helpful comments concerning the paper.

The research has been partially supported by the Polish Ministryf Science and Higher Education grant N N519 314635.

eferences

ains, W., Smith, G.C., 1988. A novel method for nucleic acid sequence determination.

J. Theor. Biol. 135, 303–307.

ennett, S., 2004. Solexa Ltd. Pharmacogenomics 5, 433–438.lazewicz, J., Figlerowicz, M., Formanowicz, P., Kasprzak, M., Nowierski, B., Styszyn-

ski, R., Szajkowski, L., Widera, P., Wiktorczyk, M., 2004. Assembling the SARS-CoVgenome—new method based on graph theoretical approach. Acta Biochim.Polonica 51, 983–993.

and Chemistry 33 (2009) 224–230

Blazewicz, J., Formanowicz, P., Guinand, F., Kasprzak, M., 2002. A heuristic managingerrors for DNA sequencing. Bioinformatics 18, 652–660.

Blazewicz, J., Formanowicz, P., Kasprzak, M., Markiewicz, W.T., Weglarz, J., 1999a.DNA sequencing with positive and negative errors. J. Comput. Biol. 6, 113–123.

Blazewicz, J., Formanowicz, P., Swiercz, A., Kasprzak, M., Markiewicz, W.T., 2004.Tabu search algorithm for DNA sequencing by hybridization with isothermiclibraries. Comput. Biol. Chem. 28, 11–19.

Blazewicz, J., Glover, F., Swiercz, A., Kasprzak, M., Markiewicz, W.T., Oguz, C., Rebholz-Schuhmann, D., 2006. Dealing with repetitions in sequencing by hybridization.Comput. Biol. Chem. 30, 313–320.

Blazewicz, J., Hertz, A., Kobler, D., de Werra, D., 1999b. On some properties of DNAgraphs. Discrete Appl. Math. 98, 1–19.

Blazewicz, J., Kasprzak, M., 2008. Graph reduction and its application to DNAsequence assembly. Bull. Polish Acad. Sci. Tech. Sci. 56, 65–70.

Chaisson, M., Pevzner, P., Tang, H., 2004. Fragment assembly with short reads. Bioin-formatics 20, 2067–2074.

Chen, F., Alessi, J., Kirton, E., Singan, V., Richardson, P., http://www.jgi.doe.gov/science/posters/chenPAG2006.pdf 2006. Comparison of 454 sequencing plat-form with traditional Sanger sequencing: a case study with de novo sequencingof Prochlorococcus marinus NATL2A genome. In: Plant and Animal GenomesConference.

Crochemore, M., Landau, G.M., Ziv-Ukelson, M., 2003. A subquadratic sequencealignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32,1654–1673.

de Bruijn, N.G., 1946. A combinatorial problem. Koninklijke Nederlandse Akademievan Wetenschappen 49, 758–764.

Drmanac, R., Labat, I., Brukner, I., Crkvenjakov, R., 1989. Sequencing of megabase plusDNA by hybridization: theory of the method. Genomics 4, 114–128.

Fu, Y., Peckham, H.E., McLaughlin, S.F., Rhodes, M.D., Malek, J.A., McKer-nan, K.J., Blanchard, A.P., 2008. SOLID sequencing and Z-Base encoding.In: The Biology of Genomes Meeting. Cold Spring Harbour Laboratory,http://www.appliedbiosystems.com.

Guénoche, A., 1992. Can we recover a sequence, just knowing all its subsequencesof given length? Comput. Appl. Biosci. 8, 569–574.

Huang, X., Madan, A., 1999. CAP3: a DNA sequence assembly program. Genome Res.9, 868–877.

Idury, R., Waterman, M., 1995. A new algorithm for DNA sequence assembly. J. Com-putat. Biol. 2, 291–306.

Jiang, T., Li, M., 1996. DNA sequencing and string learning. Math. Syst. Theor. 29,387–405.

Kececioglu, J.D., Myers, E.W., 1995. Combinatorial algorithms for DNA sequenceassembly. Algorithmica 13, 7–51.

Lysov, Y.P., Florentiev, V.L., Khorlin, A.A., Khrapko, K.R., Shik, V.V., Mirzabekov, A.D.,1988. Determination of the nucleotide sequence of DNA using hybridization witholigonucleotides. A new method. Doklady Akademii Nauk SSSR 303, 1508–1511.

Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., 2005. Genome sequenc-ing in microfabricated high-density picolitre reactors. Nature 437, 376–380.

Maxam, A.M., Gilbert, W., 1977. A new method for sequencing DNA. Proc. Natl. Acad.Sci. U.S.A. 74, 560–564.

Pevzner, P.A., Tang, H., Waterman, M.S., 2001. A new approach to fragment assemblyin DNA sequencing. In: Proc. of the Fifth Annual International Conference onComputational Molecular Biology RECOMB’01, 256–267, Montreal. ACM Press.

Sanger, F., Nicklen, S., Coulson, A.R., 1977. DNA sequencing with chain-terminatinginhibitors. Proc. Natl. Acad. Sci. U.S.A. 74, 5463–5467.

Smith, T.F., Waterman, M.S., 1981. Identification of common molecular subsequences.J. Mol. Biol. 147, 195–197.

Southern, E.M., Analyzing polynucleotide sequences. International Patent Applica-tion PCT/GB8900460, 1988.

Sundquist, A., Ronaghi, M., Tang, H., Pevzner, P., Batzoglou, S., 2007. Whole-genomesequencing and assembly with high-throughput, short-read technologies. PLoSONE 2, e484.

Sutton, G., White, O., Adams, M., Kerlavage, A., 1995. TIGR assembler: a new tool forassembling large shotgun sequencing projects. Genome Sci. Technol. 1, 9–19.

Waterman, M.S., 1995. Introduction to Computational Biology. Maps, Sequences andGenomes. Chapman & Hall, London.

Zerbino, D.R., Birney, E., 2008. Velvet: algorithms for de novo short reads assemblyusing de Bruijn graphs. Genome Res. 8, 821–829.

Zhang, J.H., Wu, L.Y., Zhang, X.S., 2003. Reconstruction of DNA sequencing byhybridization. Bioinformatics 19, 14–21.