tabu search algorithm for dna sequencing by hybridization with isothermic libraries

9
Computational Biology and Chemistry 28 (2004) 11–19 Tabu search algorithm for DNA sequencing by hybridization with isothermic libraries Jacek Bla ˙ zewicz a,b,1 , Piotr Formanowicz a,b,, Marta Kasprzak a,b , Wojciech T. Markiewicz b , Aleksandra ´ Swiercz a a Institute of Computing Science, Pozna´ n University of Technology, Piotrowo 3A, 60-965 Pozna´ n, Poland b Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Pozna´ n, Poland Received 14 July 2003; received in revised form 3 December 2003; accepted 3 December 2003 Abstract In this paper, a problem of isothermic DNA sequencing by hybridization (SBH) is considered. In isothermic SBH a new type of oligonu- cleotide libraries is used. The library consists of oligonucleotides of different lengths depending on an oligonucleotide content. It is assumed that every oligonucleotide in such a library has an equal melting temperature. Each nucleotide adds its increment to the oligonucleotide temperature and it is assumed that A and T add 2 C and C and G add 4 C. The hybridization experiment using isothermic libraries should provide data with a lower number of errors due to an expected similarity of melting temperatures. From the computational point of view the problem of isothermic DNA sequencing with errors is hard, similarly like its classical counterpart. Hence, there is a need for developing heuristic algorithms that construct good suboptimal solutions. The aim of the paper is to propose a heuristic algorithm based on tabu search approach. The algorithm solves the problem with both positive and negative errors. Results of an extensive computational experiment are presented, which prove the high quality of the proposed method. © 2003 Elsevier Ltd. All rights reserved. Keywords: Sequencing by hybridization; Isothermic libraries; Heuristics; Tabu search 1. Introduction One of the most important issue in molecular biology and genetics is reading DNA sequences. Despite the im- pressive achievements in sequencing whole genomes, there is still a need for an improvement of the existing sequenc- ing methods or for a development of the new ones. One of the sequencing method is sequencing by hybridization (SBH) proposed, among others, by Drmanac et al. (1989), Khrapko et al. (1989) and Southern et al. (1992). The method consists of two phases: the biochemical one and the computational one. In the biochemical stage, hybridiza- tion experiment is performed, in which an oligonucleotide library is compared with many copies of one strand of the examined DNA molecule. Usually, the library consists of Corresponding author. Tel.: +48-61-8528503 ext. 276; fax: +48-61-8771525. E-mail address: [email protected] (P. Formanowicz). 1 The paper was written while the first author was at Laboratoire Leibniz, Universite Joseph Fourier, whose help is greatly appreciated. all 4 l oligonucleotides of length l. The library is prepared on a DNA chip (Southern, 1988; Fodor et al., 1991; Pease et al., 1994). During the hybridization reaction fragments of the target DNA join to oligonucleotides from the library in their complementary places. After the reaction, one obtains the set of sequences being subfragments of the examined DNA molecule by reading a fluorescent or radioactive im- age of the chip. This set is named a spectrum. Most of the sequencing methods based on the hybridization reaction use sets of oligonucleotides of the same length. (An interesting exception are so called gapped chips (Pevzner and Lipshutz, 1994).) Next the computational stage of the SBH comes. It is known that in the case of an ideal hybridization experiment (no experimental errors involved), an original DNA sequence may be reconstructed in polynomial time (Pevzner, 1989). On the other hand, if there are errors in the hybridization experiment (positive—excess of l-mers or negative—lack of l-mers), the computational phase of the method becomes a strongly NP-hard problem (Gallant et al., 1980; Bla ˙ zewicz and Kasprzak, 2002). The computational phase of the SBH approach with some restricted models of 1476-9271/$ – see front matter © 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2003.12.002

Upload: jacek-blazewicz

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Computational Biology and Chemistry 28 (2004) 11–19

Tabu search algorithm for DNA sequencingby hybridization with isothermic libraries

Jacek Błazewicza,b,1, Piotr Formanowicza,b,∗, Marta Kasprzaka,b,Wojciech T. Markiewiczb, AleksandraSwiercza

a Institute of Computing Science, Pozna´n University of Technology, Piotrowo 3A, 60-965 Pozna´n, Polandb Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Pozna´n, Poland

Received 14 July 2003; received in revised form 3 December 2003; accepted 3 December 2003

Abstract

In this paper, a problem of isothermic DNA sequencing by hybridization (SBH) is considered. In isothermic SBH a new type of oligonu-cleotide libraries is used. The library consists of oligonucleotides of different lengths depending on an oligonucleotide content. It is assumedthat every oligonucleotide in such a library has an equal melting temperature. Each nucleotide adds its increment to the oligonucleotidetemperature and it is assumed that A and T add 2◦C and C and G add 4◦C. The hybridization experiment using isothermic libraries shouldprovide data with a lower number of errors due to an expected similarity of melting temperatures. From the computational point of viewthe problem of isothermic DNA sequencing with errors is hard, similarly like its classical counterpart. Hence, there is a need for developingheuristic algorithms that construct good suboptimal solutions. The aim of the paper is to propose a heuristic algorithm based on tabu searchapproach. The algorithm solves the problem with both positive and negative errors. Results of an extensive computational experiment arepresented, which prove the high quality of the proposed method.© 2003 Elsevier Ltd. All rights reserved.

Keywords:Sequencing by hybridization; Isothermic libraries; Heuristics; Tabu search

1. Introduction

One of the most important issue in molecular biologyand genetics is reading DNA sequences. Despite the im-pressive achievements in sequencing whole genomes, thereis still a need for an improvement of the existing sequenc-ing methods or for a development of the new ones. Oneof the sequencing method issequencing by hybridization(SBH) proposed, among others, byDrmanac et al. (1989),Khrapko et al. (1989)and Southern et al. (1992). Themethod consists of two phases: the biochemical one andthe computational one. In the biochemical stage, hybridiza-tion experiment is performed, in which an oligonucleotidelibrary is compared with many copies of one strand of theexamined DNA molecule. Usually, the library consists of

∗ Corresponding author. Tel.:+48-61-8528503 ext. 276;fax: +48-61-8771525.

E-mail address:[email protected] (P. Formanowicz).1 The paper was written while the first author was at Laboratoire

Leibniz, Universite Joseph Fourier, whose help is greatly appreciated.

all 4l oligonucleotides of lengthl. The library is preparedon aDNA chip (Southern, 1988; Fodor et al., 1991; Peaseet al., 1994). During the hybridization reaction fragments ofthe target DNA join to oligonucleotides from the library intheir complementary places. After the reaction, one obtainsthe set of sequences being subfragments of the examinedDNA molecule by reading a fluorescent or radioactive im-age of the chip. This set is named aspectrum. Most of thesequencing methods based on the hybridization reaction usesets of oligonucleotides of the same length. (An interestingexception are so called gapped chips(Pevzner and Lipshutz,1994).) Next the computational stage of the SBH comes.

It is known that in the case of an ideal hybridizationexperiment (no experimental errors involved), an originalDNA sequence may be reconstructed in polynomial time(Pevzner, 1989). On the other hand, if there are errors inthe hybridization experiment (positive—excess ofl-mers ornegative—lack of l-mers), the computational phase of themethod becomes a strongly NP-hard problem(Gallant et al.,1980; Błazewicz and Kasprzak, 2002). The computationalphase of the SBH approach with some restricted models of

1476-9271/$ – see front matter © 2003 Elsevier Ltd. All rights reserved.doi:10.1016/j.compbiolchem.2003.12.002

12 J. Błazewicz et al. / Computational Biology and Chemistry 28 (2004) 11–19

errors was addressed in a few papers(Bains and Smith, 1988;Khrapko et al., 1989; Drmanac et al., 1989; Pevzner, 1989;Bains, 1991; Guénoche, 1992; Lipshutz, 1993; Błazewiczet al., 1997; Fogel et al., 1998; Fogel and Chellapilla, 1999),and recently in(Phan and Skiena, 2001; Shamir and Tsur,2002; Halperin et al., 2002). The only exact method allow-ing for any type of error (positive and negative) and re-quiring no additional information about the spectrum waspresented in(Błazewicz et al., 1999). Although this methodis very general, an excessive number of errors in the spec-trum makes this approach rather slow. Hence, a challengingproblem is to design a new approach reducing a number oferrors in the biochemical phase of SBH. Such an approachhas been proposed in(Błazewicz et al., 2000a,b). It is basedon the usage of chip libraries consisting of oligonucleotidesof equal melting temperatures. The approach follows a rea-soning briefly described below. (More details can be foundin (Błazewicz et al., 2000b).)

It is known that duplexes ofC/G rich oligonucleotidesare more stable thanA/T rich ones. This phenomenon mayresult in numerous errors of the positive type. At the sametime a modification of hybridization conditions directedtowards diminishing the number of imperfect duplexesof C/G rich oligonucleotides may result in increasing anumber of missing perfect duplexes ofA/T rich oligonu-cleotides. This means that the modification decreasing anumber of positive experimental errors might increase anumber of negative ones. The duplex formation depends ona base composition but also on its length. In some models asimple equation was used to calculate melting temperaturesof oligonucleotide duplexes assuming 4◦C for C/G pairsand 2◦C for A/T pairs(Wallace et al., 1981). Thus, at leastformally, it is possible to compensate lower stability ofA/T

rich duplexes by increasing their length. It is known, that theabove description is not very accurate although it reflectsa general relative stability of different duplexes quite well.Therefore, to form more stable duplexes with hybridizedsequences, in(Błazewicz et al., 2000a,b)a new approach toSBH has been proposed. The key idea is to obtain a set ofoligonucleotides that differ in base composition and length,and are characterized by a predefined relation between basecomposition and a length of oligonucleotides. In a specificcase, if in a library an increment of C or G is twice of Aor T and the sum of increments for each oligonucleotide isconstant, then such a library is calledisothermic. It may beproved that two such libraries are sufficient to reconstructDNA sequences by the SBH method(Błazewicz et al.,2000b).

The computational phase of the reconstruction of the tar-get sequence is strongly NP-hard if there are errors in thedata(Błazewicz et al., 2000b)(similarly, like in the classicalcase). Hence, there is a need for developing efficient heuris-tics. The aim of this paper is to present such a method basedon tabu searchapproach. The presented algorithm has beentested in an extensive computational experiment which con-firmed a high quality of solutions as compared with those

obtained by other methods using standard oligonucleotidelibraries.

The organization of the paper is as follows. InSection 2,the problem is formulated and basic properties of isother-mic libraries are recalled. InSection 3, the algorithm is pre-sented, while inSection 4, the results of the computationalexperiment are shown. The paper ends with conclusions inSection 5.

2. Problem formulation

In this section, we recall basic properties of isothermiclibraries. More detailed analysis of their design and featurecan be found in(Błazewicz et al., 2000b).

Let us start with a formal definition of isothermicoligonucleotide libraries(Błazewicz et al., 2000a,b). In thisdefinition,wi denotes an increment a nucleotide of typei

(i ∈ {A,C,G, T }) adds to the melting temperature of theoligonucleotide containing it. On the other hand,xi denotesa number of appearances of a nucleotide of typei in theoligonucleotide.

Definition. An isothermic oligonucleotide libraryL of tem-peraturetL is a library of all oligonucleotides satisfying re-lationswAxA + wCxC + wGxG + wTxT = tL, wA = wT,wC = wG and 2wA = wC.

Without loss of generality we may assume thatwA =wT = 2 andwC = wG = 4. These increments correspondto temperatures which nucleotides bring into the stabilityof oligonucleotide duplexes. In what follows, a sum of in-crements of nucleotides forming an oligonucleotide will becalled the oligonucleotidetemperature.

Isothermic oligonucleotide library follows the experimen-tally established relationship between base composition andduplex stability described in Introduction. Oligonucleotidescontained in such a library should form duplexes with theircomplements in a more narrow range of experimental condi-tions (temperature, salt concentration, etc.) than that charac-teristic for an oligonucleotide library with oligomers of thesame length. Therefore, the hybridization experiments per-formed with isothermic libraries should result in a smallernumber of experimental errors. The use of such librariesshould substantially limit a number of these errors to be con-sidered in the computational phase of the SBH approach.

The basic question arising here is whether the isothermiclibrary defined above is really well suited for SBH? One caneasily realize that one such a library may be not enough. Forexample, it is not possible to cover DNA chains composedof only C and G nucleotides by oligomers from a library of atemperature not divisible by 4. Moreover, a library of a tem-perature divisible by 4 is not sufficient to cover sequenceswhere one nucleotide of A or T type is surrounded by only Gor C nucleotides. But, on the other hand, two such librarieswith temperatures differing by one increment of A or T, i.e.

J. Błazewicz et al. / Computational Biology and Chemistry 28 (2004) 11–19 13

by 2◦C, are sufficient to perform a proper hybridization ex-periment. More formally, the following theorem was proved(the proof can be found in(Błazewicz et al., 2000b)).

Theorem 1. It is always possible to cover any DNA se-quence by probes coming from two isothermic libraries oftemperatures differing by2◦C. Moreover, this coverage issuch that in the sequence two consecutive oligonucleotides(from the libraries) have starting points shifted by at mostone position.

The cardinalities of isothermic libraries are given by for-mulae(1) and(2) describing relations between temperatures(t) and cardinalities of the libraries (card), where formula(1) is concerned with temperaturet divisible by 4 and for-mula(2) is concerned with temperaturet non-divisible by 4(Błazewicz et al., 2000b).

card(t) =t/4∑i=0

[(14t + i

2i

)2(t/4)+i

](1)

card(t) =t/4∑i=0

[( ⌊14t⌋

+ i + 1

2i + 1

)2t/4+i+1

](2)

We see that these cardinalities are between the cardinali-ties of the standard libraries involving, respectively, only theshortest (t/4 or �t/4�) or only the longest (t/2) oligomers ofequal length.

Now, we will formulate isothermic sequencing problem inits most general form, i.e. with positive and negative errors.The formulation will be valid under a reasonable assump-tion that most of the data coming from the hybridization ex-periment are correct. As in the classical SBH, as a result ofa biochemical phase one gets a set of oligonucleotides thathybridized with the reconstructed DNA chain, i.e. spectrum(S). Now, the spectrum contains data from the hybridizationexperiment with two isothermic oligonucleotide librariesdiffering by one increment of A(T) nucleotide. Spectrummay contain two types of errors:positive, i.e. some oligonu-cleotides which are not a part of the DNA sequence appear,and negative, i.e. a lack of some oligonucleotides being apart of this sequence (including subsequent repetitions ofthe oligonucleotide). The most general problem is the onewhere both types of errors occur. This problem in a searchversion may be viewed as the one of finding, for a givenspectrumS, a sequence with the minimum number of pos-itive and negative errors. It is not hard to see, that in caseof the standard oligonucleotide library, where all oligonu-cleotides are of equal length, this formulation is equivalentto a maximization of a number ofl-mers from the spectrumused to build a solution (a reconstructed sequence).

Let us denote byα a number of oligonucleotides fromthe spectrum being a part of the constructed sequence. Onthe other hand, letβ denote a number of oligonucleotidesbeing members of the two used isothermic libraries (of tem-peratures respectively equal tot and t + 2), which can be

distinguished in this sequence (each oligonucleotide adds toβ a number of its occurrences in the sequence). We see thatnumbers ofnegativeandpositive errorsin the constructedsequence may be defined asβ−α and|S|−α, respectively.

It is assumed here that the only information provided bythe hybridization experiment are a spectrum and lengthn

of the DNA sequence which is looked for. Having all thediscussed notions in mind we can define our problem asfollows.

Isothermic DNA sequencing with negative and positiveerrors—search version.

Instance: set S (spectrum) of oligonucleotides, each ofthem of temperaturet or t + 2, lengthn of an original se-quence.

Answer: a sequence of lengthn with a minimum value ofβ + |S| − 2α.

The above problem may be proved to be strongly NP-hard(Błazewicz et al., 2000b), thus, the need arises to constructefficient algorithms constructing solutions (close to opti-mum) in a reasonable time. One of such method is presentedin the next section.

3. The method

The heuristic algorithm for isothermic DNA sequencingpresented in this section is based on tabu search metaheuris-tic being one of the most frequently used in combinatorialoptimization(Glover and Laguna, 1997). The tabu searchmethod is a kind of local search procedure. Every solutionx

from search spaceX has defined its neighborhoodN(x) ⊂X. N(x) is defined in such a way that every solutionx′ ∈N(x) is reachable fromx in one move, which is an elemen-tary operation in the method. Usually, the search spaceX

consists of only feasible solutions for a considered problem.Since the tabu search method is used to solve optimizationproblems, there must be defined some criterion functionf

evaluating every solutionx ∈ X. Clearly, the goal is to findan elementx∗ fromX with the maximal (in the case of max-imization problems) or minimal (in the case of minimizationproblems) value of the criterion function. At each iterationi of the method there is a current solutionxi ∈ X chosenby the algorithm. At this iteration, the method searches theneighborhoodN(xi) of solutionxi in order to find solutionxi+1 = maxx′

i∈N(xi) {f(x′i)} for maximization problems or

xi+1 = minx′i∈N(xi) {f(x′

i)} for minimization problems. So,one of the intrinsic property of the method is to approach op-tima, which usually are local optima (as in every local searchmethod). However, there are some mechanisms preventingthe method from getting stuck in local optimum. One ofthem istabu list, i.e. a list of moves recently performed bythe algorithm. None of the moves being on the list nor anyof their reverses can be performed unless it leads to the solu-tion better than the best one already found by the algorithm.Moreover, in some specific situations a series of random oralmost random moves may be performed which allows the

14 J. Błazewicz et al. / Computational Biology and Chemistry 28 (2004) 11–19

method to jump to another area of the search space. Thesemoves are usually performed when there is no improvementof solutions obtained in a series of consecutive moves.

The above description presents only a general frameworkof the tabu search method and many other components maybe added into it, where problem-specific ones may be espe-cially valuable from the method efficiency point of view.

Our tabu search algorithm for isothermic DNA sequencinghas the following structure. Spectrum consists of two datastructures: an ordered list of oligonucleotides representing acurrent solution, and an unordered set of remaining oligonu-cleotides, called atrash. At each stage of the computationsthe number of oligonucleotides being on the list cannot begreater than the one that would produce a DNA sequencecomposed of at mostn nucleotides (assuming that neigh-bors on the list are maximally overlapped). Moves obeyingthis constraint are calledfeasibleand only such moves areperformed by the method. The algorithm starts with the listconstructed by a greedy selection of oligonucleotides, over-lapping their neighbors.

In the method, three kinds of moves are initially defined:an insertion—a move transferring an oligonucleotide fromthe trash into the list, adeletion—a move transferring anoligonucleotide from the list to the trash, and ashift—a movetransferring an oligonucleotide from one position of the listinto another one.

Since moves performed on single oligonucleotides areusually not sufficient for generating good solutions, movesexecuted onclustershave been introduced. A cluster is agroup of neighboring oligonucleotides in the solution (i.e.on the list) linked together. An overlap of each pair of neigh-boring oligonucleotides in the cluster is equal top positions,wherep ≥ l(sr)−1, andl(sr) is a length of the right oligonu-cleotide in the pair. Since insertion, deletion and shift maychange a composition of a cluster, a list of clusters is up-dated after every move. Clusters exist only in the solution;they are broken into separate oligonucleotides once trans-ferred into the trash. Finally, the set of moves has been de-fined as follows: insertion of an oligonucleotide, deletion ofan oligonucleotide or a cluster, shift of an oligonucleotideor a cluster.

The following rules, limiting an application of the moves,have been introduced:

• a cluster may be shifted only if it does not break anothercluster,

• an oligonucleotide may be shifted only if it is not a partof a cluster, provided it does not break any cluster,

• an oligonucleotide may be deleted only if it is locatedoutside a cluster or at one of the ends of some cluster.

The algorithm tends to maximize a number of oligonu-cleotides in the solution. This maximization may be viewedas aglobal criterion functionfor the method, easier to im-plement in the context of tabu search and also leading to areconstruction of the desired sequence. On the other hand,the only function which is able to compare all kinds of moves

is a condensationdefined for each solution to be the ra-tio of the number of oligonucleotides in the solution (takenfrom the spectrum) to the number of nucleotides in the solu-tion (being a length of a currently constructed sequence). Ifthe moves were compared by the global criterion function,deletion and shifts would be used very rarely. The maxi-mization of the condensation causes a transformation of aninitial solution into a series of collections of well-matchedoligonucleotides. Note, that using the condensation as theonly criterion for a choice of a move, would lead after anumber of iterations to a creation of one cluster of length(in nucleotides) much less thann. To avoid this fact, twoapproaches are used which cause a gradual increase of thenumber of oligonucleotides in the solution:

(1) If a maximal value of the condensation is achieved bymore than one move, the algorithm selects a move re-sulting in a greater number of oligonucleotides in thesolution. Hence, the insertion is the most preferred movewith shifts, the deletion of an oligonucleotide and thedeletion of a cluster being next.

(2) After a given number of moves without any improve-ment of the global criterion function value, the methodexecutes a selected number ofextendingmoves. Ex-tending moves are feasible moves selected by usingfrequency-based memory instead of the condensationfunction. This memory is a structure that remembers thenumber of times each oligonucleotide from the spec-trum appears in solutions. (For example, an oligonu-cleotide contained in all solutions generated so far hasa frequency value equal to the number of iterations ofthe algorithm, while an oligonucleotide never used forconstructing solutions has a frequency value equal to 0.)There are two types of extending moves: the insertionand the deletion of an oligonucleotide. The most pre-ferred one is the insertion, and an oligonucleotide withthe lowest frequency value is chosen. If no insertion ispossible, an oligonucleotide with the highest frequencyvalue is deleted from the solution. After the executionof extending moves, the method returns to the normalscheme with the condensation as a criterion function.

Such a combination of condensing and extending the so-lution, guarantees that the number of oligonucleotides willincrease from a small value in the initial solution to a near-optimal value in the final one. The use of the frequency-based memory forces an inclusion of oligonucleotides thatare not well-matched into a solution. This effect wouldnever be obtained by using the condensation functiononly.

Inserted or shifted oligonucleotides are remembered onthe tabu list for a given number of iterations. The list ischecked if an attempt to shift or to delete an oligonucleotideis made. The shift or deletion is prevented if the oligonu-cleotide is found to be on the list. Among others, this proce-dure makes it impossible to reverse extending moves. It doesnot appear necessary to remember clusters shifted, since

J. Błazewicz et al. / Computational Biology and Chemistry 28 (2004) 11–19 15

clusters are often changed after a move. Even if the methodgets into a cycle, it will stay there only briefly because ofthe next series of extending moves. An element on the tabulist may be deleted or shifted together with the cluster con-taining it. The element may also be deleted if there is noother feasible move. In such a case, an element that hasbeen on the tabu list during the greatest number of itera-tions, is chosen. On the other hand, a deletion of a recentlyinserted oligonucleotide is not profitable from the viewpointof a maximization of the global criterion function. Thus, anaspiration criterionfor overriding the tabu status of an ele-ment is not used here.

To increase the quality of a solution, the algorithm doesnot stop after a given number of cycles consisting ofcon-densingand extending moves, but restarts the search pro-cess. This is done by introducingexternal cycles. A newstarting solution is generated on the basis of the frequency

values, choosing oligonucleotides having the lowest frequen-cies. After that most of algorithm variables are set to initialvalues, thus, the search process is independent of the previ-ous stages.

The first oligonucleotide of the target sequence can bean optional parameter for the method. The assumption con-cerningits knowledgeis well justified since thepolymerasechain reaction(PCR) is presently most often used to am-plify a DNA to be sequenced. The tabu search method doesnot need to know the first oligonucleotide, but the lack ofthis knowledge may cause the algorithm to execute moreiterations.

The method stops after a given number of the externalcycles and returns as its outcome the sequence that has beenfound to contain the greatest number of oligonucleotidesfrom the spectrum. A pseudocode of the described algorithmis given below.

16 J. Błazewicz et al. / Computational Biology and Chemistry 28 (2004) 11–19

4. Computational experiments

The presented algorithm has been tested in an extensivecomputational experiment performed on PC Pentium 4 ma-chine with 256KB RAM and Linux operating system. TenDNA sequences have been taken from GenBank as a ba-sis for the testing. Their accession numbers are given in theAppendix A.

From each of the sequences we created three sequencesof length 200, 400, and 600 nucleotides, respectively. Thesesequences were prefixes of the ones taken from GenBank.Next, a hybridization experiment has been simulated by cut-ting off (from these sequences) oligonucleotides of meltingtemperatures 26 and 28◦C, and 36 and 38◦C, respectively.In this way two spectra for every sequence were created,containing oligonucleotides of these temperatures.

The following idea has been used to introduceerrors intothe spectrum. The notation “200− 10%+ 10%” means thatfrom an ideal spectrum (without any errors) for a sequenceof length 200, 10% of randomly selected (according to auniform distribution) oligonucleotides (negative errors) havebeen deleted, and next 10% randomly generated oligonu-cleotides (positive errors) have been inserted. Moreover, inall the cases a starting oligonucleotide could not be deletedand oligonucleotides which have been introduced as posi-tive errors had to be different from those already existing inthe spectrum. Errors have been generated in a progressivemanner. The set “200− 20%+ 20%” has the same errors asthe set “200−10%+10%”, but in addition 10% of negativeerrors and 10% of positive errors have been inserted into it.

The algorithm requires seven parameters:

• a length of an original sequencen;• a number of condensing moves performed without im-

proving the value of the global criterion function (A);• a number of extending moves (B);• a number of cycles of the type “condensing moves

+ extending moves” (C);• a number of external cycles (D);• a length of the tabu list (E);• (optionally) the first oligonucleotide in an original se-

quence.

In the experiments with oligonucleotides of temperatures26 and 28◦C the following values of the parameters of thetabu method have been used:A = 7,B = 8,C = 600,D =4 andE = 30. In the case of libraries of temperatures 36and 38◦C the parameters were:A = 10,B = 7, C = 400,D = 4 andE = 30. These values were chosen after a seriesof preliminary tests.

It has been assumed that the first oligonucleotide of thereconstructed sequence is known. (As we already said, thisassumption can be easily omitted at the expense of the timeused by the algorithm.) The first (starting) solution has beengenerated by a simple heuristic algorithm(Błazewicz et al.,1999).

A validation of the sequences generated by the methodis done in two ways. Firstly, ausageof oligonucleotidesfrom the spectrum (i.e. a percentage of a number of correctoligonucleotides from the spectrum composing a constructedsequence) is applied. Secondly, a similarity to the originalsequence is checked using Needleman–Wunsch algorithm(Needleman and Wunsch, 1970). The algorithm comparestwo sequences: the one generated by the tabu search algo-rithm st and the original sequenceso. The similarity of thesequences is determined according to the following formula:σ = 100(δ − ψ)/(χ − ψ), whereδ is a scoring for the twosequences, being a sum of scores for all columns in an op-timal alignment (1 point for a match,−1 for mismatch orgap), and

ψ ={l(so)d + (l(st) − l(so))g if l(st) > l(so)

l(st)d + (l(so) − l(st))g otherwise

χ ={l(st)m if l(st) > l(so)

l(so)m otherwise

wherel(st) and l(so) are lengths of sequencest andso, re-spectively, andm = 1, d = −1, g = −1.

The presented algorithm has been compared with the re-sults of other competitive approaches for SBH with standardequal-length libraries. In fact, only few of the existing algo-rithms deal with both types of errors and non-constrained er-ror characteristic(Błazewicz et al., 1999, 2000c; Zhang et al.,2002). Below such a comparison has been made betweenthe presented algorithm and the one proposed inBłazewiczet al. (2000c). The latter is based on the formulation givenin (Błazewicz et al., 1999)reducing the SBH problem (withstandard libraries) to a variant of the Selective TravelingSalesman Problem.

Table 1contains the results of the computations done withthe spectra of temperatures 26 and 28◦C. Each entry in thetable is computed on the basis of the results obtained forsequencing the 10 DNA chains. For each chain and entrythe sequencing process has been executed 100 times (forthe same set of errors) to obtain a representative set of data.Hence, each entry is the average obtained from 1000 execu-tions. These instances have been generated in order to com-

Table 1Results for the spectra of temperatures 26 and 28◦C

Instances Usage(%)

Similarity(%)

Length(nucleotide)

Time (s)

200− 5%+ 5% 98.23 99.90 199.80 10.13200− 10%+ 10% 98.75 99.90 199.80 17.37200− 20%+ 20% 92.11 93.25 199.50 43.10

400− 5%+ 5% 96.65 92.58 399.40 57.74400− 10%+ 10% 97.84 81.82 399.20 94.21400− 20%+ 20% 90.66 73.76 399.30 219.67

600− 5%+ 5% 95.75 81.88 598.80 115.94600− 10%+ 10% 97.30 80.31 596.30 223.19600− 20%+ 20% 89.41 76.47 598.70 483.21

J. Błazewicz et al. / Computational Biology and Chemistry 28 (2004) 11–19 17

Fig. 1. Usage for the spectra of temperatures 26 and 28◦C.

Fig. 2. Similarity for the spectra of temperatures 26 and 28◦C.

pare results of our algorithm with the older one(Błazewiczet al., 2000c), basing on spectra of constant oligonucleotidelength. (The reason for this is that two libraries of tempera-tures 26 and 28◦C have a similar cardinality as one libraryof the length equal to 10 (assumed in the experiment fromBłazewicz et al. (2000c)).) Figs. 1 and 2present the val-ues of usage and similarity for the spectra of temperatures26 and 28◦C. The results fromBłazewicz et al. (2000c)arepresented inTable 2.

The computational experiments have been done on differ-ent machines, however, we can see that the algorithm fromBłazewicz et al. (2000c)consumes much more time than thenew one with the growth of an instance size. The qualityof its solutions is also much lower than in the case of thenew algorithm. Because the main scheme of both algorithmsis very similar, we could pose that the new isothermic ap-proach has an advantage over the traditional one from thecomputational point of view.

Table 2Results from Błazewicz et al. (2000c) for the spectra of constant oligonu-cleotide length equal 10

Instances Usage (%) Time (s)

200− 5%+ 5% 94.63 54.2200− 10%+ 10% 93.50 66.0200− 20%+ 20% 92.00 86.4

400− 5%+ 5% 87.92 790.2400− 10%+ 10% 89.64 830.7

Table 3Results for the spectra of temperatures 36 and 38◦C

Instances Usage (%) Similarity (%) Length (nucleotide) Time (s)

200− 5%+ 5% 99.80 99.71 199.80 9.51200− 10%+ 10% 99.02 98.90 199.80 15.69200− 20%+ 20% 99.66 99.75 199.50 29.94

400− 5%+ 5% 99.80 97.10 399.20 39.87400− 10%+ 10% 99.38 96.30 398.00 64.49400− 20%+ 20% 98.04 95.32 397.10 154.87

600− 5%+ 5% 99.10 94.50 597.90 89.87600− 10%+ 10% 98.87 93.87 597.60 147.84600− 20%+ 20% 98.85 93.21 597.50 375.13

The comparison of our results with the ones presentedin (Zhang et al., 2002)is possible only in part. Concerningrandom negative errors, they introduced only 5% of them,therefore only the instances 200− 5% + 5% and 400−5%+ 5% are comparable. Their results have been obtainedin a shorter computation time. The quality of the solutions,however, cannot be compared since they did not give anysuch measure.

The approach presented in the paper works better if thetemperatures assumed are higher. (Cardinalities of librariesare also bigger.) The results for the libraries of temperatures36 and 38◦C are given inTable 3.

The results produced by the algorithm (Table 3andFigs. 3and 4) are of very high similarity to the original sequences

Fig. 3. Usage for the spectra of temperatures 36 and 38◦C.

Fig. 4. Similarity for the spectra of temperatures 36 and 38◦C.

18 J. Błazewicz et al. / Computational Biology and Chemistry 28 (2004) 11–19

and have been generated in relatively short (knowing re-quirements of the tabu search) computation times. It shouldbe noted that the algorithm has a direct influence only onthe usage of oligonucleotides from the spectrum and on thelength of the solution (it maximizes the usage not exceed-ing the given length), and the values of the two parametersare very close to the optimal ones. The high values of thesimilarity to original sequences prove the correctness of thecriterion functions implemented in the algorithm.

5. Conclusions

In the paper a heuristic algorithm for a new variant ofDNA sequencing by hybridization has been proposed. Thevariant is based on libraries composed of oligonucleotidesof equal melting temperatures (instead of equal length likein the classical approach). The use of such libraries may leadto a reduction of experimental error rate. Another advantageof the new approach follows from the observation that the li-braries of the new type are composed of oligonucleotides ofvarious lengths. Moreover, the cardinality of such a libraryis between the cardinality of a classic library composed ofoligonucleotides of length corresponding to the shortest ele-ment from the new one and a library composed of elementsof length corresponding to the longest element. Since thenew library contains short and long oligonucleotides, thepossibility of wrong reconstruction caused by repetitions de-creases.

The computational phase of the new variant of the se-quencing method is a computationally hard problem if thereare errors in the input data (similarly to its classical counter-part). Hence, the need of developing good polynomial timeheuristics arises. In the paper, an algorithm based on tabusearch approach has been presented and its effectiveness hasbeen tested in an extensive computational experiment. Theresults obtained in the experiment confirmed a high qual-ity of the solutions provided by the algorithm, as comparedwith other existing algorithms for standard SBH.

Acknowledgements

The research has been partially supported by the KBNgrant.

Appendix A

Accession numbers of sequences coding human proteinsused in the computational experiments.

NM 005273, NM016183, NM080601, X51535,X56088, X03350, Y00264, Y00649, X05299, AF435957.

References

Bains, W., 1991. Hybridization methods for DNA sequencing. Genomics11, 294–301.

Bains, W., Smith, G.C., 1988. A novel method for nucleic acid sequencedetermination. J. Theor. Biol. 135, 303–307.

Błazewicz„ J., Kasprzak, M., 2002. Complexity of DNA sequencing byhybridization. Theor. Comput. Sci. 290, 1459–1473.

Błazewicz„ J., Kaczmarek, J., Kasprzak, M., Markiewicz, W.T., W¸eglarz,J., 1997. Sequential and parallel algorithms for DNA sequencing.CABIOS 13, 151–158.

Błazewicz„ J., Formanowicz, P., Kasprzak, M., Markiewicz, W.T.,Weglarz, J., 1999. DNA sequencing with positive and negative errors.J. Comp. Biol. 6, 113–123.

Błazewicz, J., Formanowicz, P., Kasprzak, M., Markiewicz, W.T., 2000a.Isothermic oligonucleotide libraries. In: Miyano, S., Shamir, R.,Takagi, T. (Eds.), Currents in Computational Molecular Biology,pp. 97–98.

Błazewicz, J., Formanowicz, P., Kasprzak, M., Markiewicz, W.T., 2000b.Sequencing by hybridization with isothermic oligonucleotide libraries.Research report RA-002/2000, Poznan Supercomputing and Network-ing Center, in press.

Błazewicz„ J., Formanowicz, P., Kasprzak, M., Markiewicz, W.T.,Weglarz, J., 2000c. Tabu search for DNA sequencing with false neg-atives and false positives. Eur. J. Oper. Res. 125, 257–265.

Drmanac, R., Labat, I., Brukner, I., Crkvenjakov, R., 1989. Sequencingof megabase plus DNA by hybridization: theory and the method.Genomics 4, 114–128.

Fodor, S.P.A., Read, J.L., Pirrung, M.C., Stryer, L., Lu, A., Solas, D.,1991. Light-directed, spatially addressable parallel chemical synthesis.Science 251, 767–773.

Fogel, G.B., Chellapilla, K., 1999. Simulated sequencing by hybridiza-tion using evolutionary programming. In: Proceedings of the IEEECongress in Evolutionary Computations (CEC’99), IEEE Service Cen-ter, Piscatawag, pp. 445–452.

Fogel, G.B., Chellapilla, K., Fogel, D.B., 1998. Reconstruction of DNAsequence information from a simulated DNA chip using evolutionaryprogramming. Lect. Notes Comput. Sci. 1447, 429–436.

Gallant, J., Maier, D., Storer, J.A., 1980. On finding minimal lengthsuperstrings. J. Comput. Syst. Sci. 20, 50–58.

Glover, F., Laguna, M., 1997. Tabu Search. Kluwer Academic Publishers,Boston.

Guénoche, A., 1992. Can we recover a sequence, just knowing all itssubsequences of given length? CABIOS 8, 569–574.

Halperin, E., Halperin, S., Hartman, T., Shamir, R., 2002. Handling longtargets and errors in sequencing by hybridization. In: Proceedings ofthe 6th RECOMB, pp. 176–185.

Khrapko, K.R., Lysov, Y.P., Khorlin, A.A., Shik, V.V., Florentiev, V.L.,Mirzabekov, A.D., 1989. An oligonucleotide approach to DNA se-quencing. FEBS Lett. 256, 118–122.

Lipshutz, R.J., 1993. Likelihood DNA sequencing by hybridization. J.Biomol. Struct. Dyn. 11, 637–653.

Needleman, S.B., Wunsch, C.D., 1970. A general method applicable tothe search for similarities in the amino acid sequence of two proteins.J. Mol. Biol. 48, 443–453.

Pease, A.C., Solas, D., Sullivan, E., Cronin, M., Holmes, C., Fodor, S.,1994. Light-generated oligonucleotide arrays for rapid DNA sequenceanalysis. Proc. Natl. Acad. Sci. U.S.A. 91, 5022–5026.

Pevzner, P.A., 1989. l-Tuple DNA sequencing: computer analysis. J.Biomol. Struct. Dyn. 7, 63–73.

Pevzner, P.A., Lipshutz, R.J., 1994. Towards DNA sequencing chips. In:Prıvara, I., Rovan, B., Ružicka, P. (Eds.), Mathematical Foundationsof Computer Science 1994. Springer-Verlag, Berlin, pp. 143–158.

Phan, V.T., Skiena, S.S., 2001. Dealing with errors in interactive sequenc-ing by hybridization. Bioinformatics 17, 862–870.

J. Błazewicz et al. / Computational Biology and Chemistry 28 (2004) 11–19 19

Shamir, R., Tsur, D., 2002. Large scale sequencing by hybridization. J.Comp. Biol. 9, 413–428.

Southern, E.M., 1988. UK Patent Applic. GB8 810400.Southern, E.M., Maskos, U., Elder, J.K., 1992. Analyzing and comparing

nucleic acid sequences by hybridization to arrays of oligonucleotides:evaluation using experimental models. Genomics 13, 1008–1017.

Wallace, R.B., Johnson, M.J., Hirose, T., Miyake, T., Kawashima, E.H.,Itakura, K., 1981. The use of synthetic oligonucleotides as hybridiza-tion probes. II. Hybridization of oligonucleotides of mixed sequenceto rabbit beta-globin DNA. Nucleic Acids Res. 9, 879–894.

Zhang, J., Wu, L., Zhang, X., 2002. Reconstruction of DNA sequencingby hybridization. Bioinformatics 19, 14–21.