universita ca’ foscari –...

UNIVERSITA CA’ FOSCARI – VENEZIAFacolta di Scienze Matematiche, Fisiche e Naturali

Corso di Laurea Specialistica in Informatica

Tesi di Laurea

Laureando: Giulio Marcon

Dna sequencing:the computational point of view

Relatori: Merce Llabres Segura

Marta Simeoni

Correlatori: Nicola Cannata

Giorgio Valle

Anno Accademico 2003-2004

DNA Sequencing:the Computational Point of View

Author: Giulio Marcon

Mestre (Venice) - Italy, 15th July 2004

To Ariella and Giorgio,who, perpetuating life,

gave birth to me.

Abstract

We propose and implement a solution to a computational prob-lem arising from DNA sequencing. The original biologicalproblem (clone ordering from fingerprinting data obtained by

complete digestion of four restriction enzymes and fluorescent labelling)is formalized, compared to previously studied similar problems, and dealtwith through an adaptation of algorithms originally thought for restric-tion mapping. Additionally, an approach to the noisy version of theproblem is proposed without implementation.

v

Contents

1 Introduction 1

1.1 Biological background . . . . . . . . . . . . . . . . . . . . 3

1.1.1 From DNA to proteins . . . . . . . . . . . . . . . . 4

1.1.2 Restriction enzymes . . . . . . . . . . . . . . . . . 7

1.1.3 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 DNA sequencing . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.1 Shotgun sequencing . . . . . . . . . . . . . . . . . . 10

1.2.2 Hierarchical sequencing . . . . . . . . . . . . . . . . 12

1.2.3 Brief history of DNA sequencing . . . . . . . . . . . 13

2 The problem 17

2.1 Related problems . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.1 Single Complete Digest . . . . . . . . . . . . . . . . 19

2.1.2 Partial Digest . . . . . . . . . . . . . . . . . . . . . 19

2.1.3 Double Complete Digest . . . . . . . . . . . . . . . 22

2.1.4 Multiple Complete Digest . . . . . . . . . . . . . . 23

2.1.5 Summing up . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Existing software . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 The Washington school . . . . . . . . . . . . . . . . . . . . 29

3 Solving the error-free problem 35

3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 Implementation and testing . . . . . . . . . . . . . . . . . 45

vii

4 A proposal for noisy data 514.1 The proposed algorithm . . . . . . . . . . . . . . . . . . . 524.2 Fragments compatibility . . . . . . . . . . . . . . . . . . . 54

4.2.1 Sizes compatibility . . . . . . . . . . . . . . . . . . 554.2.2 Colours compatibility . . . . . . . . . . . . . . . . . 584.2.3 Probability of compatibility . . . . . . . . . . . . . 63

4.3 How to verify the proposal . . . . . . . . . . . . . . . . . . 63

5 Conclusions and future work 65

A Source codes 81A.1 camlstomach . . . . . . . . . . . . . . . . . . . . . . . . . 81A.2 camlass . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

viii

Contents

Chapter one introduces the context of the problem: the biologicalbackground, the history of sequencing projects and their current

status. Chapter two presents the problem, the related computationalproblems and the software currently in use. Chapter three presents oursolution to the problem, by showing the used algorithms and giving somenotes on the implementation. Chapter four is a proposal on how to dealwith the same problem when the input data contain errors.

The first chapter can be skipped by a reader that is already aware ofthe context and the biological issues. All other chapters are dependanton the previous ones and, in particular, the last section of chapter two isnecessary to understand what is not our original work in chapter three,while chapter three is necessary to understand chapter four.

ix

Acknowledgments

The work presented here has been conducted during a five monthsstay, from February 2004 to June 2004, at the Balearic Island Uni-

versity (UIB), Palma de Mallorca, Spain, with an Erasmus grant fromCa’ Foscari University, Venice, Italy.

Resources both from the Mathematics Department (DMI) of the UIBand from the Computer Science Department (DSI) of Ca’ Foscari havebeen extensively used to produce this work; additionally, in both placesvery kind personnel helped in the right moments.

At the DMI-UIB, Merce Llabres guided my research; the bioinformat-ics research group leaded by Francesc Rossello and formed by RicardoAlbertich, Jaume Casasnovas, Merce Llabres, Jose Miro and Jairo Rochaalways offered a good environment for working.

At the DSI, Marta Simeoni gave me all the possible support she couldfrom more than one thousand kilometers distance. Marcello Pelillo andMassimiliano Pavan gave me good references for max cut, clustering andmaximum bipartite matching.

The original problem has been provided by Giorgio Valle from theGenomic Research Group, CRIBI Biotechnology Centre, University ofPadua (Italy) and several meetings with him and Nicola Cannata in thesecond half of 2003 gave me a quick start into the problem, together witha review of biology subjects.

Several people reviewed drafts or parts of this written work and gaveuseful suggestions: Ricardo Albertich, Nicola Cannata, Merce Llabres,Giulio Manzonetto, Jairo Rocha, Marta Simeoni, Petra Siwek.

xi

Chapter 1

Introduction

Das Neue ist niemals ganz neu. Es geht ihm immerein Traum voraus. a

(Ernst Bloch)

aThe new is never completely new. A dream always pre-cedes it.

Acomputer scientist could be easily attracted by current challengesarising from biology. It was in 1953 when Watson and Crickdiscovered the spatial structure of DNA, a long sequence in a

four letters alphabet that codifies life: a computer scientist could easilysee it as a program written in an unknown language that contains in itselfthe code to build the interpreter.

The following early quotation of Robbins [Rob92] is a computer me-taphor very suitable to attract computer scientists to the subject:

Consider the 3.3 gigabytes of a human genome as equiv-alent to 3.3 gigabytes of files on the mass-storage device ofsome computer system of unknown design. Obtaining the se-quence is equivalent to obtaining an image of the contentsof that mass-storage device. Understanding the sequence isequivalent to reverse engineering that unknown computer sys-tem (both the hardware and the 3.3 gigabytes of software) all

1

the way back to a full set of design and maintenance specifi-cations.

[. . . ]

Reverse engineering the sequence is complicated by the factthat the resulting image of the mass-storage device will not bea file-by-file copy, but rather a streaming dump of the bytes inthe order they occupied on the device and the files are knownto be fragmented. In addition, some of the device is knownto contain erased files or other garbage. Once the garbagehas been recognized and discarded and the fragmented files re-assembled, the reverse engineering of the codes must be un-dertaken with only a partial, and sometimes incorrect under-standing of the CPU on which the codes run. In fact, deducingthe structure and function of the CPU is part of the project,since some of the 3.3 billion gigabytes are known to be thebinary specifications for the computer-assisted-manufacturingprocess that fabricates the CPU. In addition, one must alsoconsider that the huge database also contains code generatedfrom the result of literally millions of maintenance revisionsperformed by the worst possible set of kludge-using, spaghetti-coding, opportunistic hackers who delight in clever tricks likewriting self-modifying code and relying upon undocumentedsystem quirks.

But beyond this long-term and optimistic view of cracking the codeof life as a hacker would do with garbled computer code, interestingcomputational problems arise all over biological issues. And, this beinga relatively new field, many of these problems have yet to be studied andanalysed in depth.

The recent conjunction of biology and computer science got namedbioinformatics : with this not so well chosen term everything that has todo with the use of a computer in biology is regarded as bioinformatics.

Therefore it is necessary to distinguish at least two different fieldsof bioinformatics: on the one hand, the field that is of less interest tous, which is the one related to systems, and deals with data storage and

2

treatment at the low level, as for example creation and maintenance ofbiological databases; this is what we will refer to when we say bioin-formatics ; on the other hand, the one that deals with formalization ofproblems (often with statistical models), computational analysis of theseand algorithms development, all aimed at analyzing biological data; thisis what we will call computational biology.

This is in line with the following definitions by the U.S. NationalInstitute of Health [HDH+00]:

bioinformatics research, development, or application of computationaltools and approaches for expanding the use of biological, medical,behavioral or health data, including those to acquire, store, orga-nize, archive, analyze, or visualize such data.

computational biology the development and application of data ana-lytical and theoretical methods, mathematical modeling and com-putational simulation techniques to the study of biological, behav-ioral, and social systems.

Moreover, it is also necessary to have an overview of biological ad-vances; some authors prefer to distinguish ages in which a problem hasbeen dealt with with particular attention: for these authors the first ageis genomics, where problems were mainly related to DNA sequencing; thesecond one is proteomics, where the problems are aimed at understandingprotein functionality (proteins are the bricks on which life is built on);the next upcoming age is evolution, where the most detailed mechanismsof life should be exposed.

All of the problems presented in this thesis refer to genomics, a fieldthat still offers challenges, as a lot of genomes have yet to be sequenced(at the moment only few of them are available, and no eukaryote genomeis complete, as they are just high quality drafts).

1.1 Biological background

We introduce here the required biological background with simplificationswhere appropriate. We first want to give a short overview of the main

3

A

G

G

C

T

T

C

Figure 1.1: The DNA is a complementary double helix in a four lettersalphabet.

process involved in building a life: DNA is a plan for the organism “inconstruction”; it gives instructions to produce chains of amino acids thatfold into proteins which are the bricks every organism is made of; DNAalso provides instructions to maintain the organism after it has beenbuilt.

1.1.1 From DNA to proteins

Deoxyribonucleic acid (DNA) is physically a double helix consisting ofalternating phosphate and sugar groups, kept together by hydrogen bondsbetween pairs of organic bases. There are only four kinds of organic bases:adenine, thymine, guanine and cytosine. These bases should satisfy thecomplementarity condition: every adenine base in a chain of the helixhas to correspond to a thymine base in the other chain (and vice versa)and every guanine to a cytosine base (and vice versa). The two chainshave a direction, as the skeleton to which the organic bases are attachedis asymmetric: the end with a free phosphate chemical group is called 5’,while the end with a free hydroxyl chemical group is called 3’; so, while

4

one chain has the direction 5’-3’, the other has the direction 3’-5’.

The DNA molecules just described can be seen as long sequencesin a four letters alphabet, where the letters are A, T, G, C from thecorresponding bases and usually the direction of the reading is 5’-3’.Thus the complementarity condition can be summed up as A - T, G - C.

It is not required, in order to follow the next chapters, to know themechanisms of transcription; anyway, it is useful to know the productsof DNA. The first product of DNA is ribonucleic acid (for short RNA),very similar to DNA but for the fact that it is single stranded and thatthe thymine base is replaced by uracil (U).

Transcription of DNA into RNA is done by some proteins in the 5’to 3’ direction: starting from some DNA (called template DNA), theseproteins unfold and fold the double helix using one of the strands asa template to build the corresponding RNA molecule with respect tocomplementarity. Every three bases in an RNA strand codify then for anamino acid, apart from some non codifying initial and final regions (thereare 20 amino acids, so dlog4 20e = 3 bases are required): amino acidschains are built out of RNA instructions. These chains fold themselvesinto proteins that are macromolecules with various functionalities; themechanism of protein folding is one of the main concerns of the proteomicage.

As DNA has the instructions to build and maintain an organism, itsstudy is fundamental for being able to improve health conditions, forunderstanding diseases and deficiencies due to genomic reasons, as wellas for understanding the evolution of species and for learning more aboutthe human race. For this reason the human DNA is the most interestingone, and therefore in October 1990 the American Department of Energyand the National Institute of Health started the Human Genome Project,whose aim was to sequence the three billion base pairs of the humangenome; more details on this and the history of sequencing projects canbe found in section 1.2.3.

5

G G C C

GGCC

5’

5’

3’

3’

G A A5’ 3’

3’

T CT

5’C T T A GA

(a) HaeIII (b) EcoRI

Figure 1.2: (a) A cut by an enzyme producing blunt ends. (b) A cut byan enzyme producing sticky ends, with a 5’ overhang and a 3’ recessedend.

G5’

3’ C T T AA

(a)

G A5’

3’ C T T AA

ddATP

(b)

Figure 1.3: (a) EcoRI restriction site after the cut: a 3’ recessed end isproduced, with a corresponding 5’ overhang. (b) A ddATP dideoxynu-cleotides can recognize the T on the 5’ overhang, thus attaching to thechain and concluding the elongation; as the ddATP carries a fluorescenttag with a particular colour, the cut can later be recognized.

6

1.1.2 Restriction enzymes

Restriction enzymes are particular proteins that can recognize a sequence(a pattern) in the DNA and cut this sequence at a determined point: thelength of the sequence, the sequence itself, the type and the point of thecut all depend on the enzyme.

A well known public database, “The Restriction Enzyme Database”- http://rebase.neb.com/, provides information about the known re-striction enzymes (at the time of the writing, about 3,000) and can beused to find an enzyme that suits particular needs in cutting DNA.

The cut produced by a restriction enzyme can be of two types: withblunt ends, where both strands of DNA are cut at the same point (seefigure 1.2.a for an example); or with sticky ends, where the helix is cutsymmetrically at different points, leaving two overhanging complemen-tary single-stranded DNAs at the point of the cut (see figure 1.2.b for anexample).

A restriction enzyme is said to be n-cutter if it recognizes n baseslong sequences. In the usual notation the sequence is described in 5’-3’direction; in the 3’-5’ direction it would be reversed and complemented.For instance, in figure 1.2, the enzyme HaeIII recognizes the sequenceGGCC and, as it cuts after the second base, this is usually written asGG|CC; the enzyme EcoRI recognizes the sequence G|AATTC instead.

Note that the notation just described does not specify if the producedcut will have blunt ends or sticky ends, but usually the enzymes used forthe techniques explained further down recognize palindrome sequences(note that palindrome means that they have to satisfy also complemen-tarity, like GGCC and GAATTC) and cut symmetrically on the other strand;for this reason we can assume that if the cut is in the middle, like forHaeIII, then the enzyme produces blunt ends, otherwise the enzyme pro-duces sticky ends, like in the case of EcoRI.

The usual DNA bases are synthesized out of deoxynucleotides thatleave an anchor point on the chain for the next nucleotide (a 3’ OHchemical group); the use of synthetic dideoxynucleotides that lack thisanchor point permits to terminate chain elongation: this means thatwhen a dideoxynucleotide is added to the chain, no other nucleotides

7

can be added afterwards. These nucleotides have to respect the com-plementarity condition (a ddATP can only be added where there is thecorresponding T on the opposite strand) but they can additionally havea tag that fluoresces in a determined colour when hit by a laser light (flu-orescent dies). A dideoxynucleotide can be used, for instance, to detecta cut produced by a restriction enzyme that leaves sticky ends (see figure1.3).

In a sequencing project it is often useful to have an intermediateproduct, called map: a partial information about the final sequence whereparticular points are annotated. A restriction map is a map with thelocations of all the restriction sites of one or more enzymes; it is partialif not all the sites are present. A physical map is a map that positionssome sequences; a complete physical map is the final sequence.

1.1.3 Vectors

To handle DNA fragments in the laboratory, special host cells calledvectors are used; there are different kinds of host cells, each one with itsown characteristics. Their main role is to provide a high availability ofexact copies of the fragments (clones) by allowing their easy replication.A vector is selected depending mainly on the fragment size and on itsstability for the replication of the type of DNA in question. Sometimes,in large genome projects, several kinds of vectors are used at the sametime.

Some of the vectors are (a full list of the up to now 2,600 used vectorscan be found at http://seq.yeastgenome.org/vectordb/):

• plasmids : the fragment is stored into a plasmid, a circular extra-chromosomal DNA element, typically in an Escherichia Coli bac-terial cell, as this bacterium has been widely studied and is wellknown; the fragment size should be about 20 Kbp;

• phages : the fragment is stored in a viral vector (phages - nucleicacids wrapped in protein coats); again the typical host cell is E.Coli, and the typical vectors are the lambda phage, that can holdup to 25 Kbp, or the P1 phage virus (also called PAC vector), that

8

can hold up to 100 Kbp; other phage-based vectors are the cosmids,plasmids containing part of the lambda phage DNA and that canbear up to 45 Kbp;

• artificial chromosomes : the fragment is stored in a small extra chro-mosome in a yeast cell (YAC ), typically Saccharomyces cerevisiae,that can hold up to 1 Mbp, or in a bacterial cell (BAC ), typicallyE. Coli, that can hold up to 300 Kbp); BACs are usually preferredas they are usually more stable, though holding smaller fragments.

The vector with the included fragment is called clone. As the vectorused in the main problem of this thesis is the bacterial one, we will oftenabuse the term BAC referring to a generic clone, even if in our models wedo not put any constraints on the features of the clone, and we will abuseit also to refer to the fragment contained itself, as we are not concernedabout the vector.

A scaffold is a set of clones together with a positioning that canbe relative (one clone is positioned with respect to others) or, less often,absolute (the exact base pair location from which a clone starts is known).

1.2 DNA sequencing

From the biological background explained above, the following pointsshould remain clear:

• DNA can be seen as a string in the four letters alphabet A, C, T,G;

• the complementarity relation is A - C, T - G and it is symmetric;

• we can cut DNA at some particular sequence;

• as a BAC is simply a host and we can see it as the DNA sequenceit carries;

• a DNA map provides some information on the location of someparticular sequences.

9

gapread read

Figure 1.4: Example of a shotgun read: the fragment is only end se-quenced with a gap in the middle (usually the fragment length is about2,000 bases, and the read 500 bases).

We want then to point out two things that are never stressed enough:the first is that human beings share 99.9% of DNA, and most of theremaining differences are on Single Nucleotide Polymorphisms (SNPs) -errors of the transcription during the replication; the second is that wehave to be aware that it is difficult to deal with entities so small in size(the diameter of the DNA helix is just two nanometers): proof of thisfact is that today’s technologies allow us to sequence no more than somehundreds of bases from the end of a DNA fragment.

As human beings share most of the DNA code, it is sufficient to study,at a first stage, the DNA of just one single human being (actually thepreviously cited Human Genome Project examined five individuals); thehardness of sequencing (we can read only some hundred bases from oneend of a fragment out of the whole three billion sequence) is the mainmotivation of research on the field and of this work.

1.2.1 Shotgun sequencing

Of all the techniques applied in DNA sequencing, shotgun sequencingis the only one feasible in large genome projects, being largely prone toautomation and scalability.

In shotgun sequencing, the whole genome is replicated (the numberof times it is replicated is called coverage) and broken into small pieces(about 2,000 bases long) that are then end-sequenced from both sides:as current technologies allow the automated end-sequencing of 500-600bases, for each piece we have two reads interleaved by a gap (figure 1.4).

10

As the initial DNA has been replicated, these fragments overlap and thisis used for the reconstruction. The following quotation of Pevzner [Pev00]gives an intuitive view of the process:

Imagine several copies of a book cut by scissors into 10million small pieces. Each copy is cut in an individual wayso that a piece from one copy may overlap a piece from an-other copy. Assuming that 1 million pieces are lost and theremaining 9 million are splashed with ink, try to recover theoriginal text.

[. . . ]Computational biologists have to assemble the entire ge-

nome from these short fragments, a task not unlike assemblingthe book from millions of slips of paper. The problem is com-plicated by unavoidable experimental errors (ink splashes).

Shotgun sequencing can be applied directly to the whole genome(Whole Genome Shotgun or, for short, WGS): in this case the assemblyprocess is computationally very demanding, as no positional informationabout the fragments is available. For this reason, the use of fragments ofvariable length is preferable, and the building of intermediate scaffolds isalmost necessary.

Another strategy involves shotgun sequencing only for sequences ofa more manageable size (usually 100-200 Kbp). This strategy, calledHierarchical Sequencing (HS), will be discussed in detail in subsection1.2.2, but we present here the main steps: first, a library of clones coveringrandom regions of the genome is created; then, clones are ordered andsome of them are selected, so that they form a minimum tiling pathcovering the whole genome; finally, shotgun sequencing is applied to theseclones to obtain the final sequence.

In general HS is preferable over WGS for the following problems thatthe last one exhibits:

• by fragmenting the genome into short sequences, also with a highcoverage, a lot of bases are left uncovered (this is easily understand-able assuming a Poisson distribution of the start of the sequence) -

11

this means that with WGS alone we often obtain partial sequenceswith gaps; longer sequences cover the genome better allowing theselection of a good minimal tiling path;

• genomes are not random sequences and present a lot of repeatedmotifs (repeats); with WGS alone it would be very difficult to dis-tinguish if a short fragment belongs to a repeat or to another one.

As a disadvantage, HS requires the time-consuming initial mapping step.

1.2.2 Hierarchical sequencing

We outline here the whole process of sequencing by hierarchical strategy.The aim is to determine the sequence of some genome that we will callthe target DNA.

The first step is to create a clone library: this is done by replicatingthe DNA a certain number n of times (usually at least 6-7); n is calledthe coverage and is usually denoted “coverage n×”. Replicas are thenrandomly cut into fragments of appropriate size (depending on the vector,see subsection 1.1.3), usually by early stopping a partial digestion andselecting sizes with electrophoresis on agarose gel.

After creating a clone library, the next aim is to order the clones withrespect to their starting position in the DNA and to select a minimal tilingpath between them. Often intermediate orderings of subsets of clones areproduced and used for creating the final ordering: these ordered subsetsof clones are called contigs. In order to order clones, relations among themhave to be inferred. One technique consists in using clone fingerprints.

A clone fingerprint is something very peculiar to the clone and canbe produced in the following ways:

• by Sequence-Tagged Sites (STS): STS are short unique sequencesthat can be detected on a clone ; a set of STS is established andthe clone is tested against all of them; the fingerprint of the cloneis the set of STS that it presents;

12

• by digestion: the clone is digested by one or more restriction en-zymes and thus cut in several fragments; the fingerprint is the mul-tiset of the measurements of the fragments.

When fingerprinting the clones by digestion, several techniques canbe used, each one aiming at maximizing the characterization of the cloneby the fingerprint; the main parameters are:

• the number of restriction enzymes (single digestion, double diges-tion, multiple digestion);

• the length of the digestion (single or multistage partial digestion,complete digestion);

• the type of the restriction enzymes used (4-cutters, 6-cutters, cutwith blunt or sticky ends);

• the measurement of the fragments (size measurement, colour mea-surement).

A fingerprinting technique is reproducible if two fingerprintings of thesame clone are equal modulo errors. Partial digestion is not reproducible,as the restriction sites chosen by the enzymes underlie too many param-eters and change every time.

1.2.3 Brief history of DNA sequencing

In 1965, Robert W. Holley (1922-1993) concluded and published the firstsequencing project: it was a simple 77 base pairs tRNA sequence, butthe work, entirely done by hand, required nine years on the whole, assuitable techniques for automatisms were still not available (and it was aremarkable success also for this reason). This and the subsequent analysisof the sequence obtained granted him the 1968 Nobel prize in physiol-ogy/medicine (shared with Har Gobind Khorana and Marshall W. Niren-berg).

In 1977 Frederick Sanger (1918-) invented a technique, named afterhim, that definitely changed the history of sequencing and that is still

13

Year Organism Size Technique1965 yeast alanine tRNA 77 bp by hand1978 PhiX174 5,386 bp WGS1981 human mitochondrial DNA 16,569 bp WGS1982 bacteriophage λ 48,502 bp WGS1984 Epstein-Barr virus 172,282 bp WGS1995 Mycoplasma genitalium 5800,73 bp WGS1995 Haemophilus influenzae 1,830,138 bp WGS1996 Saccharomyces cerevisiae(∗) 12 Mbp HS1997 Escherichia coli 4.6 Mbp WGS1998 Caenorhabditis elegans (∗) 97 Mbp HS2000 Drosophila melanogaster (∗) 120 Mbp WGS2000 Arabidopsis thaliana(∗) 125 Mbp HS2001 Homo sapiens (∗) 3 Gb HS

Table 1.1: Concluded genome projects with size of the genome. Foreukaryotic genomes, marked with (∗), the year refers to the first draftpublication.

14

widely used; it granted him the 1980 Nobel prize in chemistry (sharedwith Paul Berg and Walter Gilbert). The technique allows subsequentreading of 500-1,000 bases, and was promptly used to carry out thefirst two significant genomic projects: an Escherichia coli bacteriophage,PhiX174, in 1978, and the human mitochondrial DNA in 1981.

Several other minor projects were conducted until 1995, when the firstcellular organism, the Haemophilus influenzae bacteria, was sequencedusing WGS.

In 1996 followed the first published genome of an eukaryotic organ-ism, Saccharomyces cerevisiae, and in 1997 the Escherichia coli bacteriagenome, the favourite model for genetics, molecular biology and biotech-nology.

Finally, the first multicellular organism genome, Caenorhabditis ele-gans, was available in 1998.

In October 1990, the American Department of Energy (Human Ge-nome Program, DoE-HGP) and the National Institute of Health (Na-tional Human Genome Research Institute, NIH-NHGRI) formally startedthe joint U.S. Human Genome Project, with an original 15 years planwhose goals were to identify all the genes in human DNA (that now weknow to be approximately 30,000), determine the sequences of the 3 bil-lion chemical base pairs that make up human DNA, store this informationin databases, improve tools for data analysis, transfer related technolo-gies to the private sector, and address the ethical, legal and social issues(ELSI) of the project. Two years later a private company, Celera Ge-nomics, was started by ex-NIH Craig Venter, concurring with the publicinstitution for the completion of the sequencing. The U.S. public in-stitution quickly became an international consortium (the InternationalHuman Genome Sequencing Consortium), involving sixteen institutionsand several other laboratories and researchers all around the world. InFebruary 2001 the two public [LLB+01] and private [VAM+01] competi-tors published independently the first draft obtained respectively withHS and WGS sequencing; in 2003 they again independently presentedtheir finished sequences. The sequences differ and each one has its prosand cons; debate is still going on about the validity of the techniques andthe final results for both of them [ISF+04].

15

On June 2004, according to the Genomes OnLine Database [Kyr99],freely available through the website http://www.genomesonline.org/,194 genomes have been completed and 931 are on-going (both prokaryoticand eukaryotic).

16

Chapter 2

The problem

If there is a problem you can’t solve, then there isan easier problem you can’t solve. Solve that onefirst.(Seen on #math on EFnet irc network, adaptedfrom G. Polya, “How to Solve It”, 2nd ed., Prince-ton University Press, 1957)

The problem we deal with is the ordering of BAC clones finger-printed by digestion with four restriction enzymes with fragmentssizes and colour measurement. By ordering we mean that we want

them ordered by the starting point position in the target DNA. Such atechnique will be used in sequencing projects in the Genomic ResearchGroup, CRIBI Biotechnology Centre, University of Padua (Italy) and hasbeen proposed with successful experimentation by Luo et al. [LTY+03].

The process in the laboratory consists in the creation of about 100Kbp BAC libraries with a coverage 7×. Each clone is then fully digestedthrough four 6-cutter enzymes that leave different recessed 3’ ends (thatis all the four different bases). For an example set of such enzymessee the table in figure 2.1. The fragments resulting from the digestionare coloured (labelled) with four different fluorescent dies in order torecognize which enzyme produced the cut. Finally, size and colour ofeach fragment are measured: such measurements are the fingerprint ofthe clone.

17

Restriction Restriction Recessedendonuclease site 3’ endEcoRI G|AATTC A

BamHI G|GATTC G

XbaI T|CTAGA C

XhoI C|TCGAG T

Figure 2.1: An example four 6-cutters enzymes set: the recessed 3’ end isdifferent for each enzyme, to allow the use of dideoxydonucleotides withfluorescent dies for colouring. This set is the one proposed by Luo etal. [LTY+03], but for the exclusion of the blunt-ends producer HindIIIenzyme.

Thus, given a set of BACs, the problem input and output are:

input the BACs fingerprints;

output the ordering of the BACs.

We give in this chapter references for similar problems, to show thenin chapter 3 how we adapted them for solving our problem. The literatureon biological problems related to restriction enzymes digestion is very richas they are very interesting from the computational point of view, andthey were often abstractedly studied in computational geometry before:the availability of concrete applications gave a new verve to the researchon these problems.

In the following sections we consider the problem from both the theo-retical and practical point of views, that is, both the literature on relatedcomputational problems from computer science, and the software thatwould be used in biological laboratories.

2.1 Related problems

We present in this section four computational problems arising from re-striction mapping that were source of inspiration for our work: Single

18

Complete Digest, Partial Digest, Double Complete Digest and MultipleComplete Digest.

2.1.1 Single Complete Digest - SCD

In single complete digest several clones are completely digested by a singlerestriction enzyme and the fingerprint is the multiset of fragment sizes.The aim is to produce the most compact map of the restriction sitesconsistent with fingerprint data.

This problem has been proved to be NP-Hard by formulating it asa constrained path cover problem on a multistage graph [JK97]. A 2-approximation algorithm for the problem is given in the same paper (fora short introduction, see Pevzner’s book [Pev00, pages 53-54]).

Although the problem is NP-Hard in this formulation, there is strongstatistical evidence that often the ordering of clones can be polynomiallyestimated. Our problem is similar to this for the fact that we have asingle fingerprint (as the digestion is with all the enzymes together) butmuch simpler as we do not have to find a restriction map but just anordering of clones.

2.1.2 Partial Digest - PD

In partial digest the clone is digested by several restriction enzymes to-gether. The digestion is stopped and fragments sizes are measured atdifferent times obtaining data of partial digestion at several stages. Theaim then is to obtain a restriction map of the clone.

The related problem (Partial Digest Problem) is a classical computa-tional geometry problem (the so called Turnpike Problem) whose com-plexity is unknown. The problem can be formalized as follows: let ∆Xbe the set of all pairwise distances between a set X of points in a line

∆X = {|x1 − x2| : x1, x2 ∈ X} ,

the problem consist then in reconstructing the original set X given theset ∆X. Figure 2.2 presents an instance of the problem: the line on thebottom represents the set X, while all the numbers on the arcs are the

19

7 2 5 9

9 7 14

14 16

23

Â ÂÂ Â Â Â

0 7 9 14 23

Figure 2.2: A simple instance of the Partial Digest Problem: the line onthe bottom represents the set X, while all the numbers on the arcs are theinterpoint distances, that is the set ∆X. The problem is to reconstructthe set X from ∆X.

interpoint distances, that is the set ∆X. The problem is to reconstructthe set X from ∆X.

Note that it is not always possible to uniquely reconstruct a set Xfrom ∆X. In particular, two different sets X and Y can have ∆X = ∆Y .Such sets are said homometric and their structure have been investigatedby Rosenblatt and Seymour [RS82] giving some characterization theo-rems. A very easy and intuitive way to construct homometric sets is this:given two multisets U and V , the sets U +V = {u+v : u ∈ U, v ∈ V } andU − V = {u − v : u ∈ U, v ∈ V } are homometric. Two example homo-metric sets that cannot be built this way are {0, 1, 3, 8, 9, 11, 12, 13, 15}and {0, 1, 3, 4, 5, 7, 12, 13, 15}, represented in figure 2.3.

Skiena proposed an algorithm with backtracking to solve the prob-lem [SSL90] with expected running time O(n2 log n): no instances wereknown to require more than polynomial time to be solved by this al-gorithm until Zhang cleverly constructed such a class [Zha94]. Theseinstances can be solved in polynomial time by using the problem 0-1quadratic definition and relaxation by semidefinite programming as de-scribed later by Dakic in her PhD thesis. The thesis itself [Dak00] is alsoan excellent guide to the problem for the interested reader.

Pandurangan and Ramesh [PR02] solved the problem of the not unique

20

1

2

5

7

9

12

1

4

6

8

11

3

5

7

10

2

4

7

2

5

3Â ÂÂ Â Â Â ÂÂ

0 1 2 5 7 9 12

(a)

1

5

7

8

10

12

4

6

7

9

11

2

3

5

7

1

3

5

2

4

2Â ÂÂ Â Â Â ÂÂ

0 1 5 7 8 10 12

(b)

Figure 2.3: Two example homometric sets: (a) {0, 1, 3, 8, 9, 11, 12, 13,15} (b) {0, 1, 3, 4, 5, 7, 12, 13, 15}

21

713

3

510

8

Â Â Â Â Â Â

(a)

52

8 53Â Â Â Â Â Â

(b)

Figure 2.4: A simple instance of the Double Complete Digest Problem:(a) the separate digestion data are available A = {3, 7, 13} and B ={5, 8, 10} (b) as well as the digestion data of the two enzymes togetherA + B = {2, 3, 5, 5, 8}. The problem is to produce a restriction map ofthe clone.

reconstruction by labelling the clone ends. The new problem is called La-belled Partial Digest and the authors propose a polynomial algorithm forthe error-free case and also for the error case (with a certain error bound).

Cieliebak et al. [CEP02] studied the real problem arising in the lab-oratory, where we want to reconstruct the set ∆X with just a subsetE ⊆ ∆X of its pairwise distances, and determined it to be NP-Complete.Moreover, they studied the complexity of the problem where the data arenoisy (that is with errors) and assessed it to be NP-Hard.

2.1.3 Double Complete Digest - DCD

In double complete digest a clone is completely digested by two restrictionenzymes, both separately and together. Three sets of the digestion dataare thus available: the fragments sizes obtained from the digestion by thefirst enzyme, the sizes from the digestion by the second one and the sizesfrom the digestion of the two enzymes together. Again, the aim is toproduce a restriction map of the clone. Figure 2.4 represents an instance

22

of the problem: the two lines represent the same fragment digested withthe enzymes separately (fig. 2.4.a) and with the enzymes together (fig.2.4.b).

The problem has been proved to be NP-Complete by Goldstein andWaterman [GW87]. Schmitt and Waterman [SW91] showed how onesolution can generate another possible one by simple transformations:Pevzner used these to develop an algorithm that generates all the solu-tion classes modulo these transformations given an instance [Pev95]; thealgorithm, based on a quite complicated intuition, uses alternating eule-rian cycles in coloured graphs to find such classes. The interested readercan find an extensive treatment of this and all the previous advances onthe problem in Pevzner’s book [Pev00].

2.1.4 Multiple Complete Digest - MCD

In multiple complete digest a clone is completely digested by more thantwo enzymes separately (and not with all the enzymes together). Theaim is again to produce a restriction map of the target DNA.

The problem has been extensively studied in the last ten years, mainlyby people connected to the Washington Genome Center, where such di-gestion data were being produced for the HGP (we will refer to the groupof authors of the following publications as the Washington school). A firstwork was conducted by Alizadeh and Karp [AKNW93], both at Berkeleyat that time: they proposed a local search technique for the ordering ofclones by multiple complete digest fingerprint data. Alizadeh went thenas a visitor to Rutgers, The State University of New Jersey (where hestill is), and Karp moved for about four years to Washington University,where the big deal was on.

Alizadeh’s PhD student Settergren presented his dissertation in 1998at Rutgers [Set98] while Karp’s PhD student Fasulo presented his in2000 at Washington University [Fas00]: both theses propose algorithmsfor building a restriction map of the target DNA from MCD data.

Other relevant references on this problem are Mumey’s PhD the-sis [Mum97], where a simulated annealing approach is used to find aclone ordering; Fasulo et al. [FJK+99], where for the first time the joint

23

work for fragment identification of the two PhD students was presented;Jiang and Karp [JK97], where the problem of producing a restriction mapis definitely separated from the one of obtaining an ordering or interleav-ing; finally, in the brilliant Fasulo et al. [FJKS98], the span and inclusionrelations were proved to be stronger than simple overlap relations.

In the attempt to produce a restriction map from multiple completedigest data it is necessary to pass through the step of ordering the clonesas, in contrast to the two previously presented problems, we want here togive a restriction map of the target DNA and not of the clone. For thisreason, also if the amount of data in the MCD problem are greater thanthe data we have in ours, techniques applied to MCD proved to be themost adaptable to our problem. We will deal in detail with their relevantparts in section 2.3

2.1.5 Summing up

Although all the presented problems are related to restriction mappingand not to clone ordering, they can be a source of inspiration for ourproblem. In particular, as we are digesting with all the enzymes together,the data present similarity with the input data of SCD; but as we justwant an ordering of the clones, and not the most compact restriction map,we have an easier problem. In fact, techniques for solving MCD, that isin general an easier problem than SCD, as clones are more characterized,applied to the case of one enzyme, turned out to be very effective. Referto the table below to see the characteristics of the input data of eachproblem.

digestionseparated together complete partial enzymes

SCD • • 1PD • • n

DCD • • • 2MCD • • n

our problem • • 4

24

2.2 Existing software

One of the first softwares for clone assembly from fingerprint data wasContig9, written in fortran for VAX, that evolved in its more portableANSI C UNIX version ContigC [SMS+88]: this software was created asa support for the Caenorhabditis elegans project at The Wellcome TrustSanger Institute (Hinxton, Cambridge, UK); its main aim was to assisthuman driven contig assembly through a set of automated analysis tools.

One of the tools, the program Image, is still in use for processing gelimages from restriction digest fingerprinting experiments. It has beenrewritten several times, and finally the latest version (3.10) is fully func-tional and can be also used to extract both fragment sizes and coloursfrom the gels (the input data in our problem).

ContigC was maintained until 1995, when work on FPC, its replace-ment, started; the first public release was two years later, and was pre-sented by Soderlund et al. [SLM97]. It was still more a working instru-ment for assemblying contigs than a completely automated tool. Afterdata acquisition (still by Image) the main process done by FPC is thefollowing:

shared bands the number of shared bands between two clone finger-prints is computed by counting the number of compatible bands(fragment sizes); two bands are considered compatible if their dif-ference is below a user supplied tolerance (that by default is 7);

binning clones are clustered using the number of shared bands as dis-tance measure; this process is called binning; the intention is thento order the clones inside each bin to form a contig;

overlap the overlap relations between clones in a bin are calculated ver-ifying that their overlap probability is below a user supplied cutoff;the overlap probability of two clones ci and cj, called Sulston scoreand denoted SulstonScore(ci, cj), is given by the following equation(from ContigC):

SulstonScore(ci, cj) =nL∑

m=M

[(nl

M

)(1− p)mpnl−M

],

25

where M is the number of shared bands between ci and cj, p = (1−b)nH , nL and nH are the lowest and the highest number of bandsof ci and cj and b = 2t/gel length with t the tolerance from sharedbands calculation and gel length is the fixed length of the gel usedto measure fragments sizes. Another equation is more commonlyused to compute overlap probability, the so called “Equation 2” orMott score that is not formalized in any publication and we willnot report here, although the source code that computes it is freelyavailable;

markers STS markers information on clones can be added (this is calleda framework in FPC); the user is able to choose how the overlaprelation is decided with these additional fingerprinting data (this iscalled CpM - Cutoff plus Marker, and is simply a criteria based onthe number of shared markers and on the number of shared bands);

ordering clones in a bin are ordered through the CB algorithm, its namederiving from “consensus band” (a clone ordering in FPC terms); theCB algorithm builds iteratively an ordering of the clones (CB map)picking them from the bin in the order they are; as the random orderthey have in the bin influences the produced CB map, ten runs areexecuted shuffling the input and keeping the best CB map.

Several improvements were made to the original programs: Soderlundet al. [SHDF00] presented version 4.7, where the three salient changes are:the possibility of the CB algorithm to mark a clone as questionable, if ithardly fits in the CB map; the possibility to incrementally build contigsas new fingerprinted data for new clones are available; and the possibilityto split or merge bins.

As the process of calculating all pairwise shared bands between clonesfingerprints was very time consuming, a parallelized version of FPC wasimplemented and described by Marra et al. [NTK+02]: it strangely usesa custom client/server protocol instead of using the classical MessagePassing Interface, with a central architecture where the server sends andcollects results of tasks of fixed size (six clones against all the others).

Other improvements described by Engeler et al. [EHNS03] allow toadd simulated fingerprinted data from already sequenced clones (through

26

two stand alone programs: FSD for simulating digestion data and BSS forSTS markers data). But the biggest improvement described in the samepaper is the pickMTP algorithm that selects a minimum tiling path ofclones for the target DNA (this feature requires BACs end-sequencesdata, for short BES data: that is BACs need to be end-sequenced).

As fingerprinting technologies evolved, some new tools were developedto intervene in the FPC process, directly modifying the FPC database totake advantage of the already existing and familiar interface to interactwith the data: their use, often combined with tricky techniques, allowsFPC to work with peculiar fingerprinting data. We present in the followingparagraphs these tools.

The program GenoProfiler [MMCQ+04] can been used to prepro-cess fingerprint data obtained by digestion with several enzymes to usethem directly in FPC. Luo et al. [LTY+03] used it to for fingerprintdata produced with four enzymes with fluorescent fingerprinting, likeour case but for an additional fifth enzyme producing blunt ends; Dinget al. [DJC+99, DJC+01], used it with fingerprint data obtained fromcomplete digestion and fluorescent labelling of three enzymes.

A recently developed algorithm named CORAL (Clone ORdering AL-gorithm), by Flibotte et al. [FCF+04], is thought to be inserted justafter the binning of FPC. The bands are not processed through Image([SMS+88, SMDH89]) but through BandLeader [FKC+03]. Data arethen binned with FPC and clones in each bin are ordered using the CORAL

algorithm (implemented as a stand alone program) instead of the CBalgorithm. The output is still FPC compatible and therefore FPC can stillbe used for visualization and manual rearrangements. The CORAL algo-rithm is an optimization algorithm that aims at maximizing a functioncalled fitness score. The fitness score evaluates the quality of a cloneordering and is substantially based on the productory of Sulston scores

27

of subsequent clones in an ordering, formally defined by them as:

F (C1, ..., Cn) = −n−1∑i=1

log SulstonScore(Ci, Ci+1) =

that is just:

= − logn−1∏i=1

SulstonScore(Ci, Ci+1).

In order to find a good clone ordering, CORAL starts producing variousinitial solutions with a greedy approach and then combines them usinga sort of genetic algorithm that stops when the fitness score does notimprove any more.

The authors of CORAL report comparisons against CB with both insilico and real data, showing the better behaviour of their algorithm.Unfortunately, in contrast to the availability information in the originalpaper, the software is still in “internal testing and licensing stage” [Var04,personal communication].

From August 2003, some changes were done to the main FPC softwareto allow the handling of what they call High Information Content Finger-printing (HICF), that is, including fragment colours information (and notonly size). Actually, this is achieved through a workaround (a constantfor each colour is added to the sizes) and does not take full advantage ofthe fingerprint information, as FPC is not aware of the added informationand all the rest of the process is left unchanged. A tutorial on how to useFPC for HICF is available since May 2004 on the http://www.genome.

arizona.edu/software/fpc/ website. This technique has been used inthe previously cited works by Ding et al. [DJC+99, DJC+01].

In 2000, a private company branched the FPC sources to add anotherequation to the system for detecting overlaps: this equation was specif-ically studied for HICF, fully taking into account size and colour frag-ment data. The results with the underlying statistical model are still notpublished [RCH04, in preparation]. The branch is called multiFPC andits sources (without commercial support) are available on the Discovery

28

Biosciences corporate website (http://www.discoverybio.com/).

Although these approaches are aimed at solving problems very similarto ours, we think that none is satisfactory. We believe that a radicalnew approach is needed: the several heuristics implemented in all theseprograms seem more patches to fix a bad initial idea than real completeapproaches.

An interesting work, also if the software is not available, is the one byStates et al. [SNB01], where a generic statistical model is built to handleall the up-to-now known types of fingerprint data. The implementation isquite slow (one hour to order 60 clones) as the algorithm uses simulatedannealing with Metropolis sampling. Thus, although the statistical modelis quite smart and provides a good framework, the implementation is notsatisfactory.

Curiously enough, part of the ideas behind RMAP [Fas00], a softwarefor restriction mapping from the Washington school, seems instead to fitour idea of a neat process for clone ordering with a good, though simple,background framework.

2.3 The Washington school

We present in this section excerpts from the works of what we call theWashington school substantially what has been produced in 1998-2000 byAlizadeh, Fasulo, Jiang, Karp, Settergren, Sharma and that we consideruseful for our problem.

The common procedure to order clones presented in literature is touse overlap relations, estimating if two clones overlap by checking if thenumber of their shared bands is above a threshold. In a first work, Fa-sulo et al. [FJKS98] introduced span and inclusion relations for orderingclones, and proved them to be much more stable with respect to theoverlap relations.

From span and inclusion relations all the work of ordering clonescomes almost naturally. We outline our interpretation of the whole pro-cess described by Fasulo [Fas00]. We recall that the aim is ordering theset of clones by starting position on the target DNA.

29

Input Let C = {C1, . . . , Cn} be a set of n clones; the input is then thefingerprints sets obtained by separated digestion with D enzymes:

F 1 = {F 11 . . . F 1

n}, . . . , FD = {FD1 . . . FD

n };

where each fingerprints F ji is the multisets of fragment sizes ob-

tained from digestion of the i-th clone by the j-th enzyme. Theunderlying data structure used to represent the fingerprint is anascending ordered list, where the total order relation is the one onnaturals.

Computing the coincidence The coincidence between two clones Ci

and Cj is given by the sum of the coincidence of the fingerprintsfor each enzyme:

Co(Ci, Cj) =D∑

h=1

FragmentMatch(F hi , F h

j );

coincidence between fingerprints is calculated by producing a max-imum matching of compatible fragments. As fragments are sup-posed to be prone to a measurement error proportional to theirsize, this give rise to the following simple algorithm (adapted fromFasulo [Fas00, figure 2.2]), provided that the fragments in the fin-gerprints are ordered.

30

FragmentMatch(F ki , F k

j )

1 x ← head[F ki ]

2 y ← head[F kj ]

3 match ← 04 while x 6= Nil ∧ y 6= Nil5 do6 if x ∼= y ¤ compatibility check7 then match ← match + 18 x ← next[x]9 y ← next[y]

10 elseif x < y11 then x ← next[x]12 else y ← next[y]13 return match

The algorithm simply scans the two ordered fragments list incre-menting a counter every time it meets two compatible fragments.

Inclusion relation The inclusion relation checks that all the fragmentsof a clone are matched in the other clone, and is estimated in thisway:

Ci ≤In Cj iff Co(Ci, Cj) =D∑

h=1

∣∣F hi

∣∣ .

Maximal clones With the inclusion relation, a set of maximal clonesis selected (also known in the literature as canonical clones), thatis, a set of clones where none is included in the others; first, onerepresentative is taken for each class of the following equivalencerelation that represents clones with the same fingerprint:

Ci ≡Id Cj iff Ci ≤In Cj ∧ Cj ≤In Ci

then the maximal clones are defined as follows:

Cmax = {Ci | @Cj.Ci ≤In Cj with Ci, Cj ∈ C/≡Id};

31

Span relation The span relation checks if a given maximal clone spanstwo other maximal clones, and is estimated as:

Sp(Ci, Cj, Ck) iff Co(Ck, Ci ∪ Cj) =D∑

h=1

∣∣F hk

∣∣ ,

where Ci, Cj and Ck are maximal clones and ∪ means that we pro-duce a new clone fingerprint where the fragments are the multisetunion of the fragments of the two clones for each enzyme domain.

Neighbours For each clone a set of neighbours is calculated from thespan relation; neighbours of a clone Ck are all clones Ci and Cj suchthat Sp(Ci, Cj, Ck); additionally, the neighbours set is partitionedinto two sets, such that the neighbours on each set lie on differentsides of the clone Ck in the ordering; this is done through the al-gorithm Neighbours that produces a max cut on the neighbourgraph of Ck where vertices are C \ Ck and there is an edge betweenCi and Cj if Sp(Ci, Cj, Ck). The max cut is approximated withthe algorithm MaxCut, adapted from the partition algorithm byFasulo [Fas00, figure 2.5].

Neighbours(C, Ck)

¤ Creates the indirected neighbour graph G:1 for each Ci ∈ C2 do for each Cj ∈ C3 do if Sp(Ci, Cj, Ck)4 then V ← V ∪ {Ci, Cj}5 E ← E ∪ {(Ci, Cj), (Cj, Ci)}

¤ Partitions the neighbours in two sets such thatthey lie on different sides of Ck:

6 return MaxCut(G)

32

MaxCut(G)

¤ Randomly assign sides at vertices:1 for each v ∈ V [G]2 do side[v] ← L or R randomly3 while the cut between L and R is growing4 do for each v ∈ V [G]5 do n ← neighboursSameSide(v)6 n′ ← neighboursOppositeSide(v)7 if n > n′

8 then newside[v] ← Not side[v]9 elseif n = n′

10 then newside[v] ← L or R randomly11 else newside[v] ← side[v]12 for each v ∈ V [G]13 do side[v] ← newside[v]

Contigs from mutual nearest neighbours Before producing the fi-nal ordering of clones, we use this intermediate step to order subsetsof clones. We recall that an ordered subset of clones is called contig.In order to obtain contigs, we use this procedude: we first assign,for each clone Ck, the following triple of values to each neighbourCi of Ck:

• the number of neighbours Cj of Ci on the same partition suchthat Sp(Ci, Ck, Cj);

• the number of neighbours of Ci in the opposite partition;

• the coincidence between Ci and Ck.

The two sets of neighbours are then ordered by using lexycographicordering on the tuple just defined; the two top neighbours on eachof the two lists are called the nearest neighbours of the clone Ck; iftwo clones are mutual nearest neighbours, then they are probablysubsequent in a clone ordering; a strongly ordered component is alist of clones where each one is mutual nearest neighbour with the

33

following one; such strongly ordered components can be consideredas contigs.

Clone ordering The final clone ordering is obtained by ordering thecontigs produced in the previous step. In particular, contigs haveto be merged and directed, and this is done by creating a specialgraph (component ordering graph) where:

• there are two vertices for each contig, representing the twoends, connected by an edge of low weight;

• there is an edge between all ends of contigs weighted dependingon a scheme justified by empirical data [Fas00, figure 2.6].

On this graph, Held-Karp iteration is used to produce a minimumspanning tree with a long body and short branches and the longbody is the contig ordering.

In the next chapter, we describe how we solve our problem from thisskeleton process, and what we have fixed, improved and adapted.

34

Chapter 3

Solving the error-free problem

Inside every large problem is a small problem strug-gling to get out.(Hoare’s Law of Large Problems)

We solve in this chapter the error free problem. We first for-malize the problem input, adapting Fasulo’s proposal [Fas00],exposed in the previous chapter, when there is a single enzyme

domain (D = 1): this is because we do not have separate digestion data.We then expose the algorithm that we devised and we give some detailsof the implementation.

3.1 Definitions

We denote with N0 the set of naturals without the 0, that is N0 = N \ {0}.

Definition 3.1 (fragment measurement) A fragment measurement isa pair

m = (s, c) s ∈ N0, c ∈ N0

where s represents the size measurement and c represents the colour mea-surement.

35

FjÂ Â

FiÂ Â

FiÂ Â

FkÂ Â

FjÂ Â

(a) Fi ≤In Fj (b) Sp(Fi, Fj, Fk)

Figure 3.1: Example representation of relative placement of clones for(a) the inclusion relation and (b) the span relation.

Definition 3.2 (clone fingerprint) A clone fingerprint is a multisetof fragment measurements

F = {(s1, c1), . . . , (sk, ck)};

We identify each clone in the set C = {C1, ..., Cn} with its fingerprintsince this last one is the only input to our algorithm. So, we will refer toclones just by their fingerprint and we will write the same set of clonesas F = {F1, ..., Fn}.

The fragments on the ends of each clone do not characterize it, as oneof their extremities is not determined by digestion of a restriction enzymebut from the process that generated the clone itself. For this reason, weassume that the fingerprint of a clone does not contain the measure-ments of the end fragments. This is a reasonable assumption, since endfragments can be easily detected by standard laboratory techniques (likeradioactive labelling [PR02]) and removed from the fingerprint.

Now we define the coincidence function.

Definition 3.3 (coincidence) The coincidence between two clones Fi

and Fj is the number of equal fragment measurements they share and isgiven by:

Co(Fi, Fj) = |Fi ∩ Fj| ,where ∩ is multiset intersection.

The inclusion relation is defined in order to formalize maximal clones.

36

Definition 3.4 (inclusion) Let Fi and Fj be clones. We say that Fi

is included in Fj or, equivalently, that Fj includes Fi if all the fragmentmeasurements of Fi and we write:

Fi ≤In Fj iff Co(Fi, Fj) = |Fi| .

Figure 3.1.a shows an example of the include relation. Note that theinclusion relation is a partial order on the set of clones, as the followingproperties hold:

reflexivity : for each clone Fi, Co(Fi, Fi) = |Fi ∩ Fi| = |Fi|;antisymmetry : if Fi ≤In Fj and Fj ≤In Fi, then Co(Fi, Fj) = |Fj ∩ Fi| =

|Fi| = |Fj| and this is true if and only if Fi and Fj are the samemultiset;

transitivity : if Fi ≤In Fj and Fj ≤In Fk, then we know that |Fi ∩ Fj| =|Fi| and that |Fj ∩ Fk| = |Fj|; from the properties of multisetsintersection this is possible if and only if Fi ⊆ Fj and Fj ⊆ Fk; butthen Fi ⊆ Fk and Co(Fi, Fk) = |Fi ∩ Fk| = |Fi|.

If two clones are included one in the other they are identical for ourpurposes and one of the two has to be removed: we introduce for thispurpose the identity relation.

Definition 3.5 (identity) Let Fi and Fj be clones. We say that Fi andFj are identical if they have the same fingerprint and we write:

Fi ≡Id Fj iff Fi ≤In Fj ∧ Fj ≤In Fi.

Note that the identity relation is an equivalence relation as the fol-lowing properties hold:

reflexivity : as for each clone Fi, Fi ≤In Fi (reflexivity of ≤In), it followstrivially that Fi ≡Id Fi;

symmetry : if Fi ≡Id Fj, Fi ≤In Fj ∧ Fj ≤In Fi but then, switchingthem, we obtain that Fj ≡Id Fi;

37

transitivity : if Fi ≡Id Fj and Fj ≡Id Fk, than we can say that Fi ≤In

Fk, as Fi ≤In Fj and Fj ≤In Fk, and that Fk ≤In Fi, as Fk ≤In Fj

and Fj ≤In Fi; than, Fi ≡Id Fk.

Now we can define the maximal clones.

Definition 3.6 (maximal clones) The set of maximal clones Fmax isa maximal set of clones such that none is included in the others, that is:

Fmax = {Fi |@Fj.Fi ≤In Fj, Fi, Fj ∈ F/≡Id}.

The span relation is finally defined in order to formalize a clone or-dering.

Definition 3.7 (span) Let Fi, Fj and Fk be maximal clones. We saythat Fk spans Fi and Fj if each fragment measurement of Fk is either inFi or in Fj, and we write:

Sp(Fi, Fj, Fk) iff Co(Fk, Fi ∪ Fj) = |Fk| .

Figure 3.1.b shows an example of this relation: note that we do notknow if Fi (or equivalently Fj) is on the left or the right of Fk but weknow for sure that Fi and Fj are on different sides.

The expected output is an ordering of a set of maximal clones Fmax ={F1, . . . , Fm}, that is, a sequence of clones such that for every triple(Fi, Fj, Fk) of subsequent clones in the sequence, Sp(Fi, Fk, Fj) holds.Formally:

Definition 3.8 (clone ordering) A clone ordering of a set of maximalclones Fmax = {F1, . . . , Fm} is a sequence < Fπ(1), . . . , Fπ(m) > given bya permutation π ∈ Sm

1 of indexes such that Sp(Fπ(i−1), Fπ(i+1), Fi) holdsfor each i = 2 . . . m− 1.

1Sm: the symmetric group of degree m, that is, the group of all permutations onm symbols. It is a permutation group of order m! that contains as subgroups everygroup of order m.

38

3.2 The algorithm

The process of producing the clone ordering from their fingerprints is sim-ilar to the one presented in section 2.3: we first get the maximal clones,to compute then the neighbours and, finally, from nearest neighbours wedirectly build the final ordering (with exact data we do not need to firstproduce contigs as nearest neighbours information is strong enough).

The underlying data structure for representing fingerprints is an as-cending lexicographically ordered list.

The building blocks of the main algorithm are the functions for calcu-lating the coincidence and for estimating the span and inclusion relations.The Coincidence function scans the two input ordered fragments listsFi and Fj to look for compatible fragments: as they are ordered, we canjust use two iterators x and y and move them forward keeping the invari-ant that no fragment before x (or, respectively, y) could be compatiblewith y (or, respectively, x), and incrementing the match counter everytime that the fragment x is compatible with fragment y.

Coincidence(Fi, Fj)

1 x ← head[Fi]2 y ← head[Fj]3 co ← 04 while x 6= Nil ∧ y 6= Nil5 do6 if size[x] = size[y] ∧ colour[x] = colour[y]7 then co ← co + 18 x ← next[x]9 y ← next[y]

10 elseif x < y11 then x ← next[x]12 else y ← next[y]13 return co

The Identical and Inclusion algorithms for calculating the respec-tive relations are very simple as they are a direct implementation of theirdefinitions.

39

Identical(Fi, Fj)

1 return |Fi| = |Fj| ∧Coincidence(Fi, Fj) = |Fi|

Inclusion(Fi, Fj)

1 return Coincidence(Fi, Fj) = |Fi|

As calculating the Span procedure through multiset union of thefingerprints would be computationally expensive, we created an auxil-iary function, FastSpanIndex(Fi, Fj, Fk) that computes Co(Fi∪Fj, Fk)faster (even if asymptotically the complexity is the same): in this way weobtain the efficient Span algorithm that computes the respective relation.Note that the clones in input Fi, Fj, Fk are supposed to be maximal.

FastSpanIndex(Fi, Fj, Fk)

1 x ← head[Fi]2 y ← head[Fj]3 z ← head[Fk]4 co ← 05 while x 6= Nil ∧ y 6= Nil ∧ z 6= Nil6 do if x > z ∧ y > z7 then z ← next[z]8 elseif x = z ∨ y = z9 then co ← co + 1

10 z ← next[z]11 elseif x < z12 then x ← next[x]13 elseif y < z14 then y ← next[y]15 if x 6= Nil16 then return co + Coincidence(x, z)17 if y 6= Nil18 then return co + Coincidence(y, z)19 return sp

40

Span(Fi, Fj, Fk)

1 return i 6= j ∧ j 6= k ∧ i 6= k∧FastSpanIndex(Fi, Fj, Fk) = |Fk|

As most of the time in ordering is spent in relation computation (notethat, on the whole, the span relation requires O(n3) time, with n numberof the maximal clones), we created also ad hoc, dramatically improved,procedures for the exact case. The idea behind these procedures is thatthey stop computing the coincidence when it cannot satisfy anymore therequirement of reaching a certain value. With this improvement, a fullordering of 460 clones of about 100 Kbp from a 4 Mbp genome takes lessthan one minute on a normal desktop machine.

From the inclusion relation it is straightforward to compute the setof maximal clones in time O(n2), removing first the identical clones andselecting then only the ones that are not included in any other clone.

To compute an ordering of the maximal clones, we use the same rea-soning as Fasulo [Fas00], but we derive a slightly different conclusion thatin simulated data always created perfect nearest neighbours.

For each clone Fk we want to find a neighbourhood that we will parti-tion in two sets, such that they lie on different sides of the clone (that is,in an ordering, all clones in a set are before Fk and all the clones in theother set are after Fk). As the span relation gives us information aboutclose clones that lie on opposite sides, we use it to find neighbours and,moreover, to partition the neighbourhood obtained.

For each clone Fk ∈ Fmax we build the indirected neighbour graphwhere the vertices are all the other clones and there is an edge betweentwo clones if they span Fk, that is:

G = (Fmax \ Fk, {(Fi, Fj) |Sp(Fi, Fj, Fk)}).The neighbour graph is computed by the algorithm Neighbours alreadypresented on page 2.3.

As the span relation is an estimation, we can have some additionaledges (called false positives) that can cause the graph not to be bipartite.This is the reason why we use max cut to produce the partition: otherwiseit would be sufficient to use some simple walks on the graph.

41

F4@GAFBECD

F3@GAFBECD

F2@GAFBECD

F1@GAFBECD

F9@GAFBECD

F8@GAFBECD

F7@GAFBECD

F6@GAFBECD

F5@GAFBECD

lllllllllllllllllllllbbbbbbbbbbbbbbbbbbbb

yyyyyyyyyyyyyyyyyyyyyyyyyy

oooooooooooooooooooooo

eeeeeeeeeeeeeeeeeeee

¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥

|||||||||||||||||||||||||||

rrrrrrrrrrrrrrrrrrrrrrr

iiiiiiiiiiiiiiiiiiii

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ

(a)

F1Â Â

F2Â Â

F3Â Â

F4Â Â

FkÂÂ ÂÂ

F5Â Â

F6Â Â

F7Â Â

F8Â Â

F9Â Â

(b)

Figure 3.2: Example partitioning of a neighbourhood of Fk by a MaxCuton the neighbours graph: (a) the neighbour graph with the cut; verticesare clones, edges are span relations with respect to the clone Fk; the edgesnot crossing the cut are false positives, that is, span relations wronglyestimated; (b) an example positioning of clones that can have such agraph.

42

As MaxCut is NP-Hard (actually, the only known algorithm to solveit exactly is by enumeration) we use a randomized procedure, similar tothe one defined by Fasulo [Fas00], that works well with our graphs. Theoriginal procedure randomly assigns sides to vertices and then improvesthe cut calculating for each vertex the number of neighbours n on thesame partition and the number of neighbours n′ on the opposite partition:each vertex is then reassigned a side based on these two values, and thusentirely relying on the previous partition. In algorithm MaxCut’ wehave added the following two modifications, that greatly improve thequality of the cut, although with little added computational effort:

• awareness of the partition being created: we compute thenumbers n, n′ of neighbours during the reassignment, thus beingaware of the change occurring in the partitioning process;

• multiple runs: we run the randomized algorithm ten times (orless, if a perfect bipartition is found), changing every time the initialconfiguration, and choosing in the end the best obtained cut.

RandomMaxCut(G)

¤ Randomly assigns sides at vertices:1 for each v ∈ V [G]2 do side[v] ← L or R randomly3 while the cut between L and R is growing4 do for each v ∈ V [G]5 do n ← neighboursSameSide(v)6 n′ ← neighboursOppositeSide(v)7 if n > n′

8 then side[v] ← Not side[v]9 elseif n = n′

10 then side[v] ← L or R randomly

43

MaxCut’(G)

1 fp ←∞2 for i = 0 to 103 do RandomMaxCut(G)4 cfp ← CountFalsePositives(G)5 if cfp < fp6 then if cfp = 07 then return8 else Save sides assignment9 fp ← cfp

10 Restore sides assignment11 return

We now want to sort the two opposite neighbourhoods by proximityto the clone Fk. Consider figure 3.2.b; while approaching clone Fk bymoving from clone F1 to clone F4, the following quantities change foreach Fi (i = 1, . . . , 4):

• the number of clones Fj on the same side of Fi such that Sp(Fi, Fk, Fj)decreases, while the number of Fj on the same side such thatSp(Fj, Fk, Fi) increases;

• the number of neighbours of Fi on the opposite side increases;

• the coincidence between Fi and Fk increases.

The two neighbourhoods can then be ordered as requested using lexi-cographical decreasing ordering of this triple of values assigned to eachclone (for the first value, we use Sp(Fj, Fk, Fi)). Note that the first ele-ment of the triple used by Fasulo [Fas00, page 22], that we reported inpage 33, is wrong and is actually the opposite of what we used (it shouldbe Sp(Cj, Ck, Ci) to be correct).

The two top neighbours on each neighbourhood of Fk ordered thisway are now the nearest neighbours of Fk. We then consider a graphwhere the vertices are the maximal clones and there is an edge betweenclones if they are mutual nearest neighbours:

G = (Fmax, {(Fi, Fj) |Fi and Fj are mutual nearest neighbours }).44

It is easy to prove that the connected components on this graph arenon-empty sets of clones such that one of these two conditions holds:

• the cardinality is one or

• exactly two clones have only one nearest neighbour in the set, whileall the other clones have both nearest neighbours in the set.

Additionally, note that every vertex has maximum degree two, so theconnected components are just chains: in fact they are contigs, an inter-mediate step of the ordering (in Fasulo [Fas00] these are called stronglyordered components). To find connected components, a simple doubledepth first search visit can be used (as suggested by Deaver2 in Cor-men et al. [CSRL01], original algorithm due to Tarjan [Tar72]), but ouralgorithm works on the directed graph:

G′ = (Fmax, {(Fi, Fj) | if Fj is a nearest neighbour of Fi}).

and exploits its structure (maximum out degree two) to produce the samesets of clones.

Actually, in all the simulated tests, the procedure produced just onecontig, and thus the complete ordering, hence it was not necessary todevise an algorithm for contig ordering and directioning. It is trivial toprove that the chain of the only produced contig must satisfy definition3.8 and indeed be a clone ordering.

3.3 Implementation and testing

The implementation has been written in OCaml, and the source codehas been compiled with OCaml 3.07 MinGW port (September 2003),with a slightly modified version of the native-code compiler to fix anincompatibility with the MinGW linker.

The implementation consists of the following program:

2This is a joke in Cormen et al. [CSRL01]: The exercise is on strongly connectedcomponents, and Michael Deaver used his connections to excess in the Reagan admin-istration.

45

camlass : OCaml clone assembler

Input Through command line parameters: the name of the file (inFASTA-like format) containing the fingerprints of the clones.

Output In standard output: the number of assembled contigs. Ina separate file: the nearest neighbours graph with clusteredcontigs in dotty format. From this file it is immediate tooutput the ordering of the clones and to produce a graphicaloutput of the ordering.

To test the implementation, the following program has been used toproduce in silico digestion data (sizes and colours) of clones:

camlstomach : OCaml simulated digestion

Input Through command line parameters: the name of the filecontaining the clones in FASTA format and a set of restrictionenzymes. These can be specified as names (a set of some com-mon enzymes is recognized), for example EcoRI, or giving therecognized sequenced with cleavage site, for example G|AATTC.

Output In standard output: the fingerprints of the clones, thatis, the measurements (sizes and coulours) of the fragmentsobtained by simulating a digestion of the clones, in a FASTA-like format.

Finally, the test clone libraries have been produced in silico with thefollowing program procured by Giorgio Valle (University of Padua) thatequally distributes the clones starting position on the DNA:

baclib : in silico BAC library creation

Input Through command line parameters: the name of the filewith the target DNA sequence in FASTA format, the numberof clones to be created, the average length of each clone, themaximum allowed deviation from the average. FASTA is a stan-dard format for representing lists of sequences with names.

46

Output In standard output: as many clones as requested in FASTA

format, with starting position uniformly distributed on thetarget DNA and length in the allowed range.

Several tests were conducted on a full sequence creating a clone librarywith coverage 10× (460 clones with an average size of 100 Kbp of the full4,6 Mbp E. Coli sequence - this coverage has been suggested by Valle),digesting with three different sets of restriction enzymes (one taken fromLuo et al. [LTY+03], the other two proposed by Valle): all tests requiredless than one minute (about 30 seconds on a Athlon XP 2800 with 256 MBRAM), discarded about half of the clones as non-maximal, and producedjust one contig from nearest neighbours that was the correct ordering.

These are the three sets of restriction enzymes used:

overhang sourceA C G T

EcoRI XbaI BamHI XhoI Luo et al. [LTY+03]MfeI MluI BclI Ppu0I Valle (CRIBI)EcoRI NcoI BamHI SalI Valle (CRIBI)

Each single test has been conducted this way:

BAC library creation the program baclib was used to create a BAClibrary of the 4,6 Mbp E. Coli sequence, creating BACs of length100, 000 ± 20, 000 (command line baclib.exe ecoli.fasta 460

100000 20000);

fingerprinting the output BAC library was fingerprinted by digest-ing the clones with one of the set of enzymes presented abovethrough camlstomach; original starting positions of the clones inthe DNA were kept in the name of the clone for later verification(command line camlstomach.opt.exe bacs.fasta EcoRI BamHI

XbaI XhoI);

ordering the output fingerprints were used by camlass to order theclones (command line, for example, camlass.opt.exe digestedbacs.fasta);

47

verification the ordering was automatically verified to be correct us-ing starting position information of the clones contained in theirnames; additionally, graphical output was produced to visualizethe ordering.

The input data produced this way reflects what is produced in laboratory,except for being without measurement errors: a coverage 10× grants thatmost of the times all bases are covered by clones, while the enzyme setsare reasonable (moreover, the enzyme set from Luo et al. [LTY+03] isvery cheap).

Some tests were also conducted with lower coverage, obtaining theexpected result of having just some contigs ordered and not all the clones.In general, with a coverage 5×, about ten contigs were produced. Figure3.3 shows an example of such an ordering: the vertices in the graph arethe clones, the arrows are the nearest neighbors, and the boxes are thecontigs (note that we are only interested in contigs of a certain length).

48

insert17

insert142

insert43

insert38

insert37

insert36 insert221

insert205

insert4

insert178

insert65

insert135

insert48

insert187

insert46

insert50

insert209

insert77

insert9

insert1

insert175

insert33

insert72

insert15

insert102

insert94

insert119

insert84

insert81

insert186

insert59

insert206

insert204

insert74

insert52

insert96

insert203

insert132

insert27

insert55

insert91

insert64

insert130

insert129

insert6

insert195

insert164

insert131

insert3

insert14

insert104

insert115

insert7

insert157

insert208 insert23

insert193

insert159

insert192

insert180

insert100

insert113

insert75

insert11

insert151

insert99

insert97

insert51

insert212

insert16

insert181

insert26

insert76

insert116

insert24

insert176

insert136

insert29

insert67

insert8

insert2

insert172

insert165

insert211

insert90

insert215

insert147

insert20

insert103

insert167

insert213

insert183

insert53

insert149

insert10

insert83

insert44

insert92

insert69

insert189

insert182

insert156

insert118

insert31

insert202

insert219

insert223

insert179

insert123

insert155

insert21

insert191

insert60

insert173

insert199

insert190

insert217

insert25

insert171

insert35

insert222

insert63

insert122

insert200

insert161 insert216

insert139

insert110

insert121

insert49

insert62

insert188

insert152

insert225

insert227

insert61

insert230

insert160

insert210

insert79

insert47

insert68

insert196

insert56

insert184

Figure 3.3: Example of a failed ordering because of a low coverage (5×):vertices are contigs, arrows are nearest neighbors and slashed boxes arecontigs.

49

Chapter 4

A proposal for noisy data

One idea that is stolen from human practice, anawfully good place to get ideas, is that maybe thereason we have the illusion that large numbers ofthings are computable is that we only notice theones we have computed. So what we may have isthe illusion that most problems may be solved.(Herbert A. Simon)

In the previous chapter we solved the problem with exact data: thisis a simplification of the original problem arising in laboratory wheredata is subject to an error (noisy data). We want then to extend

in this chapter the obtained results of the exact data case to cope withnoisy data. The presented extension is a proposal, as it has still not beenfully implemented and tested for adequacy.

It is reasonable to adapt the framework proposed for exact data tothe noisy case, but while statistical evidence in dealing with exact datagives us enough confidence in using the simple estimations for relationsexposed above, with noisy data we have a big shift towards uncertaintyas relations cannot be estimated with such a precision.

Moreover, the following steps (neighbours, nearest neighbours, stronglyordered components) should take into account the possibility that esti-mated relations are wrong (false positives) or that some real relations aremissing (false negatives).

51

We present first the main algorithm and then our preliminary errormodel that is used to compute the coincidence index, on which the wholealgorithm is based.

4.1 The proposed algorithm

We propose here, step by step, a process for noisy data, stressing the factthat this has not been fully implemented and thus could be ineffective,although preliminary versions are promising.

Clustering Our intuition from preliminary tests and from what weread in Fasulo [Fas00] is that the last step with exact data (thestrongly ordered components) would produce several contigs withnoisy data; we thus propose to cluster clones at the beginning toproduce a contig for each cluster, and then to order and directthe contigs in a last step. For clustering, this could be taken asdistance:

d(Fi, Fj) =2Co(Fi, Fj)

|Fi|+ |Fj| ,

that is the coincidence index normalized to the average of the twoclones lengths. Preliminary tests have been conducted using analgorithm recently proposed by Pavan and Pelillo [PP03].

Coincidence If the error is not proportional to the fragment size, as inFasulo [Fas00], we propose to use a maximum bipartite matchingon a graph where the two partitions are the clones fragments andthere is an edge between them only if they are compatible. The sizeof the match is the coincidence index. To decide if the fragmentsare compatible, given an error model, we can use a threshold onthe probability that they are measurements of the same fragmentderived from the model. Another possibility is to do a maximumweighted bipartite match, putting the probability of compatibilityas a weight on the edge: the match can then be thresholded to ob-tain the coincidence index. For both cases, we can find the matchingusing the Ford-Fulkerson algorithm [CSRL01], or an optimization

52

approach finding a maximum independent set on the respective linegraph with replicator dynamics as suggested by Sperotto [Spe04].We show in the next section how to, given two measurements, com-pute the probability that they are from the same fragment (actually,from fragment structurally equal).

Relations Preliminary experimentation proved that the relations re-quirements should be relaxed: instead of checking for equality, weshould test if the ratio is above a determined threshold, derivedfrom the error model. So, relations should be:

Fi ≤In Fj iffCo(Fi, Fj)

|Fi| > ΘIn

and

Sp(Fi, Fj, Fk) iffCo(Fk, Fi ∪ Fj)

|Fk| > ΘSp.

We still do not have any proposal for computing the appropriateΘIn and ΘSp.

Maximal clones When selecting classes of clones that seem identicaland, when selecting maximal clones, an averaging on the measure-ments of the matched fragments can be used to strengthen thereliability of the data.

Nearest neighbours Computing neighbours and ordering them to ob-tain nearest neighbours is straightforward from the exact case, asthe combination of max cut and ordering are resilient enough. Formax cut, instead of using the randomized algorithm previously de-scribed, an optimization approach can be used, as suggested byHager and Kryluk [HK99]. Moreover, with this approach, it wouldbe of little effort to use the probability of the span relation as edgesweights on the complete graph, instead of just having edges wherethe span relation holds. The probability can be obtained using theratio:

P (Sp(Fi, Fj, Fk)) =Co(Fk, Fi ∪ Fj)

|Fk| .

53

We speculate that this technique partitions the two neighbourhoodsbetter.

Strongly Ordered Components At this point we should obtain justone contig with few discarded clones from mutual nearest neigh-bours information.

Ordering and directing clones Contigs can be ordered and directedby simply calculating relations between their near-end clones andjoining contigs that have strong connections between their near-end clones. Fasulo’s approach [Fas00] of Held-Karp iteration canbe used, but the weighting scheme should be justified by the errormodel.

4.2 Fragments compatibility

To compute the coincidence index as we proposed, it is necessary to knowthe probability that two measurements correspond to measurements ofthe same fragment. Actually, as we cannot distinguish between struc-turally equal fragment (that is with the same colour and size), we cancompute only the probability that two measurements correspond to mea-surements of structurally equal fragments.

In order to do this, we need an underlying error model but, as realfingerprinting data from the laboratory were still not available at themoment of writing, we could not infer it by ourselves: so we considerederror models for size measurement proposed by other authors and a sim-ple intuitive error model for colour measurement based on comments byValle [Val04, personal communication]. From such error models we de-rived formulas that can be used for calculating the required probability.

We then define here the probability of compatibility between twofragments given their measurements with error.

A fragment is a pair of a size and a colour

fi = (si, ci),

54

but we suppose to observe an erroneous read of the same fragment

fi = (si, ci).

We say that two fragments fi and fj are compatible, and we write

fi ≈ fj,

if they can be measurements with error of structurally equal fragmentsfi = fj (this is because we cannot distinguish between structurally equalfragments).

Definition 4.1 (probability of compatibility) The probability of com-patibility of two fragments is the product of the probability of compatibilityof the two elements of the pair, as we assume that their errors distributeindependently:

P (fi ≈ fj) = P ((si, ci) ≈ (sj, cj))

= P (si ≈ sj) P (ci ≈ cj)

.

Thus, we first define the error model for size and colour, to obtain thentheir respective probability of compatibility and finally the probability ofcompatibility of two fragments.

4.2.1 Sizes compatibility

There are two kinds of errors in reading the size: if the fragment is toosmall or to big, we cannot even detect it; this kind of error does not botherus because it is systematic and thus the fragment will not be detectedin any clone. All the other fragments are subject to a size measurementerror that usually is derived from empirical data.

We consider and present here an error model inspired by Fasulo [Fas00],where sizes are subject to a proportional error (this is inferred from dataof the Human Genome Project at Washington Genome Center).

We first give the following needed definitions to present then the errormodel.

55

Definition 4.2 (originator interval) Given a measurement si, we saythat the size of the fragment that produced the measurement ranges inthe interval ε(si) = [L(si), R(si)], where L[si] is the smallest possiblefragment that can have produced such a measurement and, similarly, R[si]is the biggest possible fragment that can have produced the measurement.

Definition 4.3 (sizes compatibility) Given two measurements si, sj,we say that they are compatible if ε(si)∩ ε(sj) 6= ∅, and we write si ≈ sj.

In this model the error is proportional to the size of the fragmentitself, and thus

si − e(si) ≤ si ≤ si + e(si),

where e() is the function that models the proportional error, that weconsider differentiable and with positive derivative always less than one(0 < e′(s) < 1; this is a reasonable assumption as in most of the casese(s) = αs with 0 < α < 1). We can note by this equation that, given ameasurement si, the real size of the fragment si ranges between

si − e(si) ≤ si ≤ si + e(si).

Consider then both sides of the disequation

e−(si) = si − e(si) e+(si) = si + e(si)

and we consider figure 4.1: it shows the two functions just defined; for acertain si measured, we can see that the minimum and the maximum frag-ment size that could have generated the measurement could be obtainedby the inverse of these functions. By the differentiability constraints onthe e() we have that the two functions are monotically increasing, and sotheir inverses are well defined; we can then define the error function interms of si as half of the difference between the maximum and the mini-mum sizes that could have generated the reading, as this is the maximumerror allowed:

e(si) =e−1− (si)− e−1

+ (si)

2.

56

s//

OO

si •• •oo //

� � _?

L[si] R[si]

s− e(s)

s + e(s)

Figure 4.1: Plot of the typical e−(s) = s − e(s) and e+(s) = s + e(s)functions: given a measurement si, taking the preimage of the givenfunctions, we can find the smallest L[si] and the biggest R[si] fragmentsize that could have produced the measurement.

Finally, we have the range defined as a function of si:

si − e(si) ≤ si ≤ si + e(si).

From the definition of size compatibility and from the fact that in ourmodel the originator interval is

ε(si) = [si − e(si), si + e(si)],

we can say that two fragments are compatible if

si ≈ sj ⇔{

si + e(si) > sj − e(sj), if si < sj

sj + e(sj) > si − e(si), if sj < si.

This can be easily summed up as

si ≈ sj ⇔ |si − sj| < e(si) + e(sj);

this last formula leads to the already intuitive definition that we werelooking for.

57

Definition 4.4 (sizes probability of compatibility) (with proportionalerror) Given two measurements si, sj, we define the probability of incom-patibility between them as the fraction between their difference and themaximum error allowed in both reads

P (si 6≈ sj) =|si − sj|

e(si) + e(sj),

and then the probability of compatibility is

P (si ≈ sj) = 1− |si − sj|e(si) + e(sj)

.

4.2.2 Colours compatibility

We assume first to have D colours, namely {d1, . . . , dD}, with D > 1.For modelling the error on the reading of the colour, we assume that anerror occurs every p reads, and that the error produces a bad reading(and thus it will not produce the correct colour again, but a differentone). Then we define the reading ci of a colour ci as a random variable:

ci =

ci with probability P =p− 1

p

dj, where dj 6= ci, with probability P =1

p

,

where ci is the colour read of the i-th fragment (and thus i varies on theset of fragments) while dj is the j-th colour (and thus j varies on the setof different colours).

We are interested in determining the following two probabilities:

P (ci = cj|ci = cj) P (ci = cj|ci 6= cj)

To do this, we first see how the probability distributes through allthe different cases. Given two clones ci and cj, we have the followingerror distribution, assuming that subsequent coulours and subsequentmeasurements are independent:

58

• zero errors with probability P (ci = ci∧cj = cj) = P (ci = ci)P (cj =cj):

P0 =

(p− 1

p

)2

;

note that this is equal to P (ci = ci ∧ cj = cj | ci = cj) and toP (ci = ci∧cj = cj | ci 6= cj); to understand why, think that it is likethe probability that, when tossing coins, they show the same facethey had before the tossing if they change face with probability 1

p;

similar reasoning can be done for the following two probabilities;

• one error with probability P ((ci 6= ci ∧ cj = cj) ∨ (ci = ci ∧ cj 6=cj)) = P (ci 6= ci)P (cj = cj) + P (ci = ci)P (cj 6= cj):

P1 = 2p− 1

p2;

• two errors with probability P (ci 6= ci ∧ cj 6= cj) = P (ci 6= ci)P (cj 6=cj):

P2 =1

p2.

In the following paragraphs we use these figures as examples on theD = 4 colours case to help understanding combinatorial issues; the dotson the left are the possible colours for the first clone, and similarly thedots on the right for the second one; a black dot is the effective colour ofthe clone; dotted arrows are all possible choices; arrows without crossesare the choices in which we are interested; a non-dotted arrow is anexample choice (done in the two errors case). An example:

◦ci=d1 cj=d1◦

◦ cj=d2•

cc

×¢¢ ×

}}

•ci=d3

??

×AA

%%

◦

◦ ◦59

In this example the clone ci has colour d3, and can do 2 choices out ofthe 3 possible (D−2

D−1), while the clone cj has colour d2 and can do 1 choice

out of the 3 possible ( 1D−1

), supposing that the choice made on the firstclone produced ci = d1.

We can now define the following probabilities:

• P (ci = cj|ci = cj): we must have no errors or two errors bothproducing the same colour (probability that two random tosses onD − 1 colours coincide)

◦ ◦•

::

ºº

""

•×

dd

¨¨×

||

◦ ◦◦ ◦

P (ci = cj|ci = cj) = P0 + P21

D − 1;

• P (ci 6= cj|ci = cj): we must have one error or two errors produc-ing different results (probability that two random tosses on D − 1colours are different)

◦ ◦•

::

ºº

""

•dd

×

||

◦ ◦◦ ◦

P (ci 6= cj|ci = cj) = P1 + P2D − 2

D − 1;

• P (ci = cj|ci 6= cj): we must have one error producing the correct

60

colour (one choice on D − 1 colours)

◦ ◦◦ ••

×

<<

GG

×$$

◦◦ ◦

or two errors with the same result (the first error has D− 2 choiceson D − 1 colours, the second one choice on D − 1 colours)

◦ ◦◦ •

dd

××

||

•

<<

×GG

$$

◦◦ ◦

P (ci = cj|ci 6= cj) = P11

D − 1+ P2

D − 2

(D − 1)2 ;

• P (ci 6= cj|ci 6= cj): we must have no errors or one error that pro-duces different colours (D− 2 choices on D− 1 remaining colours)

◦ ◦◦ ••

<<

×GG

$$

◦◦ ◦

or two errors that still produce different colours (if the error on ci

produces cj, we can choose any of the remaining colours, otherwise

61

we can choose only D − 2 colours out of D − 1)

◦ ◦◦ •

dd

¨¨

||

•

<<

GG

$$

◦◦ ◦

◦ ◦◦ •

×dd

¨¨

||

•

<<

GG

$$

◦◦ ◦

P (ci 6= cj|ci 6= cj) = P0 + P1D − 2

D − 1+ P2

((D − 2)2

(D − 1)2 +1

D − 1

).

We can finally calculate (assuming that colours are equally distributed):

P (ci = cj|ci = cj) =P (ci = cj)P (ci = cj|ci = cj)

P (ci = cj)=

=P (ci = cj)P (ci = cj|ci = cj)

P (ci = cj)P (ci = cj|ci = cj) + P (ci 6= cj)P (ci = cj|ci 6= cj)=

=D (p− 1)2 − p (p− 2)

(D − 1) p2

and:

P (ci = cj|ci 6= cj) =P (ci = cj)P (ci 6= cj|ci = cj)

P (ci 6= cj)=

=P (ci = cj)P (ci 6= cj|ci = cj)

P (ci = cj)P (ci 6= cj|ci = cj) + P (ci 6= cj)P (ci 6= cj|ci 6= cj)=

=D (2p− 1)− 2p

(D − 1)2 p2

62

4.2.3 Probability of compatibility

Finally, the probability of compatibility (definition 4.1) with proportionalerror on size is:

P (fi ≈ fj) = P ((si, ci) ≈ (sj, cj)) =

= P (si ≈ sj)P (ci ≈ cj) =

=

(1− |si − sj|

e(si) + e(sj)

) (D (p− 1)2 − p (p− 2)

(D − 1) p2

)if ci = cj

(1− |si − sj|

e(si) + e(sj)

)(D (2p− 1)− 2p

(D − 1)2 p2

)if ci 6= cj

.

4.3 How to verify the proposal

The proposal presented in this chapter should be completed and verified.In particular, we suggest the following further steps:

• implement the algorithm in a high modular way, for being able totest all the different proposals for each point;

• test the algorithm on in silico exact data and check if the orderingis still correct;

• test the algorithm on in silico noisy data and check if the orderingis still correct; the noisy data should be produced simulating theerror that the algorithm expects;

• adapt the parameters of the algorithm (like ΘIn and ΘSp) and in-duce a model to justify them;

• finally, when the algorithm is fine tuned, test it on real data.

63

Chapter 5

Conclusions and future work

The problem dealt in the thesis is the ordering of clones from finger-printing data obtained by complete digestion of four restrictionenzymes and fluorescent labelling. It is a problem arising from

DNA sequencing techniques currently used in biological laboratories andfor which there is still not a satisfactory automatic solution.

Literature on the problem has been studied, ranging from publicationson computer science journals to publications on biological journals, andwe think that our approach takes into account both the best advances insimilar algorithms and the real needs of a biological laboratory: a proofof this is the fact that for our clone ordering problem we adapted partsof algorithms originally devised for restriction mapping.

Altough the real results presented in this work are only for a simplerproblem than the one arising from DNA sequencing, that is with exactdata instead of noisy data, we succesfully and always produced correctorderings that the most used software in biological laboratories, FPC, isnot always able to do [FCF+04, p. 1265 and 1269]. Moreover, we areconfident that our proposed algorithm for the problem with noisy datacan work faster (both in computer time and in human time saved) andqualitatively better than any existing software.

Apart from implementing and verifying correctness of what has beenproposed in chapter 4 for the problem with noisy data, we think that stillfurther research should be conducted in order to throuroughly investigate

65

the problem and to produce a fully functional program for automatedordering, and in particular:

FPC format support FPC is in use in most laboratories, and biologistsare familiar with its interface, improved all over the last ten years.For this reason, developing a new interface would be a waste of en-ergy, if not for simply re-engineering the code (a switch to wxWidgetswould for example improve platform portability). Moreover it wouldbe useful being able to easily switch to the use of our program in aproject where FPC is in use: for these reasons adding FPC formatsupport (loading and saving) would improve the efficacy.

Lander-Waterman extension The Lander and Waterman model [LW88]is the reference model for fingerprinting projects: unluckily, it is notapplicable to projects that do not make use of the overlap relation.An extension of this model for the span relation would for exampleallow to obtain results on the number of expected contigs given thecoverage.

Minimal tiling path Once the clones have been ordered, it is costlyeffective for the sequencing project to select a minimum set ofclones that covers the target DNA: this is called minimal tilingpath (MTP). The current approach is to sequence the clone endsand use these data (BES data) to produce the MTP. We think thatadapting the works on fragment identification by the Washingtonschool, simply using them to try to relatively place the clones onewith the other, we can avoid the use of BES data and still producean equivalently effective MTP.

66

Greetings

This five months work concludes my last two years of studies: in thesetwo years of life I met so many people I’m grateful to that I think

they could not fit in here. And, also if they could, it is difficult for me tosum up why I feel so grateful to them; for many of them it was simplybecause they were there at the right time; others just put me in the rightmood to go on. Anyway, I want to take a try.

Ah, another thing I learned is that greetings have to be written in alanguage that can be understood by the person you address, so this isin Italian: prima di tutto voglio ringraziare mamma e papa, senza il cuisupporto tutto questo non sarebbe stato possibile: e per questo motivoche dedico loro la mia tesi.

In these two years I spent most most of my working time in thecomputer science department in Mestre where, after all, I felt very com-fortable; I want to thank most of the professors, all the mates, all thestaff, all my students, as teaching was one of the most exciting things Iever did in my life, my parents in the true computer science (Alvise andDiego) and my children (Anna, Giorgia, Andrea).

Each member of my working team, i.SenSE, has evolved into a fullprofessional, and our careers are following different routes, although westill keep contact and we still are the best, each one in his way. Thankyou Mattia, Giop, Marty, Zeff and Pan for constantly giving challengesand support.

Most of my non-working life has been spent with friends, i fioi, andwithout them life would be terribly boring; in particular I want to thankthe closest ones: DVD, Luca, Domenico, Marty, Pan, Checco, Erica,Irene, Laura.

67

In 2002-2003 I spent all Sunday mornings and many evenings withmy band; YBCos will always be in my heart, with all past, current andfuture members; I’m sure we left something important to our children;thank you: Checco, Laura, Marty, Ciccio, Anna, Lau, Tony, Claudia,Monica, Silvia.

Most of the financial support has been provided by my parents; somefrom Altevie Technologies, where I spent most of the summer of 2002and almost all of the summer of 2003 in front of a monitor in the heat(thank you Danilo, Cesar, Marco, Simone, Enrico for suffering with me);some other support came from my university, through Campus One andErasmus grants; some from debitors from previous works (better latethan never); and I think I should thank the region where I live for beingfull of working opportunities.

I want to thank also the really nice group of Erasmus students thathelped me to have an enjoyable stay in that little piece of land namedMallorca, in particular: Petra, Antonello, Jolanda, Andrea, Alessandro,F&F. My spanish mum, Merce, has been too kind to me and she was anincredible support and source of life.

The final rush to the Laurea started on Friday 11th of June 2003 waspossible thanks to la mamma, Giop, Zeff and Marta Simeoni. Thank youalso to all the people involved in the decision, it was worth it: MartaSimeoni, Merce, Petra, Marty, mamma e papa.

And, finally, a special thank you goes to Petra, who taught me thatproblems are just new challenges.

68

Bibliography

[AKNW93] Farid Alizadeh, Richard M. Karp, L. A. Newberg, and Deb-orah K. Weisser. Physical Mapping of Chromosomes: aCombinatorial Problem in Molecular Biology. In Sympo-sium on Discrete Algorithms, pages 371–381, 1993.

[BL76] K.S. Booth and G.S. Lueker. Testing for the consecutiveones property, interval graphs and graph planarity testingusing PQ-tree algorithms. Journal of Computer and SystemSciences, 13:335–379, 1976.

[CEP02] M. Cieliebak, S. Eidenbenz, and P. Penna. Noisy DataMake the Partial Digest Problem NP-hard. Technical Re-port 381, ETH Zurich, Department of Computer Science,2002.

[CSRL01] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, andCharles E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2001.

[Dak00] Tamara Dakic. On the Turnpike problem. PhD thesis, Si-mon Fraser University, 2000.

[DJC+99] Y Ding, M D Johnson, R Colayco, Y J Chen, J Mel-nyk, H Schmitt, and H Shizuya. Contig assembly ofbacterial artificial chromosome clones through multiplexedfluorescence-labeled fingerprinting. Genomics, 56(3):237–246, Mar 1999.

69

[DJC+01] Y. Ding, M. D. Johnson, W. Q. Chen, D. Wong, Y-J. Chen,S. C. Benson, J. Y. Lam, Y-M. Kim, and H. Shizuya. Ge-nomics, 74(2):142–154, 2001.

[EHNS03] F. Engler, J. Hatfield, W. Nelson, and C. Soderlund. Locat-ing sequence on FPC maps and selecting a minimal tilingpath. Genome Research, 13:2152–2163, 2003.

[Fas00] Daniel Fasulo. Algorithms for DNA Restriction Mapping.PhD thesis, University of Washington, 2000.

[FBS+03] Christopher D. Fjell, Ian Bosdet, Jacqueline E. Schein,Steven J.M. Jones, and Marco A. Marra. Internet ContigExplorer (iCE) - A Tool for Visualizing Clone FingerprintMaps. Genome Research, 13(6a):1244–1249, 2003.

[FCF+04] Stephane Flibotte, Readman Chiu, Chris Fjell, Mar-tin Krzywinski, Jacqueline E. Schein, Heesun Shin, andMarco A. Marra. Automated ordering of fingerprintedclones. Bioinformatics, 20(8):1264–1271, 2004.

[FHDM02] Daniel Fasulo, Aaron Halpern, Ian Dew, and Clark Mo-barry. Efficiently detecting polymorphisms during the frag-ment assembly process. Bioinformatics, 18 Suppl 1:S294–302, 2002.

[FJK+99] Daniel P. Fasulo, Tao Jiang, Richard M. Karp, Reuben Set-tergren, and Edward C. Thayer. An Algorithmic Approachto Multiple Complete Digest Mapping. Journal of Compu-tational Biology, 6(2):187–208, 1999.

[FJKS98] Daniel P. Fasulo, Tao Jiang, Richard M. Karp, and NitinSharma. Constructing maps using the span and inclusionrelations. In RECOMB, pages 64–73, 1998.

[FKC+03] Daniel R. Fuhrmann, Martin I. Krzywinski, ReadmanChiu, Parvaneh Saeedi, Jacqueline E. Schein, Ian E. Bos-det, Asif Chinwalla, LaDeana W. Hillier, Robert H. Water-ston, John D. McPherson, Steven J.M. Jones, and Marco A.

70

Marra. Software for Automated Analysis of DNA Finger-printing Gels. Genome Res., 13(5):940–953, 2003.

[GHB97] Simon G. Gregory, Gareth R. Howell, and David R. Bent-ley. Genome Mapping by Fluorescent Fingerprinting. Ge-nome Research, 7(12):1162–1168, 1997.

[GHW+96] Will Gillett, Liz Hanks, Gane Ka-Shu Wong, Regina LimJun Yu, and Maynard V. Olson. Assembly of High-Resolution Restriction Maps Based on Multiple CompleteDigests of a Redundant Set of Overlapping Clones. Ge-nomics, 33:389–408, may 1996.

[GW87] Larry Goldstein and Michael S Waterman. Mapping DNAby stochastic relaxation. Advances in Applied Mathematics,8(2):194–207, 1987.

[HBG01] Eric Harley, Anthony Bonner, and Nathan Goodman. Uni-form integration of genome mapping data using intersectiongraphs. Bioinformatics, 17(6):487–494, 2001.

[HDH+00] Michael Huerta, Gregory Downing, Florence Haseltine, Be-linda Seto, and Yuan Liu. NIH working definition of bioin-formatics and computational biology, July 2000.

[HK99] William W. Hager and Yaroslav Krylyuk. Graph Parti-tioning and Continuous Quadratic Programming. SIAMJournal on Discrete Mathematics, 12(4):500–523, 1999.

[ISF+04] Sorin Istrail, Granger G Sutton, Liliana Florea, Aaron LHalpern, Clark M Mobarry, Ross Lippert, Brian Walenz,Hagit Shatkay, Ian Dew, Jason R Miller, Michael J Flani-gan, Nathan J Edwards, Randall Bolanos, Daniel Fasulo,Bjarni V Halldorsson, Sridhar Hannenhalli, Russell Turner,Shibu Yooseph, Fu Lu, Deborah R Nusskern, Bixiong ChrisShue, Xiangqun Holly Zheng, Fei Zhong, Arthur L Delcher,Daniel H Huson, Saul A Kravitz, Laurent Mouchard, Knut

71

Reinert, Karin A Remington, Andrew G Clark, Michael SWaterman, Evan E Eichler, Mark D Adams, Michael WHunkapiller, Eugene W Myers, and J Craig Venter. Whole-genome shotgun assembly and comparison of human ge-nome assemblies. Proceedings of the National Academy ofSciences, 101(7):1916–1921, Feb 2004. Evaluation Studies.

[JK97] Tao Jiang and Richard M. Karp. Mapping clones witha given ordering or interleaving. In Proceedings of theeighth annual ACM-SIAM symposium on Discrete algo-rithms, pages 400–409. Society for Industrial and AppliedMathematics, 1997.

[Kyr99] NC Kyrpides. Genomes OnLine Database (GOLD 1.0): amonitor of complete and ongoing genome projects world-wide. Bioinformatics, 15(9):773–774, 1999.

[Les03] Arthur M. Lesk. Introduction to bioinformatics. OxfordUniversity Press, 2003.

[LLB+01] E S Lander, L M Linton, B Birren, C Nusbaum, M C Zody,J Baldwin, K Devon, K Dewar, M Doyle, W FitzHugh,R Funke, D Gage, K Harris, A Heaford, J Howland,L Kann, J Lehoczky, R LeVine, P McEwan, K McK-ernan, J Meldrim, J P Mesirov, C Miranda, W Morris,J Naylor, C Raymond, M Rosetti, R Santos, A Sheridan,C Sougnez, N Stange-Thomann, N Stojanovic, A Subra-manian, D Wyman, J Rogers, J Sulston, R Ainscough,S Beck, D Bentley, J Burton, C Clee, N Carter, A Coulson,R Deadman, P Deloukas, A Dunham, I Dunham, R Durbin,L French, D Grafham, S Gregory, T Hubbard, S Humphray,A Hunt, M Jones, C Lloyd, A McMurray, L Matthews,S Mercer, S Milne, J C Mullikin, A Mungall, R Plumb,M Ross, R Shownkeen, S Sims, R H Waterston, R K Wil-son, L W Hillier, J D McPherson, M A Marra, E R Mardis,L A Fulton, A T Chinwalla, K H Pepin, W R Gish, S L

72

Chissoe, M C Wendl, K D Delehaunty, T L Miner, A Dele-haunty, J B Kramer, L L Cook, R S Fulton, D L Johnson,P J Minx, S W Clifton, T Hawkins, E Branscomb, P Predki,P. Initial sequencing and analysis of the human genome.Nature, 409(6822):860–921, Feb 2001.

[LTY+03] Ming-Cheng Luo, Carolyn Thomas, Frank M. You, JosephHsiao, Shu Ouyang, C. Robin Buell, Marc Malandro,Patrick E. McGuire, Olin D. Anderson, and Jan Dvorak.High-throughput fingerprinting of bacterial artificial chro-mosomes using the snapshot labeling kit and sizing of re-striction fragments by capillary electrophoresis. Genomics,82(3):378–389, 2003.

[LW88] E S Lander and M S Waterman. Genomic mapping by fin-gerprinting random clones: a mathematical analysis. Ge-nomics, 2(3):231–239, Apr 1988.

[MAM+02] Richard J Mural, Mark D Adams, Eugene W Myers, Hamil-ton O Smith, George L Gabor Miklos, Ron Wides, AaronHalpern, Peter W Li, Granger G Sutton, Joe Nadeau,Steven L Salzberg, Robert A Holt, Chinnappa D Kodira,Fu Lu, Lin Chen, Zuoming Deng, Carlos C Evangelista,Weiniu Gan, Thomas J Heiman, Jiayin Li, Zhenya Li, Gen-nady V Merkulov, Natalia V Milshina, Ashwinikumar KNaik, Rong Qi, Bixiong Chris Shue, Aihui Wang, JianWang, Xin Wang, Xianghe Yan, Jane Ye, Shibu Yooseph,Qi Zhao, Liansheng Zheng, Shiaoping C Zhu, Kendra Bid-dick, Randall Bolanos, Arthur L Delcher, Ian M Dew,Daniel Fasulo, Michael J Flanigan, Daniel H Huson, Saul AKravitz, Jason R Miller, Clark M Mobarry, Knut Reinert,Karin A Remington, Qing Zhang, Xiangqun H Zheng, Deb-orah R Nusskern, Zhongwu Lai, Yiding Lei, Wenyan Zhong,Alison Yao, Ping Guan, Rui-Ru Ji, Zhiping Gu, Zhen-Yuan Wang, Fei Zhong, Chunlin Xiao, Chia-Chien Chiang,Mark Yandell, Jennifer R Wortman, Peter G Amanatides,

73

Suzanne L Hladun, Eric C Pratts, Jeffer. A comparison ofwhole-genome shotgun-derived mouse chromosome 16 andthe human genome. Science, 296(5573):1661–1671, May2002.

[MKD+97] Marco A. Marra, Tamara A. Kucaba, Nicole L. Dietrich,Eric D. Green, Buddy Brownstein, Richard K. Wilson,Ken M. McDonald, LaDeana W. Hillier, John D. McPher-son, and Robert H. Waterston. High Throughput Finger-print Analysis of Large-Insert Clones. Genome Research,7(11):1072–1084, 1997.

[MMCQ+04] You F. M., Luo M.-C, Gu Y. Q., Lazo G. R., Thomas C.,Deal K, McGuire P. E., Dvorak J, and Anderson O. D.GenoProfiler: A software package for processing capillaryfingerprinting data, 2004.

[MSD+00] E W Myers, G G Sutton, A L Delcher, I M Dew, D P Fa-sulo, M J Flanigan, S A Kravitz, C M Mobarry, K H Rein-ert, K A Remington, E L Anson, R A Bolanos, H H Chou,C M Jordan, A L Halpern, S Lonardi, E M Beasley, R CBrandon, L Chen, P J Dunn, Z Lai, Y Liang, D R Nusskern,M Zhan, Q Zhang, X Zheng, G M Rubin, M D Adams, andJ C Venter. A whole-genome assembly of Drosophila. Sci-ence, 287(5461):2196–2204, Mar 2000.

[Mum97] Brendan Marshall Mumey. Some Computational Problemsfrom Genomic Mapping. PhD thesis, University of Wash-ington, 1997.

[NTK+02] S R Ness, W Terpstra, M Krzywinski, M A Marra, andS J M Jones. Assembly of fingerprint contigs: parallelizedFPC. Bioinformatics, 18(3):484–485, Mar 2002.

[OMW+01] Kazutoyo Osoegawa, Aaron G. Mammoser, Chenyan Wu,Eirik Frengen, Changjiang Zeng, Joseph J. Catanese, and

74

Pieter J. de Jong. A Bacterial Artificial Chromosome Li-brary for Sequencing the Complete Human Genome. Ge-nome Research, 11(3):483–496, 2001.

[Pev95] P. A. Pevzner. DNA physical mapping and alternating Eu-lerian cycles in colored graphs. Algorithmica, 13(1/2):77–105, 1995.

[Pev00] Pavel A. Pevzner. Computational Molecular Biology. Brad-ford Book, 2000.

[PP03] Massimiliano Pavan and Marcello Pelillo. Generalizing theMotzkin-Straus Theorem to Edge-Weighted Graphs, withApplications to Image Segmentation. In Energy Minimiza-tion Methods in Computer Vision and Pattern Recognition,4th International Workshop, Lecture Notes in ComputerScience, pages 485–500. Springer, 2003.

[PR02] G. Pandurangan and H. Ramesh. The Restriction Map-ping Problem Revisited. Journal of Computer and SystemSciences (special issue on Computational Biology), 65:526–544, 2002.

[RCH04] M. Robinson, W.-Q. Chen, and T. Hunkapiller. multiFPC:The use of automated sequencer multicolor data in large-scale clone mapping. Manuscript in preparation, 2004.

[Rob92] R.J. Robbins. Challenges in the human genome project.IEEE Engineering in Medicine and Biology Magazine,11(1):25–34, Mar 1992.

[RS82] Joseph Rosenblatt and Paul D. Seymour. The structure ofhomometric sets. SIAM Journal on Algebraic and DiscreteMethods, 3(3):343–350, 1982.

[RS98] Eric C. Rouchka and David J. States. Sequence AssemblyValidation by Multiple Restriction Digest Fragment Cov-erage Analysis. In Proceedings of Intelligent Systems for

75

Molecular Biology (ISMB), pages 140–147. American Asso-ciation for Artificial Intelligence, 1998.

[Set98] Reuben Settergren. Theory And Algorithms For PhysicalMapping Of DNA. PhD thesis, Rutgers, The State Univer-sity of New Jersey, 1998.

[SHDF00] C. Soderlund, S. Humphray, A. Dunham, and L. French.Contigs built with fingerprints, markers, and FPC V4.7.Genome Research, 10(11):1772–1787, 2000.

[SLM97] C. Soderlund, I. Longden, and R. Mott. FPC: A systemfor building contigs from restriction fingerprinted clones.Computer Applications in the Biosciences, 5(13):523–535,1997.

[SMDH89] J Sulston, F Mallett, R Durbin, and T Horsnell. Imageanalysis of restriction enzyme fingerprint autoradiograms.Comput. Appl. Biosci., 5(2):101–106, 1989.

[SMS+88] J Sulston, F Mallett, R Staden, R Durbin, T Horsnell, andA Coulson. Software for genome mapping by fingerprint-ing techniques. Computer Applications in the Biosciences,4(1):125–132, Mar 1988.

[SNB01] David J. States, Volker Nowotny, and Thomas W. Black-well. Probabilistic approaches to the use of higher orderclone relationships in physical map assembly. Bioinformat-ics, 17(90001):S262–S269, 2001.

[Spe04] Anna Sperotto. Maximum Bipartite Matching con replica-tor dynamics (Personal communication), March 2004.

[SSL90] S. Skiena, W. Smith, and P. Lemke. Reconstructing setsfrom interpoint distances. In Sixth ACM Symposium onComputational Geometry, pages 332–339, 1990.

76

[SW91] William Schmitt and Michael S. Waterman. Multiple solu-tions of DNA restriction mapping problems. Advances inApplied Mathematics, 12(4):412–427, 1991.

[Tam03] Martti T. Tammi. The principles of shotgun sequencingand automated fragment assembly. (course notes), 2003.

[Tar72] R. E. Tarjan. Depth first search and linear graph algo-rithms. SIAM Journal on Computing, 1(2):146–160, June1972.

[Val04] Giorgio Valle. Colour measurement errors (Personal com-munication), May 2004.

[VAM+01] J C Venter, M D Adams, E W Myers, P W Li, R J Mural,G G Sutton, H O Smith, M Yandell, C A Evans, R A Holt,J D Gocayne, P Amanatides, R M Ballew, D H Huson,J R Wortman, Q Zhang, C D Kodira, X H Zheng, L Chen,M Skupski, G Subramanian, P D Thomas, J Zhang, G LGabor Miklos, C Nelson, S Broder, A G Clark, J Nadeau,V A McKusick, N Zinder, A J Levine, R J Roberts, M Si-mon, C Slayman, M Hunkapiller, R Bolanos, A Delcher,I Dew, D Fasulo, M Flanigan, L Florea, A Halpern, S Han-nenhalli, S Kravitz, S Levy, C Mobarry, K Reinert, K Rem-ington, J Abu-Threideh, E Beasley, K Biddick, V Bonazzi,R Brandon, M Cargill, I Chandramouliswaran, R Charlab,K Chaturvedi, Z Deng, V Di Francesco, P Dunn, K Eilbeck,C Evangelista, A E Gabrielian, W Gan, W Ge, F Gong,Z Gu, P Guan, T J Heiman, M E Higgins, R R Ji, Z Ke, K AKetchum, Z Lai, Y Lei, Z Li, J Li, Y Liang, X Lin, F Lu,G V Merkulov, N Milshina, H M Moore, A K Naik, V ANarayan, B Neelam, D Nusskern, D B Rusch, S Salzberg,W Shao, B Shue, J Sun, Z W. The sequence of the humangenome. Science, 291(5507):1304–1351, Feb 2001.

[Var04] Dmitry Varabei. CORAL (Personal communication), May2004.

77

[VHCAP03] Giorgio Valle, Manuela Helmer-Citterich, Marcella Atti-monelli, and Graziano Pesole. Introduzione alla Bioinfor-matica. Zanichelli, 2003.

[WYTO97] Gane K.-S. Wong, Jun Yu, Edward C. Thayer, andMaynard V. Olson. Multiple-complete-digest restrictionfragment mapping: Generating sequence-ready maps forlarge-scale DNA sequencing. Proceedings of the NationalAcademy of Sciences, 94(10):5225–5230, 1997.

[Zha94] Z Zhang. An exponential example for a partial digestmapping algorithm. Journal of Computational Biology,1(3):235–239, Fall 1994.

78

Index

artificial chromosomes, see vec-tors

BAC, see vectorsBac End Sequences, 26, 66BandLeader, 27BES, see Bac End SequencesBSS, 26

clone, see vectorsclone fingerprinting, 12

by digestion, 12by Sequence-Tagged Sites, 12

Clone ORdering ALgorithm, 27coincidence

for exact data, 39for noisy data, 52

colouring, 7compatibility

of fragments, probability of,55

of sizes, 56of sizes, probability of

proportional error, 57contig, 12Contig9, 25ContigC, 25CORAL, 27

coverage, 10, 12

DCD, see Double Complete Di-gest

deoxynucleotides, 7deoxyribonucleic acid, 4dideoxynucleotides, 7DNA, see deoxyribonucleic acidDNA map, see mappingDouble Complete Digest, 22

enzyme, see restriction enzyme

fingerprint, see clone fingerprint-ing

fluorescent dies, 7fluorescent labelling, 7FPC, 25, 66FSD, 26

Genomes OnLine Database, 15GenoProfiler, 27GOLD, see Genomes OnLine Da-

tabase

HGP, see Human Genome ProjectHICF, see High Information Con-

tent FingerprintingHierarchical Sequencing, 11

79

High Information Content Finger-printing, 28

homometric sets, 20host cells, 8HS, see Hierarchical SequencingHuman Genome Project, 5, 15

identical, see relationsImage, 25inclusion, see relations

Lander-Waterman model, 66

mapping, 8MCD, see Multiple Complete Di-

gestMott score, 25multiFPC, 28Multiple Complete Digest, 23

National Institute of Health, 3, 5,15

NIH, see National Institute of Health

Partial Digest, 19PD, see Partial DigestPDP, see Partial Digestphages, see vectorsphysical map, see mappingplasmids, see vectorsproteins, 5

REBASE, see restriction enzymedatabase

relationsidentical

for exact data, 39

for noisy data, 53inclusion

for exact data, 39for noisy data, 53

spanfor exact data, 40for noisy data, 53

restriction enzyme, 7blunt ends, 7database, 7n-cutter, 7sticky ends, 7

restriction map, see mappingribonucleic acid, 5RMAP, 29RNA, see ribonucleic acid

Sanger sequencing, 13scaffold, 9SCD, see Single Complete DigestSequence-Tagged Sites, see clone

fingerprintingshotgun sequencing, 10Single Complete Digest, 19span, see relationsSTS, see clone fingerprintingSulston score, 25

target DNA, 12

vectors, 8

WGS, see Whole Genome Shot-gun

Whole Genome Shotgun, 11

YAC, see vectors

80

Appendix A

Source codes

A.1 camlstomach

(********************************************************************* args.ml: parameters handling* Giulio Marcon <[email protected]>*********************************************************************)

module Args = struct

let sources = ref [ ]let help = ref falselet infile = ref "" 10

let enzymes = ref [ ]let error = ref ( fun x −> x )

let specl = [( "−error", Arg.Int ( fun p −> error := EnzymeDigest.simulate error p )," add error with perturbation level p (default = no)" ) ;

( "−help", Arg.Set help," display this help message") ;

]20

let usage = "Camlstomach: multiple digestion of a sequence by restriction enzymes.\n"^ "Usage: " ^ Sys.argv.(0) ^ " [options] sequencefile enzyme 1 . . enzyme n \n"^ "Options:\n"

81

Main

Args Enzymes Io Utils

EnzymeDigest

Math

Figure A.1: camlstomach modules dependency graph.

^ " sequencefile file containing the sequence\n"^ " enzyme i sequence recognized by enzyme i (ex: G/AATTC)"

let get par () =let anon s = sources := !sources @ [s] in( Arg.parse specl anon usage ;

if ( ( List.length !sources ) > 0 ) 30

then ( infile := List.hd !sources ; enzymes := List.tl !sources )else ( Arg.usage specl usage ; exit 0 ) ;

if ( !help )then ( Arg.usage specl usage; exit 0 )else ( !infile, !enzymes, !error ) )

end

let get parameters () =Args.get par () 40

82

(********************************************************************* enzymeDigest.ml: camlstomach digestion* Giulio Marcon <[email protected]>*********************************************************************)

module Xstr = struct

(* index of substring from and indexlist of substring are part of the“xstr” package - Copyright 1999 by Gerd Stolpmann

10

The package “xstr” is copyright by Gerd Stolpmann.

Permission is hereby granted, free of charge, to any person obtaininga copy of the “xstr” software (the “Software”), to deal in theSoftware without restriction, including without limitation the rightsto use, copy, modify, merge, publish, distribute, sublicense, and/orsell copies of the Software, and to permit persons to whom theSoftware is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included 20

in all copies or substantial portions of the Software. *)

let index of substring from s k left substr =let l = String.length s inlet lsub = String.length substr inlet k right = l − lsub inlet c = if substr <> "" then substr.[0] else ’ ’ inlet rec search k =

if k <= k right then beginif String.sub s k lsub = substr then 30

kelse

let k next = String.index from s (k+1) c insearch k next

endelse raise Not found

inif substr = "" then k left else search k left

let indexlist of substring s substr = 40

let rec enumerate k =

83

trylet pos = index of substring from s k substr inpos :: enumerate (pos+1)

withNot found −> [ ]

inenumerate 0

end 50

let multiple digest enzymes sequence =let reverse compare b a = compare a b inlet finish digestion restriction sites =

(* flatten lists of restriction sites adding colours *)let add colours restriction sites =

let rec add colours colour enzymes acc = function| [ ] −>

acc| rsites :: tl −> 60

let ( loff, roff ) =let ( eseq, eoff ) = List.hd enzymes in( eoff, ( String.length eseq ) − eoff )

inlet c = if loff = roff then −1 else colour inlet rec add colour acc = function| [ ] −>

acc| idx :: idxs −>

add colour ( ( idx, c, loff, roff ) :: acc ) idxs 70

inlet colour = if c = −1 then colour else colour + 1 inadd colours colour ( List.tl enzymes ) ( add colour acc rsites ) tl

inadd colours 0 enzymes [ ] restriction sites

in(* calculates fragments sizes from restriction sites *)let calc fragments coloured restriction sites =

let f = fun ( ( left, lc, loff, loff’ ), l ) ( right, rc, roff, roff’ ) −>let size = right − left in 80

( ( right, rc, roff, roff’ ),( right + roff − left − loff, lc ) ::( right + roff’ − left − loff’, rc ) :: l )

84

inlet coloured restriction sites =

coloured restriction sites @ [ ( String.length sequence, −1, 0, 0 ) ]insnd ( List.fold left f ( ( 0, −1, 0, 0 ), [ ] ) coloured restriction sites )

inlet fst compare = ( fun ( a, , , ) ( b, , , ) −> compare a b ) in 90

calc fragments ( List.sort fst compare ( add colours restriction sites ) )inlet rec digest r acc = function| [ ] −>

finish digestion acc| ( enzyme seq, enzyme off ) :: enzymes −>

let acc = ( Xstr.indexlist of substring sequence enzyme seq ) :: acc indigest r acc enzymes

indigest r [ ] enzymes 100

(* Flibotte04 error model *)let simulate error p fragments =

let size error s =int of float (

Math.RandomDistributions.normal( float of int p )( float of int s )

)in 110

(* remove fragments *)let fragments =

let f acc x = if Random.int 100 < p then acc else x :: acc inList.fold left f [ ] fragments

in(* duplicate fragments *)let fragments =

let f acc x = if Random.int 100 < p then x :: x :: acc else x :: acc ) inList.fold left [ ] fragments

in 120

List.rev map ( fun ( s, ( c : int ) ) −> ( size error s, c ) ) fragments

85

(********************************************************************* enzymes.ml: some predefined enzymes and enzyme of string utility* Giulio Marcon <[email protected]>*********************************************************************)

exception Invalid of string

module Enzymes = struct

let dictionary = 10

List.map ( fun ( key, value ) −> ( String.lowercase key, value ) )(

( "EcoRI", ( "GAACCT", 1 ) ) ::( "MfeI", ( "CAATTG", 1 ) ) ::( "HindIII", ( "AAGCTT", 1 ) ) ::( "BamHI", ( "GGATCC", 1 ) ) ::( "BclI", ( "TGATCA", 1 ) ) ::( "XbaI", ( "TCTAGA", 1 ) ) ::( "NcoI", ( "CCATGG", 1 ) ) ::( "MluI", ( "ACGCGT", 1 ) ) :: 20

( "XhoI", ( "CTCGAG", 1 ) ) ::( "SalI", ( "GTCGAC", 1 ) ) ::( "Ppu10I", ( "ATGCAT", 1 ) ) ::( "HaeIII", ( "GGCC", 2 ) ) ::( "AluI", ( "AGCT", 2 ) ) ::[ ]

)

let e str =List.assoc ( String.lowercase str ) dictionary 30

let enzyme of string str =try

let offset = String.index str ’/’ inlet strlen = String.length str inlet sequence =

( String.sub str 0 offset ) ^( String.sub str ( offset + 1 ) ( strlen − offset − 1 ) )

in( String.uppercase sequence, offset ) 40

with

86

| Not found −>try

e strwith| Not found −>raise ( Invalid ( "invalid enzyme sequence (example of a correct " ^"sequence: G/AACCT ), or enzyme not found in database" ) )

let set luo03 = 50

[ e "EcoRI" ;e "BamHI" ;e "XbaI" ;e "XhoI" ;e "HaeIII" ]

let set valle1 =[ e "MfeI" ;

e "MluI" ;e "BclI" ; 60

e "Ppu10I" ]

let set valle2 =[ e "EcoRI" ;

e "NcoI" ;e "BamHI" ;e "SalI" ]

end70

let ( enzyme of string,default set ) =

( Enzymes.enzyme of string,Enzymes.set luo03 )

87

(********************************************************************* io.ml: camlstomach i/o operations* - Fasta input for sequences* - simil-Fasta output for digestion* Giulio Marcon <[email protected]>*********************************************************************)

module Fasta = struct

exception Invalid format of string 10

let load single sequence ib =let check acgt str =

let is acgt char =if char = ’a’ | | char = ’c’ | | char = ’g’ | | char = ’t’

then trueelse false

inlet strlen = String.length str inlet i = ref 0 in 20

while !i < strlen && is acgt str.[!i]do i := !i + 1

done ;if !i = strlen

then strelse raise ( Invalid format ("unknown character [" ^( Char.escaped str.[!i] ) ^ "]" ) )

inlet parse desc () =

Scanf.bscanf ib ">%s@\n" ( fun s −> s ) 30

inlet parse bases () =

let rec parse bases r acc =let c =

tryScanf.bscanf ib "%0c" ( fun c −> c )

with| End of file −> ’>’

inif c = ’>’ 40

then acc

88

else parse bases r( Scanf.bscanf ib "%s@\n" String.uppercase :: acc )

inString.concat "" ( List.rev ( parse bases r [ ] ) )

inlet desc = parse desc () inlet bases = parse bases () in( desc, bases )

50

let load in channel =let ib = Scanf.Scanning.from channel in channel inlet rec parse seq r () =

load single sequence ib :: parse seq r ()inparse seq r ()

let save single digestion out channel ( name, fragments ) =Printf.fprintf out channel ">%s\n" name ;let f ( s, c ) = Printf.fprintf out channel "%d\t%d\n" s c in 60

List.iter f fragments

end

89

(********************************************************************* main.ml: camlstomach main procedure* Giulio Marcon <[email protected]>*********************************************************************)

(* process file enzymes : digests with enzymes the sequences in file(file must be in Fasta format) *)

let process file enzymes add error =let rec process r ib =

if Scanf.Scanning.end of input of input ib then () else 10

let ( name, sequence ) = Io.Fasta.load single sequence ib inlet fragments = EnzymeDigest.multiple digest enzymes sequence inIo.Fasta.save single digestion stdout ( name, add error fragments ) ;process r ib

inUtils.temp resource

( fun in channel −> process r ( Scanf.Scanning.from channel in channel ) )( open in file )close in

20

let main () =Random.init ( int of float ( Unix.time () ) ) ;let ( seq file, enzymes, add error ) = Args.get parameters () inlet enzymes =

if List.length enzymes = 0then Enzymes.default setelse List.map Enzymes.enzyme of string enzymes

inprocess seq file enzymes add error

;; 30

main ()

90

(********************************************************************* math.ml : math library* Giulio Marcon <[email protected]>*********************************************************************)

let pi = 4.0 *. atan 1.0

let gaussian stddev mean x =1.0 /. ( sqrt ( 2.0 *. pi ) *. stddev ) *.exp ( −. ( x −. mean ) ** 2.0 /. ( 2.0 *. stddev ** 2.0 ) ) 10

(*Anyone attempting to generate random numbers by deterministic means is,of course, living in a state of sin.

– John von Neumann*)module RandomDistributions = struct

let z2 = ref 0.020

(* polar form of the Box-Muller transformation (1958) to generaterandom numbers with normal distribution starting from a uniformone *)

(* non-polar form:y1 = sqrt( - 2 ln(x1) ) cos( 2 pi x2 )

y2 = sqrt( - 2 ln(x1) ) sin( 2 pi x2 ) *)let normal stddev mean =

let rec gen two () =let x1 = Random.float 2.0 −. 1.0 inlet x2 = Random.float 2.0 −. 1.0 in 30

let w = x1 *. x1 +. x2 *. x2 inif w < 1.0

then let w = sqrt ( ( −2.0 *. log w ) /. w ) in( x1 *. w , x2 *. w )

else gen two ()inif !z2 <> 0.0

then let v = mean +. !z2 *. stddev inz2 := 0.0 ; v

else let ( x1, x2 ) = gen two () in 40

z2 := x2 ;

91

mean +. x1 *. stddevend

92

(********************************************************************* utils.ml : general utilities* Giulio Marcon <[email protected]>*********************************************************************)

let temp resource f open res close res =let res = open res inlet v = f res inlet = close res res inv 10

93

A.2 camlass

(********************************************************************* args.ml: parameters handling* Giulio Marcon <[email protected]>*********************************************************************)

module Args = struct

let t2 = ref 1.0let sources = ref [ ]let help = ref false 10

let infile = ref ""

let specl = [("−help", Arg.Set help," display this help message") ;

("−t2", Arg.Float ( fun f −> t2 := f ),"\t t2 value (default = " ^ ( string of float !t2 ) ^ ")") ;

]

let usage = "Camlass: OCaml bac assembler.\n" 20

^ "Usage: " ^ Sys.argv.(0) ^ " [options] infile\n"^ "Options:\n"^ " infile bac fingerprints file"

let get parameters () =let anon s = sources := !sources @ [s] inArg.parse specl anon usage ;if ( ( List.length !sources ) == 1)

then ( infile := List.hd !sources ; () )else ( Arg.usage specl usage ; exit 0 ) ; 30

if ( !help )then (Arg.usage specl usage; exit 0)else () ;

( !infile, !t2 )

end

let get parameters () =Args.get parameters ()

94

Main

Args Fasulo

Io

FasuloOrdering

FasuloNeighbors FasuloMaximal

FingerprintedClone

CloneFingerprint

Fragment

Math

Utils

Figure A.2: camlass modules dependency graph.

95

(********************************************************************* clone.ml : clone data type and operations* Giulio Marcon <[email protected]>*********************************************************************)

module Clone = struct

(* state *)let id = ref 0

10

(* type *)type id = inttype ’a t = id * ’a

(* operations *)let eq ( id i, ) ( id j, ) =

id i = id j

let clone fresh id () =id := !id + 1 ; !id 20

end

type name = stringtype left = inttype length = inttype t = ( name * left * length ) Clone.t

let clone fresh id = Clone.clone fresh idlet eq = Clone.eq 30

let (=) = Clone.eq

97

(********************************************************************* fasulo.ml: interface for implementation of Fasulo’s algorithm* Giulio Marcon <[email protected]>*********************************************************************)

let get maximal clones = FasuloMaximal.get maximal cloneslet get neighbors = FasuloNeighbors.get neighborslet get contigs = FasuloOrdering.get contigs

98

(********************************************************************* fasuloMaximal.ml: implementation of Fasulo algorithm for maximal* clones* Giulio Marcon <[email protected]>*********************************************************************)

let get maximal clones clones =(* selection of one representative for each equivalence class *)let remove identical clones clones =

let rec f clones = function 10

| [ ] −>[ ]

| clone :: rest −>let ( eq class, clones ) =let f x =

FingerprintedClone.identical ( clone, x )inList.partition f restin( Utils.List.min ( clone :: eq class ) ) :: f clones rest 20

inf clones clones

in(* inclusion is a partial order relation on clones, we want the top

of each chain, thus all the maximal c i, where maximal means thatthere are no c j such that In ( c i, c j ) *)

let rec f acc = function| [ ] −>

acc| clone :: rest −> 30

let acc =let f x =

FingerprintedClone.inclusion ( clone, x )inif List.exists f clones

then accelse clone :: acc

inf acc rest

in 40

f [ ] ( remove identical clones clones )

99

(********************************************************************* fasuloNeighbors.ml: implementation of Fasulo algorithm for nearest* neighbors with several improvements* Giulio Marcon <[email protected]>*********************************************************************)

type ’a partition vertex = { v : ’a; mutable side : bool }type ’a graph = Graph of ’a list * ( ( ’a * ’a ) list )

let get neighbors clone clones = 10

(* creates neighbor graph of a clone *)let create neighbor graph clone =

let rec graph of relation spanrelations =let add unique element set =

trylet f x = Clone.(=) x.v element.v in( List.find f set, set )

withNot found −> ( element, element :: set )

in 20

let verticize clone ={ v = clone ; side = false }

inlet rec graph of relation r v e = function| [ ] −>

Graph ( v, e )| ( ci, cj ) :: tl −>

let ( ci, v ) = add unique ( verticize ci ) v inlet ( cj, v ) = add unique ( verticize cj ) v inlet e = ( cj, ci ) :: ( ci, cj ) :: e in 30

graph of relation r v e tlingraph of relation r [ ] [ ] spanrelations

inlet spanning ( c i, c j ) =

if FingerprintedClone.spanning ( c i, c j, clone )then Some ( c i, c j )else None

ingraph of relation ( Utils.pairwise sym spanning clones ) 40

in

101

(* partitions neighbor graph *)let partition neighbor graph = function Graph ( vertices, edges ) −>

(* returns a couple ( n, n’ ) with n is the numbers of adjacentvertexes on the same partition and n’ is the number of adjvertexes on the opposite partition *)

let count neighbors vertex =let f ( n, n’ ) ( v1, v2 ) =

if Clone.(=) vertex.v v1.vthen if vertex.side = v2.side 50

then ( n + 1, n’ )else ( n, n’ + 1 )

else if Clone.(=) vertex.v v2.vthen if vertex.side = v1.side

then ( n + 1, n’ )else ( n, n’ + 1 )

else ( n, n’ )inlet ( n, n’ ) = List.fold left f ( 0, 0 ) edges in(* note: these are the double of the actual value due to 60

the fact that the graph is undirected *)( n, n’ )

inlet count false positives () =

let f sum v = sum + ( fst ( count neighbors v ) ) inList.fold left f 0 vertices

in(* changes the mutable field new side to propose a new

configuration *)let update side vertex = 70

let ( n, n’ ) = count neighbors vertex inif n > n’

then vertex.side <− not vertex.sideelse if n = n’

then vertex.side <− Random.bool ()else vertex.side <− vertex.side

inlet rec main loop n =

if n = 0 then 0 else (let new n = 80

List.iter update side vertices ;count false positives ()

in

102

(* termination: checks that number of falseposititives is decreasing *)

if new n >= nthen nelse main loop new n

)in 90

let random restarts runs =let randomize sides () =

let f v = v.side <− Random.bool () inList.iter f vertices

inlet save partition () =

let f v = v.side inList.rev map f vertices

inlet restore partition partition = 100

let f v s = v.side <− s inList.iter2 vertices ( List.rev partition )

inlet rec f runs best partition best fp =

if runs <= 0 | | best fp = 0then best partitionelse (

randomize sides () ;let fp = main loop ( count false positives () ) inif fp < best fp 110

then f ( runs − 1 ) ( save partition () ) fpelse f ( runs − 1 ) best partition best fp

)inlet v = f runs ( save partition() ) ( count false positives() ) inrestore partition v ;Graph ( vertices, edges )

inrandom restarts 10

in 120

let get partition sets = function Graph ( vertices, edges ) −>let triple c i =

let f1 sum c j =if

not ( Clone.(=) c i.v c j.v ) && c i.side = c j.side &&

103

FingerprintedClone.spanning ( c i.v, clone, c j.v )then sumelse sum + 1

inlet f2 sum ( v1, v2 ) = 130

ifClone.(=) c i.v v1.v && c i.side != v2.side | |Clone.(=) c i.v v2.v && c i.side != v1.sidethen sum + 1else sum

in( List.fold left f1 0 vertices,

List.fold left f2 0 edges,FingerprintedClone.coincidence ( c i.v, clone ) )

in 140

let sort and clean n =let f ( , x ) ( , y ) = compare x y inlet n = List.sort f n inList.rev map ( fun ( x, ) −> x.v ) n

in(* we first get the sets and we add an ordering index: *)let ( n, n’ ) = List.fold left

( fun ( n, n’ ) v −>if v.side

then ( ( v, triple v ) :: n, n’ ) 150

else ( n, ( v, triple v ) :: n’ ))( [ ], [ ] ) vertices

in(* we then order them by the index and we remove it: *)( sort and clean n, sort and clean n’ )

inget partition sets ( partition neighbor graph ( create neighbor graph clone ) )

104

(********************************************************************* fasuloOrdering.ml: algorithm for the ordering of clones given their* nearest neighbors lists; partially inspired by* Fasulo’s algorithm for ordering* Giulio Marcon <[email protected]>*********************************************************************)

type ’a graph = Graph of ’a list * ( ( ’a * ’a ) list )

let get contigs neighbors = 10

(* build top ranked neighbors graph from neigbors data *)let create top ranked neighbors graph neighbors =

let add edge c n acc =if List.length n > 0

then ( c, List.hd n ) :: accelse acc

inlet f ( v, e ) ( c, ( n, n’ ) ) =

( c :: v, add edge c n ( add edge c n’ e ) )in 20

let ( v, e ) = List.fold left f ( [ ], [ ] ) neighbors inGraph ( v, e )

in(* grows a component by adding vertexes that are strongly connected *)let grow vertex ( Graph ( vertices, edges ) ) =

(* finds the first element el of the list satisfying the predicate fand returns Some el and the list without the element, or (if thereare no elements satisfying the predicate) returns None and the listunchanged *)

let partition first f list = 30

let rec f acc = function| [ ] −>

( None, acc )| hd :: tl −>

if f hdthen ( Some hd, acc @ tl )else f ( hd :: acc ) tl

inf [ ] list

in 40

let grow step vertex ( Graph ( vertices, edges ) ) =

105

let ( outedge, edges ) =let f e = Clone.(=) ( fst e ) vertex inpartition first f edges

inmatch outedge with| None −> ( None, Graph ( vertices, edges ) )| Some outedge −>let ( symedge, edges ) =

let f e = 50

Clone.(=) ( fst e ) ( snd outedge ) &&Clone.(=) ( snd e ) vertex

inpartition first f edges

inmatch symedge with| None −> ( None, Graph ( vertices, outedge :: edges ) )| Some symedge −>( Some ( snd outedge ),

let v = 60

let f = ( Clone.(=) ( snd outedge ) ) insnd ( partition first f vertices )

inGraph ( v, edges ) )

inlet rec grow r vertex g =

match grow step vertex g with| ( Some v, g ) −> ( match grow r v g with ( l, g ) −> ( v :: l, g ) )| ( None, g ) −> ( [ ], g )

in 70

let v =let f = (=) vertex insnd ( List.partition f vertices )

ingrow r vertex ( Graph ( v, edges ) )

in(* enumerates strongly ordered components *)let rec enum components acc = function Graph ( vertices, edges ) −>match vertices with| [ ] −> 80

acc| v :: vertices −>

(* Printf.printf “\r%3d%!” ( List.length vertices ) ; *)

106

let ( right, g ) = grow v ( Graph ( vertices, edges ) ) inlet ( left, g ) = grow v g inenum components ( ( ( List.rev left ) @ [ v ] @ right ) :: acc ) g

in(* contigs are given by strongly ordered components *)enum components [ ] ( create top ranked neighbors graph neighbors )

90

(*let order contigs contigs neighbors =

(* connected components graph:- each contig has two vertices (L and R) connected

by an internal edge- external edges are between different contigs iif

they’re confirmed neighbors *)let rec create internal edges v e = function| [ ] −>

( v, e ) 100

| contig :: contigs −>let ( v1, v2 ) = ( List.hd contig, Utils.List.last contig ) in

let v = List.hd contig :: Utils.List.last contigs :: v inlet e = ( v1, v2, contig ) :: ( v2, v1, List.rev contig ) increate internal edges v e tl

increate internal edges [ ] [ ] contigs

*) 110

107

(********************************************************************* fingerprint.ml: fingerprint data type and operations* Giulio Marcon <[email protected]>*********************************************************************)

type t = Fragment.t list

let order fingerprint =List.sort compare fingerprint

10

let coincidence fingerprint i fingerprint j =let rec f acc = function| ( [ ], ) | ( , [ ] ) −>

acc| ( f i :: fs i, f j :: fs j ) −>

if Fragment.(=) f i f jthen f ( acc + 1 ) ( fs i, fs j )

else if f i < f jthen f acc ( fs i, f j :: fs j )else f acc ( f i :: fs i, fs j ) 20

inf 0 ( fingerprint i, fingerprint j )

let union fingerprint i fingerprint j =let rec f = function| ( [ ], rest ) | ( rest, [ ] ) −>

rest| ( f i :: fs i, f j :: fs j ) −>

if Fragment.(=) f i f jthen Fragment.merge f i f j :: f ( fs i, fs j ) 30

else if f i < f jthen f i :: f ( fs i, f j :: fs j )else f j :: f ( f i :: fs i, fs j )

inf ( fingerprint i, fingerprint j )

108

(********************************************************************* fingerprintedClone.ml : clone data type and operations* Giulio Marcon <[email protected]>*********************************************************************)

type t = Clone.t * Fingerprint.t

(* works with fingerprints that have the end fragments ( not coloured ) removed *)module Exact = struct

10

let identical ( ( c i, f i ), ( c j, f j ) ) =List.length f i = List.length f j &&Fingerprint.coincidence f i f j = List.length f i

let inclusion ( ( c i, f i ), ( c j, f j ) ) =(* different clones *)not ( Clone.(=) c i c j ) &&(* Co ( f i, f j ) = | f i | *)Fingerprint.coincidence f i f j = List.length f i

20

let spanning ( ( c i, f i ), ( c j, f j ), ( c k, f k ) ) =(* different clones *)not ( Clone.(=) c i c j | | Clone.(=) c j c k | | Clone.(=) c k c i ) &&Fingerprint.coincidence ( Fingerprint.union f i f j ) f k = List.length f k

let fast spanning ( ( c i, f i ), ( c j, f j ), ( c k, f k ) ) =let rec f acc = function| ( , , [ ] ) −>

acc| ( [ ], rest, fs k ) | ( rest, [ ], fs k ) −> 30

acc + Fingerprint.coincidence rest fs k| ( f i :: fs i, f j :: fs j, f k :: fs k ) −>

let ( acc, fs i’, fs j’, fs k’ ) =if Fragment.(=) f i f k

then if Fragment.(=) f j f kthen ( acc + 1, fs i, fs j, fs k )

else if f j < f kthen ( acc + 1, fs i, fs j, fs k )else ( acc + 1, fs i, f j :: fs j, fs k )

else if f i < f k 40

then if Fragment.(=) f j f k

109

then ( acc + 1, fs i, fs j, fs k )else if f j < f k

then ( acc, fs i, fs j, f k :: fs k )else ( acc, fs i, f j :: fs j, f k :: fs k )

else if Fragment.(=) f j f kthen ( acc + 1, f i :: fs i, fs j, fs k )

else if f j < f kthen ( acc, f i :: fs i, fs j, f k :: fs k )else 50

(* a fragment f k is not matched, stop! *)( acc, [ ], [ ], [ ] )(* ( acc, f i :: fs i, f j :: fs j, fs k ) *)

inf acc ( fs i’, fs j’, fs k’ )

innot ( Clone.(=) c i c j | | Clone.(=) c j c k | | Clone.(=) c k c i ) &&f 0 ( f i, f j, f k ) = List.length f k

end 60

module ExactWithEnds = struct

let identical ( ( c i, f i ), ( c j, f j ) ) =List.length f i = List.length f j &&Fingerprint.coincidence f i f j = List.length f i − 4

let inclusion ( ( c i, f i ), ( c j, f j ) ) =(* different clones *)not ( Clone.(=) c i c j ) && 70

(* Co ( f i, f j ) = | f i | *)Fingerprint.coincidence f i f j = List.length f i − 4

let spanning ( ( c i, f i ), ( c j, f j ), ( c k, f k ) ) =(* different clones *)not ( Clone.(=) c i c j | | Clone.(=) c j c k | | Clone.(=) c k c i ) &&Fingerprint.coincidence ( Fingerprint.union f i f j ) f k = List.length f k − 4

let fast spanning ( ( c i, f i ), ( c j, f j ), ( c k, f k ) ) =let rec f acc acc2 = function 80

| ( , , [ ] ) −>acc

| ( [ ], rest, fs k ) | ( rest, [ ], fs k ) −>

110

acc + Fingerprint.coincidence rest fs k| ( f i :: fs i, f j :: fs j, f k :: fs k ) −>

let ( acc, acc2, fs i’, fs j’, fs k’ ) =if Fragment.(=) f i f k

then if Fragment.(=) f j f kthen ( acc + 1, acc2, fs i, fs j, fs k )

else if f j < f k 90

then ( acc + 1, acc2, fs i, fs j, fs k )else ( acc + 1, acc2, fs i, f j :: fs j, fs k )

else if f i < f kthen if Fragment.(=) f j f k

then ( acc + 1, acc2, fs i, fs j, fs k )else if f j < f k

then ( acc, acc2, fs i, fs j, f k :: fs k )else ( acc, acc2, fs i, f j :: fs j, f k :: fs k )

else if Fragment.(=) f j f kthen ( acc + 1, acc2, f i :: fs i, fs j, fs k ) 100

else if f j < f kthen ( acc, acc2, f i :: fs i, fs j, f k :: fs k )else if acc2 > 4 then

(* more than 4 fragments f k are not matched, stop! *)( acc, acc2 + 1, [ ], [ ], [ ] )

else( acc, acc2 + 1, f i :: fs i, f j :: fs j, fs k )

inf acc acc2 ( fs i’, fs j’, fs k’ )

in 110

not ( Clone.(=) c i c j | | Clone.(=) c j c k | | Clone.(=) c k c i ) &&f 0 0 ( f i, f j, f k ) = List.length f k − 4

end

module Error = struct

let inc thresh = 99let span thresh = 99

120

let identical ( ( c i, f i ), ( c j, f j ) ) =List.length f i = List.length f j &&Fingerprint.coincidence f i f j * 100 / List.length f i > inc thresh

let inclusion ( ( c i, f i ), ( c j, f j ) ) =

111

(* different clones *)not ( Clone.(=) c i c j ) &&(* Co ( f i, f j ) = | f i | *)Fingerprint.coincidence f i f j * 100 / List.length f i > inc thresh

130

let spanning ( ( c i, f i ), ( c j, f j ), ( c k, f k ) ) =(* different clones *)not ( Clone.(=) c i c j | | Clone.(=) c j c k | | Clone.(=) c k c i ) &&Fingerprint.coincidence ( Fingerprint.union f i f j ) f k * 100 /List.length f k > span thresh

end

let coincidence ( ( c i, f i ), ( c j, f j ) ) =Fingerprint.coincidence f i f j 140

let identical = Exact.identicallet inclusion = Exact.inclusionlet spanning = Exact.fast spanning

(*let identical = ExactWithEnds.identicallet inclusion = ExactWithEnds.inclusionlet spanning = ExactWithEnds.spanning*) 150

(*let identical = Error.identicallet inclusion = Error.inclusionlet spanning = Error.spanning*)

112

(********************************************************************* fragment.ml : everything about a fragment* Giulio Marcon <[email protected]>*********************************************************************)

(* fragment type and operations with exact measurements: *)module Exact = struct

(* type *)type size = int 10

type colour = inttype t = size * colour

(* operations *)let eq ( size i, colour i ) ( size j, colour j ) =

size i = size j && colour i = colour j

let merge ( size i, colour i ) ( size j, colour j ) =( ( size i + size j ) / 2, min colour i colour j )

20

end

type t = Exact.tlet (=) = Exact.eqlet merge = Exact.merge

113

(********************************************************************* io.ml : input / output operations* Giulio Marcon <[email protected]>*********************************************************************)

exception Invalid format of string

(* reads a list of clones and fragment measurements in fasta-like format *)let load clones in channel : FingerprintedClone.t list =

let ib = Scanf.Scanning.from channel in channel in 10

let row = ref 0 inlet rec load clones r () =

trylet clone =

Scanf.bscanf ib ">%s start=%d length=%d\n"( fun name start length −>

let fingerprint = Fingerprint.order ( load fragments r () ) in( ( Clone.clone fresh id (), ( name, start, length ) ) , fingerprint )

)in 20

( row := !row + 1 ; clone :: load clones r () )with| End of file −>

[ ]| Scanf.Scan failure e −>

raise ( Invalid format ( "row " ^ ( string of int !row ) ^ ": " ^ e) )andload fragments r () =

trylet fragment = 30

Scanf.bscanf ib "%d\t%d\n"( fun length color −> ( length, color ) )

in( row := !row + 1 ; fragment :: load fragments r () )

with| End of file −>

[ ]| Scanf.Scan failure e −>

[ ]in 40

load clones r ()

114

(********************************************************************* main.ml: camlass main procedure* Giulio Marcon <[email protected]>*********************************************************************)

let main bacs file =(* initialization of pseudo random number generator *)Random.init ( int of float ( Unix.time () ) ) ;

(* clones loading *) 10

let clones = Utils.Io.temp resource Io.load clones ( open in bacs file ) close in inPrintf.eprintf "%d clones loaded.\n%!" ( List.length clones ) ;

(* maximal clones selection *)let f () = Fasulo.get maximal clones clones inlet clones = Utils.Io.snapshot ( bacs file ^ ".maximal" ) f inPrintf.eprintf "%d maximal clones selected.\n%!" ( List.length clones ) ;

(* nearest neighbors discovery *)let f () = List.rev map ( fun c −> ( c, Fasulo.get neighbors c clones ) ) clones in 20

let neighbors = Utils.Io.snapshot ( bacs file ^ ".neighbors" ) f inPrintf.eprintf "%d neighbors lists created.\n%!" ( List.length neighbors ) ;

(* contigs *)let f () = Fasulo.get contigs neighbors inlet contigs = Utils.Io.snapshot ( bacs file ^ ".contigs" ) f inlet contigs = f () inPrintf.eprintf "%d contigs assembled (%d with size at least 2).\n%!"

( List.length contigs )( List.length ( List.filter ( fun x −> List.length x >= 2 ) contigs ) ) ; 30

();;

let ( bacs file, t2 ) = Args.get parameters () inmain bacs file

116

(********************************************************************* math.ml: mathematical utilities** Giulio Marcon <[email protected]>*********************************************************************)

let pi = 4.0 *. atan 1.0

let gaussian stddev mean x =1.0 /. ( sqrt ( 2.0 *. pi ) *. stddev ) *. 10

exp ( −. ( x −. mean ) ** 2.0 /. ( 2.0 *. stddev ** 2.0 ) )

117

(********************************************************************* utils.ml : general utilities* Giulio Marcon <[email protected]>*********************************************************************)

(* general I/O utilities: *)module Io = struct

let marshal file data =let out = open out bin file in 10

Marshal.to channel out data [ ] ;close out out

let unmarshal file =let input = open in bin file inlet data = Marshal.from channel input inclose in input ;data

let temp resource f open res close res = 20

let res = open res inlet v = f res inlet = close res res inv

let snapshot file f =if Sys.file exists file

then unmarshal fileelse let v = f () in

marshal file v ; v 30

end

(* list utilities: *)module List = struct

let rec last = function| [ ] −>

raise ( Invalid argument "Utils.List.last: empty list" )| last el :: [ ] −>

last el 40

| hd :: tl −>

118

last tl

let rec first n list =assert ( n >= 0 ) ;if n = 0

then [ ]else ( List.hd list ) :: ( first ( n − 1 ) ( List.tl list ) )

let rec skip n list = 50

assert ( n >= 0 ) ;if n = 0

then listelse skip ( n − 1 ) ( List.tl list )

let rec sub start length list =first length ( skip start list )

let find el list =try 60

Some ( List.find ( fun x −> el = x ) list )with| Not found −> None

let rec min = function| [ ] −>

invalid arg "Utils.List.min: empty list"| ( x :: [ ] ) −>

x| ( x :: xs ) −> 70

Pervasives.min x ( min xs )end

let pairwise gen symmetric f list =let rec pairwise r acc = function| ( [ ], ) −>

acc| ( h1 :: tl, [ ] ) −>

pairwise r acc ( tl, ( if symmetric then tl else list ) )| ( h1 :: t1, h2 :: t2 ) −> 80

let acc = match f ( h1, h2 ) with| Some s −> s :: acc| None −> acc

119

inpairwise r acc ( h1 :: t1, t2 )

inpairwise r [ ] ( list, list )

let pairwise sym f list =pairwise gen true f list 90

let pairwise f list =pairwise gen false f list

120

ESTRATTO PER RIASSUNTO DELLA TESI DI LAUREA EDICHIARAZIONE DI CONSULTABILITA' (*)

Il sottoscritto/a

Matricola n.

Facoltà

iscritto al corso di laurea/diploma in

Titolo della tesi (*):

DICHIARA CHE LA SUA TESI E':

Consultabile da subito Non consultabile Consultabile dopo mesi

Venezia, Firma dello studente

(spazio per la battitura dell'estratto)

(*) il titolo deve essere quello definitivo uguale a quello che risulta stampato sulla copertina dell'elaborato

consegnato al Presidente della Commissione di Laurea (*) Da inserire come ultima pagina della tesi. L'estratto

non deve superare le mille battute

Università Ca' Foscari - Venezia

Università Ca' Foscari - Venezia

GIULIO MARCON

793939

SCIENZE MM.FF.NN.

INFORMATICA (SPECIALISTICA)

DNA Sequencing: the Computational Point of View

30 giugno 2004

We propose and implement a solution to a computational problem arising from DNA

sequencing. The original biological problem (clone ordering from fingerprinting data

obtained by complete digestion of four restriction enzymes and fluorescent labelling) is

formalized, compared to previously studied similar problems, and dealt with through an

adaptation of algorithms originally thought for restriction mapping. Additionally, an

approach to the noisy version of the problem is proposed without implementation.

universita ca’ foscari –...

Documents