essentials of genomics and bioinformatics || bioinformatics: genomic data representation through...

19
Essentials of Genomics and Bioinformatics by C. W. Sensen 0 WILEY-VCH Verlag GmbH, 2002 15 Genomic Data Representation through Images - MAGPIE as an Example PAUL GORDON Halifax, Canada TERRY GAASTERLAND New York, NY, USA CHRISTOPH W. SENSEN Calgary, Canada 1 Introduction 346 2 The Graphical System 346 2.1 The Hierarchical MAGPIE Display System 348 2.1.1 Whole Project View 348 2.1.2 Coding Region Displays 351 2.1.3 Contiguous Sequence with ORF Evidence 352 2.1.4 Contiguous Sequence with Evidence 354 2.1.5 ORF Close-up 355 2.1.6 Analysis Tools Summary 355 2.1.7 Expanded Tool Summary 356 2.1.8 Base Composition 357 2.1.9 Sequence Repeats 358 2.1.10 Sequence Ambiguities 359 2.1.11 Sequence Strand Assembly Coverage 2.1.12 Restriction Enzyme Fragmentation 360 2.1.13 Agarose Gel Simulation 360 360 3 Conclusions and Open Issues 4 References 362 362

Upload: c-w

Post on 06-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Essentials of Genomics and Bioinformatics by C. W. Sensen

0 WILEY-VCH Verlag GmbH, 2002

15 Genomic Data Representation through Images -

MAGPIE as an Example

PAUL GORDON Halifax, Canada

TERRY GAASTERLAND New York, NY, USA

CHRISTOPH W. SENSEN Calgary, Canada

1 Introduction 346 2 The Graphical System 346

2.1 The Hierarchical MAGPIE Display System 348 2.1.1 Whole Project View 348 2.1.2 Coding Region Displays 351 2.1.3 Contiguous Sequence with ORF Evidence 352 2.1.4 Contiguous Sequence with Evidence 354 2.1.5 ORF Close-up 355 2.1.6 Analysis Tools Summary 355 2.1.7 Expanded Tool Summary 356 2.1.8 Base Composition 357 2.1.9 Sequence Repeats 358 2.1.10 Sequence Ambiguities 359 2.1.11 Sequence Strand Assembly Coverage 2.1.12 Restriction Enzyme Fragmentation 360 2.1.13 Agarose Gel Simulation 360

360

3 Conclusions and Open Issues 4 References 362

362

346 15 Genomic Data Representation through Images - MAGPIE as an Example

1 Introduction

Graphical display systems for complex data, which can be used to analyze what would otherwise be overwhelming amounts of data, are becoming increasingly important in many scientific fields. Molecular biology and genom- ics are key areas for this development, because the size and number of genomic databases and the number of analysis tools is continually in- creasing. A typical microbial genome has up to 1,000 open reading frames (ORFs) per mega- base of genomic sequence. If just 10 tools were used to analyze such a genome, and each tool showed hits against 100 database entries (which is not an uncommon number), about one million pieces of evidence would need to be recorded and mapped along the megabase of genomic sequence. When dealing with this amount of information, the saying “A picture is worth a thousand words” is an understate- ment!

In 1996, GAASTERLAND and SENSEN intro- duced a system for the automated analysis of biological sequences, called MAGPIE (Multi- purpose Automated Genome Invesigation En- vironment) (GAASTERLAND and SENSEN, 1996). This tool-integration system executes and then integrates the results from multiple bioinformatics tools into an easily interpret- able form. It has been used for the analysis of complete genomes, genomic DNA fragments, ESTs, proteins, and protein fragments. Initial- ly, all MAGPIE output was in tabular format, but the need for rich visual representations of genomic analyses became evident early on. MAGPIE summarizes information about evi- dence supporting functional assignments for genes, information about regulatory elements in genomic sequences, including promoters, terminators and Shine-Dalgarno sequences, metabolic pathways, and the phylogenetic ori- gin of genes. MAGPIE sorts and ranks the ev- idence by strength and is able to display the level of confidence associated with database search results.

MAGPIE was the first genome analysis and annotation system to add graphical represen- tations to the results. Based on continual user feedback from various installations, the imag- es have been refined over time, allowing anno-

tators to process the large quantity of relevant information quickly and efficiently. MAGPIE has one of the most comprehensive graphical capabilities available, thus we will use MAG- PIE as the example for genomic annotation that is supported by imaging. We describe the meaning of the various images, the algorithms used to create them, and the data types from which they are derived.

The MAGPIE image types reflect the fact that sequence data is stored and presented in a hierarchical manner. A MAGPIE project gen- erally consists of, but is not limited to, related sequences generated for a particular organ- ism. To facilitate browsing and data mainte- nance, sequences are organized into logical groups. For example, all sequences from a sin- gle clone are normally placed in one group, be- cause they will be joined as sequencing pro- gresses. The images presented in this text are from the Sulfolobus solfuturicus P2 sequenc- ing project (CHARLEBOIS et al., 1996).This pro- ject was chosen as the example because it in- cludes all of the graphical representations that can be produced by MAGPIE. The complete Sulfolobus MAGPIE project is available at http://www. cbr. nrc. cdsulfhome.

MAGPIE images can be classified into three categories: representation of evidence, summary of genomic features, and biochemi- cal assay simulation, which supports the design of follow-up experiments. We discuss how these images aid the researcher in genome se- quencing and annotation through pattern rec- ognition and information filtering and how they can be used to support the validation of genomic data.

2 The Graphical System The MAGPIE user interface, which was

initially implemented with a Web-based dis- play that supported text and table output, has been developed over time into a Web-based graphical display system. As described pre- viously (GAASTERLAND and SENSEN, 1996), the tools used in MAGPIE include, but are not limited to: the FastA (PEARSON and LIPMAN, 1988) family of programs (including the

2 The Graphical System 347

ssearch Smith-Waterman implementation, and protein fragment analysis tools fastf and tfastf), the BLAST (ALTSCHUL et al., 1997) family of programs (ungapped, gapped, and Position Specific Iterative), Blitz (http://www.ebi.ac.uldbic-sw/),BLOCKS (HE- NIKOFF et al., 1999), ProSearch (KOLAKOWSKI et al., 1992), Genscan (BURGE and KARLIN, 1997), Glimmer (SALZBERG et al., 1998), and GeneMark (BORODOVSKY and MCININCH, 1993). To link image representations to align- ments, individual tool “hits” and other related information, the original tool outputs (e.g., BLAST or BLOCKS responses) are stored as HTML files after data processing. Hit length and scores are extracted during the response processing using dedicated parsers. The hits are sorted into user-defined confidence levels.

The components of the modular computer code used for MAGPIE input and output are modules for Web standards written in the Per15 (http://www.perl. com) programing language: HTML (Hypertext Markup Language), CGI (Common Gateway Interface), GIF (Graphics Interchange Format), and PNG (Portable Net- work Graphics). The text reporting system is based on a combination of pre-computed HTML pages and CGI programs producing HTML dynamically. A graphics library module called GD.pm (http://stein. cshl. org/www/soft- ware/GD), which is dynamically patched into the Per15 system, is used to generate the MAGPIE graphics. GD.pm provides function- ality for drawing on a two-dimensional canvas. The canvas can be translated into a browser- readable form such as PNG. Lines, basic geo- metric shapes, and arbitrary polygons can be drawn and color-filled using GD.pm. The drawing functions, when applied to linearly en- coded data such as DNA or proteins, lend themselves to particular succinct representa- tions. Rulers, along which features are posi- tioned, are drawn as straight lines. Simple ticks (small lines perpendicular to the ruler) gener- ally represent position-specific features that occur frequently, e.g., stop codons. Unique polygons that occupy more space are used for position-specific features, which occur less fre- quently, e.g., promoter sites. Data that cover a range, e.g., open reading frames, are displayed as boxes. These boxes may be subdivided when additional information needs to be encoded.

MAGPIE’S graphics fulfill three main needs in a genome project once data is collected: dis- play of the genomic features at various levels of detail in the genomic context, evaluation the evidence supporting a feature’s annotation, and quality control. Four types of information are used to generate images in MAGPIE: user preferences, sequence information, data from tool outputs as well as analysis results, and manual user annotations (verifications). User- based annotations are stored as text files, usu- ally created via the Web interface, but they can also be imported from files using standards such as GenBank flat format or the General Feature Format (http://www.sanger.ac.uW Software/formats/GFF/). User-configurable visualization parameters (e.g., the bases-per- pixel scale and the maximum image width) are stored in plain text configuration files, similar to the previously described (GAASTERLAND and SENSEN, 1996) configuration files for con- fidence level criteria and other configurable parts of MAGPIE. By default, images in MAGPIE are defined with a maximum width of 1,000 pixels. This allows landscape mode printing of the images on 8 x 11 ” paper.

While most data in MAGPIE is hierarchical and stored as text files, cross-referencing of equivalent sequence identifiers is done using binary Gnu Database Manager (GDBM) files. This exception to the text-file storage allows the rapid location of a current analysis report via an identifier for a previous version of the respective sequence (version control).

The graphical reporting system in MAGPIE has two distinct user-configurable modes: stat- ic or dynamic, respectively. In the static graph- ical mode, all images are pre-computed for viewing after the analysis is finished. This re- quires considerable disk space, but it is compu- tationally and temporally efficient when the sequence and analyses do not change much, e.g., when a completely finished genome is analyzed. In the dynamic mode, images are created on demand, using the data extracted from the analysis. Although this requires more ad hoc computation, it is appropriate when the underlying sequences or the analyses are fre- quently updated, for example in the case of an ongoing genome sequencing project.

Two key features for viewing data in context are the hierarchical representation of the data

348 15 Genomic Data Representation through Images - MAGPIE as an Example

and consistent display idioms. Idioms such as bolder coloring for stronger evidence (darker text, brighter hues, respectively), and using red and blue to indicate the forward or reverse DNA strand respectively, pervade the images and allow complex information to be encoded in the graphics without cluttering them. Good idioms need only be learned once (COOPER, 1995). For example, by surveying the annota- tion and evidence strength status from Fig. 6, as well as the overlaps, it becomes evident how the clone relates to its genomic neighbors, and how much information has been gathered about the non-redundant part of the sequence.

Consistent color use improves the delivery of information in the MAGPIE images. Blue bars and borders indicate information located on the positive DNA strand (forward strand), while red bars and borders represent informa- tion located on the negative DNA strand. Black generally indicates information that is not strand specific. Coloration of analysis data is specified in a user-definable color prefer- ence file. Consistent coloring can group the evidence from similar tools by color range. For example and shown later in the text, FastA (PEARSON and LIPMAN, 1988) hits against an EST collection will always be shown in a par- ticular green hue, and FastA hits against a pro- tein database may be colored in a different green hue.

Shading is used throughout the MAGPIE interface to denote the confidence level of the evidence. Stronger evidence is always dis- played in darker shades. For example, descrip- tion text in reports is black when the evidence is good, gray when it is moderate, and white when it is only marginally useful. As shown be- low, this holds true for the graphical evidence displays as well. It is easy to filter out informa- tion related to potential genomic function by following this simple concept. These represen- tational consistencies also reduce visual clut- ter. Information is implicitly conveyed in the color instead of requiring explicit depiction or labeling of the displayed features.

Three key features of the individual ORF display are succinctness, pattern display, and data linking. The succinct representation of the tool responses is essential to allow the annota- tor a quick survey of the salient information. Some of the information, such as the subject

description and discriminant score, remains in textual form within the images, while details such as the location of the hits are graphically mapped onto the images. The exact positioning of hit patterns is important to the annotator, because it can determine the relevance of the match. The comparative match display is max- imized in MAGPIE by displaying results from tools, which are based on similar algorithms, data sets and evidence types (i.e., amino acid, DNA, and motif), atop each other.

Even though a succinct representation of the responses can be very useful, it is equally important to be able to access the original re- sponses, database entries, associated metabolic pathways and other related information, thus most image types contain configurable hyper- links to Web-based information, including SRS-6, ExPASy, or the NCBI Web services.

Finally, simulation images assist the wet-lab researcher during the verification process. MAGPIE analyses, like all results generated by automated genome analysis and annotation systems, are only computer models, which often need verification through a biochemical experiment.

2.1 The Hierarchical MAGPIE Display System

In the following paragraphs, we introduce the various graphical displays that are imple- mented in MAGPIE. The MAGPIE hierarchy is reflected in the set of graphical images. The resolution of the images increases over several levels until an almost single base pair resolu- tion is reached. Fig. 1 shows the hierarchical connection between the different images. De- pending on the state of the analyzed sequence, not all images are present; some images are mutually exclusive.

2.1.1 Whole Project View

MAGPIE can track sequencing efforts from single sequence reads to fully assembled clones and genomes. Based on mapping infor- mation, which identifies the relationship of clones in the sequencing project, MAGPIE can automatically generate and display non-

Fig

. 1. I

mag

e hi

erar

chy.

Fro

m t

op t

o bo

ttom

, mor

e de

tail

is s

how

n ab

out

smal

ler

subs

eque

nces

. Whe

re c

onne

ctin

g li

nes

exis

t, th

e im

ages

are

ei

ther

juxt

apos

ed o

r cl

ick-

thro

ugh.

The

imag

es a

re la

bele

d ac

cord

ing

to th

eir

figu

re n

umbe

rs.

\o

350 15 Genomic Data Representation through Images - MAGPIE as an Example

Fig. 2. Overlapping clone cluster in a Magpie project. Each sequence is hyperlinked to its Magpie report. All of the sequences are filled in with green, denoting finished sequence, and shaded where redundant. Red-out- lined sequences are reverse complemented in the assembly. White letter labels denote the presence of anno- tations in the sequence (see color figures).

Fig. 3. Partially assembled fragments of BAC b07zd03-b28. Fragments are sort- ed by size and hyperlinked to their re- spective Magpie reports. The yellow fill indicates that sequences are in the link- ing state (see color figures).

redundant sequence(s) and the non-redundant gene set. Figs. 2 and 3 show examples of the im- ages that display the summary "whole project view" page of the Sulfolobus solfataricus ge- nome. Acting as a starting point for the anno- tator, this single page contains hyperlinked im- ages representing all of the contiguous se- quences (contigs) in the MAGPIE project. Contigs are drawn to scale, and color-filled according to their MAGPIE state. The states used in the Sulfolobus MAGPIE project are primary, linking, polishing and finished respec- tively, but the user for other projects could de- fine other states. The colors for the states can be set in the user preference files. Users can also define in which of the states the sequence is resolved well enough so that MAGPIE can assemble larger contigs from individual clones. In the example of the Sulfolobus MAGPIE project, this would be sequence in state "polish- ing". Overlapping contigs are appropriately positioned in the image, and the areas of over- lap are grayed out to denote redundancy. In keeping with the color usage described earlier, blue outlines denote that the contigs are in their normal (forward) orientation, while those outlined in red were reverse comple- mented in order to fit into the genome assem- bly. White text on a black background denotes the presence of manual annotation on the -

labeled contig. In Fig. 2, it is discernable by the black label text and gray fill that clone 1910-127 is the only clone without annotation. This clone was not annotated because it is completely redundant.

Even though overlaps between MAGPlE contigs are calculated in order to remove re- dundancies, MAGPIE is not meant to be a full- fledged assembly engine. For contigs to be considered overlapping, the user must specify that two clones are neighbors. This is informa- tion usually known from the clone-mapping phase or derived from self-identity searches in MAGPIE. The user specification avoids spuri- ous assemblies that may be taken for granted. The extent and orientation of the sequence overlap is determined by running a FastA sim- ilarity search. Based on the percent similarity and length criteria set for the project, the over- lap is either accepted or rejected. If the over- laps do not occur at the very ends of the con- tigs, the match is also rejected. This helps to avoid linking contigs based on repetitive re- gions. Based on the one-to-one neighbor infor- mation, larger contigs are formed using the fol- lowing logic:

(1) let S be a set of sets, each containing a

(2) let N be the set of neighbor relationships single contig

2 The Graphical System 351

(3) while N is not the empty set a) pick a relationship R(C1,C2) from N

c) Find S1, the set in S to which the

d) Find S2, the set in S to which the

e) S=(S-Sl-S2) u (S1 u S2)

b) N=N-R

contig C1 belongs

contig C2 belongs

The exclusive sets in S are implemented in an array format. This format provides con- stant-time union of sets and logarithmic-order time set membership determination. A con- sensus sequence for each contig set in S is de- termined. When conflicts between overlapping contigs occur, the better resolved sequence of the contig in a more complete state takes prec- edence. The ORFs on the consensus sequence are identified and a table of ORF equivalen- cies across all the contigs is created subse- quently.

At step 1, each contig is in its own set. Be- cause the data set consists of non-redundant contigs, the sets in S are by default disjoint. Steps 2, 3a and 3b iterate through all known connections. Two sets, which are determined in steps 3c and 3d can be combined when a contig from the first set is connected to one from the second set, which is valid because of the transi- tive property of contig connections. In step 3e, the two now connected contigs are removed from S and replaced with a combined set. The resulting sets of disjoint contigs are called “supercontigs” in MAGPIE, because they may consist of more than one clone.

The overlap and equivalency information is used in the images described below to propa- gate ORF information across equivalent se- quence feature displays. The algorithm has been used in the Sulfolobus solfataricus P2 genome project to successfully assemble 110 bac-, cosmid- and lambda clones into a single non-redundant contig.

2.1.2 Coding Region Displays

A number of different image types repre- sent the ORF evidence with increasing levels of detail. Sometimes it is necessary to see the actual evidence, e.g., for decision making dur- ing the manual annotation process. At other

times the larger context of the ORFs is more useful, in this case, a detailed display contain- ing all MAGPIE evidence would be over- crowded. These needs necessitate multiple representations of the same data. The varying levels of evidence abstraction are the most powerful part of the MAGPIE graphical envi- ronment.

A user may specify a wide variety of tools to be run against all contigs in a particular MAG- PIE state. Tools used by MAGPIE can also produce periodically updated results, adding to the dynamic nature of the evidence. A user may wish to store all of the MAGPIE-generat- ed images or create them on demand. This de- pends on available disk space, CPU power, and the frequency with which contigs and tool out- puts are updated.

As shown in Fig. 4, the same code is used to generate both images and their image maps. When the CGI is called, it generates the neces- sary HTML content on the Web page by set- ting the image maker command-line argu- ments to print the image to standard output. Standard output is redirected to a port on the server, and the image URL points to that same port. The CGI then redirects standard error to the former standard output. It includes (as op- posed to launching as a separate process) the image-making script with the appropriate ar- guments. When the script prints the image to the standard output stream instead of to a file, it is configured to print the image map to stan- dard error. The stderr data stream is now redi- rected to the client’s HTML page, therefore, no temporary files are created, even though two output streams are used simultaneously.

All scripts that generate a graphical repre- sentation of a contig and its ORFs may gener- ate more than one image for the sequence. If the combined contig length and scale factor exceeds the maximum image width, the image is split into multiple “panes”. The number and size of the images must be determined before any drawing takes place. This allows the draw- ing canvases to be allocated in the program. The information is also used to create the re- quired number of image references in the HTML pages. The height of the image is fixed, because it depends entirely on fixed parame- ters. The image width is variable because se- quences are represented horizontally. The

352 15 Genomic Data Representation through Images - MAGPIE as an Example

....................... ............ .............. I.., .,.,

We b cl i e n t. ................ >cGI

Title and info vn I cimg src=“host.port“>..l .......................................................................

I <imagemap> +‘ / ......................................................... ...................... . * image maker

Fig. 4. Fileless HTML and image generation. Dashes represent client requests. The upper curved line repre- sents the standard error output stream of the image maker redirected through the CGI standard output stream. The lower curved line represents the former’s standard output redirected to the open port specified by CGI.

width is a function of the sequence length times the scale factor, plus constant elements such as border padding. When multiple panes are required, all but the last image have the maximum width. If the last image had ten or less pixels, it is merged into the preceding im- age. This slightly exceeds the maximum per- mitted width, but avoids an unintelligibly small sequence display.

2.1.3 Contiguous Sequence with ORF Evidence

Fig. 5 displays a sequence and its features with the highest degree of data abstraction. Images of the type shown in Fig. 5 summarize the evidence against all of the ORFs in a con- tig in all six open reading frames. They also show additional genomic features, which may be located around coding regions: promoters, terminators, and stop codons. Links to the cor- responding ORF reports are provided via the imagemap. ORFs are labeled sequentially from left to right. In our example from the Sul- folobus project, the 100 amino acid residue cutoff is stated in the lower left-hand corner of Fig. 4. Inter-ORF regions are analyzed separ- ately by MAGPIE using a set of scoring crite- ria, which is different from the one defined for large-ORF regions. The goal of the inter-ORF analysis is to identify small coding regions

(e.g., small proteins and RNA-coding regions). The names of the potential small coding re- gions meeting the user-defined criteria are de- noted with the prefix “n”. Clicking on a dis- played small coding region brings up a screen that displays the evidence as in shown in Fig. 9. The user must confirm that the small sequence segment is indeed a coding region. Once con- firmed, these identifiers of the small ORF are displayed with the “s” prefix. This naming con- vention differentiates types of small ORFs without the need for renaming ORFs when small ORFs are verified to be coding.

There are several aspects to ORF colora- tion. Gray-shaded boxes indicate ORF sup- pression when users deem particular ORFs to be non-coding. This may be when the ORF is more than a certain percentage inside another ORF (typically completely contained in an ORF on the opposite DNA strand), or the ORF shows an unusual amino acid composi- tion. MAGPIE can be configured to automati- cally suppress ORFs for either reason, or for lack of database matches. ORFs with Xs (so called Saint Andrew’s Crosses) through them have been annotated. A white background de- notes that the assigned function is “putative”, “hypothetical”, or “uncharacterized”. All three words stand for unknown function. These functional assignments (or lack thereof) may be carried over from similar ORFs in other ge- nomes. The equivalencies used were deter- mined in the creation of Figs. 2 and 3. Outline

2 The Graphical System 353

Fig. 5. Contiguous sequence with open reading frames displayed. Boxes on the six reading frame lines repre- sent possible genes. The boxes are “x”ed when annotated. Light “x”ed boxes have annotations described as hypothetical or uncharacterized. Sub-boxes in unannotated genes indicate composition characteristics, plus the best level of protein, DNA and motif database hits. Grayed-out genes have been suppressed. Background shading and hyperlinked arrows in the corners indicate neighboring sequence overlaps. Boxes with labels that start in “n” are possible genes shorter than the specified minimum length (see color figures).

colors for ORFs starting with ATG, GTG and TTG are blue, green, and red, respectively. The different colors are only used, if they are de- fined by the user to represent the valid start codons for the organism. When the ORF starts upstream of the current contig and the start codon is unknown, the outline is black. Black also indicates the use of alternative start co- dons, which may occur in some organisms.

ORFs that are not validated by a user are split into three by two isometric blocks. These blocks can be colored to indicate the presence and strength of certain evidence. This is indi- cated in the ”OW traits” section of the Fig. 4 legend. The blocks in the upper half denote calculated sequence characteristics. Blocks “f” and “a” indicate on/off traits. A blue “f” block denotes that the codon usage in the ORF is within 10% of frequencies observed for this organism, A blue “a” block denotes that the purine (A+G) composition of this ORF is greater than 50%. This is known to be an indi- cator for a good coding likelihood in many prokaryotes (CHARLEBOIS et al., 1996). For or- ganisms with a G + C% greater or smaller than 50%, the “c” block in the upper right corner of

the block represents another indicator for cod- ing sequence; G + C% codon compensation. The calculation of this parameter is based on the third (and to a lesser degree second) posi- tion codon wobble. Compensation at confi- dence level 1 occurs when the combined fre- quency of G + C% compensation is highest in the third base of the codons. For level 2, the second base G + C % compensation is the highest, followed by the compensation for the third base of the codons.

The blocks in the lower half denote data- base search results and the level of confidence in three levels, level one indicating the strong- est evidence. The colors for the three levels are blue, cyan, and gray respectively. The lower half trait levels are determined by comparing extracted similarity analysis scores with the user-specified criteria. The “p” block indicates the highest level of protein similarity found through the database searches. The “d” block indicates the highest level of DNA similarity found. The “m” block indicates the best level of sequence motifs found (e.g., scored Prosite hits) (KOLAKOWSKI et al., 1992). After learning the representation scheme, the annotator can

354 15 Genomic Data Representation through Images - MAGPIE as an Example

quickly see the nature and strength of coding indicators for all ORFs in a sequence.

Other indicators for transcription include Shine-Dalgarno motifs (SHINE and DALGAR- NO, 1974), promoters and terminators. These features are displayed in the appropriate read- ing frame as small black rectangles, green tri- angles and red sideways T’s, respectively. In keeping with the representational consistency, the candidate with the highest score for each of these features around any ORF is colored in a darker shade. Shine-Dalgarno motifs are found by matching a user-defined subse- quence, which represents the reverse comple- ment of the 3‘ end of the organism’s 16s rRNA molecule. Promoter and terminator searching has so far been implemented for archaea! DNA sequences (GORDON and SENSEN, un- published data).

Stop codons are marked with orange ticks within the reading frames. The location of stop codons is determined while the image is being created as follows: In each forward translation frame, the search for the next stop codon be- gins at the first in-frame base represented by the next pixel, thus increasing the calculation efficiency without sacrificing information in the display. For example, if a stop codon is found at base 23, and each pixel represents 50 bases, the search for the next stop codon in that frame starts at base 51.This is because 51 is the first in-frame triplet in the next pixel, repre- senting the [50, 1001 range. By not repeatedly drawing stop ticks in the same pixel, no wasted rendering effort is made. Another display shortcut is to search for the reverse comple- ments of the stop codons on the forward strand in the same manner when rendering the nega- tive DNA strand. Finding stop codon reverse complements saves the effort of reverse com- plementing the whole sequence and inverting the information again for graphical rendering.

On the ends of the lower ruler, labels can be added to indicate information about the prim- ing sites, which are flanking the insert of the clone. In Fig. 5 it is shown that the insert is in sp6 to t7 orientation. The information about the ends of a clone is stored in the configura- tion file that defines the clone neighboring re- lationships.

The image also displays the overlaps between contigs, which were determined dur-

ing the generation of Figs. 2 and 3. Ovcrlaps are denoted with arrows in the upper left and right hand corners of the image. The extent and orientation of the overlaps is indicated by the shading of the background between the upper and lower rulers in the blue (for forward orientation of the neighbor) or red (for re- verse complement orientation of the neigh- bor).

2.1.4 Contiguous Sequence with Evidence

Fig. 6 shows an example of a contiguous se- quence display with evidence. Images of this type are similar in layout to the image shown in Fig. 5, reducing the user’s learning curve. This type of image is used when the sequence is in a primary or linking state, i.e., when multi- ple sequencing errors and ambiguities still may exist in the sequence. The key difference between Figs. 5 and 6 is that evidence in Fig. 6 is not abstracted into ORF traits. Evidence is displayed at its absolute position on the DNA strand. Fig. 6 demonstrates the usefulness of this display. The cyan (level 2) and blue (level 1) evidence is on the -2 DNA translation

1 2875

‘1

+2

+3

-3

-2

-1

Minimun length = 100 ( AH - residues ) Start codon = ATG GTG TTG

Fig. 6. Contiguous sequence with open reading frames and evidence displayed. Evidence and ORFs are displayed in their frames and locations. This fa- cilitates easy recognition of frameshifts and partial genes. Other characteristics are the same as in Fig. 5 (see color figures).

2 The Graphical System 355

frame, and a frameshift in the 3’ end of the se- quence is likely, because of the neighboring ORFs on the -3 frame. Higher ranked evi- dence always appears above lower ranked evi- dence. In this way, the more pertinent informa- tion is displayed in the limited screen area available. The evidence markings are hyper- linked to the pattern matches and alignments, which they represent. The ORFs are linked to a Fig. 9 view of the evidence that falls within the same boundaries on the same DNA strand.

These images can also be used to find pre- viously unrecognized assembly overlaps. To identify potential overlaps, a MAGPIE se- quence database search of each contig against all contigs in the project is performed, and self- matches Of conti@ are suppressed in the sub- sequent analYsis.AnY matches Potentially mat- ing two contigs are then visible. This kind of in- formation would not be represented in Fig. 5.

Fig. 8. Evidence summary. Evidence is grouped by type, and displayed as one line for all hits from a tool. Better evidence is darker and closer in the fore- ground (see color figures).

2.1.5 ORF Close-up

The main purpose of Fig. 7 is to display a close-up of the ORF and surrounding features. Links to the overlapping ORFs are provided, so that the user can check whether the inner ORF or the outer ORF is an artifact, and to provide indications for frame shifts, which might result in slightly overlapping ORFs such as ORF number 009 in the example.The color-

+1 E! 1 2 : +3

Hinilnun l&th = i O z l ( clr? - residues i Start codon = flTG 6TG TTG ORF traits: -1 Trait l e

strange 40 cow. Other:

Fig. 7. ORF close-up. Displays the exact start and stop coordinates, and hyperlinked overlapping ORFs. Gene labels and trait sub-boxes are potential- ly easier to read than in Fig. 5 (see color figures).

ation of the ORFs is analogous to Fig. 5, this is useful for smaller ORFs which otherwise could be difficult to read. The green arrow in- dicates the orientation of the ORE ORFs on the negative DNA strand are reverse-comple- mented in Figs. 8 and 9 to display all evidence in 5’-3’ orientation, thus they may be dis- played in different orientations than Fig. 7.

Because of the simplicity of this image, its width is fixed. Unlike the situation in most other images, the scale factor is a function of the sequence length divided by the fixed width. The rulers have major markings every 100 pixels. For example in Fig. 7, the scale is (9900-9500)/100, or 4 bases/pixel.

2.1.6 Analysis Tools Summary

Fig. 8 contains a summary of the location of evidence from all tools along an ORE This provides an overview of which parts of the ORF have evidence. The image has a fixed width, and it is usually shown next to the simi- larly sized Fig. 7. Its height is dependent on the number of tools that yielded valuable respons- es. This requires that all of the evidence infor- mation is loaded into memory before the can- vas allocation and drawing of the image is per- formed.

356 15 Genomic Data Representation through Images - MAGPIE as an Example

Fig. 9. Expanded evidence. Evidence is sorted by level, tool, score, and length. The first column links to the database ID of the similar sequence. The second displays the similarity location. It is linked t o the original tool report. The third names the tool used. The fourth displays the tool’s scoring of the match. The fifth dis- plays the Enzyme Commission Number, hyperlinked to further enzyme information. The last column dis- plays the matching sequence description (see color figures).

The graphic can also be useful in determin- ing the location of the real start codon. One can rule out the first start codon under certain conditions by using the rare and start codon in- dicators at the top of the image, combined with the fact that supporting evidence may only exist from a certain start codon onwards. By default, rare codons are the ones that normal- ly constitute less than 10% of the encoding of a particular amino acid for the particular or- ganism. Rare codons are colored according to the color scheme for start codons. Shine-Dal- garno sequences are denoted with a black rec- tangle near the start codon indicators. Similar to Fig. 6, highly ranked evidence is placed on top so that best results are always shown for any region of the ORE

2.1.7 Expanded Tool Summary

Fig. 9 displays in detail all of the evidence ac- cumulated for the ORF during the MAGPIE analysis. The top ruler indicates the length of the translated ORF in amino acids. The bot- tom ruler indicates the position of the ORF within the contig. The rulers are numbered from right to left when the ORF is on the re- verse-complement DNA strand. In this way the evidence is always presented in the same direction as the ORF translation. The evidence is separated into those database entries that

have at least a single level one hit, at least one level two hit, and others. The evidence coloring and order of placement is identical to that in Fig. 5.

This kind of image can be created for any subsequence. It is also used to display evidence in the confirmation pages for small inter-ORF features. For every database subject, the hit score is shown in the third text column, and the database subject description in the last col- umn, thus consistently high scores and consis- tent descriptions are easy to spot.

The three types of representational display link to more in-depth information. The first linked data in text form are the accession num- ber for the database subjects that hit against the query sequence in the first text column. The accession numbers are linked to the origi- nal database entries, e.g., GenBank or EMBL in accordance with a link-configuration file. For example, at NRC, the MAGPIE links con- nect to information provided by the Sequence Retrieval System (http://www. lionbio. co. uk) of the Canadian Bioinformatics Resource (http://www.cbr.nrc.ca). Other sites may con- figure these links to point to Entrez (http://ncbi.nlm.nih.gov/Entrez) at NCBI. If an Enzyme Commission (EC) number is associat- ed with the database subject, the number is placed in the fourth text column and linked to a MAGPIE page listing the metabolic path- ways in which the enzyme occurs.

2 The Graphical System 357

between toolhd sets in step 4. Steps one and two are only executed once. The system uses Perl’s built-in sort, which is an implementation of the all-purpose quicksort (HOARE 1962) al- gorithm. All of the information about database search tools, scores and hit lengths is kept in hash tables for quick reference during sorting. The speed benefit of hash table lookup out- weighs its space cost. MAGPIE is usually run on servers with the capacity to execute multi- ple analyses in parallel, therefore, short-term requirements for large amounts of main mem- ory are usually dealt with easily. The total height of the image can be calculated only af- ter the number of tool/id sets and their order- ing is determined. The image width is deter- mined by keeping track of the longest value in each of columns while the evidence is loaded. Drawing can only begin after the height and width are determined.

The last but most informative linked compo- nent is the colored match coverage. The qual- ity and types of evidence are clear because of the different colors assigned to the respective tools and respective confidence levels. The col- or differentiation can also be used to display other differences. For example, if protein-level BLAST analyses are performed as separate tools against eukaryotes, archaebacteria, and bacte- ria, and the resulting responses are colored us- ing slightly different hues, the commonality of the gene is visually evident. Placing the mouse over the area with similarity causes a message to appear on the browser’s status line. The message contains the exact interval of the sim- ilarity on both the query and subject sequenc- es. The similarity display is hyperlinked to the original data in the text response.

The positioning of evidence rows in this im- age is more complex than in Fig. 8. The logic behind this display is as follows:

(1) Separate the evidence into sets where the tool and database id for the match- es are the same.

sets where the top level match in the tool/id set is either I, 2, or 3.

(3) Order in a descending manner the toolhd match sets within each top match-level set by the user-specified tool rankings.

(4) Order toolhd sets within a tool ranking by score. If all scores are greater than one, order in a descending manner, oth- erwise order in an ascending manner (e.g., expected random probability scores).

( 5 ) Within a score, rank in descending or- der tool/id sets by the total length of ORF intervals they cover, effectively giving longer hits higher priority.

(6) Within a length, sort alphabetically by hit description.

(7) Within a description, sort alphabetically by database identifier.

(2) Separate the toolhd hit sets into three

This fine level of sorting ensures predictabil- ity for the user. Evidence is sorted in terms of relevance and lexical ordering from top to bottom. In practice, the sorting is quite fast because a differentiation is usually made

2.1.8 Base Composition

Images displaying characteristics of the whole DNA sequences, e.g., the Base Compo- sition or the Assembly Coverage Figures are usually drawn to the same scale as Fig. 4. They can be juxtaposed on top of each other to view them in context.

Fig. 10 displays two base composition distri- butions along the DNA sequence. In both graphs the colored region shows the actual base composition, which is determined by slid- ing a window along the forward DNA strand.

The G+C% graph is configured with a mean (as indicated by a horizontal line) equal to the average of the complete genomic se- quence. Denoted on the left scale in the exam- ple, the Sulfolobus solfaturicus P2 genome average is 35%. The graph allows the rapid de- tection of areas of unusual base composition. Such aberrations may for example indicate the presence of transposable elements or other ge- nomic anomalies. In the chosen example though, there is no great variability, indicating a low likelihood for the occurrence of trans- posable elements in this region of the genome. The 34.1% average base composition for this sequence is denoted on the right hand scale.

As previously mentioned, the red purine (A + G%) composition graph can be used for

358 15 Genomic Data Representation through Images - MAGPIE as an Example

50.0 ":I k:' (0.0 0

35.0 1 30.0 1

lo.0 79.0, R or C (Purines) R u l e cornpoaition 1: 50.0 Sequence wiidow f a r augs.: 500

65.0

55.0

50.0 45.6

35.0

Fig. 10. Base compositions. Average A + G and G + C compositions are calculated using a sliding window of 500 bases. The moving average is displayed as a filled graph, both above and below the centerline average. Unusual G + C may indicate the presence of transposable elements. Majority A + G indicates coding strand in many organisms (see color figures).

many species to predict the strand containing the coding sequence. Lined up against Fig. 5 , ORFs that are most likely non-coding can be detected. When the purine composition is greater than 50%, the coding ORFs are likely on the positive strand. They are likely to be on the negative strand when the composition is much below 50%.

The composition percentages are smoothed out by calculating averages with a sliding win- dow of 500 base pairs. When each pixel repre- sents 50 base pairs on the scale, and the window for composition averaging is 500, we can use the previous 5 pixels' (50 bases per pixel times 5 pixels = 250 pixels) and the next five pixels' totals for each plotted pixel column to calculate the current average. At each pixel column, we add a new total and discard the total for the first pixel column in the sliding window. Calcu- lating the average at any location requires only averaging 10 numbers. Otherwise, in our exam- ple, 500 would need to be averaged at the se- quence ends where the look-ahead and memo- ry about the values for previous columns do not exist. To avoid invading whitespace, the sliding average peak is truncated when it is out- side of the user-defined scale ranges.

2.1.9 Sequence Repeats

Fig. 11 indicates the portions of the se- quence that are repeated in the project. MAGPIE calculates families of repeated se- quences sharing a minimum number of contig- uous bases. By default, the minimum number is 20. Repeats are sorted into families of match- ing subsequences, which are further sorted by size in descending order. The matching se- quence name is in the right hand column, while the location and size of the match are dis- played as filled boxes under the ruler. Red, blue, and green boxes represent forward, re- verse complement, and complement matches.

When many repeats occur in a sequence, the images are very tall. In these instances, the di- mensions may exceed the maximum image dimensions that can be shown by the browser. The stored image is loaded into a Java applet with scroll bars in order to overcome this browser limitation. A faster, memory intensive repeats search has recently replaced the origi- nal repeats finder. The new repeats finder ex- ports the repeats information to enable spe- cialized viewing of the data by Java applets ac- cepting a special data format.

2 The Graphical System 359

Repeat fmlly 29 b5zelOUz?l020D

Repeat frmlli 156 11ZW-149X6 Repeat fanild 236 11213_149616 Repeat fanlly 766 11213-149022 Repeat Fani ly 1296 cWRC~ss<C11.RC c41 1 Repedt farnil9 1520 C41.FLSS<C41-RC,C43,1

I

I I

I

Fig. 11. Sequence repeats. The image has scrollbars (as part of an applet) because of its large dimensions. Re- peats are sorted into families of matching subsequences, which are further sorted by size in descending order. The matching sequence name is in the left part of the right hand column, while the location and size of the match are displayed as filled boxes under the ruler. Red, blue and green boxes represent forward, reverse complement, and complement matches (see color figures).

2.1.10 Sequence Ambiguities

Fig. 12 displays the location of ambiguous bases in a contig. This image can be used in the polishing stage of the DNA sequencing pro- ject. It is usually viewed atop Fig. 13, which provides the assembly context. The generation of this image requires assembly information from the Staden package (STADEN et al., 1998). The scale in the center denotes the number of

base pairs in the sequence. Red vertical ticks on the upper line represent ambiguities, where there is only positive strand coverage. Blue ticks on the lower bar represent ambiguities when only the negative strand is sequenced. When an ambiguity exists in a region of double stranded coverage, the tick appears on the cen- ter scale. As an exception to the consistent use of color, this tick is colored red for better vis- ibility. The total number of ambiguities is dis-

Eonbiguities this graph: 2

Fig. 12. Assembly information. Histograms of average positive and negative strand assembly coverage are above and below the centerline. Breaks in the blue and red bars indicate gaps in the positive and negative strand coverage. Genomic neighbors are indicated by background shading as in Fig. 5 (see color figures).

Assembly Info IU

1 38987

10 0

-Negative Strand Coverage Coverage Means I+/-ibGth) : 1.9,’1.7/3.6

Fig. 13. Ambiguity information. This graph starts at base 50,001 because it is the second pane for a 71,519 base sequence limited to 50,000 bases per image. If the ambiguous base has been sequences on the forward, re- verse, or both strands, the tick is displayed on the top, bottom, or center line respectively. The displayed am- biguities total is for the pane, not the sequence as a whole (see color figures).

360 15 Genomic Data Representation through Images - MAGPIE us an Example

played in the lower-left corner. In our exam- ple, the second pane of the display is shown. The number of ambiguities (two) is valid for this pane only; there may be additional ambi- guities in the first pane.

2.1.11 Sequence Strand Assembly Coverage

The image shown in Fig. 13 summarizes the quality of the sequence assembly for emerging genomes in MAGPIE. MAGPIE can extract assembly information from the output of the Staden package assembler gap4. As in Fig. 5, background shading denotes the extent and orientation of the neighboring contigs.

Average coverage multiplicity is the average number of times any base of the contig has been sequenced. This number is separately cal- culated for both strands. The two values and the total for the contig are displayed at bot- tom-center of the image. The green area in the center of the image can be interpreted as two separate histograms, the one above the ruler quantifying the average coverage on the posi- tive strand, while the one below the center rul- er quantifies the average coverage on the neg- ative strand. The histogram bars are truncated if greater than ten-fold coverage is reached. The histograms are used to determine the reli- ability of the data. This is useful where frame shifts or miscalled bases are suspected.

Further resolution of poorly covered re- gions is provided through the continuity of the large horizontal blue and red bars. A gap will appear in the blue or red bar if even a single base pair has not been sequenced on the for- ward or reverse strand, respectively. This al- lows the user to see how much DNA sequence polishing is required to double-strand the en- tire sequence assembly. The displayed overlap with other project sequences is once again use- ful. Gaps in the current sequence’s assembly may be less worrisome in regions overlapping with other contigs.

In order to create this image, the assembly information is read in chunks. The chunk size is equal to the scale factor for the image (SO basedpixel by default). The base pairs on each strand are mapped onto a blank template

string of 50 bases. Blanks in the template indi- cate gaps.

The average sequencing coverage for that pixel is calculated at the same time. The se- quence averages are calculated by summing up the pixel totals. This provides major savings over the much larger individual base totals.

2.1.12 Restriction Enzyme Fragmentation

Fig. 14 displays the location of restriction en- zyme cuts on the insert.The MAGPIE user can define the set of restriction enzymes. The clon- ing vector and the orientation of the insert in the clone can be specified when the contig is added to the MAGPIE project. This informa- tion is taken into account during the fragment calculation. In the figure, cut locations are de- noted as vertical ticks on each enzyme’s lines. The fragments are numbered from the 5’ to the 3‘ end, including the vector sequence (which is not shown in this display). The vector is always oriented to the 5’ end (i.e., left end) of the insert. Fragment numbers in green rep- resent parts of the vector-free insert. Frag- ments labeled in yellow contain parts of the vector sequence. Fragments that contain only vector are not displayed in this figure, because the ruler only includes the range of the insert. Such fragments appear only in the agarose gel simulation described below. The numbering of fragments in the Fig. 10 display does not al- ways start at number one, because of the out- of-sight vector fragments. The example clone is from a Hind ZZI restricted library. The enzyme cuts the sequence at the very start and at the very end of the insert (i.e., there are no yellow- labeled fragments).

2.1.13 Agarose Gel Simulation

Fig. 1.5 is a computer simulation of an aga- rose gel displaying the same the restriction di- gests as Fig. 14.The width of the image is based on the number of restriction enzymes. The height depends on the user-configurable size of the agarose plate. Given that we know the theoretical fragment lengths, the hypothetical

2 The Graphical System 361

1 41411 y o , , : ?Oh0 p u o , , , , ? OhOD , , ? , 5000 , , , , , , , , goo0 , , , p5000 , , , , , , a000

, -. . - 4 e , -i a 1 : E m T h -.~. , . .

4 E 2 , 0 @F 14 2 1 - i ‘ S ‘ H i @ C i f ’

4 15 ,L: 15, , ? , : c , 11 , I 2 .1&l17 3 :

F , 6 , ? ,

EE%Ti;

Codes ape 5’ t a 3’. with differentiation afLer CUL

Fig. 14. Restriction sites.Ticks correspond to the location of cut sites for the restriction enzymes listed in the right hand margin. The fragments produced are labeled 5 ’ to 3 ’, with the undisplayed cloning vector on the 5’ end. Fragments containing vector have yellow labels, otherwise they are green. The HindIII line has re- striction sites at the ends of the sequence because the clone library was HindIII restricted. By consequence, it has no yellow labeled fragments (see color figures)

Fig. 15. Agarose gel simulation. Fragment lanes and labels correspond to those in Fig. 14. The fragment migration is calculated using the specified standard marker “M” lane migrations. Fragments containing vector are colored yellow. Fragments composed en- tirely of vector are colored red. Where fragments are very close, luminescence is more intense and labels are offset for readability (see color figures).

fragment migration distances are calculated using the reciprocal method (SCHAFFER and SEDEROFF, 1981). This is used instead of the less accurate logarithmic scale based on the se- quence length. The reciprocal method calcu- lates mobility as the inverse of the length. The exact relationship is governed by constants calculated from a least-squares fit of the mark- er mobility data. The reciprocal formula is:

(m - rn”) ( L - L,) = c

where m is the mobility in the agarose gel and L the fragment length.

The constants from the least-squares fit are m,, Lo, and c. Like most of the settings for the graphical displays in MAGPIE, the marker mobility data are specified in a configuration file. The regression is performed using the tra- ditional summation method as opposed to the matrix multiplication method. This method was chosen because of the limited efficiency of array manipulation in Perl5.

Describing the features of the gel from top to bottom, the agarose percentage is displayed. This number is specified in the marker mobil- ity configuration. The outside lanes marked “M” are the marker lanes. The inside lanes are numbered left to right in the top to bottom order of the restriction enzymes of Fig. 14. The horizontal gray line represents the location of the wells. The fragments in the marker lanes have their lengths displayed in the image mar- gins. The bands of marker lane fragments are solidly colored to ensure they can be clearly observed. The bands in the enzyme lanes ap- pear slightly diffused in order to more closely resemble the appearance of physical gels. These bands are also distinguishable from un- differentiated bands, which are more solid and overall brighter in color. When bands are close to each other in a lane, some labels for frag- ment numbers are offset in an attempt to max- imize readability. As an example of these dis- tinctions, observe fragments 16, 8, and 11 in lane 3 (an EcoR I digestion). Fragment 16 is isolated, has a diffused band, and a left justi- fied label. On top of one another, fragments 8 and 11 are given a larger bright area. This high- lights their overlap even though they occupy the same amount of space as fragment 16. The

362 15 Genomic Data Representation through Images - MAGPIE us an Example

label for number 11 is offset to the right of number 8.

Fragments containing or entirely composed of vector sequence are also displayed. This is done in order to remain true to the physical manifestation. Partial vector fragments are colored yellow. Full vector fragments are col- ored red. This labeling is done so that frag- ments containing vector are not accidentally isolated from the real agarose gels. As would be expected, in the Hind IZI lane (number 2) there is a single fragment, number 1, which contains the entire vector.

3 Conclusions and Open Issues

The images shown in this text are extremely useful as a support tool for genome annota- tion. Without the images, the efficient scanning of genomic evidence would be much harder and in many cases probably impossible. More than 20 genome projects have used MAGPIE to date for the annotation of “their” genome.

Yet, the system cannot satisfy the need for more complex data queries, which will be a theme for at least the next decade. For exam- ple, most prokaryotic genomes are circular in nature, a fact that is not displayed in any of the MAGPIE graphics. While the administrator can configure particular features of the MAGPIE graphics, there is a limit to the flex- ibility. For example, a graphical display of a query like: “Display all the tRNA coding genes and the tRNA-synthetase coding genes in a genome and show potential relations between the two gene sets”cannot be served by the cur- rent MAGPIE environment. Refinement of the displays and the addition of input forms for eukaryotic features (GAASTERLAND et al., 2000) in MAGPIE will continue as more anal- ysis and genomic annotation is done.

Acknowledgement The authors wish to thank Dr. JOHN P. VAN

DER MEER, Director of Research at IMB for critically reading the manuscript.

4 References

ALI‘SCHUL, S. F., MADDEN, T. L., SCI-INTER, A. A,, ZHANG, J., ZHANG, Z. et al. (1997), Gapped BLAST and PSI-BLAST a new generation of protein database search programs, Nucleic Acids Res. 25,3389-3402.

BORODOVSKY, M., MCININCH, J. (1993), GeneMark: Parallel gene recognition for both DNA strands, Computers & Chemistry 17,123-133.

BURGE, C., KARLIN, S. (1997), Prediction of complete gene structures in human gcnoinic DNA, L Mol.

CHARLEBOIS R. L., GAASTERLAND T., KAGAN M. A., DOOLITTLE W. F., SENSEN, C. W. (1996),The Sulfo- lobus solfataricus P2 genome project, FEBS Lett.

COOPER, A. (1995), About Face: The Essentials of User Interface Design. Foster City, CA: IDG Books Worldwide.

GAASTERLAND T., SENSEN, C. W. (1996), Fully auto- mated genome analysis that reflects user needs and preferences - A detailed introduction to the MAGPIE system architecture, Biochimie 78, 302-310.

GAASTERLAND, T., SCZYRBA, A,, THOMAS, E., AYTE- KIN-KURBAN G., GORDON, P., SENSEN, C. W. (2000), MAGPIE/EGKET Annotation of the 2.9 Mb Drosophila melanogaster ADH region, Ge- nome Research 10,502-510.

GORDON, P., SENSEN, C. W. (1999), Bluejay: A brows- er for linear units in Java, in: Proc. 13th Ann. Int. Symp. High Performance Computing Systems and Applications, pp. 183-194.

HENIKOFF S., HENIKOFF, J. G., PIETROKOVSKI, S. (1999), Blocks+: A non-redundant database of protein alignment blocks derived from multiple compilations, Bioinfornzatics 15,471479.

HOARE, C. A. R. (1962), Quicksort, Computer J . 5, 10-15.

KOLAKOWSKI, L. F., Jr., LEUNISSEN J. A. M., SMITH, J. E. (1992), Prosearch: fast searching of protein se- quences with regular expression patterns related to protein structure and function, Biotechniques

PEARSON, W. R., LIPMAN, D. J. (1988), Improved tools for biological sequence comparison, Proceedings of the National Academy of Science 85, 2444-2448.

SALZBERG, S., DELCHER, A,, KASIF, S., WHITE, 0. (1998), Microbial gene identification using inter- polated Markov models, Nucleic Acids Res. 26, 544-548.

SCHAFFER, H. E., SEDEROFF, K. R. (1981), Least squares fit of DNA fragment length to gel mobil- ity, Anal. Biochem. 115,113-122.

Biol. 268,78-94.

389,88-91.

13,919-921.

4 References 363

STADEN, R., BEAL, K. F., BONFIELD, J. K. (1998), The Staden Package, Computer Methods Mol. Biol. 132,115-130.

SHINE, J., DALGARNO, L. (J 974), The 3 ‘-terminal se- quence of Escherichia coli 16s ribosomal RNA: complementarity to nonsense triplets and ribo- some binding sites, Proc. Nat. Acad. Sci. 71, 1342-1346.