[ieee 2008 cairo international biomedical engineering conference (cibec) - cairo, egypt...

4

Click here to load reader

Upload: mi

Post on 04-Apr-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2008 Cairo International Biomedical Engineering Conference (CIBEC) - Cairo, Egypt (2008.12.18-2008.12.20)] 2008 Cairo International Biomedical Engineering Conference - VisCHAINER:

Proceedings of the 2008 IEEE, CIBEC'08 978-1-4244-2695-9/08/$25.00 ©2008 IEEE

VisCHAINER: VISUALIZING GENOME COMPARISON

A. Othman1, A. Martin2, D. Butterstein2, M.I. Abouelhoda3,4

1Lane Department of Computer Science and Electrical Engineering, West Virginia University, USA2Faculty of Engineering and Information Science, University of Ulm, Germany

3Faculty of Engineering, Cairo University, Giza, Egypt4Nile University, Giza, Egypt

e-mails: [email protected], {Alfons.Martin, d.butterstein}@web.de, [email protected]

Abstract—Visualization of genome comparison data is valuablefor identifying genomic structural variations and determiningevolutionary events. Although there are many software toolswith varying degrees of sophistication for displaying such com-parisons, there is no tool for displaying dot plots of multiplegenome comparisons. The dot plot mode of visualization ismore appropriate and convenient than the traditional linearmode, particularly for detecting large scale genome deletions,duplications, and rearrangements. In this paper, we presentVisCHAINER, which addresses this limitation, and displaysdot plots of multiple genome comparisons in addition to thetraditional linear mode. VisCHAINER is a stand-alone interactivevisualization that effectively handles large amounts of genomecomparison data.

I. INTRODUCTION

Since the sequencing of the first genomes, it has beenbelieved that comparing genomic sequences to themselvesor to one another is the key methodology to understandinghow genomes function, organize, and evolve. The rationale isthat the regions of similarity should share common functionsamong the compared genomes and the different regions refer totraits unique to the respective genomes. Accordingly, the maincomputational problem associated with whole genome com-parison is to identify the regions of similarity and differenceamong the given genomes. To cope with the shear volume ofgenomic data, a number of software tools were consequentlydeveloped to automatically accomplish the comparison task,see [1, 2] for a review.

The output of early whole genome comparison tools wastextual. But the resulting voluminous data made it difficult tofigure out and classify large segmental changes. Immediately,it has been recognized that graphical representations of thewhole genome comparison data significantly help biologistsmake sense of it. Hence, many visualization tools, eitherstand-alone packages or part of the comparison tools, wereintroduced to the biological community. These tools can bedivided, according to the visualization mode, into three groups:

1) Tools providing dot plots; These include, among oth-ers, DOTTER [3], Gepard [4], GenoPix2D [5], andDNAVis [6].

2) Tools providing linear representation of the genomes andthe similar regions; These include, among others, ACT[7], GenomePixelizer [8], Mauve [9], GATA [10],and M-GCAT [2].

3) Tools providing both types of modes; To the best ofour knowledge, these include CGAT [11] and REPuter[12]. However, the former visualizes solely pairwisecomparisons and the latter visualizes repeats within asingle sequence.

Unfortunately, there are some limitations to the aforemen-tioned tools, particularly when visualizing multiple genomecomparison data. On one hand, the dot plot tools are limitedonly to displaying pairwise genome comparisons, and thetools displaying multiple genome comparisons are restrictedto the linear mode. That is, there is no tool that automaticallydisplays dot plots of multiple genome comparisons. On theother hand, the linear mode visualization module is tightlybuilt in most of the whole genome comparison tools such thatvisualizing comparison data of other tools is not supported.

In this paper we present the program VisCHAINER, whichis a stand-alone interactive visualization tool that providesboth the linear and dot plot visualization modes. It can alsobe used for visualizing pairwise as well as multiple genomecomparisons.

This paper is organized as follows. In the following section,we address the basic terminologies used in VisCHAINER.Section III explores the main features of VisCHAINER. Someimplementation issues are addressed in Section IV. Conclu-sions and Future work are given in Section V.

II. VisCHAINER AND CoCoNUT

VisCHAINER was originally developed to visualize the out-put of the program CoCoNUT [13], especially that producedby the module CHAINER [14]. Nevertheless, we stress thatVisCHAINER is a stand-alone program and the user candisplay her/his comparison data produced by other programs,provided that the input format is compatible with that ofCoCoNUT .

Because of the relation between VisCHAINER and theCoCoNUT system, we briefly recall the main definitions andtechniques used in CoCoNUT . (In fact the techniques ofCoCoNUT are generally used in most recent whole genomecomparison tools, but with different implementation details;see [2] for a list of these tools)

Page 2: [IEEE 2008 Cairo International Biomedical Engineering Conference (CIBEC) - Cairo, Egypt (2008.12.18-2008.12.20)] 2008 Cairo International Biomedical Engineering Conference - VisCHAINER:

Proceedings of the 2008 IEEE, CIBEC'08 978-1-4244-2695-9/08/$25.00 ©2008 IEEE

Fig. 1. (a) The set of fragments (above) and the resulting representativechains (bottom). (b) The same set of fragments in (a) are represented asrectangles in the plane. The chains 〈1, 4, 6〉 and 〈7, 8〉 are representative ofsignificant highest-scoring local chains, and are plotted in different colors.

A. CoCoNUT overview

To cope with the large genomic sequences, CoCoNUT findsregions of high similarity using the anchor-based strategy thatis composed of three phases [13, 15]:

1) Computation of fragments. (matches among genomicsequences as defined below). This step is efficientlycarried out using the enhanced suffix array [16].

2) Computation of highest-scoring chains of colinear frag-ments (chains are defined below). Each of these highest-scoring chains corresponds to a region of similarity. Thefragments in each of such chains are the anchors. Thesechains are efficiently computed using techniques fromcomputational geometry.

3) Post processing the resulting chains, which includes,e.g., aligning the regions between the anchors of a chainusing the standard dynamic programming algorithm.

The rationale of generating fragments is that the regionsharboring fragments are regions of potential similarity, whilethose regions containing no fragments are different and areexcluded from further processing. To assert that the fragmentsactually constitute regions of similarity and did not appear bychance, they are further processed by the chaining algorithms.

B. Formal definitions of fragments and chains

For 1 ≤ i ≤ k, let Si denote a string that representsone of the given k DNA sequences or complete genomes.Si[li . . . hi] is the substring of Si starting at position liand ending at position hi. A fragment is a similar regionoccurring in the given genomes. This region is specified by

the substrings S1[l1 . . . h1], S2[l2 . . . h2], . . . , Sk[lk . . . hk]. Afragment is called exact if S1[l1 . . . h1] = S2[l2 . . . h2] =. . . = Sk[lk . . . hk], i.e., the substrings composing it areidentical. If character mismatches, deletions, or insertions areallowed in the substrings composing the fragment, then wespeak of a non-exact fragment.

Geometrically, a fragment f of k genomes can be repre-sented by a hyper-rectangle in Nk

≥0 with the two extremecorner points beg(f) and end(f). beg(f)= (l1, l2, . . . , lk)specifies where the fragment starts at positions l1, . . . , lk inS1 . . . Sk respectively, and end(f)= (h1, h2, . . . , hk) specifieswhere it ends at positions h1, . . . , hk in S1 . . . Sk respectively;see Figure 1.

Definition 2.1: Two fragments f and f ′ are colinear andnon-overlapping if and only if end(f).xi < beg(f ′).xi for all1 ≤ i ≤ k.

Definition 2.2: A chain of fragments is a sequence ofcolinear and non-overlapping fragments f1, f2, . . . , f`, and itsscore C is

score(C) =∑̀i=1

length(fi)−`−1∑i=1

g(fi+1, fi)

where g(fi+1, fi) is the gap between fi+1 and fi, and it is thedistance between beg(fi+1) and end(fi) in the L1 metric.

Given a threshold T , CoCoNUT computes all local chainsof score ≥ T . These chains are the regions of high similarity.

In Figure 1 (a, upper part), we show an example of frag-ments from two genomes in the linear mode, where the sub-string composing each fragment are represented by lines (bars)and connected by a line. In part (b) of the figure, the same setof fragments are represented as two dimensional rectangles ina 2D plot. The local chains exceeding a certain threshold, givenin part (a, lower part) and in (b), are 〈1, 3, 6〉, 〈1, 4, 6〉, 〈7, 8〉,and 〈7, 9〉. However, the two chains 〈1, 3, 6〉 and 〈1, 4, 6〉 sharethe fragments 1 and 6, yielding the cluster 〈1, {3, 4}, 6〉. Thecluster 〈7, {8, 9}〉 represents two local chains 〈7, 8〉 and 〈7, 9〉.To reduce the output size, we report the local chain of highestscore in each cluster as a representative chain of this cluster.

Inversions can be taken into account by comparing theforward strand of one genome to the reverse complementstrand of another genome.

III. FEATURES OF VisCHAINER

VisCHAINER displays fragments and chains produced byCoCoNUT . But we stress that VisCHAINER can visualizethe output of any program, provided that it is given in theCoCoNUT format. For example, the output of the popularprogram BLASTZ [17] used for pairwise genome comparisoncan be parsed to extract the score (length) and the boundariesof the similar regions. Then fragment files are produced in theCoCoNUT format to be displayed by VisCHAINER; see theprogram manuals for details.

A. VisCHAINER Sessions

VisCHAINER is based on the idea of sessions, where eachsession encompasses the files of a single comparison task.

Page 3: [IEEE 2008 Cairo International Biomedical Engineering Conference (CIBEC) - Cairo, Egypt (2008.12.18-2008.12.20)] 2008 Cairo International Biomedical Engineering Conference - VisCHAINER:

Proceedings of the 2008 IEEE, CIBEC'08 978-1-4244-2695-9/08/$25.00 ©2008 IEEE

Each session is independent from other sessions. The rationalebehind supporting multiple sessions is to enable the user todisplay different comparisons of different species within thesame program and to have a clear overview of all of them.The user can save a whole session (i.e., the files opened inthis session), and re-opens it later.

B. Displaying comparisons in linear mode

In the linear mode (also known as bar view or alignment),each genomic sequence is represented by a line. The substringscomposing a fragment are drawn as bars on these lines. Thebars composing a match are connected by a line. In theexample of Figure 1 (a, upper part), we sketch some fragmentsin the linear mode. A chain composed of the fragments〈f1, f2, . . . , ft〉 is represented either by plotting its fragments(as sketched in Figure 1) or by the bars extending frombeg(f1).xi to end(ft).xi in each genome i; in other words itis represented as a larger fragment whose boundaries are thetwo extreme points beg(f1) and end(ft). The latter form iscalled the compact representation of a chain. Figure 2 displayscompact representation of chains from three bacterial genomesin the linear mode.

C. Displaying comparisons in 2D plots

Two-dimensional plots (also known as dot plots or matrixview) are well known for visualizing pairwise sequence sim-ilarity. For multiple genomes we face the problem of higher-dimensionality, where the fragments/chains can be regardedas hyper-rectangles in a higher dimensional space. Therefore,VisCHAINER displays projections of the comparison withrespect to every two genomes as 2D plots.

VisCHAINER displays each fragment f by drawing thediagonal line connecting beg(fi+1) and end(fi). To distin-guish inversions, we draw anti-diagonals for each invertedfragment/chain. As default we use different colors for invertedfragments/chains. (The user can, however, change this color.)

The chain composed of fragments 〈f1, f2, . . . , ft〉 is rep-resented in 2D plot either by plotting its fragments or bydrawing a diagonal line connecting beg(f1) and end(ft), i.e.,in a compact form. Chains in compact form between forwardand negative strands are represented by anti-diagonal lines.

Figure 2 shows two projections of chains in compact formproduced when comparing three bacterial genomes.

D. Interactivity

The interactive components of VisCHAINER includes con-tinuous display of the coordinates, controlling the appearanceof the program, zooming and selecting subsets of the data,filtering data out of the display area based on their scoreand length, and overlaying some other comparisons over thecurrently displayed one. In this section, we handle all theseissues in some detail.

1) Coordinates: The genome coordinates corresponding tothe mouse coordinates are continuously displayed in the statusbar of the respective window; see Figure 2. The coordinatesof each fragment/chain are displayed upon selection.

2) Display options: VisCHAINER is an interactive visu-alization tool. The Look&Feel menu item makes it possibleto change the global theme of the program. Through othercontrol bars, the color of the fragments, chains, or compactchains can be changed. The user can also display either chains,fragments, compact representation of chains, or all of themsimultaneously. Moreover, any displayed plot can be exportedinto a JPEG image.

3) Zooming and selection: The whole comparison initiallyfits in the window. The user can zoom in/out and resetzooming. It is also possible to select and display subset of thefragments and chains. Details of the selected fragments andchains (e.g., coordinates and score) are also displayed. If therespective nucleotide sequences are uploaded, VisCHAINERcan display the respective sequence of the substrings compos-ing the fragments. The selected fragments can be saved in aseparate file for further processing.

4) Filtration: VisCHAINER provides the user with thepossibility of filtering fragments and chains based on eithertheir score or length. This interactive feature is of utmostimportance, because it enables the user to easily filter noisyfragments and chains.

E. Displaying selected files and overlaying

In VisCHAINER the user can display certain files, not thewhole projections. There is also a possibility to display thisfile in a separate window or overlay it on the currentlyopened window. This features helps in differential display ofcomparisons produced using different parameters or programs.

F. Portability and usability

The program is implemented in Java to guarantee portability.However, some settings specific to the operating system arestill required, such as handling file paths. To overcome this,the user is prompted upon starting the program to specifyunder which operating system (Unix/Linux or Windows) theprogram will run. This automatically uses the correspondingsetting compatible with each operating system. The systemitself requires no installation, and is easy to run. The interfaceis intuitive and self explanatory. We also provide a step-by-stepmanual exploring the program features in detail.

IV. IMPLEMENTATION ISSUES

VisCHAINER is implemented in Java version 1.6 or higher.Older versions suffer from slow I/O operations, which dramat-ically affects the interactivity of the program.

To speed up the zooming and selecting operations, we use avery simple range search algorithm that works well in practice.For k genomic sequences, we sort the fragments/chains withrespect to their coordinates in each genome. Specifically, wekeep k arrays keeping the order with respect to each genome.For a selection rectangle ([l1..h1], [l2..h2], . . . , [lk..hk]) in acertain projection, say genome 1 versus 2, we apply binarysearch over the array where the fragments are sorted withrespect to the first genome to collect the points in the range[l1..h1]. From these points we display those that lie in therange ([l2..h2], . . . , [lk..hk]).

Page 4: [IEEE 2008 Cairo International Biomedical Engineering Conference (CIBEC) - Cairo, Egypt (2008.12.18-2008.12.20)] 2008 Cairo International Biomedical Engineering Conference - VisCHAINER:

Proceedings of the 2008 IEEE, CIBEC'08 978-1-4244-2695-9/08/$25.00 ©2008 IEEE

Fig. 2. 2D projections of chains (compact representation) of three bacterial genomes and linear mode display of the comparisons. In the 2D plots, reversechains are plotted in different colors. In the linear mode there are small arrows refering to their orientation. On the right, we show the bar controlling if tovisualize chains or fragments on these plots.

V. CONCLUSIONS AND FUTURE WORK

VisCHAINER is an efficient interactive tool for visualizingmultiple genome comparisons. Unlike other tools, it displaysall pairwise 2D projections of multiple genome comparisons.In future versions, we will extend VisCHAINER to read morefile formats (such as BLAST). We also plan to visualizegenome annotations given in GFF file formats.

REFERENCES

[1] P. Chain, S. Kurtz, E. Ohlebusch, and T. Slezak, “An applications-focused review of comparative genomics tools: Capabilities, limitationsand future challenges,” Briefings in Bioinformatics, 2003. To appear.

[2] T. Treangen and X. Messeguer, “M-GCAT: Interactively and efficientlyconstructing large-scale multiple genome comparison frameworks inclosely related species,” BMC Bioinformatics, vol. 7:433, 2006.

[3] E. Sonnhammer and R. Durbin, “A dot-matrix program with dynamicthreshold control suited for genomic DNA and protein sequence analy-sis,” Gene, vol. 167, pp. GC1–GC10, 1995.

[4] J. Krumsiek, R. Arnold, and T. Rattei, “Gepard: A rapid and sensitivetool for creating dotplots on genome scale,” Bioinformatics, vol. 23,no. 8, pp. 1026–1028, 2007.

[5] C. S.B., A. Kozik, B. Chan, R. Michelmore, and N. Young, “DiagHunterand GenoPix2D: Programs for genomic comparisons, large-scale homol-ogy discovery and visualization,” Genome Biology, vol. 4, no. 10, p. R68,2003.

[6] M. Fiers, H. Wetering, T. van de Peeters, J. van Wijk, and J. Nap,“DNAVis: Interactive visualization of comparative genome annotations,”Bioinformatics, vol. 22, no. 3, pp. 354–355, 2005.

[7] T. Carver, K. Rutherford, M. Berriman, M. Rajandream, B. Barrell, andJ. Parkhill, “Act: The Artemis comparison tool,” Bioinformatics, vol. 21,no. 16, pp. 3422–3423, 2005.

[8] A. Kozik, E. Kochetkova, and R. Michelmore, “GenomePixelizer-avisualization program for comparative genomics within and betweenspecies,” Bioinformatics, vol. 18, no. 2, pp. 335–336, 2002.

[9] A. Darling, B. Mau, F. Blattner, and N. Perna, “Mauve: Multiplealignment of conserved genomic sequence with rearrangement,” GenomeResearch, vol. 14, pp. 1394–1403, 2004.

[10] D. Nix and M. Eisen, “Gata: a graphic alignment tool for comparativesequence analysis,” BMC Bioinformatics, vol. 6:9, 2005.

[11] I. Uchiyama, T. Higuchi, and I. Kobayashi, “CGAT: A comparativegenome analysis tool for visualizing alignments in the analysis ofcomplex evolutionary changes between closely related genomes,” BMCBioinformatics, vol. 7:472, 2006.

[12] S. Kurtz, J. V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, andR. Giegerich, “REPuter: The manifold applications of repeat analysis ona genomic scale,” Nucleic Acids Research, vol. 29, no. 22, pp. 4633–4642, 2001.

[13] “CoCoNUT an efficient system for the analysis and comparison ofgenomes,” http://toolcoconut.org.

[14] M. I. Abouelhoda and E. Ohlebusch, “CHAINER: Software forcomparing genomes,” in Proc. of the 12th ISMB/3rd ECCB, 2004.[Online]. Available: www.iscb.org/ismbeccb2004/short\ papers/19.pdf

[15] ——, “Chaining algorithms and applications in comparative genomics,”Journal of Discrete Algorithms, vol. 3, no. 2-4, pp. 321–341, 2005.

[16] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch, “Replacing suffix treeswith enhanced suffix arrays,” Journal of Discrete Algorithms, vol. 2,no. 1, pp. 53–86, 2004.

[17] S. Schwartz, J. K. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison,D. Haussler, and W. Miller, “Human-mouse alignments with BLASTZ,”Genome Research, vol. 13, pp. 103–107, 2003.