bernhard haubold · thomas wiehe comparative genomics...

17
Naturwissenschaften (2004) 91:405–421 DOI 10.1007/s00114-004-0542-8 REVIEW Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and applications Published online: 25 June 2004 # Springer-Verlag 2004 Abstract Interpreting the functional content of a given genomic sequence is one of the central challenges of bi- ology today. Perhaps the most promising approach to this problem is based on the comparative method of classic biology in the modern guise of sequence comparison. For instance, protein-coding regions tend to be conserved between species. Hence, a simple method for distin- guishing a functional exon from the chance absence of stop codons is to investigate its homologue from closely related species. Predicting regulatory elements is even more difficult than exon prediction, but again, compar- isons pinpointing conserved sequence motifs upstream of translation start sites are helping to unravel gene regula- tory networks. In addition to interspecific studies, intra- specific sequence comparison yields insights into the evolutionary forces that have acted on a species in the past. Of particular interest here is the identification of selection events such as selective sweeps. Both intra- and interspecific sequence comparisons are based on a variety of computational methods, including alignment, phylo- genetic reconstruction, and coalescent theory. This article surveys the biology and the central computational ideas applied in recent comparative genomics projects. We ar- gue that the most fruitful method of understanding the functional content of genomes is to study them in the context of related genomic sequences. In particular, such a study may reveal selection, a fundamental pointer to biological relevance. Introduction Comparative genomics has been part of the human ge- nome project right from the start. The original plan was to sequence the human genome and the genomes of a small number of model organisms, including bacteria, yeast, fruit fly, the nematode worm Caenorhabditis elegans, and the mouse (National Institutes of Health and Depart- ment of Energy 1990). Today, these goals have all been achieved many times over and biologists have at their disposal 190 published genomes of freeliving organisms with more than four times this number in the pipeline (Bernal et al. 2001). Figure 1 shows a phylogenetic tree comprising 106 of the organisms whose complete genome sequence has been published. Bacteria make up the bulk of the organisms sequenced to date, reflecting both the relative technical ease with which their compact genomes can be sequenced and the medical interest in bacterial pathogens (Fig. 1). The sequenced bacterial genomes range in size from 580 kilobases (kb) encoding 470 open reading frames in Mycoplasma genitalium (Fraser et al. 1995) to 9,105 kb encoding 8,317 open reading frames in Bradyrhizobium japonicum (Kaneko et al. 2002). M. genitalium is a human pathogen causing reproductive tract infections, while B. japonicum is a soil-dwelling bacterium capable of fixing atmospheric nitrogen into organic compounds. It forms root nodules in soy bean, serving as the plant’s nitrogen source. The archaeal genomes sequenced range in size from 1,564 kb with 1,509 open reading frames in Thermo- plasma acidophilium (Ruepp et al. 2000) to 5,751 kb with 4,524 open reading frames in Methanosarcina acetivo- rans (Galagan et al. 2002). T. acidophilium is adapted to extremely acidic environments and has an optimal growth temperature of 59 0C. M. acetivorans, in contrast, is one of the most metabolically diverse methanogens, coloniz- ing diverse habitats including oil wells, sewage lagoons, decaying leaves, and the stomachs of cows. Uniquely among Archaea, it forms complex multicellular structures (Galagan et al. 2002). B. Haubold ( ) ) Fachbereich Biotechnologie & Bioinformatik, Fachhochschule Weihenstephan, 85350 Freising, Germany e-mail: [email protected] T. Wiehe Institut fɒr Genetik, UniversitȨt zu KɆln, Cologne, Germany

Upload: ngohanh

Post on 29-Sep-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

Naturwissenschaften (2004) 91:405–421DOI 10.1007/s00114-004-0542-8

R E V I E W

Bernhard Haubold · Thomas Wiehe

Comparative genomics: methods and applications

Published online: 25 June 2004� Springer-Verlag 2004

Abstract Interpreting the functional content of a givengenomic sequence is one of the central challenges of bi-ology today. Perhaps the most promising approach to thisproblem is based on the comparative method of classicbiology in the modern guise of sequence comparison. Forinstance, protein-coding regions tend to be conservedbetween species. Hence, a simple method for distin-guishing a functional exon from the chance absence ofstop codons is to investigate its homologue from closelyrelated species. Predicting regulatory elements is evenmore difficult than exon prediction, but again, compar-isons pinpointing conserved sequence motifs upstream oftranslation start sites are helping to unravel gene regula-tory networks. In addition to interspecific studies, intra-specific sequence comparison yields insights into theevolutionary forces that have acted on a species in thepast. Of particular interest here is the identification ofselection events such as selective sweeps. Both intra- andinterspecific sequence comparisons are based on a varietyof computational methods, including alignment, phylo-genetic reconstruction, and coalescent theory. This articlesurveys the biology and the central computational ideasapplied in recent comparative genomics projects. We ar-gue that the most fruitful method of understanding thefunctional content of genomes is to study them in thecontext of related genomic sequences. In particular, sucha study may reveal selection, a fundamental pointer tobiological relevance.

Introduction

Comparative genomics has been part of the human ge-nome project right from the start. The original plan was tosequence the human genome and the genomes of a smallnumber of model organisms, including bacteria, yeast,fruit fly, the nematode worm Caenorhabditis elegans, andthe mouse (National Institutes of Health and Depart-ment of Energy 1990). Today, these goals have all beenachieved many times over and biologists have at theirdisposal 190 published genomes of freeliving organismswith more than four times this number in the pipeline(Bernal et al. 2001). Figure 1 shows a phylogenetic treecomprising 106 of the organisms whose complete genomesequence has been published. Bacteria make up the bulkof the organisms sequenced to date, reflecting both therelative technical ease with which their compact genomescan be sequenced and the medical interest in bacterialpathogens (Fig. 1). The sequenced bacterial genomesrange in size from 580 kilobases (kb) encoding 470 openreading frames in Mycoplasma genitalium (Fraser et al.1995) to 9,105 kb encoding 8,317 open reading framesin Bradyrhizobium japonicum (Kaneko et al. 2002). M.genitalium is a human pathogen causing reproductivetract infections, while B. japonicum is a soil-dwellingbacterium capable of fixing atmospheric nitrogen intoorganic compounds. It forms root nodules in soy bean,serving as the plant’s nitrogen source.

The archaeal genomes sequenced range in size from1,564 kb with 1,509 open reading frames in Thermo-plasma acidophilium (Ruepp et al. 2000) to 5,751 kb with4,524 open reading frames in Methanosarcina acetivo-rans (Galagan et al. 2002). T. acidophilium is adapted toextremely acidic environments and has an optimal growthtemperature of 59 �C. M. acetivorans, in contrast, is oneof the most metabolically diverse methanogens, coloniz-ing diverse habitats including oil wells, sewage lagoons,decaying leaves, and the stomachs of cows. Uniquelyamong Archaea, it forms complex multicellular structures(Galagan et al. 2002).

B. Haubold ())Fachbereich Biotechnologie & Bioinformatik,Fachhochschule Weihenstephan,85350 Freising, Germanye-mail: [email protected]

T. WieheInstitut f�r Genetik,Universit�t zu K�ln,Cologne, Germany

Page 2: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

These very different bacteria and archaebacteria il-lustrate two broad trends in the genomics of prokaryotes:

1. Prokaryotic genomes are tightly packed with openreading frames. The four genomes mentioned so faruse on average between 1.0 kb and 1.3 kb per openreading frame.

2. Specialists such as the pathogenic Mycoplasma geni-talium have smaller genomes than generalists like thenitrogen-fixing B. japonicum.

These trends have been recognized for some time(Casjens 1998), but do not hold in eukaryotes. The ge-

nome of yeast (12,069 kb; Goffeau et al. 1996) is roughly200 times smaller than the mouse genome (2.4 Gb; MouseGenome Sequencing Consortium 2002). However, themouse genome contains only five times as many protein-coding genes as yeast. The mouse genome in turn encodesapproximately the same number of proteins as the humangenome, which is 14% larger (International Human Ge-nome Sequencing Consortium 2001), the largest eukary-otic genome sequenced to date. The largest known eu-karyotic genome is the genetic blueprint for a unicellularamoeba, Amoeba dubia, using a staggeringly wasteful670 Gb of DNA sequence (Li 1997, p. 383).

Fig. 1 Phylogeny of 106 organ-isms whose genomes have beensequenced completely. Num-bers indicate bootstrap supportfor a given node out of1,000 bootstrap samples. Thephylogeny was calculated onthe basis of an alignment of thesmall subunit ribosomal RNAmolecule, using the neighbor-joining algorithm (Saitou andNei 1987) as implemented inthe software Clustal W(Thompson et al. 1994), anddrawn using ATV (Zmaskekand Eddy 2001). Organismslisted in the Genomes onlinedatabase (Bernal et al. 2001)which also had an entry in theRibosomal database project(Cole et al. 2003) were includedin the analysis

406

Page 3: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

Comparing genome sizes and gene densities are twoof the simplest exercises in comparative genomics. Moresophisticated tasks include the description of a minimalgenome for free-living organisms (Mushegian and Koonin1996; Hutchison III et al. 1999; Makarova and Koonin2003), the comparison of non-redundant proteomes ofeukaryotes (Rubin et al. 2000), structural and functionalgenome annotation (Dermitzakis et al. 2002; Brachat etal. 2003; Kessler et al. 2003; Werner 2003a), and thedetection of selection (Mouse Genome Sequencing Con-sortium 2002; Schl�tterer 2003). Of particular medicalinterest is the comparison of bacterial pathogens (Cole1998; Field et al. 1999; Brosch et al. 2001; Buysse 2001;Fitzgerald and Musser 2001; Andersson et al. 2002;Schoolnik 2002), medical diagnosis (Willey et al. 2002;Zhou et al. 2002), and the mapping of disease genes(Horikawa et al. 2000). All these tasks rely on compara-tive sequence data, which are currently being generated atan unprecedented rate. While the initial emphasis in thehuman genome project was on sequencing distantly re-lated model organisms to cover the entire realm of livingorganisms, the advantages of comparing sequences ofclosely related bacteria, e.g. pathogenic and commensalstrains of the same species, have been stressed for sometime (Field et al. 1999). More recently, a trend has alsodeveloped towards sequencing closely related eukaryoticgenomes (Boffelli et al. 2003; Clark et al. 2003; Kellis etal. 2003).

The comparison of genomes with a view towards un-derstanding the encoded organisms’ biology raises bothtechnical and conceptual issues, which we survey in thisreview. It consists of two main sections, one concernedwith inter-specific and the other with intra-specific ge-nome comparison. In each section, we first explain centralcomputational methods, of which we consider three to bemost significant: alignment, phylogenetic reconstruction,and coalescent theory. These methodological introduc-tions are followed by examples of applications of thesetechniques. Our aim throughout is to highlight challengesand opportunities arising from the increasing availabilityof whole-genome data for related organisms. The selec-tion of illustrative examples is highly subjective and wemake no attempt to deal with the field comprehensively.Rather, we wish to convey the current excitement felt bymany biologists as genomics matures from a phase offocussing on technical and infrastructure issues to a phasewhere one of biology’s oldest and most successfulmethods—that of comparing closely related organisms toinfer function and evolutionary forces (Weiner 1994)—isbecoming an all-pervasive concern in the young scienceof genomics (Clark 1999; Galperin and Koonin 2003).

Inter-specific comparisons

Computational tools

Among the most important tools in comparative molec-ular work are alignment algorithms. An alignment of

more than two sequences might subsequently be subjectedto phylogenetic analysis, thus visualizing the evolutionaryscenario implied by the alignment. In the following, westart with a discussion of the alignment problem in thecontext of genome sequences before commenting onmethods for subsequent phylogenetic reconstruction.

Aligning genomic sequences

One of the classic tasks in biology is to compare organ-isms using homologous characters, that is, characters thathave diverged from a common ancestral structure due tospeciation events. For example, a bat’s wing is homolo-gous to a human hand but not to a fly’s wing. In molecularbiology, the alignment problem is equivalent to the iden-tification of homologous characters. Correspondingly, analignment consists of two or more rows of sequenceswritten in such a way as to place homologous nucleotidesor amino acids in the same column. Before aligning two ormore sequences, a choice about their most likely evolu-tionary relationship has to be made. If this relationship haspreserved homology across the entire length of the se-quences, a global alignment procedure is appropriate. If,however, only subregions of the sequences are homolo-gous, a local alignment algorithm should be used.

The first global pairwise alignment algorithm is alsoknown as the Needleman–Wunsch algorithm (Needlemanand Wunsch 1970) and its first local variant is often re-ferred to as the Smith–Waterman algorithm (Smith andWaterman 1981). Both of these algorithms are optimal inthe sense that they are guaranteed to return the bestpossible solution; and in essence they remain the goldstandard for pairwise alignment today. Traditionally, theonly two evolutionary events considered in these algo-rithms have been point mutations and insertions/deletionsmarked by gaps. However, in keeping with a trend incomputational biology to consider evolutionary eventsrelevant on the scale of long sequences, optimal align-ment has more recently been generalized to include arecombination cost in the score scheme (Kececioglu andGusfield 1998).

With rapidly growing sequence databases, interest infaster, approximate, or heuristic (as opposed to optimal)alignment procedures has increased. Two of the mostpopular heuristic local pairwise alignment algorithms areimplemented in the FASTA (Pearson and Lipman 1988)and BLAST (Altschul et al. 1990, 1997) software pack-ages (Table 1; n.b. all computer programs mentioned inthis review are listed in Table 1, together with their web-addresses). These programs achieve their speed by ex-cluding from further consideration those parts of the da-tabase that do not contain an exact match to a shortsubsequence of the query sequence. This strategy ofbuilding an alignment only in the vicinity of a “seed”consisting of an exact match between a short fragment ofthe query and the database is a fundamental heuristicapplied in all fast alignment programs (Baeza-Yates andPerleberg 1992).

407

Page 4: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

Popular alignment programs such as BLAST andFASTA or the multiple alignment program Clustal W areessentially optimized for the alignment of protein se-quences (Miller 2001). For instance, the nucleotide ver-sion of BLAST scores two aligned residues either asmatch or mismatch, in spite of the fact that transitions areknown to occur more frequently than transversions andbase composition may change across genomic sequences(Chiaromonte et al. 2002).

Apart from the scoring problem, which is conceptuallyeasy to fix in a manner similar to substitution matrices forprotein alignments (Dayhoff et al. 1978; Henikoff andHenikoff 1992), aligning genomic-scale sequences createsfurther problems. The most obvious difficulty is perhapsthat both CPU time (or patience) and computer memoryare limited. Further, repetitive sequences in complexgenomes tend to result in false-positive local alignments.However, global approaches do not represent an auto-matic solution to this issue as they are only applicableover regions with conserved gene order and orientation.

A number of programs dedicated to aligning twosequences several megabases long have been publishedin recent years, e.g. MUMmer (Delcher et al. 1999),BLASTZ (Schwartz et al. 2003), and AVID (Bray et al.2003). BLASTZ uses a local approach, while MUMmerand AVID are global alignment programs. The latter twoare based on an interesting data structure known as asuffix tree. As explained in Fig. 2, using such a suffix treemakes it possible to rapidly identify matching subse-quences in two or more input genomes (Gusfield 1997).Both MUMmer and AVID gain much of their speed bycomputing a suffix tree from the two sequences to bealigned and using it to detect matches between the twosequences. Starting from these exact matches, the final

alignment is constructed (Delcher et al. 1999; Bray et al.2003).

In order to capitalize on the advantages of both localand global alignment methods, one strategy is to combinethe two (Couronne et al. 2003): A fast version of BLASTcalled BLAT (Kent 2002) is first used to find homologous

Table 1 Computer programs mentioned in this review

Program Purpose Reference Website

AGenDa Gene prediction T�her et al. (2003) http://bibiserv.techfak.uni-bielefeld.de/agenda/ATV View phylogenies Zmaskek and Eddy (2001) http://www.genetics.wustl.edu/eddy/atv/AVID/MAVID Genome alignment Bray et al. (2003) http://baboon.math.berkeley.edu/mavid/BLASTZ Genome alignment Schwartz et al. (2003) http://www.bx.psu.edu/miller_lab/BLAST Database search Altschul et al. (1990) http://www.ncbi.nlm.nih.gov/BLAST/BLAT Database search Kent (2002) http://www.genomeblat.com/genomeblat/Clustal W Multiple sequence align-

mentThompson et al. (1994) http://bioweb.pasteur.fr/seqanal/interfaces/clustalw.html

coalator Drawing coalescent trees Unpublished http://adenine.biz.fh-weihenstephan.de/drawStrees Drawing suffix trees Unpublished http://adenine.biz.fh-weihenstephan.de/DoubleScan Gene prediction Meyer and Durbin (2002) http://www.sanger.ac.uk/Software/analysis/doublescan/FASTA Database search Pearson and Lipman (1988) http://fasta.bioch.virginia.edu/GenScan Gene prediction Burge and Karlin (1997) http://genes.mit.edu/GENSCAN.htmlgff2aplot Plotting sequence compar-

isonsAbril et al. (2004) http://genome.imim.es/software/

ms Coalescent simulations Hudson (2002) http://home.uchicago.edu/~rhudson1/source.htmlMUMmer Genome alignment Delcher et al. (1999) http://www.tigr.org/software/PHYLIP Phylogeny reconstruction Felsenstein (1993) http://evolution.genetics.washington.edu/phylip.htmlPPH Haplotype reconstruction Chung and Gusfield (2003) http://wwwcsif.cs.ucdavis.edu/~gusfield/pph.htmlSGP-2 Gene prediction Parra et al. (2003) http://www1.imim.es/software/sgp2/SLAM Gene prediction Pachter et al. (2001) http://bio.math.berkeley.edu/slam/TAP Transcript assembly Kan et al. (2001) –TwinScan Gene prediction Korf et al. (2001) http://genes.cs.wustl.edu/

Fig. 2 Suffix tree for the two strings T1 = ACGAT$ and T2 =GATCCT$. The tree consists of nodes and connecting edges. Thetop node is called the root and the nodes it is connected to arereferred to as its child nodes. Those nodes that have no child nodesare known as leaf nodes, to distinguish them from the internalnodes. Leaf nodes are labeled by pairs of numbers, where the firstnumber refers to a string, and the second to a position within thatstring. An important feature of a suffix tree is that any string ob-tained by concatenating the edge labels along the path leading fromthe root to an internal node corresponds to a repeat sequence(Gusfield 1997). The positions of the repeat are indicated by theleaf labels below the internal node in question. For instance, theedge label of the rightmost internal node is C. It occurs at threepositions in the input strings: position 2 in T1 and positions 4 and 5in T2. By looking for internal nodes whose edge labels appearexactly once in the two input strings, a scaffold for fast globalalignment can be built (Delcher et al. 1999). The suffix tree abovewas drawn using the program drawStrees (Table 1)

408

Page 5: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

“anchor” regions. These anchors are then filtered to detectsets with conserved order and orientation. Such sets ofanchors are then aligned using a global strategy (Cour-onne et al. 2003).

In summary, alignment algorithms are currently beingdeveloped further to model evolutionary events that areimportant on a genomic scale, such as gene duplicationand recombination. In addition, available algorithms arefurther optimized for speed to cope with the sheer bulk ofgenome sequence data. Most of the ideas described in thissection refer to pairwise alignment programs. However,MAVID (Bray et al. 2003) is a recent example of a pro-gram for aligning multiple genomes.

Phylogenetic analysis

When aligning more than two sequences, it is often de-sirable to visualize the evolutionary history implied by thealignment in a phylogenetic tree. A standard version ofsuch a genealogy consists of a binary tree, in which everyinternal node is connected to three branches, two leadingto child nodes and one leading to a parent node (c.f.Fig. 1). The leaves of such a tree are labeled by the extantsequences (or the corresponding taxa) and branch lengthsare proportional to the number of mutations.

Since the ground-breaking work on phylogenetic re-construction by Edwards and Cavalli-Sforza (1964), threeclasses of methods for phylogenic reconstruction havebecome widely used: distance methods, parsimony, andmaximum likelihood. All three start from a multiple se-quence alignment. Distance methods transform thisalignment into a matrix of pairwise distances between thesequences. However, there are n

2

� �pairs of distances be-

tween n sequences, while there are only 2(n �1) �1branches in the corresponding unrooted tree. This over-determination of the tree by the distance data means that atree correctly representing all pairwise distances may notexist. If we disregard this possibility and assume thatthe distance data can be mapped onto a tree, it is possibleto rapidly reconstruct this tree using, for example, theneighbor-joining algorithm (Saitou and Nei 1987). Incontrast, maximum parsimony and maximum likelihoodare methods for scoring a given tree rather than for re-constructing it. In the case of parsimony, the score con-sists of the total number of mutations implied by the tree.In the case of likelihood, the probability of the sequencedata given the tree is used as the score. The maximallyparsimonious tree is that tree implying the minimumnumber of mutations along its branches, while the maxi-mally likely tree is that with the highest probability of thedata given the tree. Actually finding the optimal tree in-volves searching through a space of

Qni¼4 ð2i� 5Þ possi-

ble unrooted bifurcating topologies that may account forthe evolution of n taxa. For just n=15, this amounts toapproximately 7.9�1012 trees; and computations may beinfeasible for large n.

The details of phylogenetic reconstruction are de-scribed authoritatively by Felsenstein (2004) and havebeen implemented in several computer programs, forexample the free PHYLIP package (Felsenstein 1993). Inthe context of this review, we only wish to draw attentionto two critical assumptions made in conventional phy-logeny reconstruction. The first is that sequences—and byimplication the taxa they signify—are descended from asingle common ancestor and the second is that this de-scent has occurred solely by the splitting of lineages(segregation). Both assumptions are not necessarily valid.Genetic material not only segregates but can also mergein the processes of recombination (within species) orhybridization (between species). The latter process alsomeans that there are two ancestors for one sequence.

There are methods that directly explore the appro-priateness of the evolutionary tree model for a set ofsequences. One of these is statistical geometry (Eigenet al. 1988) and methods inspired by it, including likeli-hood mapping (Strimmer and von Haeseler 1997) andSplitsTree (Huson 1998). The common feature of thesemethods is that they allow the discovery of a tree-likestructure, if one exists, without imposing this model apriori on the data.

Phylogenetic analysis is usually carried out with onlyone representative from each species of interest. Whenstudying descent within a sexual species, the standardapproach is to use a non-recombining molecule such asthe chloroplast or mitochondrial genome as marker. Al-ternatively, it is possible to reconstruct the “bottleneckrecombination history” of nuclear genes (Kececioglu andGusfield 1998). This represents a modification of tradi-tional phylogenies where, instead of one ancestor, a pairof ancestors (the protopair) is reconstructed.

Pairwise comparisons

Pairwise comparison of genomic sequence data has beenapplied extensively in comparative gene prediction. Morerecently, it has also been used to uncover the fraction ofthe mouse genome that is under selection. In the follow-ing, we first give an overview of pairwise gene prediction,followed by a brief survey of genome-wide selection inmouse.

Comparative gene prediction

In this section, we concentrate on gene finding, using bothinter- and intra-specific sequence comparisons. In contrastto so-called ab initio methods, which search for specificsequence patters such as splice-sites within a single se-quence, similarity-based methods start by comparing aquery sequence with homologous target sequences in or-der to identify segments which are conserved over evo-lutionary time.

Our first illustration of this approach to gene predic-tion is based on the BLAST suite of programs. BLASTX

409

Page 6: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

(Altschul et al. 1990; Gish and States 1993) translates agenomic query sequence into a set of amino acid se-quences and compares these against a database of knownprotein sequences. Those segments in the genomic querywhich are similar to database proteins are putative exons.Alternatively, TBLASTX (Altschul et al. 1990; Gish andStates 1993), which compares translated DNA sequences,can be used to look for significant local homologies. Anexample of this is shown in Fig. 3, where the major his-tocompatability complex II (MHC II) region from mouseand human is compared. Boxes along the top x-axis andthe y-axis indicate the positions of known exons and the

black lines along the major diagonal highlight regions ofsignificant homology. Notice that such homology is re-stricted to exonic regions.

The success of comparative gene prediction dependson the availability of genomic sequences from organismsseparated by an appropriate evolutionary distance. Un-fortunately, it is not possible to make a general statementabout what “appropriate evolutionary distance” means. Itdepends on various aspects: whether the researcher isinterested in protein-coding genes or in conserved non-coding DNA, the local rate of evolution, the overall rateof divergence between the species studied, and the sen-

Fig. 3 Comparison of the 35-kbmajor histocompatability com-plex (MHC) II region from hu-man and mouse using the pro-gram TBLASTX (Altschul et al.1990; Gish and States 1993).The top panel displays the po-sition of significant local ho-mologies. These coincide withthe positions of known exons,which are displayed as boxesalong the top x-axis and the y-axis. The middle panel displaysthe percent identity of signifi-cant alignments. Due to the sixpossible reading frames, there isusually more than one highlysimilar fragment in a region ofhomology. The bottom paneldisplays similar information asthe middle panel in the form ofan average alignment scorewithin a sliding window. TheFigure was drawn using thesoftware gff2aplot (Abril et al.2004)

410

Page 7: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

sitivity of the alignment algorithm employed. Despitethese complicating factors, recent studies suggest thatsimilarity-based gene prediction can be very reliable overa fairly wide spectrum of species and evolutionary di-vergence times, ranging from a few million to severalhundred million years (Crollius et al. 2000; Wiehe et al.2000; Taher et al. 2003). The simple rationale underlyingall these studies is that a random mutation in a functional,not necessarily protein-coding, region is frequently dele-terious and hence unlikely to become fixed in the species,whereas mutations in non-functional regions tend to beignored by natural selection. Thus, they are more likely tobe fixed and constitute the bulk of extant sequence di-vergence. This fact is exploited not only in order to lo-calize genes (see Fig. 3), but also to determine the intron–exon structures of genes, their regulatory regions, and toinfer gene function.

A typical partner species for gene prediction in humansequences is Mus musculus, one of the most importantmammalian model organisms. Its divergence time fromhuman is about 80 million years before present. However,conservation of functional sequences can also be detectedfor much more distantly related taxa, such as human andFugu (Brenner et al. 1993), human and Drosophila (Korn-berg and Krasnow 2000; Rubin et al. 2000), or even hu-man and yeast (Gavin et al. 2002).

Comparative gene prediction is also possible in moreclosely related organisms. For instance, the Brassicaceae,which diverged approximately 23 million years ago (Hau-bold and Wiehe 2001), share gene structures and regula-tory units which can be detected based solely on theirsequence similarity (Koch et al. 2001; Mayer et al. 2001).

Generally, the potential to identify complete genestructures from their similarity with genes in related spe-cies decreases with divergence time, leading to an in-creased number of false-negative predictions. In contrast,recent divergence times, usually correlated with highsequence similarity, may increase the false-positive rate(however, see the section on multiple comparisons formitigating this problem). Since the rate of evolutionvaries among clades and species, no specific divergencetime is uniformly optimal for comparative gene predic-tion. For instance, while human–avian comparisons (di-vergence about 300 million years ago) may work well,Brendel et al. (2002) raise concerns about the feasibilityof comparative gene prediction using Arabidopsis thali-ana and Oryza sativa, although the divergence time ofmonocots and eudicots is estimated to be 150–200 mil-lion years (Bennetzen 2002).

For the prediction of gene regulatory elements, whichare often short motifs and functionally little constrainedon the sequence level, other pairs or sets of species maybe better suited than for the prediction of coding genes.Hardison (2000) advocates the comparison of more close-ly related species. Indeed, Dubchak et al. (2000) showthat a three-way comparison of human, mouse, and dogsequences can help to pinpoint putative regulatory ele-ments in non-coding DNA. Apart from variations in inter-specific rates of evolution, local fluctuations in mutation

rates within a single genome may also confound gene-finding algorithms. As a result, prediction accuracy canvary a great deal even within a given species pair, de-pending on the chromosomal regions compared. For in-stance, in highly conserved regions, such as the HOXcluster, the false-positive rate may be considerably higherthan in less conserved regions, such as the ERCC locus(Wiehe et al. 2000). Rate variation is due to the nature ofthe evolutionary process, with natural selection as one ofseveral potential causes for deviations from a neutralmolecular clock.

Recently developed similarity-based gene predictionprograms working with homologous genomic sequencesinclude Slam (Pachter et al. 2001), TwinScan (Korf et al.2001), AGenDa (Taher et al. 2003), DoubleScan (Meyerand Durbin 2002), and SGP-2 (Parra et al. 2003). Some ofthese programs are used to cross-annotate the human andmouse genomes. Slam and DoubleScan are applicationsof generalized pair hidden Markov models (HMM), whileTwinScan is an extension of GenScan (Burge and Karlin1997) and, like its predecessor, incorporates a pseudo-HMM algorithm to generate gene models. As the authorsnote, TwinScan has in the past been applied to human/mouse gene prediction, but should also be useful for arange of other species.

Despite recent progress in comparative gene finding,the process unfortunately remains error-prone. As a con-sequence, certainty about a gene’s intron/exon structurecan still only be reached by sequencing the correspond-ing cDNA clone. Although a number of public and pri-vate projects exist to create a comprehensive databaseof human (and other species) full-length cDNAs (see,for instance, http://genome.gsc.riken.jp/, http://www.nedo.go.jp/bio-e/, http://mips2.gsf.de/proj/cDNA/; Kawai et al.2001), it is still unclear how comprehensive these col-lections are. Because low-level background transcriptionof almost the whole genome has been reported (Wong etal. 2001), it is possible that many full-length cDNAsnever end up in translated protein products; and fortechnical reasons it is often difficult to guarantee that the50 end of a gene has been correctly determined. However,a substantial fraction of genes or alternative gene productsis expressed only during particular stages in the devel-opmental process, in specialized tissues or cells, or inresponse to particular environmental conditions.

While it has been known for some time that sequencesimilarity is an excellent indicator of orthologous genelocation in related genomes, it has only recently beenrealized that the exact structure and the number of pos-sible alternative transcripts may differ widely betweenmembers of a gene pair (Reichwald 2003). This is trueeven for the favorite species pair of vertebrate compara-tive genomics, human and mouse (Nurtdinov et al. 2003).Clearly, there is more to comparative functional genomicsthan aligning two sequences or searching a database.

The identification of the untranslated and promoterregions of a gene tends to be even more difficult than theprediction of its coding portion. The state of the art in thisfield has been reviewed by Werner (2003b) and com-

411

Page 8: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

parative methods have been advocated in this context byvarious authors (Kan et al. 2001; Ohler and Niemann2001). One approach, termed tandem affinity purification,is based on genomically aligned expressed sequence tagsthat are used to define untranslated regions (UTRs) andto detect alternative splice variants (Kan et al. 2001).However, promoters and other regulatory elements canusually tolerate much more sequence divergence thancoding regions and still retain their original function. Thisis one source of the numerous difficulties encounteredby promoter prediction programs and often prevents ameaningful comparison of promoter sequences. Becauseregulatory elements function in part via the structuralcharacteristics of their constituent DNA, both the physi-co-chemical properties of the relevant DNA sequence andthe 3-D structure of the interacting proteins have to betaken into account (Ponomarenko et al. 1999; Ohler et al.2001). Unlike coding sequences, promoters are expectedto be predictable much better when very closely relatedorganisms are compared (see the section on multiplecomparisons).

Quantifying genome-wide selectionby human/mouse comparison

Of the two principal evolutionary forces shaping gen-omes, selection and mutation, selection is the essentialindicator of functional importance. Selection due tofunctional constraints, also known as purifying or nega-tive selection, results in sequence conservation. Con-versely, sequence conservation alone is not sufficient forconcluding selection. In order to quantify the portion ofthe human or mouse genome that is under purifying se-lection, the two genomes were compared and regionsbetter conserved than expected from the underlying neu-tral mutation rate were delineated (Mouse Genome Se-quencing Consortium 2002). Such regions accounted forapproximately 5% of the genome, which is a surprisingresult given that only approximately 1.5% of the humangenome is coding, with another 1% encompassing UTRsof protein-coding regions. The unaccounted 2.5% of theselected portion of the genome presumably containcontrol elements and non-protein-coding RNAs. The au-thors conclude that “characterization of the conservedsequences should be a high priority for genomics in theyears ahead” (Mouse Genome Sequencing Consortium2002, p. 553).

Multiple comparisons: phylogenetic footprintingand phylogenetic shadowing

The comparison of multiple genomes is a natural exten-sion of pairwise inter-specific comparisons. Such com-parisons are usually carried out in order to detect con-served regions across two phylogenetic scales: Deepcomparisons, also known as phylogenetic footprinting(Hardison et al. 1997), reveal conservation across higher

taxonomic units, for example the vertebrates (Elgar et al.1996; Thomas and Touchman 2002). Shallow compar-isons, recently termed phylogenetic shadowing (Boffelliet al. 2003), probe conservation across a group of closelyrelated species. We present a recent example of eachtechnique to clarify the issues involved and the insightsthat can be gained.

In a project aimed at discovering functional conserva-tion across the vertebrates, a 1.8-Mb region containing tengenes, including the gene responsible for cystic fibrosisin humans, was sequenced from 12 vertebrate species,including fish, bird, rodents, and primates (Thomas etal. 2003). The corresponding multiple alignment reveal-ed 1,194 conserved regions, or phylogenetic footprints,termed multi-species conserved sequences (MCSs). Hav-ing been conserved across the 450 million years of ver-tebrate evolutionary history, these elements are presum-ably of fundamental functional importance. Surprisingly,65% of the bases contained in such MCSs did not overlapknown coding regions or UTRs. In order to demonstratethe superiority of multiple sequence comparisons overpairwise approaches, the authors investigated how manyMCSs could be detected when the human sequence wascompared with one or more regions from other vertebrates.In fact, the central parameter in the detection of MCSs wasthe total branch length of the phylogenetic tree connectingthe set of sequences analyzed. The greater this total branchlength, i.e. the more species included or the more diver-gent these species, the more discriminatory power the dataset had in the detection of conserved elements (Thomas etal. 2003). The fact that the total branch length of a phy-logenetic tree is a function of both the divergence and thenumber of species provides the key insight motivatingphylogenetic shadowing (Boffelli et al. 2003).

Phylogenetic shadowing, i.e. the comparison of a set ofclosely related species in order to detect conserved re-gions, addresses two potential problems associated withphylogenetic footprinting. The first is that aligning se-quences becomes increasingly difficult with increasingdivergence. The second limitation of phylogenetic foot-printing is that conservation across higher-order taxo-nomic units such as vertebrates cannot lead to the dis-covery of, say, primate-specific regulatory elements. Ifwe are interested in genus-specific regulatory elements(after all, we look quite different from fish) then se-quences from closely related species need to be scannedfor unusual conservation or divergence. Such sequencesare easy to align, but often contain too few polymor-phisms to allow the detection of conserved regions.However, in the study of the cystic fibrosis region invertebrates outlined above, it was already noted that dis-criminatory power is a function of the total branch lengthof the phylogenetic tree connecting the aligned sequences(Thomas et al. 2003). In other words, discriminatorypower for the detection of sequence conservation amongclosely related species can be gained by sequencing en-ough of them.

This idea was applied in a study of primates aimed atdetecting functional sequences in the human genome

412

Page 9: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

(Boffelli et al. 2003). A 1.6-kb region upstream of theapo(a) transcription start site was sequenced in 18 OldWorld monkeys and hominoids and sequence conserva-tion was monitored. This revealed conservation corre-sponding to two known regulatory sites (including theTATA box) and eight hitherto undescribed conservedelements. In transfection experiments, deletion of sevenof these regions had a significant effect on the expressionof a reporter gene (Boffelli et al. 2003).

Intra-specific comparisons

Comparisons between genomes obtained from membersof the same species are of two kinds: (1) comparisonswithin sexual species where the biological species con-cept applies, and (2) comparisons between asexual spe-cies, notably microorganisms, where species boundariesare less clear-cut. With microorganisms, no new con-ceptual issues arise and their investigation in the contextof comparative genomics takes place in the frameworkoutlined above for inter-specific work. In contrast, mo-lecular comparisons within bona fide species are carriedout in the context of population genetics. Contemporarypopulation genetics is strongly dependent on one partic-ular set of mathematical and computational ideas, knownas the coalescent theory. In the following, we introducethe reader to coalescent theory. In its standard form, it isbased on a mutation model which corresponds closely tosingle nucleotide polymorphisms (SNPs). We thereforeproceed by explaining SNP data and its modeling usingcoalescent theory before finishing with a section on howto discover selection from SNPs.

Computational tool: the coalescent

SNPs are detected by aligning two or more DNA se-quences from members of the same species and lookingfor polymorphic sites. As elaborated in the section onSNPs, the SNP is the most abundant type of polymor-phism in most populations. When comparing two ho-mologous sequences in humans, about one position in1,000 is a SNP (International SNP Map Working Group2001). There is currently a lot of interest in SNP data,both in the context of medical research and in the contextof studying fundamental aspects of genome evolution. Inboth cases, it is very useful to have a clear theoreticalunderstanding of the distribution of SNPs expected undervarious evolutionary scenarios. Such an understanding isprovided by coalescent theory.

The power of coalescent theory is based on a funda-mental change in the way the evolutionary process isimagined and subsequently modeled. Traditionally, theevolution of a set of genes was modeled and simulatedforward in time: a population of genes was evolved for-ward from one generation to the next and, at certain timeintervals, a sample was drawn from this population and thestatistic of interest calculated from this sample. This re-

sulted in a distribution of the statistic concerned, say,heterozygosity (i.e. the probability of randomly selectingtwo non-identical homologous alleles from a population).When used as a simulation tool, this forward procedurehas two disadvantages. First, it runs in time proportional tothe size of the population modeled, which might be vast.Second, as the population evolves from one generation tothe next, the statistics calculated from samples are notindependent but auto-correlated. In order to minimize thisauto-correlation, the simulation has to run for many stepsbetween samplings, which again increases its run-time.

A revolution in statistical genetics was brought aboutin the early 1980s by imagining the evolutionary processnot forward but backward in time (Kingman 1982a,1982b, 2000). A sample of extant gene sequences is theresult of a genealogical process which has taken place inthe past. This genealogical process needs to simply betraced back to the last common ancestor in order tocapture all the evolutionary events that have affecteda given sample of homologous genes. In the absenceof recombination, this trace-back procedure takes timeproportional to the sample size rather than the popula-tion size. Moreover, different genealogies are indepen-dent from each other, thereby avoiding the problem ofautocorrelation. If we continue to disregard recombina-tion, there is only one genealogical event: the merging oftwo lineages into their last common ancestor, a coales-cence event. In a non-recombining population of constantsize where all members have the same fitness, simulatingsuch coalescents is simple (Hudson 1990). Figure 4 dis-plays examples of random genealogies for various sam-ple sizes. Notice that the height of the coalescents, alsoknown as the time to the last common ancestor, does notappear to depend strongly on the sample size. In fact, theexpected time to the last common ancestor varies only bya factor of two between samples of size two and samplesof size infinity (Nordborg 2001). However, for a givensample size, the variance of time to the last commonancestor is large.

There is a superficial similarity between a phylogeny(Fig. 1) and a coalescent (Fig. 4). The fundamental dif-ference between the two graphs is that a phylogeny isdeduced from a given data set while a coalescent is astructure for simulating data sets.

We need to add mutations to our genealogy to make itbiologically meaningful. The mutation model appliedmost frequently is known as the infinite sites model, whereno site can mutate twice. For closely related DNA mole-cules, such as those found among members of the samespecies, this is a reasonable assumption and corresponds tothe observation that SNPs are usually biallelic. Under theinfinite sites model, S, the expected number of polymor-phic sites in a sample of n homologous sequences, is equalto the sum of all branch lengths on the coalescent multi-plied by the mutation rate (Watterson 1975):

S ¼ 2NmXn�1

i¼1

1i; ð1Þ

413

Page 10: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

where N is the size of the gene pool and � the mutationrate. Both � and N are usually unknown, but from Eq. 1 itfollows that their product (q=2N�) is simply the observ-able number of polymorphic sites between two homolo-gous sequences.

Figure 5 illustrates the relation between mutations onthe coalescent and the resulting haplotypes. Notice thatmutations on the outer branches of the genealogy corre-spond to polymorphisms carried only by the sequence thisbranch leads to (c.f. haplotype 2 in Fig. 5). The number ofsuch singleton mutations, Ssingleton, is equal to the lengthof the outer branches of the coalescent multiplied by themutation rate (Fu and Li 1993):

Ssingleton ¼ 2Nm: ð2ÞSexual species undergo reciprocal recombination in

every generation during meiosis. The coalescent wasgeneralized early on to include this process (Hudson1983), which transforms the coalescent tree into a more

general data structure known as the ancestral recombi-nation graph (Griffiths and Marjoram 1997). An exampleof such a graph is shown in Fig. 6. Notice that an ancestralrecombination graph contains two events, the coalescentevents already described, which reduce the number oflineages by one, and recombination events, which in-crease the number of lineages by one. Going backwards intime, the first event depicted in Fig. 6 is a recombinationevent (R), which effectively splits the chromosome at arandom point in two parts, thereby uncoupling the evo-lutionary history of the resulting segments. The shortersegment on the left-hand side coalesces in the first coa-lescent event (C1) and the longer segment in the secondcoalescent event (C2).We can see that, as a result of re-combination, the time to the most recent common an-cestor varies along the length of the chromosome (Hudson1990). As a consequence, recombination leads to an un-even distribution of SNPs in sequences long enough to beaffected by recombination. In our example (Fig. 6), we

Fig. 4 Examples of neutralgene genealogies for samplesizes n = 2, 10, and 100. Time ismeasured in 2N generations,where N is the (usually un-known) size of the gene pool.The genealogies were drawnusing the program coalator

414

Page 11: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

would expect the alignment of the two sequences con-cerned to contain a lower SNP density in the left-handfifth of the sequence than in the four-fifths on the right-hand side. In other words, as a result of recombina-tion, SNPs are clustered along the chromosome (Hudson1990).

Apart from recombination, the coalescent has beenextended to incorporate demography, such as populationexpansion, population structure (Hudson et al. 1992),gene conversion (Wiuf and Hein 2000), and selection(Hudson and Kaplan 1987). The classic survey of the field

was published by Hudson (1990) with a more recent ex-tensive review provided by Nordborg (2001). In addition,the coalescent simulation software ms (for “make sam-ple”) is freely available (Hudson 2002). From its begin-nings as a specialty of a few statistically inclined ge-neticists, coalescent theory has now entered the biologicalmainstream with both draft publications of the humangenome containing coalescent simulations (Venter et al.2001; International Human Genome Sequencing Consor-tium 2001); and any contemporary discussion of SNPs isunderpinned by this theory.

The following sections are concerned with the biologythat can be learned by making intra-specific comparisonsof genome sequences.

Single nucleotide polymorphisms

SNPs, rather than gross differences at the genetic level,are thought to underlie most of the heritable diversity wesee among our fellow human beings. In addition, SNPsare the most abundant genetic markers in most popula-tions. In 2001, a map of 1.4 million human SNPs waspublished along with the draft of the human genome se-quence (International SNP Map Working Group 2001).These SNPs were mapped from a sample size of two (butfor a comment on this, see Haubold and Wiehe 2002) and,from Eq. 1, it follows that this number would grow by afactor of

P99i¼1

1i � 5:2 for a sample size of 100. The

international HapMap project (http://www.hapmap.org)aims to uncover a large fraction of this as yet unknownhuman genetic diversity.

In the following, we concentrate on the fundamentalaspects of SNPs by first surveying what has been learnedabout their genome-wide distribution. This is followed byan account of how SNPs are used to detect selection.

Genome-wide distribution of SNPs

It has now become clear from comparisons between co-alescent simulations and genome-wide SNP data that theassumption of uniform recombination rates is not tenable(Reich et al. 2002; McVean et al. 2004). Local fluctua-tions in recombination rate correspond to the hypothesisthat the human genome consists of islands of linkagedisequilibrium due to the localization of recombination tohot spots (Goldstein 2001). Such a pattern of localizedrecombination could have important consequences formapping genes through their association with one or moreSNPs. In this case, the typing of relatively few SNPsmight suffice to locate genes to their non-recombininggenomic segment, while more precise localization withinsuch a segment would require the application of tech-niques other than association surveys. Using coalescentsimulations, McVean et al. (2004) have now estimatedthat 50% of recombination events are localized in only10% of the human genome sequence.

Fig. 5 Relation between mutations (circles) on a coalescent and theresulting multiple alignment of haplotypes (horizontal lines). Thedotted arrows connect the mutation events on the coalescent totheir (random) mutation sites along the multiple alignment

Fig. 6 Ancestral recombination graph for sample size 2. Threeevents are shown, one recombination (R) and two coalescent events(C1, C2). After the recombination event, the chromosome is splitinto two lineages containing material that will end up in the finalsample (ancestral material shown as a box) and material fromanonymous sequences in the population (non-ancestral materialshown as a line). At coalescent event C1, the left-hand short seg-ment of the original chromosome coalesces, at C2 the longer right-hand segment (Hudson 1990)

415

Page 12: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

SNP data is most useful in the form of haplotype data.However, with the exception of a large data set forchromosome 21 (Patil et al. 2001), phase is unknown formost SNPs in the databases and needs to be reconstructedcomputationally (Clark 1990; Gusfield 2001; Schwartz etal. 2002). This problem has been called the perfect phy-logeny haplotyping problem (Gusfield 2001): given aSNP profile for n diploid individuals, find 2n haplotypessuch that the evolution of sets of 2n haplotypes can bemodeled by a coalescent without recombination. Figure 4shows examples of such genealogies; and this recon-struction leads to the identification of blocks of SNPswhere no recombination has occurred. A program calledPPH implementing this reconstruction has recently beenpublished (Chung and Gusfield 2003).

Detecting selection from SNPs: selective sweeps

Selection is a signature of functionally important geneticelements. From a population genetic perspective, a geneunder strong positive selection is quickly “swept” to fix-ation, thereby erasing any neutral polymorphisms linked toit. This idea was first proposed by Maynard Smith andHaigh (1974) as the hitch-hiking effect and today is oftenreferred to as a selective sweep. A selective sweep ischaracterized by a region of low genetic diversity centeredaround a selected gene. An alternative way of thinkingabout a selective sweep is to consider its effect on theshape of the corresponding coalescent and the distributionof mutations on the coalescent under selection (Tajima1989; Fu and Li 1993; Fay and Wu 2000).

Typical scenarios for episodes of positive selectioninclude bacterial resistance to antibiotics and resistance toepidemic diseases in multicellular organisms. Perhaps oneof the best known cases of disease resistance among an-imals is the resistance to AIDS onset in chimpanzees. In astudy of this phenomenon, genetic diversity among MHCI genes was sampled in humans and chimpanzees (Giladet al. 2002). The MHC I genes are responsible for pre-senting cells suffering from viral infections to the immunesystem. The genetic diversity at the chimpanzee locus wassurveyed at an intronic sequence, which presumablyevolves neutrally. Its genetic diversity was significantlyreduced, compared with the human diversity found at thesame locus. The authors suggest that present-day chim-panzees are the survivors of an HIV-like epidemic thattook place 2–3 million years ago and left its geneticfootprint by eliminating significant amounts of neutralgenetic diversity. According to this hypothesis, the sur-vivors of this ancient epidemic are the ancestors of to-day’s AIDS-resistant chimpanzees (Gilad et al. 2002).While there is no direct evidence linking a HIV-likeepidemic to the observed reduction in genetic diversity,this is an intriguing hypothesis.

It should be stressed, however, that rejection of the nullhypothesis of neutral evolution is necessary but by nomeans sufficient for inferring selection. Other factors,including the mode of recombination (Haubold et al.

2002) and demography, may lead to significant changesin the observed SNP spectrum. Demography is a partic-ularly interesting factor in the context of comparativegenomics, as in sexual organisms selection should beacting on single genes, while demography affects entiregenomes uniformly. Hence genomic data, or its substitutemultilocus data, can be used to dissect the relative con-tributions of demography and selection to the observedSNP spectrum. Glinka et al. (2003) scanned 105 loci ofapproximately 500 bp each along the X chromosome of12 lines each of European and African Drosophila mel-anogaster. From the chromosome-wide excess of single-ton mutations found in African D. melanogaster, theyconcluded that the population had undergone a recentexpansion. In contrast, the European D. melanogastersampled only showed localized reductions in genetic di-versity, which is evidence for recent selective sweeps asthe flies migrated from their sub-Saharan place of origininto new European habitats (Glinka et al. 2003).

Conclusions

The aim of comparative genomics is to use an ensembleof related genomes to better understand each individualgenome in the set. The comparative method has a longtradition in biology (Darwin 1859) and its inclusionwithin genomics has also been advocated for some time(Elgar et al. 1996). In fact, comparative genomics mightbe regarded as the interpretative branch of the genomicsendeavor (Clark 1999).

In the field of bacterial genomics, the sequencing ofclosely related organisms specifically for comparativepurposes has been carried out for some time (Field et al.1999; Andersson et al. 2002), while the technically moredemanding comparative whole-genome sequencing ofclosely related eukaryotes is a relatively recent concept(Mouse Genome Sequencing Consortium 2002; Clark etal. 2003; Kellis et al. 2003). However, even among eu-karyotic organisms, the number of whole-genome com-parisons has grown rapidly recently and now encom-passes plants, fungi, and animals. Among the animals,publication of the genomes of puffer fish (Fugu rubripes;Aparicio et al. 2002) and mouse (Mouse Genome Se-quencing Consortium 2002) have attracted particular in-terest in the context of comparison with the human ge-nome (e. g. Dermitzakis et al. 2002; Mural et al. 2002).Apart from such inter-specific comparisons, intra-specificcomparisons in the form of SNP studies have created a lotof excitement, particularly since the publication of thefirst comprehensive map of SNPs in humans (Interna-tional SNP Map Working Group 2001). Hence, the phy-logenetic scale of studies in comparative genomics rangesfrom all cellular life forms (Koonin and Mushegian1996), through domains (Rubin et al. 2000; Sorrells et al.2003) and phyla (Thomas and Touchman 2002), down tothe species (Kellis et al. 2003) and population (Interna-tional SNP Map Working Group 2001) levels.

416

Page 13: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

Most studies of comparative genomics are based oninter-specific comparisons. However, in this review, wehave taken the more comprehensive view that any in-vestigation based on genome-wide sequence comparisonshould be included under the comparative genomicsumbrella. This means that intra-specific studies of thestatistics of SNPs are discussed alongside the classicstudies of gene discovery by alignment. Our central pointof reference was whether a study was based on inter-specific comparisons or on intra-specific comparisons, asthese two types of study rely on distinct tools and ideas.

Sequence comparisons usually start from an alignment.Pairwise alignment methods have been a staple of com-putational biology for decades (Needleman and Wunsch1970; Smith and Waterman 1981; Altschul et al. 1997)and are now being honed to allow the comparison ofwhole genomes, often with a view to gene discovery andgenome annotation (Miller 2001). However, there is atradeoff here: closely related genomes are easy to align,but may not contain enough polymorphisms to allow thedelineation of conserved regions (Boffelli et al. 2003).Extending pairwise to multiple alignments is a firststep towards solving this problem. The branches alonga phylogenetic tree constructed from such a multiplealignment represent all the mutations in the sample ofsequences investigated. The topology of such a tree mightcomprise a few highly divergent sequences, as in phylo-genetic footprinting (Hardison et al. 1997), or many moreclosely related sequences, as in phylogenetic shadowing(Boffelli et al. 2003). Perhaps surprisingly, the total num-ber of polymorphisms in the sample can be kept constantbetween these two approaches. In addition to sequenceconservation, evidence of positive selection is a strongindicator of biological relevance. At the level of popula-tions, selection is traditionally investigated by simulatingthe distribution of polymorphisms expected under neu-trality and comparing this with empirical data. The coa-lescent is the tool of choice for such simulations.

Today, comparative genomics provides biologists withover a thousand genomes to choose from, ranging fromviruses to mammals (Zhang et al. 2003). Often this cre-ates an “embarrassment of riches”, where it becomesdifficult to decide which genes to pick for further func-tional study. Selection is perhaps the best guide to func-tional relevance. As we have shown in this review, se-lection at the molecular level is typically discoveredthrough the comparison of aligned homologous se-quences. An alternative to this approach has recently beendeveloped for protein coding sequences within a singlegenome. Plotkin et al. (2004) determine the proportion ofthe nine possible single-step mutations per codon that arenon-synonymous. This “volatility” measure ranges from5/9 for a codon like CUG (Leu), to 9/9=1for a codonlike AUG (Met). For any given protein, a high aver-age volatility per codon indicates recent selection foramino acid substitution (positive selection), while a lowvolatility indicates selection against amino acid substitu-tion (purifying selection). In this way, selected proteinscan be detected, given just a single genomic sequence.

However, the volatility of any given protein still needs tobe compared with the “background volatility” of allproteins in the genome. For the interpretation of genomicsequences, there seems to be no escaping the comparativeparadigm, even though in some cases this can be reducedto an intra-genomic comparison.

Acknowledgements We are grateful to an anonymous refereefor helpful comments. B.H. is financially supported by DehnerGartencenter GmbH and the Stifterverband f�r die DeutscheWissenschaft.

References

Abril JF, Guig� R, Wiehe T (2004) gff2aplot: plotting sequencecomparisons. Bioinformatics 19:2477–2479

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990)Basic local alignment search tool. J. Mol Biol 215:403–410

Altschul SF, Madden TL, Sch�ffer AA, Zhang J, Zhang Z, MillerW, Lipman D (1997) Gapped BLAST and PSI-BLAST: a newgeneration of protein database search programs. Nucleic AcidsRes 25:3389–3402

Andersson SG, Alsmark C, Canback B, Davids W, Frank C,Karlberg O, Klasson L, Antoine-Legault B, Mira A, Tamas I(2002) Comparative genomics of microbial pathogens andsymbionts. Bioinformatics 18 [Suppl 2]:S17

Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, Dehal P,Christoffels A, Rash S, Hoon S, Smit A, Gelpke MD, Roach J,Oh T, Ho IY, Wong M, Detter C, Verhoef F, Predki P, Tay A,Lucas S, Richardson P, Smith SF, Clark MS, Edwards YJ,Doggett N, Zharkikh A, Tavtigian SV, Pruss D, Barnstead M,Evans C, Baden H, Powell J, Glusman G, Rowen L, Hood L,Tan YH, Elgar G, Hawkins T, Venkatesh B, Rokhsar D,Brenner S (2002) Whole-genome shotgun assembly and anal-ysis of the genome of Fugu rubripes. Science 297:1301–1310

Baeza-Yates RA, Perleberg CH(1992) Fast and practical approxi-mate string matching. In: Springer (ed) Proc 3rd Symp Com-binatorial Pattern Matching. (Springer lecture notes in com-puter science, vol 644) Springer, Berlin Heidelberg New York,pp 185–192

Bennetzen J (2002) Opening the door to comparative plant biology.Science 296:60–63

Bernal A, Ear U, Kypides N (2001) Genomes online database(GOLD): a monitor of genome projects world-wide. NucleicAcids Res 29:126–127

Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I,Pachter L, Rubin EM (2003) Phylogenetic shadowing of pri-mate sequences to find functional regions of the human ge-nome. Science 299:1391–1394

Brachat S, Dietrich FS, Voegeli S, Zhang Z, Stuart L, Lerch A,Gates K, Gaffney T, Philippsen P (2003) Reinvestigation of theSaccharomyces cerevisiae genome annotation by comparison tothe genome of a related fungus: Ashbya gossypii. Genome Biol4:R45

Bray N, Dubchak I, Pachter L (2003) AVID: a global alignmentprogram. Genome Research 13:97–102

Brendel V, Kurtz S, Walbot V (2002) Comparative genomics ofArabidopsis and maize: prospects and limitations. Genome Biol3:1005

Brenner S, Elgar G, Sandford R, Macrae A, Venkatesh B, AparicioS (1993) Characterization of the pufferfish (Fugu) genome as acompact model vertebrate genome. Nature 366:265–268

Brosch R, Pym AS, Gordon SV, Cole ST (2001) The evolution ofmycobacterial pathogenicity: clues from comparative geno-mics. Trends Microbiol 9:452–458

Burge C, Karlin S (1997) Prediction of complete gene structures inhuman genomic DNA. J Mol Biol 268:78–94

Buysse JM (2001) The role of genomics in antibacterial targetdiscovery. Curr Med Chem 8:1713–1726

417

Page 14: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

Casjens S (1998) The diverse and dynamic structure of bacterialgenomes. Annu Rev Genet 32:339–377

Chiaromonte F, Yap VB, Miller W (2002) Scoring pairwise ge-nomic sequence alignments. Pacific Symp Biocomput 2002:115–126

Chung HR, Gusfield G (2003) Perfect phylogeny haplotyper:haplotype inferral using a tree model. Bioinformatics 19:780–781

Clark AG(1990) Inference of haplotypes from PCR-amplifiedsamples of diploid populations. Mol Biol Evol 7:111–122

Clark AG, Gibson G, Kaufman T, Myers E, O’Grady P (2003)Draft proposal for Drosophila as a model system for compar-ative genomics. http://life.biology.mcmaster.ca/brian/evoldir.html

Clark MS (1999) Comparative genomics: the key to understandingthe human genome project. Bioessays 21:121–130

Cole JR, Chai B, Marsh TL, Farris RJ, Wang Q, Kulam SA,Chandra S, McGarrell DM, Schmidt TM, Garrity GM, TiedjeJM (2003) The ribosomal database project (RDP-II): preview-ing a new autoaligner that allows regular updates and the newprokaryotic taxonomy. Nucleic Acids Res 31:442–443

Cole ST (1998) Comparative mycobacterial genomics. Curr OpinMicrobiol 1:567–571

Couronne O, Poliakov A, Bray N, Ishkhanov T, Ryaboy D, RubinE, Pachter L, Dubchak I (2003) Strategies and tools for whole-genome alignments. Genome Res 13:73–80

Crollius HR, Jaillon O, Bernot A, Dasilva C, Bouneau L, Fischer C,Fizames C, Wincker P, Brottier P, Quetier F, Saurin W,Weissenbach J (2000) Estimate of human gene number pro-vided by genome-wide analysis using Tetraodon nigroviridisDNA sequence. Nat Genet 25:235–238

Darwin C (1859) On the origin of species by means of naturalselection or the preservation of favoured races in the strugglefor life, 1985 edn. Penguin, London

Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evo-lutionary change in proteins. In: Dayhoff MO (ed) Atlas ofprotein sequence and structure, vol 5. National BiomedicalResearch Foundation, Washington, D.C., pp 345–352

Delcher AL, Kasti S, Fleischmann RD, Peterson J, White W,Salzberg SL (1999) Alignment of whole genomes. NucleicAcids Res 27:2369–2376

Dermitzakis ET, Reymond A, Lyle R, Scamuffa N, Ucla C,Deutsch S, Stevenson BJ, Flegel V, Bucher P, Jongeneel CV,Antonarakis SE (2002) Numerous potentially functional butnon-genic conserved sequences on human chromosome 21.Nature 420:578–582

Dubchak I, Brudno M, Loots GG, Pachter L, Mayor C, Rubin EM,Frazer KA (2000) Active conservation of noncoding sequencesrevealed by three-way species comparisons. Genome Res10:1304–1306

Edwards AWF, Cavalli-Sforza LL (1964) Reconstruction of evo-lutionary trees. In: Heywood VH, McNeill J (eds) Phenetic andphylogenetic classification. Systematics Association, London,pp 67–76

Eigen M, Winkler-Oswatitsch R, Dress A (1988) Statistical ge-ometry in sequence space: a method of quantitative compara-tive sequence analysis. Proc Natl Acad Sci USA 85:5913–5917

Elgar G, Sandford R, Aparicio S, Macrae A, Venkatesh B, BrennerS (1996) Small is beautiful: comparative genomics with thepufferfish (Fugu rubripes). Trends Genet 12:145–150

Fay JC, Wu CI (2000) Hitchhiking under positive Darwinian se-lection. Genetics 155:1405–1413

Felsenstein J (1993) PHYLIP (phylogeny interference package).University of Washington, Seattle

Felsenstein J (2004) Inferring phylogenies. Sinauer, Sunderland,Mass.

Field D, Hood D, Moxon R (1999) Contribution of genomics tobacterial pathogenesis. Curr Opin Genet Dev 9:700–703

Fitzgerald JR, Musser JM (2001) Evolutionary genomics ofpathogenic bacteria. Trends Microbiol 9:547–553

Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA,Fleischmann RD, Bult CJ, Kerlavage AR, Sutton GG, Kelley

JM, Fritchman JL, Weidman JF, Small KV, Sandusky M,Fuhrmann JL, Nguyen DT, Utterback T, Saudek DM, PhillipsCA, Merrick JM, Tomb J, Dougherty BA, Bott KF, Hu PC,Lucier TS, Peterson SN, Smith HO, Venter JC (1995) Theminimal gene complement of Mycoplasma genitalium. Science270:397–403

Fu YX, Li WH(1993) Statistical tests of neutrality of mutations.Genetics 133:693–709

Galagan JE, Nusbaum C, Roy A, Endrizzi MG, Macdonald P,FitzHugh W, Calvo S, Engels R, Smirnov S, Atnoor D, BrownA, Allen N, Naylor J, Stange-Thomann N, DeArellano K,Johnson R, Linton L, McEwan P, McKernan K, Talamas J,Tirrell A, Ye W, Zimmer A, Barber RD, Cann I, Graham DE,Grahame DA, Guss AM, Hedderich R, Ingram-Smith C,Kuettner HC, Krzycki JA, Leigh JA, Li W, Liu J, Mukho-padhyay B, Reeve JN, Smith K, Springer TA, Umayam LA,White O, White RH, Conway de Macario E, Ferry JG, JarrellKF, Jing H, Macario AJ, Paulsen I, Pritchett M, Sowers KR,Swanson RV, Zinder SH, Lander E, Metcalf WW, Birren B(2002) The genome of M. acetivorans reveals extensive meta-bolic and physiological diversity. Genome Res 12:532–542

Galperin MY, Koonin EV (2003) Frontiers in computational ge-nomics. (Functional genomics, vol 3) Caister, Wymondham

Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A,Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, HofertC, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K,Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S,Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A,Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T,Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G(2002) Functional organization of the yeast proteome by sys-tematic analysis of protein complexes. Nature 415:141–147

Gilad Y, Rosenberg S, Przeworski M, Lancet D, Skorecki K (2002)Evidence for positive selection and population structure at thehuman MAO-A gene. Proc Natl Acad Sci USA 99:862–867

Gish W, States D (1993) Identification of protein coding regions bydatabase similarity search. Nat Genet 3:266–272

Glinka S, Ometto L, Mousset S, Stephan W, Lorenzo DD (2003)Demography and natural selection have shaped genetic varia-tion in Drosophila melanogaster: a multi-locus approach. Ge-netics 165:1269–1278

Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, FeldmannH, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ,Mewes HW, Murakami Y, Philippsen P, Tettelin H, OliverSG(1996) Life with 6000 genes. Science 274:546–563

Goldstein DB (2001) Islands of linkage disequilibrium. Nat Genet29:109–111

Griffiths RC, Marjoram P (1997) An ancestral recombinationgraph. In: Donnelly P, Tavar’e S (eds) Progress in populationgenetics and human evolution. (The IAM volumes in mathe-matics and its applications, vol 87) Springer, Berlin HeidelbergNew York, pp 257–270

Gusfield D (1997) Algorithms on strings, trees, and sequences:computer science and computational biology. Cambridge Uni-versity Press, Cambridge

Gusfield D (2001) Inference of haplotypes from samples of diploidpopulations: complexity and algorithms. J Comput Biol 8:305–323

Hardison RC (2000) Conserved noncoding sequences are reliableguides to regulatory elements. Trends Genet 16:369–372

Hardison RC, Oeltjen J, Miller W (1997) Long human–mouse se-quence alignments reveal novel regulatory elements: a reasonto sequence the mouse genome. Genome Res 7:959–966

Haubold B, Wiehe T (2001) Statistics of divergence times. MolBiol Evol 18:1157–1160

Haubold B, Wiehe T (2002) Calculating the SNP-effective samplesize from an alignment. Bioinformatics 18:36–38

Haubold B, Kroymann J, Ratzka A, Mitchell-Olds T, Wiehe T(2002) Recombination and gene conversion in Arabidopsisthaliana. Genetics 161:1269–1278

Henikoff S, Henikoff JG (1992) Amino acid substitution matricesfrom protein blocks. Proc Natl Acad Sci USA 89:10915–10919

418

Page 15: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

Horikawa Y, Oda N, Cox NJ, Li X, Orho-Melander M, Hara M,Hinokio Y, Lindner TH, Mashima H, Schwarz PEH, Bosque-Plata L del, Horikawa Y, Oda Y, Yoshiuchi I, Colilla S,Polonsky KS, Wei S, Concannon P, Iwasaki N, Schulze J, BaierLJ, Bogardus C, Groop L, Boerwinkle E, Hanis CL, Bell GI(2000) Genetic variation in the gene encoding calpain-10 isassociated with type 2 diabetes mellitus. Nat Genet 26:163–175

Hudson RR (1983) Properties of a neutral allele model with in-tragenic recombination. Theor Popul Biol 23:183–201

Hudson RR (1990) Gene genealogies and the coalescent process.Oxford Surv Evol Biol 7:1–44

Hudson RR (2002) Generating samples under a Wright–Fisherneutral model of genetic variation. Bioinformatics 18:337–338

Hudson RR, Kaplan NL (1987) The coalescent process in modelswith selection and recombination. Genetics 120:831–840

Hudson RR, Slatkin M, Maddison WP (1992) Estimation of levelsof gene flow from DNA sequence data. Genetics 132:583–589

Huson DH (1998) SplitsTree: analyzing and visualizing evolu-tionary data. Bioinformatics 14:68–73

Hutchison CA III, Peterson SN, Gill SR, Cline RT, White O, FraserCM, Smith HO, Venter CJ (1999) Global transposon muta-genesis and a minimal Mycoplasma genome. Science 286:2165

International Human Genome Sequencing Consortium (2001) Ini-tial sequencing and analysis of the human genome. Nature409:860–921

International SNP Map Working Group (2001) A map of humangenome sequence variation containing 1.42 million single nu-cleotide polymorphisms. Nature 409:928–933

Kan Z, Rouchka E, Gish W, States D (2001) Gene structure pre-diction and alternative splicing analysis using genomicallyaligned ESTs. Genome Res 11:889–900

Kaneko T, Nakamura Y, Sato S, Minamisawa K, Uchiumi T,Sasamoto S, Watanabe A, Idesawa K, Iriguchi M, KawashimaK, Kohara M, Matsumoto M, Shimpo S, Tsuruoka H, Wada T,Yamada M, Tabata S (2002) Complete genomic sequence ofnitrogen-fixing symbiotic bacterium Bradyrhizobium japon-icum USDA110. DNA Res 9:189–197

Kawai J, Shinagawa A, Shibata K, Yoshino M, Itoh M, Ishii Y,Arakawa T, Hara A, Fukunishi Y, Konno H, Adachi J, FukudaS, Aizawa K, Izawa M, Nishi K, Kiyosawa H, Kondo S, Ya-manaka I, Saito T, Okazaki Y, Gojobori T, Bono H, KasukawaT, Saito R, Kadota K, Matsuda H, Ashburner M, Batalov S,Casavant T, Fleischmann W, Gaasterland T, Gissi C, King B,Kochiwa H, Kuehl P, Lewis S, Matsuo Y, Nikaido I, Pesole G,Quackenbush J, Schriml LM, Staubli F, Suzuki R, Tomita M,Wagner L, Washio T, Sakai K, Okido T, Furuno M, Aono H,Baldarelli R, Barsh G, Blake J, Boffelli D, Bojunga N, CarninciP, De Bonaldo MF, Brownstein MJ, Bult C, Fletcher C, FujitaM, Gariboldi M, Gustincich S, Hill D, Hofmann M, Hume DA,Kamiya M, Lee NH, Lyons P, Marchionni L, Mashima J,Mazzarelli J, Mombaerts P, Nordone P, Ring B, Ringwald M,Rodriguez I, Sakamoto N, Sasaki H, Sato K, Schonbach C,Seya T, Shibata Y, Storch KF, Suzuki H, Toyo-oka K, WangKH, Weitz C, Whittaker C, Wilming L, Wynshaw-Boris A,Yoshida K, Hasegawa Y, Kawaji H, Kohtsuki S, Hayashizaki Y(2001) Functional annotation of a full-length mouse cDNAcollection. Nature 409:685–690

Kececioglu J, Gusfield D (1998) Reconstructing a history of re-combinations from a set of sequences. Discrete Appl Math88:239–260

Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003)Sequencing and comparison of yeast species to identify genesand regulatory elements. Nature 423:241–254

Kent WJ (2002) BLAT—the BLAST-like alignment tool. GenomeRes 12:656–664

Kessler MM, Zeng Q, Hogan S, Cook R, Morales AJ, CottarelG(2003) Systematic discovery of new genes in the Saccharo-myces cerevisiae genome. Genome Res 13:264–271

Kingman JFC (1982a) The coalescent. Stochastic Process Appl13:235–248

Kingman JFC (1982b) On the genealogy of large populations.J Appl Probab 19A:27–43

Kingman JFC (2000) Origins of the coalescent: 1974–1982. Ge-netics 154:1461–1463

Koch MA, Weisshaar B, Kroymann J, Haubold B, Mitchell-Olds T(2001) Comparative genomics and regulatory evolution: con-servation and function of the chs and apetala3 promoters. MolBiol Evol 18:1882–1891

Koonin EV, Mushegian AR (1996) Complete genome sequences ofcellular life forms: glimpses of theoretical evolutionary geno-mics. Curr Opin Genet Dev 6:757–762

Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomichomology into gene structure prediction. Bioinformatics 17:S140–S148

Kornberg TB, Krasnow MA (2000) The Drosophila genome se-quence: implications for biology and medicine. Science 287:2218–2220

Li WH (1997) Molecular evolution. Sinauer, Sunderland, Mass.Makarova KS, Koonin EV (2003) Comparative genomics of

Archaea: how much have we learned in six years, and what’snext? Genome Biol 4:115

Mayer K, Murphy G, Tarchini R, Wambutt R, Volckaert G, Pohl T,Dusterhof A, Stiekema W, Entian KD, Terryn N, Lemcke K,Haase D, Hall CR, Dodeweerd AM van, Tingey SV, MewesHW, Bevan MW, Bancroft I (2001) Conservation of micro-structure between a sequenced region of the genome of rice andmultiple segments of the genome of Arabidopsis thaliana.Genome Res 11:1167–1174

Maynard Smith J, Haigh J (1974) The hitch-hiking effect of a fa-vourable gene. Genet Res 23:23–35

McVean GAT, Myers SR, Hunt S, Deloukas P, Bentley DR,Donnelly P (2004) The fine-scale structure of recombinationrate variation in the human genome. Science 304:581–584

Meyer I, Durbin R (2002) Comparative ab initio prediction of genestructures using pair HMMs. Bioinformatics 18:1309–1318

Miller W (2001) Comparison of genomic DNA sequences: solvedand unsolved problems. Bioinformatics 17:391–397

Mouse Genome Sequencing Consortium (2002) Initial sequenc-ing and comparative analysis of the mouse genome. Nature420:520–561

Mural RJ, Adams MD, Myers EW, Smith HO, Miklos GL, WidesR, Halpern A, Li PW, Sutton GG, Nadeau J, Salzberg SL, HoltRA, Kodira CD, Lu F, Chen L, Deng Z, Evangelista CC, GanW, Heiman TJ, Li J, Li Z, Merkulov GV, Milshina NV, NaikAK, Qi R, Shue BC, Wang A, Wang J, Wang X, YanX, Ye J,Yooseph S, Zhao Q, Zheng L, Zhu SC, Biddick K, Bolanos R,Delcher AL, Dew IM, Fasulo D, Flanigan MJ, Huson DH,Kravitz SA, Miller JR, Mobarry CM, Reinert K, RemingtonKA, Zhang Q, Zheng XH, Nusskern DR, Lai Z, Lei Y, ZhongW, Yao A, Guan P, Ji RR, Gu Z, Wang ZY, Zhong F, Xiao C,Chiang CC, Yandell M, Wortman JR, Amanatides PG, HladunSL, Pratts EC, Johnson JE, Dodson KL, Woodford KJ, EvansCA, Gropman B, Rusch DB, Venter E, Wang M, Smith TJ,Houck JT, Tompkins DE, Haynes C, Jacob D, Chin SH, AllenDR, Dahlke CE, Sanders R, Li K, Liu X, Levitsky AA, MajorosWH, Chen Q, Xia AC, Lopez JR, Donnelly MT, Newman MH,Glodek A, Kraft CL, Nodell M, Ali F, An HJ, Baldwin-Pitts D,Beeson KY, Cai S, Carnes M, Carver A, Caulk PM, Center A,Chen YH, Cheng ML, Coyne MD, Crowder M, Danaher S,Davenport LB, Desilets R, Dietz SM, Doup L, Dullaghan P,Ferriera S, Fosler CR, Gire HC, Gluecksmann A, Gocayne JD,Gray J, Hart B, Haynes J, Hoover J, Howland T, Ibegwam C,Jalali M, Johns D, Kline L, Ma DS, MacCawley S, Magoon A,Mann F, May D, McIntosh TC, Mehta S, Moy L, Moy MC,Murphy BJ, Murphy SD, Nelson KA, Nuri Z, Parker KA,Prudhomme AC, Puri VN, Qureshi H, Raley JC, Reardon MS,Regier MA, Rogers YH, Romblad DL, Schutz J, Scott JL, ScottR, Sitter CD, Smallwood M, Sprague AC, Stewart E, StrongRV, Suh E, Sylvester K, Thomas R, Tint NN, Tsonis C,WangG, Wang G, Williams MS, Williams SM, Windsor SM, WolfeK, Wu MM, Zaveri J, Chaturvedi K, Gabrielian AE, Ke Z, SunJ, Subramanian G, Venter JC, Pfannkoch CM, Barnstead M,Stephenson LD (2002) A comparison of whole-genome shot-

419

Page 16: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

gun-derived mouse chromosome 16 and the human genome.Science 296:1661–1671

Mushegian AR, Koonin EV (1996) A minimal gene set for cellularlife derived by comparison of complete bacterial genomes. ProcNatl Acad Sci USA 93:10268–10273

National Institutes of Health and Department of Energy (1990)Understanding our genetic inheritance. (The United States hu-man genome project; the first five years: fiscal years 1991–1995. Technical report) National Institutes of Heals and De-partment of Energy http://www.genome.gov

Needleman SB, Wunsch CD (1970) A general method applicable tothe search for similarities in the amino acid sequence of twoproteins. J Mol Biol 48:443–453

Nordborg M (2001) Coalescent theory. In: Balding DJ, Bishop M,Cannings C (eds) Handbook of statistical genetics. Wiley,Mannheim, pp 178–212

Nurtdinov RN, Artamonova II, Mironov AA, Gelfand MS (2003)Low conservation of alternative splicing patterns in the humanand mouse genomes. Hum Mol Genet 12:1313–1320

Ohler U, Niemann H (2001) Identification and analysis of eu-karyotic promoters: recent computational approaches. TrendsGenet 17:56–60

Ohler U, Niemann H, Liao GC, Rubin GM (2001) Joint modelingof DNA sequence and physical properties to improve eukary-otic promoter recognition. Bioinformatics 17 [Suppl]:S199–S206

Pachter L, Alexandersson M, Cawley S (2001) Applications ofgeneralized pair hidden Markov models to alignment and genefinding problems. In: Press A (ed) Proceedings of the fifthannual conference on computational molecular biology. RE-COMB, New York, pp 241–248

Parra G, Agarwal P, Abril J, Wiehe T, Fickett J, Guig� R (2003)Comparative gene prediction in human and mouse. GenomeRes 13:108–117

Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR,Kautzer CR, Lee DH, Marjoribanks C, McDonough DP,Nguyen BT, Norris MC, Sheehan JB, Shen N, Stern D,Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA,Fodor SP, Cox DR (2001) Blocks of limited haplotype diversityrevealed by high-resolution scanning of human chromo-some 21. Science 294:1719–1723

Pearson WR, Lipman DJ (1988) Improved tools for biological se-quence comparison. Proc Natl Acad Sci 85:2444–2448

Plotkin JP, Dushoff J, Fraser HB (2004) Detecting selection using asingle genome sequence of M. tuberculosis and P. falciparum.Nature 428:942–945

Ponomarenko JV, Ponomarenko MP, Frolov AS, Vorobyev DG,Overton GC, Kolchanov NA (1999) Conformational and phys-icochemical DNA features specific for transcription factorbinding sites. Bioinformatics 15:654–668

Reich DR, Schaffner SF, Daly MJ, McVean G, Mullikin JC,Higgins JM, Richter DJ, Lander ES, Altschuler D (2002) Hu-man genome sequence variation and the influence of genehistory, mutation and recombination. Nat Genet 32:135–142

Reichwald K (2003) Interspeziesvergleich genomischer DNA-Se-quenzen zur Genidentifizierung in 240 kb des humanen undmurinen X-Chromosoms. PhD thesis, Friedrich Schiller Uni-versit�t, Jena

Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, NelsonCR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleisch-mann W, Cherry JM, Henikoff S, Skupski MP, Misra S,Ashburner M, Birney E, Boguski MS, Brody T, Brokstein P,Celniker SE, Chervitz SA, Coates D, Cravchik A, Gabrielian A,Galle RF, Gelbart WM, George RA, Goldstein LS, Gong F,Guan P, Harris NL, Hay BA, Hoskins RA, Li J, Li Z, HynesRO, Jones SJ, Kuehl PM, Lemaitre B, Littleton JT, MorrisonDK, Mungall C, O’Farrell PH, Pickeral OK, Shue C, VosshallLB, Zhang J, Zhao Q, Zheng XH, Lewis S (2000) Comparativegenomics of the eukaryotes. Science 287:2204–2215

Ruepp A, Gram lW, Santos-Martinez ML, Koretke KK, Volker C,Mewes HW, Frishman D, Stocker S, Lupas AN, Baumeister W

(2000) The genome sequence of the thermoacidophilic scav-enger Thermoplasma acidophilum. Nature 407:508–513

Saitou N, Nei M (1987) The neighbor-joining method: a newmethod for reconstructing phylgenetic trees. Mol Biol Evol4:406–425

Schl�tterer C (2003) Hitchhiking mapping—functional genomicsfrom the population genetics perspective. Trends Genet 19:32–38

SchoolnikGK(2002) Functional and comparative genomics ofpathogenic bacteria. Curr Opin Microbiol 5:20–26

Schwartz R, Clark AG, Istrail S (2002) Methods for inferringblock-wise ancestral history from haploid sequences. In: Guig�R, Gusfield D (eds) Lecture notes in computer science,Springer, Berlin Heidelberg New York, pp 44–59

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC,Haussler D, Miller W (2003) Human–mouse alignments withBLASTZ. Genome Res 13:103–107

Smith TF, Waterman MS (1981) Identification of common mo-lecular subsequences. J Mol Biol 147:195–197

Sorrells ME, La Rota M, Bermudez-Kandianis CE, Greene RA,Kantety R, Munkvold JD, Miftahudin A, Mahmoud A, Ma X,Gustafson PJ, Qi LL, Echalier B, Gill BS, Matthews DE, LazoGR, Chao S, Anderson OD, Edwards H, Linkiewicz AM,Dubcovsky J, Akhunov ED, Dvorak J, Zhang D, Nguyen HT,Peng J, Lapitan NL, Gonzalez-Hernandez JL, Anderson JA,Hossain K, Kalavacharla V, Kianian SF, Choi DW, Close TJ,Dilbirligi M, Gill KS, Steber C, Walker-Simmons MK,McGuire PE, Qualset CO (2003) Comparative DNA sequenceanalysis of wheat and rice genomes. Genome Res 13:1818–1827

Strimmer K, Haeseler A von (1997) Likelihood-mapping: a simplemethod to visualize phylogenetic content of a sequence align-ment. Proc Natl Acad Sci USA 94:6815–6819

Taher L, Rinner O, Garg S, Sczyrba A, Brudno M, Batzoglou S,Morgenstern B (2003) Agenda: homology-based gene predic-tion. Bioinformatics 12:1575–1577

Tajima F (1989) Statistical method for testing the neutral mutationhypothesis by DNA polymorphism. Genetics 123:585–595

Thomas JW, Touchman JW (2002)Vertebrate genome sequencing:building a backbone for comparative genomics. Trends Genet18:104–108

Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beck-strom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC,Thomas PJ, McDowell JC, Maskeri B, Hansen NF, SchwartzMS, Weber RJ, Kent WJ, Karolchik D, Bruen TC, Bevan R,Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-LinSQ, Maduro VV, Summers TJ, Portnoy ME, Dietrich NL,Akhter N, Ayele K, Benjamin B, Cariaga K, Brinkley CP,Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL,Huang MC, Karlins E, Laric PL, Legaspi R, Lim MJ, MaduroQL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R,Stantripop S, Tiongson EE, Tran JT, Tsurgeon C, Vogt JL,Walker MA, Wetherby KD, Wiggins LS, Young AC, ZhangLH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ,Lawrence CE, Smit AF, Chakravarti A, Haussler D, Green P,Miller W, Green ED (2003) Comparative analyses of multi-species sequences from targeted genomic regions. Nature424:788–793

Thompson JD, Higgins DG, Gibson TJ (1994) Clustal W: im-proving the sensitivity of progressive multiple sequencealignment through sequence weighting, position specific gappenalties and weight matrix choice. Nucleic Acids Res 22:4673–4680

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG,Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD,Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q,Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G,Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S,Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ,Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R,Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A,Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K,

420

Page 17: Bernhard Haubold · Thomas Wiehe Comparative genomics ...web.bf.uni-lj.si/bi/biokemija/bioinfo/2007/clanek4.pdf · Bernhard Haubold · Thomas Wiehe Comparative genomics: methods and

Remington K, Abu-Threideh J, Beasley E, Biddick K, BonazziV, Brandon R, Cargill M, Chandramouliswaran I, Charlab R,Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K,Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z,Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA,Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV,Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B,Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J,Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C,Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q,Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D,Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, AliF, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I,Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L,Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, DoupL, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J,Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T,Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A,Mann F, May D, McCawley S, McIntosh T, McMullen I, MoyM, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V,Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D,Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R,Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J,Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K,Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, SjolanderKV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T,Narechania A, Diemer K, Muruganujan A, Guo N, Sato S,Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, YoosephS, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A,Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, GireH, Glanowski S, Glasser K, Glodek A, Gorokhov M, GrahamK, Gropman B, Harris M, Heil J, Henderson S, Hoover J,Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C,Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W,McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N,Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R,

Scott J, Simpson M, Smith T, Sprague A, Stockwell T, TurnerR, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, ZandiehA, Zhu X (2001) The sequence of the human genome. Science291:1304–1351

Watterson GA (1975) On the number of segregating sites in ge-netical models without recombination. Theor Popul Biol 7:256–276

Weiner J (1994) The beak of the finch. Vintage, New YorkWerner T (2003a) Promoters can contribute to the elucidation of

protein function. Trends Biotechnol 21:9–13Werner T (2003b) The state of the art of mammalian promoter

recognition. Brief Bioinf 4:22–30Wiehe T, Guigo R, Miller W(2000) Genome sequence compar-

isons: hurdles in the fast lane to functional genomics. BriefBioinform 1:381–388

Willey JS, Dao-Ung LP, Sluyter R, Shemon AN, Li C, Taper J,Gallo J, Manoharan A (2002) A loss-of-function polymorphicmutation in the cytolytic P2X7 receptor gene and chroniclymphocytic leukaemia: a molecular study. Lancet 359:1114–1119

Wiuf C, Hein J (2000) The coalescent with gene conversion. Ge-netics 155:451–462

Wong GKS, Passey DA, Yu J (2001) Most of the human genome istranscribed. Genome Res 11:1975–1977

Zhang CT, Zhang R, Ou HY (2003) The Z curve database: agraphic representation of genome sequences. Bioinformatics19:593–599

Zhou W, Goodman SN, Galizia G, Lieto C, Ferraraccio F,Pignatelli C, Purdie CA, Piris J, Morris R, Harrison DJ, PatyPB, Culliford A, Romans KE, Montgomery EA, Choti MA,Kinzler KW, Vogelstein B (2002) Counting alleles to predictrecurrence of early-stage colorectal cancers. Lancet 359:219–225

Zmaskek CM, Eddy SR (2001) ATV: display and manipulation ofannotated phylgenetic trees. Bioinformatics 17:383–384

421