bioinformatics - llama.mshri.on.callama.mshri.on.ca/publications/grad_bioinformatics_2004.pdf ·...

BIOINFORMATICS Vol 20 no 16 2004 pages 2738ndash2750doi101093bioinformaticsbth320

Prediction of similarly acting cis-regulatorymodules by subsequence profiling andcomparative genomics in Drosophilamelanogaster and Dpseudoobscura

Yonatan H Grad1 Frederick P Roth2 Marc S Halfon3 andGeorge M Church1lowast

1The Lipper Center for Computational Genetics and Department of Genetics77 Avenue Louis Pasteur and 2Department of Biological Chemistry and MolecularPharmacology 250 Longwood Avenue Harvard Medical School BostonMassachusetts 02115 USA and 3Department of Biochemistry and Center ofExcellence in Bioinformatics SUNY at Buffalo Buffalo New York 14214 USA

Received on December 23 2003 revised on April 7 2004 accepted on April 27 2004

Advance Access publication May 4 2004

ABSTRACTMotivation To date computational searches for cis-regulatory modules (CRMs) have relied on two methodsThe first phylogenetic footprinting has been used to findCRMs in non-coding sequence but does not directly link DNAsequence with spatio-temporal patterns of expression Thesecond based on searches for combinations of transcriptionfactor (TF) binding motifs has been employed in genome-wide discovery of similarly acting enhancers but requires priorknowledge of the set of TFs acting at the CRM and the TFsrsquobinding motifsResults We propose a method for CRM discovery that com-bines aspects of both approaches in an effort to overcometheir individual limitations By treating phylogenetically foot-printed non-coding regions (PFRs) as proxies for CRMs weendeavor to find PFRs near co-regulated genes that are com-prised of similar short conserved sequences Using Markovchains as a convenient formulation to assess similarity wedevelop a sampling algorithm to search a large group ofPFRs for the most similar subset When starting with a set ofgenes involved in Drosophila early blastoderm developmentand using phylogenetic comparisons of Drosophila melano-gaster and Dpseudoobscura genomes we show here thatour algorithm successfully detects known CRMs Further weuse our similarity metric based on Markov chain discrimina-tion in a genome-wide search and uncover additional knownand many candidate early blastoderm CRMsAvailability Software is available via httparepmedharvardeduenhancersContact see httparepmedharvardeduemailhtml

lowastTo whom correspondence should be addressed

Supplementary information Can be accessed at httparepmedharvardeduenhancers

INTRODUCTIONThe complex regulation of metazoan gene expression is sub-stantially controlled through the interaction of transcriptionfactors (TFs) and cis-regulatory DNA sequences These cis-regulatory sequences are organized into modules whereeach module integrates input from a specific set of tran-scription factors to direct a corresponding spatiotemporalexpression pattern These key regulatory sequences termedcis-regulatory modules (CRMs) and often referred to aslsquoenhancersrsquo share a number of important features They areusually sim500ndash1000 bp in length are located in genomicsequence near the genes they regulate and contain one ormore binding sites for each set of TFs (Carroll et al 2001Davidson 2001) CRMs can thus be thought of as sit-ting at the nexus of gene regulatory networks they areDNA sequences which assist in translating a combinat-orial code of TF inputs into a specific gene expressionoutput

Although little is understood about the evolutionary pro-cesses affecting CRMs studies have observed that theyundergo stabilizing selection with maintenance of the overallset of TF inputs and resulting expression pattern coupled withspecies-specific gain and loss of TF binding sites a processknown as lsquoturnoverrsquo (Ludwig et al 1998 2000 Dermitzakisand Clark 2002 Dermitzakis et al 2003) The degree ofturnover appears to increase with evolutionary distance How-ever the CRMs are under functional constraint and appearto change much more slowly than non-functional sequenceThey are expected to have a degree of sequence conservation

2738 Bioinformatics vol 20 issue 16 copy Oxford University Press 2004 all rights reserved

Similarly acting cis-regulatory modules by subsequence profiling and comparative genomics

that given sufficient evolutionary distance is significantlyhigher than background and that is related to the rate of bind-ing site turnover change andor loss of CRM function (Fickettand Wasserman 2000 Levy et al 2001 Dermitzakis andClark 2002 Moses et al 2003)

Existing sequence analysis-based tools for identificationof CRMs take two main approaches The first involvesinter-species comparisons designed to take advantage of evol-utionary conservation of regulatory sequences an approachtermed lsquophylogenetic footprintingrsquo (Tagle et al 1988) Herenon-coding sequences conserved between two or more relatedspecies are treated as likely candidates for regulatory regionsThe observed conservation arises from TF binding sites thatremain stable in relative location to one another and henceform lsquoanchorsrsquo that seed long linear sequence alignmentsPhylogenetic footprinting has routinely served as a guide to thediscovery of regulatory sequences (Blackman and Meselson1986 Gumucio et al 1993 Vuillaumier et al 1997 Lootset al 2000 McGuire et al 2000 Bergman et al 2002)This method is often used as a first step to define the bound-aries of potential regulatory sequences and is followed byexperimental studies to elucidate the spatio-temporal pat-tern of expression if any that those potential regulatorysequences might direct Since metazoan genes generally pos-sess complex expression patterns directed by multiple CRMsa search for a module responsible for a particular patternrequires laborious experimental testing of all nearby candidatesequences

Recently several groups have designed a second set ofapproaches that predict CRMs by identifying clusters ofpotential TF binding sites Starting with a set of coordin-ately acting TFs and their experimentally described bindingmotifs some of these algorithms search the genome forsequences of sim500ndash1000 bp with uncommonly high concen-trations or co-occurrences of predicted binding sites (Crowleyet al 1997 Fickett and Wasserman 2000 Frith et al 2001Berman et al 2002 Halfon et al 2002 Markstein et al2002 Rajewsky et al 2002 Rebeiz et al 2002) The pre-sumption underlying these approaches is that binding-motifrich sequences provide evidence of TF binding and hencemay direct similar expression patterns In several cases asubset of predictions has been experimentally verified How-ever the arduous task of testing all predictions generatedby these algorithms and the absence of extensive and well-characterized CRM datasets have hindered careful evaluationof the false positive and false negative rates It is also import-ant to point out that all tests in Drosophila melanogaster so farhave concentrated on only a small number of regulatory net-works (chiefly gene expression in the early blastoderm) Thegeneral applicability of TF-binding motif clustering meth-ods to other pathways has not yet been examined and wemay yet learn that other pathways are regulated in a man-ner refractory to or requiring more sophistication than siteclustering-type analyses Two obstacles preventing broader

evaluation of these algorithms are that (1) binding motifshave been confidently described for only a handful of TFs and(2) there are few well-characterized networks of coordinatelyacting TFs that could serve as starting points for additionalbinding site clusteringco-occurrence studies

Just as the availability of full genome sequence promptedthe development of computational tools to comprehensivelysearch for CRMs the availability of the genomes of pairs ofrelated species presents the opportunity to combine the aboveapproaches by introducing systematic phylogenetic compar-isons into algorithms designed to decipher the regulatorycode (Cliften et al 2003 Kellis et al 2003 Moses et al2003) The sequencing of Drosophila pseudoobscura andDmelanogaster a pair of species separated by approximately24 million years of evolution (Russo et al 1995) provides thetools to pursue such aims in metazoans (Adams et al 2000httpwwwhgscbcmtmceduprojectsdrosophila)

We propose here an approach to identify similarly act-ing cis-regulatory modules given genome sequences forDmelangoaster and Dpseudoobscura and a set of co-regulated genes We hypothesize that similarly acting enhan-cers can be identified by conserved subsequence signaturesin phylogenetic footprinted regions (PFRs) Drawing inspir-ation from Gibbs-sampling based motif-finding algorithmsthat successfully identify regulatory sequence motifs near co-regulated genes (Tavazoie et al 1999 Hughes et al 2000)we developed an approach to identify the most similar phylo-genetic footprints near co-regulated genes in a program wecall PFR-Sampler (Fig 1a) As a first step we assemble alist of PFRs To do this we use AvidVista (Dubchak et al2000 Mayor et al 2000 Bray et al 2003) to identify con-served non-coding regions across the genome which we thencollate into a genome-wide set of PFRs We treat each PFRas a potential CRM and generate for each PFR a profile ofthe conserved sequences that anchor the PFR alignment Bykeeping track of the conserved sequence profile with Markovchains we can use Markov chain discrimination to assessPFR similarity (Durbin et al 1998) We then use a samplingalgorithm PFR-Sampler to identify a subset of maximallyrelated PFRs among the full initial set of phylogenetic foot-prints near a set of previously defined co-regulated genes Theconserved sequences that are discovered to anchor the foot-printed regions are candidate TF-binding sequences and thePFRs in the subset of maximally related PFRs are candidateCRMs responsible for the observed co-regulation

This method of CRM discovery effectively circumventsobstacles facing algorithms that aim to predict CRMs bysearching for co-occurrence of TF-binding motifs in that itdoes not require prior knowledge of the constellation of TFsthat might act in the pathway of interest and further does notrequire any prior information regarding TF nucleotide bindingspecificities The ability to use a combination of phylogen-etic conservation and co-expression to compensate for lack ofknowledge of TF constellation and binding sites has precedent

2739

YHGrad et al

Fig 1 Overview of PFR-Sampler and PFR-Searcher methods (a) PFR-Sampler overview A set of co-regulated genes together with theconserved sequence profiles (see Algorithm) for all genomic PFRs are input into PFR-Sampler which identifies the set of PFRs surroundingthe co-regulated genes and identifies the subset with the most similar conserved word profiles that is also most distinct from background (b)PFR-Searcher overview The training set of PFRs along with the PFRs to search are input into PFR-Searcher which constructs a model fromthe training set and reports a ranked list of the scanned PFRs according to similarity to the model

in previous work (Wasserman et al 2000) which demon-strates the feasibility of using humanndashrodent phylogeneticconservation together with co-expression data for genes activein skeletal muscle to identify binding sites for key transcriptionfactors responsible for this expression

To evaluate our algorithm we focus on the regulatorynetwork that coordinates blastoderm expression in earlyDrosophila embryo development This system is very wellcharacterized and has been used in several other studies toevaluate the TF-binding motif clustering algorithms (Fickettand Wasserman 2000 Berman et al 2002 Rajewsky et al2002) Starting only with a list of co-regulated genes and gen-ome sequences from Dmelanogaster and Dpseudoobscurawe show that our approach successfully identifies PFRs thatcorrespond to known blastoderm enhancers when blasto-dermally expressed genes are input and we anticipate thatthis approach can be more broadly applied to other sets ofco-regulated genes

Extending these findings we show that the output of thesampling algorithm can be used in genome-wide scans for

similarly acting enhancers The PFR-Sampler output set ofsimilar PFRs can be used to train a Markov chain discrimin-ation algorithm which we call PFR-Searcher (Fig 1b) andall PFRs can then be scored to evaluate the extent of similar-ity to the model Again taking phylogenetic footprint regionsas possible bona fide cis-regulatory modules we show thatthis approach performs well in recognizing additional knownblastodermal CRMs and predicting new CRMs throughout thegenome

ALGORITHMIdentification of putative regulatory regions byphylogenetic footprintingOur algorithm for identifying putative regulatory regions startswith genome sequences for Dpseudoobscura (httpwwwhgscbcmtmceduprojectsdrosophila) and Dmelanogasterand successively analyzes them for conserved non-codingsequences (CNS) conserved non-coding subsequences(CNSS) and PFRs which later are analyzed using Markov

2740


chain discrimination algorithm The genome sequenceswere first aligned by Avid global alignment soft-ware (Bray et al 2003) available online from theBerkeley Genome Pipeline (version 8 July 2003 httppipelinelblgovpseudo) Non-coding sequence was extrac-ted from these aligned sequences using Release 31 annota-tions (httpwwwfruitflyorgannotrelease3html) From theresulting non-coding sequence alignments using Vista soft-ware (Dubchak et al 2000 Mayor et al 2000) we thenidentified CNSs as groups of aligned non-coding sequenceswith gt60 conservation over 100 nt which are separ-ated by lt100 nt These parameters were estimated by anassessment of conservation and distances between alignednon-coding sequences in a few experimentally defined enhan-cers By examining the sequence alignments within a CRMCNSSs were identified as maximal contiguous stretches ofaligned conserved ungapped sequence that contained atmost two mismatches within an 8 bp sliding window Thisparameter was modeled on the observed changes of TF-binding sites between Dmelanogaster and Dpseudoobscuraand the average nucleotide difference between two TF-bindingsites for the same TF in Dmelanogaster (Ludwig et al1998 2000 Dermitzakis et al 2003) Note that CNSs arestretches of sequence alignment strings each position ofwhich pairs either a base code A or C or G or T or lsquominusrsquo(a gap) from Dmelanogaster against a base code or gap fromDpseudoobscura and that CNSSs are constituent blocks ofaligned nucleotides (no gaps) within a given CNS Henceforthwe consider only the Dmelanogaster sequence correspondingto these aligned sequence stretches

CNSSs which are ungapped are analyzed in setting upour Markov chain algorithm each CNSS string s1 s2 snis considered as a sequence of n minus 5 6-nt windowssisi+1si+2si+3si+4si+5 and each such window is considereda lsquostate transitionrsquo from the prefix 5-word sisi+1si+2si+3si+4

to the suffix 5-word si+1si+2si+3si+4si+5 Any CNS with over300 state-transitions was identified as a PFR a threshold wechose because it reflects the length of archetypal enhancersSince a CNS may be comprised of multiple CNSSs separ-ated by gaps or by stretches of sequences that fall below thethreshold of two mismatches in an 8 bp window we notethat these 300 state-transitions need not be contiguous APFR then is a CNS that is comprised of clustered CNSSscontaining in total more than 300 state-transitions

PFR-Searcher Markov chain discriminationalgorithmA maximum-likelihood (ML) Markov model is generatedfrom the PFRs by computing a frequency table of transitionsfrom all possible strings of five base codes (PFR model) anda ML model is generated by the same method for a set of ran-domly chosen PFRs (background model) As described abovewe encode a first order Markov chain using the alphabet ofall 5mers (mathematically equivalent to a fifth order Markov

chain) In an effort to improve signal-to-noise by diminish-ing the impact of sequences shared among PFRs throughoutthe genome we used all the PFRs identified on chromosomearm 2R as the background model The Markov chain discrim-ination algorithm then computes a score measured in bitsdescribing the similarity of a sequence to a model as com-pared to the background model (Durbin et al 1998) IdeallyML estimators for a given set of state-transition probabilitiesshould be constructed solely from observed frequencies butthis requires an extensive sampling of all transitions How-ever PFR datasets generally yield only sparse estimates oftransition counts Although pseudocounts are often used toovercome sparse estimates we adjusted counts in a morebiologically meaningful manner as binding sites for TFs gen-erally are similar to a consensus sequence and the 6mersconsidered in state-transitions represent possible TF-bindingsites we adjusted frequency counts for any given transitionwith all sequence transitions within a Hamming distance ofone nucleotide

c+st = cst +

sumsi d(sminussi )=1

csi t middot w

where cst indicates the counts of state-transition from words to word t is observed in the training set sequences w is theweight by which all the neighboring counts are multipliedand c+

st represents the resulting consensus adjusted countsBy these means many state-transitions that have an actualcount of 0 in the PFR data may be assigned non-zero adjus-ted counts based on a biologically supported similarity withother transitions that actually did occur Weights used in thiswork were 025 The ML estimators were then determined asfollows

a+st = c+

stsumt prime c

+st prime

where a+st is the ML estimator for the s to t transition in the

training set Using this method ML estimators were computedfrom state-transitions in PFRs to define the PFR model andfrom those in the background to define the background modelThe Markov chain discrimination score between the PFR andbackground models was then calculated for any sequence x

as the log-odds ratio

S(x) = logP(x | model+)

P (x | modelminus)=

Lsumi=1

loga+

ximinus1xi

aminusximinus1xi

for a query sequence with L transitions

PFR-Sampler sampling algorithmThe aim of the sampling algorithm is to identify a subset ofPFRs with the most closely related subsequence profiles Webegin with an input set of genes G = g1 g2 gn Eachgene gi is associated with a set of nearby PFRs (proximity isuser-defined) denoted gi1 gij In these studies all PFRswithin 50 000 bp from gene boundaries were included Where

2741

YHGrad et al

the 50 kb overlap two genes included in the analysis PFRsin the overlapping region were assigned to the locus of theclosest gene The total set of all PFRs associated with genesin G is A The object of this algorithm is to find the subset ofA termed S such that when the members of S are used as atraining model for the Markov chain discrimination test (usingthe same formulation described above for the PFR-Searcheralgorithm) the sum of their scores is maximal A maximumscore indicates that the PFRs in set S possess the most closelyrelated subsequence profiles that are the most distinct frombackground of any possible subset of A

Simulated annealing is used to avoid local minima and thusenable the sampling to search for the set of PFRs with max-imum score (Press et al 1992) The algorithm begins byinitializing the set S to a randomly selected assortment ofPFRs from A The number of initial elements is set by theuser-defined parameter lsquo-prsquo and represents the number ofsimilar PFRs the user expects to discover In this study thenumber of expected PFRs was set as equal to the number ofinput genes Once the algorithm is underway this numberis allowed to fluctuate Additionally we include a require-ment that each gene contribute no more than two PFRs tothe set S to ensure that the set is comprised of contributionsfrom multiple gene loci and that it is not overly biased bythe sequence composition of a given locus The program thencycles through each gene gi and evaluates the score of thecurrent set S and the scores under three additional modelsincluding swapping PFRs removal of a PFR and inclusion ofan additional PFR The score of the highest scoring alternativemodel is compared with the score of the current model If thebest alternative modelrsquos score is greater than that of the currentmodel S is updated to the alternative Otherwise S is updatedto the lower-scoring alternative based on a probability determ-ined from the scoring difference between the alternative andcurrent models (S) and the simulated annealing temperatureschedule where the lsquotemperaturersquo is given below by T

p = exp

(S

T

)

The initial T used in these studies is 20 with a schedule todecrease by 95 after each cycle through G The algorithmhalts when either no change is made to the set S over the courseof several cycles through G or when either 50 updates or 30cycles are completed We note that simulated annealing is astochastic optimization algorithm and the results may varyThe results are dependent on the lsquotemperaturersquo schedule andthe random number seed used Varying schedules may alter theoutput by giving more or less opportunity to move toward theglobal optimum and as is standard for simulated annealingprotocols we encourage trials with varying parameters

SoftwareThe PFR-Sampler and PFR-Searcher programs were writtenin C and use auxiliary modules written in C and PERL All

programs and instructions on their use are available at ourwebsite httparepmedharvardeduenhancers

Cross-validationWe employed lsquoleave-one-outrsquo cross-validation to assess theresults of the sampling algorithm In this procedure a set ofn PFRs (members of set S of the sampling algorithm) is usedto generate n sets of PFRs in which a different single PFR isleft out For each of the n sets the model used in the Markovchain discrimination algorithm is the set itself and scores aredetermined for the left-out PFR and each of a test set of 1000PFRs randomly selected from the genome The rank of thePFR within the set of the 1000 randomly selected PFRs is anindication of its similarity to the training set

RESULTSPhylogenetic footprintingA general and commonly mentioned architectural feature ofcis-regulatory sequences is that they are often though notalways separable and non-overlapping This feature and theevolutionary conservation of the sequences due to functionalconstraint has led to the common use of CNS in search ofCRMs Unanswered questions remain about the appropri-ate evolutionary distance that maximizes the signal-to-noiseratio for picking out meaningfully conserved sequences andabout the applicability of only a single species-pair com-parison to identify all CRMs since CRMs may evolve atdifferent rates Still there are a constantly growing number ofreported successes of CRM discovery using a range of spe-cies pairs including mousendashhuman (Loots et al 2000) andDpseudoobscurandashDmelanogaster (Bergman et al 2002)thus making genome scale analysis of CRM conservation inthese species pairs at least a reasonable if not rigorously estab-lished venture In addition simulation studies are beginningto explore and construct solid foundations for identification offunctionally constrained non-coding sequences (Pollard et al2004)

Here we focus on the DpseudoobscurandashDmelanogasterpair To assemble a complete set of genomic CNSs weobtained global alignments of these two genomes extrac-ted non-coding aligned sequence and identified conservedsequence and subsequence regions that we considered PFRs(see Algorithm) Using these procedures we generated a gen-omic complement of 24 651 PFRs with average length of817 bp average number of conserved nucleotides equal to640 bp and average of 506 state-transitions

PFR-Sampler uncovers PFRs correlating tocis-regulatory modules that regulate blastodermexpressionOne of the best studied developmental networks in metazoansis that of blastoderm development in Dmelanogaster Withmany gene loci closely studied and a number of characterized

2742


Fig 2 Examples of results from the phylogenetic footprinting procedure for loci surrounding 19 blastodermally expressed genes (Abbrevi-ations abd-A abdominal-A ftz fushi tarazu hb hunchback Kr Kruppel Ubx ultrabithorax) The loci extend 50 kb flanking the start andstopping site of each gene or to a midway point between two adjacent genes Purple arrows designate exons blue triangles indicate knownblastodermal enhancers and green semicircles indicate PFRs Figure generated using gff2ps (Abril and Guigo 2000) For full results seeSupplemental Figure 1

regulatory sequences known to direct blastodermal expres-sion it is the most attractive system with which to evaluateour hypothesis that similarly acting CRMs can be identifiedby conserved subsequence signatures in PFRs

We examined the overlap of our PFRs with enhancers knownfor a set of blastodermal development genes (Fig 2 Sup-plemental Figure 1) The locations of CRMs are from thefollowing references Berman et al (2002) Rajewsky et al(2002) and Lifanov et al (2003) We observe that the vastmajority of CRMs overlap coordinates of phylogenetic foot-prints (2730 Fig 2 and Supplemental Figure 1) Of the threethat do not appear in the set of PFRs the oc (also called otd)early enhancer and the runt stripe 5 enhancer both do notappear to be well conserved over an extended region and therunt stripe 1+7 enhancer (which is adjacent to the runt stripe5 enhancer and represented together with the stripe 5 enhan-cer as one single regulatory region in the figures) appears tobe below the conservation threshold used to generate the setof PFRs The low conservation of runt enhancers has been

observed in nearby regions of this locus offering a preced-ent for the apparent low conservation seen here (Wolff et al1999) In cases of multiple contiguous or overlapping CRMswe observed two outcomes First the contiguous block ofhairy stripe 3+4 stripe 7 and stripe 6+2 enhancers overlapsan area of high conservation which is split into two conservednon-coding regions by our phylogenetic footprinting methodThe first PFR contains the stripe 3 + 4 enhancer and partof stripe 7 enhancer and the second contains an additionalpart of the stripe 7 enhancer and stripe 6 + 2 enhancer Sim-ilarly while a conserved non-coding region appears withinthe boundaries of the tll PD enhancer the abutting ADndashPDenhancer bridges this conserved non-coding block and twoothers making unclear the functional relationship betweenthe CNSs and the enhancers Second the eve stripe 4 + 6 and1 + 5 enhancers are situated in long stretches of highly con-served regions Studies have shown that densely packed oftenoverlapping CRMs populate this sequence (Sackerson et al1999) In both cases the PFRs containing the sequence of

2743

YHGrad et al

the two sets of stripe enhancers are far longer than the exper-imentally defined minimal enhancers themselves the stripe4 + 6 enhancer is an sim600 bp enhancer embedded within aphylogentically footprinted sequence of length sim4 kb andstripe 1 + 5 is jointly sim1700 bp within a stretch of con-served sequence of sim 2600 bp As we see here and has beenreported previously in Bergman et al (2002) many enhan-cers have a convenient one-to-one correspondence with PFRsbut there are a few exceptions including the absence of realenhancers from the PFR set and the conjoining of multipletrue and differing enhancers particularly in the case of long(gt25 kb) PFRs

We ran our CRM discovery algorithm PFR-Sampler (seeAlgorithm) on loci surrounding genes with characterizedexpression in the early blastoderm The algorithm takes theapproach of sampling various subsets of PFRs to identify thoseout of a large set which bear most resemblance to each otherand uses leave-one-out cross-validation to assess the outcomeIn leave-one-out cross-validation (see Algorithm for details)each PFR is serially removed its score against the trainingset of remaining PFRs is calculated and the rank of this scoreas compared to 1000 randomly selected PFRs is determinedThe average rank of the left-out PFRs gives an indication ofhow similar the PFRs are to each other and how distinct frombackground

To explore the robustness of the PFR-Sampler program weused inputs of 10 12 14 17 and 19 blastodermally expressedgenes (Table 1) In each of these inputs the core set of 10 geneswas the same drawn from a previous TF-clustering basedCRM prediction study (Berman et al 2002) The larger inputscomprise this set of 10 plus a random selection of additionalsimilarly expressed genes from the nine annotated in Lifanovet al (2003) In each case we define the locus for each geneas including 50 kb from both the start and stop sites Wherethe 50 kb overlap two genes included in the analysis PFRsin the overlapping region were assigned to the locus of theclosest gene

Notably each run using these inputs returned high scoresand output that contained a high fraction of known blasto-dermal CRMs indicating that the program is capable ofsuccessfully identifying CRMs from a range of inputs Theinput set with the best score was set lsquocrsquo (14 gene loci)which found 24 PFRs as the most similar set within the 457total PFRs surrounding the 14 loci (Fig 3 SupplementalFigure 3) By leave-one-out analysis the 24 PFRs had anaverage rank of 23751000 To evaluate the significance ofthese results we ran our PFR-Sampler program on 100 setsof 14 randomly selected genes and constructed a distribu-tion of average outcome rank The output for set lsquocrsquo (rank2375) places this result in the 98th percentile Similarlywe analyzed the significance of the runs with various num-bers of input genes Each appeared at or above the 92ndpercentile with the significance level for set lsquocrsquo again sur-passing the others Affirming the potential for this approach

Table 1 Input sets of genes comprised of blastodermally expressed genes

Group Gene locus Locus length (bp) PFRs ()

a [set of 10] Abd-A 122 426 65eve 101 537 14gt 101 856 22h 103 280 21hb 106 502 15kni 103 033 49Kr 102 919 27run 102 885 20salm 111 292 50Ubx 178 250 60

b [set of 12] a (from above)en 104 206 58prd 103 459 16

c [set of 14] a (from above)btd 103 385 21ftz 101 904 50prd 103 459 16tll 102 005 27

d [set of 17] a (from above)btd 103 385 21en 104 206 58ftz 101 904 50gsb 102 314 29oc 119 310 11prd 103 459 16tll 102 005 27

e [set of 19] a (from above)btd 103 385 21Dll 120 333 43ems 102 765 55en 104 206 58ftz 101 904 50gsb 102 314 29oc 119 310 11prd 103 459 16tll 102 005 27

(a) The core set of 10 genes from Berman et al (2002) which are included in all sets(bndashe) Additional sets of genes of size 12 14 17 and 19 genes respectively randomlyselected from a pool of blastodermally expressed genes (Lifanov et al 2002) The geneloci include 50 kb upstream of the annotated start site and downstream of the annotatedstop site for each gene (Dmelanogaster Release 31 annotations Misra et al 2002) andthe lsquolocus lengthrsquo column indicates the number of base pairs considered The number ofPFs (for parameters defining PFRs see Methods) within each of these loci is reportedin the PFR column

to identify similarly acting CRMs the output for set lsquocrsquoincludes 17 which correspond or are in close proximity tothe locations of known enhancers (Table 2 and SupplementalFigure 2 (httparepmedharvardeduenhancers) for a sum-mary of results from each input set To evaluate the impactof alternative parameters for the simulated annealing optim-ization we evaluated the results for set lsquocrsquo using a startingtemperature increased to 40 the same stepwise decrementsand an increased number of cycles (60) and updates (100)before halting The results for set lsquocrsquo were largely the same

2744


Fig 3 Several examples of blastodermally expressed genes denoted with results of PFR-Searcher using the output from PFR-Sampler withinput set c Purple arrows designate exons blue triangles indicate known blastodermal enhancers (Lifanov et al 2003) green semicirclesindicate PFRs red semicircles indicate PFRs comprising the output of the PFR-Sampler run and yellow semicircles indicate PFRs with scoresabove threshold as determined by PFR-Searcher Figure generated using gff2ps (Abril and Guigo 2000) Gene name abbreviations as inFigure 2 For full results see Supplemental Figure 3

Table 2 Summary of output from PFR-Sampler given the five input setsdescribed in Table 1

Inputset

Numberof genes

Score Percentile PFRs inoutput

TotalPFRssearched

PFRs correspondingor near to knownenhancers

a 10 563 92 19 343 7b 12 319 96 21 417 11c 14 2375 98 24 457 17d 17 46 94 30 555 18e 19 853 93 30 653 17

The score (lsquoaverage rankrsquo) is an assessment of the similarity to output where leave-one-out cross-validation is performed on each of the output PFRs and the rank out of1000 randomly selected genes is determined the average rank for each of the PFRs inthe output set is reported here For each input set percentile is determined from thedistribution of scores of 100 sets of randomly selected genes

as reported for the shorter simulated annealing protocol (seeSupplemental Figure 3)

Included in the output were PFRs for which there is no cor-responding enhancer in the literature These may represent

false positives which by chance have a short conservedsequence profile similar to the regions known to be subjectto regulation by the TFs involved in blastoderm developmentHowever we note that the majority of these regions appearfar from the genersquos transcriptional start site and we hypothes-ize that true blastodermal CRMs residing in these locationsmay have been subject to less intense scrutiny and hencehave eluded detection We propose these PFRs correspondto candidates for novel blastodermal CRMs

No PFR-Sampler run identified phylogenetic footprintsunderlying all of the known blastoderm CRMs regulating theinput gene set It may be that some the CRMs are subjectto alternative regulation and thus have an alternative shortsequence signature representing different TF-binding sitesFor some characterized blastodermal CRMs including thosefor gsb and ems this is the case The enhancers identifiedby PFR-Sampler are known to be responsive to gap geneswhereas the enhancers directing the blastodermal expressionof gsb and ems are responsive to pair rule genes (Jones andMcGinnis 1993 Li et al 1993 Bouchard et al 2000) Inseveral other instances the false negative result may be due

2745

YHGrad et al

to the restriction that at most two PFRs from a given genelocus can contribute to the results a restriction we institutedto keep the model from being biased by a single locusrsquos gen-omic characteristics We explored this as a possible source offalse negatives as described below

Genome-wide search for blastoderm CRMs basedon sampling-derived modelSince the output of PFR-Sampler is a set of PFRs we cancombine their short sequence profiles to create a model andwe can compare any other phylogenetic footprint to this modelto determine the extent of similarity Effectively this methodprovides a tool with which to perform a genome-wide searchfor similarly acting CRMs and to move from a set of co-regulated genes and a pair of genomes to a genome-widecatalog of candidate CRMs

To perform this search we took the output from the PFR-Sampler run described above for the set of 14 blastodermalgenes and used our PFR-Searcher program based on aMarkov chain discrimination algorithm (see Algorithm) EachPFR in the genome gets assigned a score based on its simil-arity to the model as compared with background The higherthe score the more the PFR region resembles the training setBy assuming the error from the leave-one-out analyses gen-eralizes to the complete set we can derive a threshold scoreto achieve a given sensitivity Since 80 of the left-out PFRswere at rank 21000 or better and 88 at rank 41000 or betterwe elected to take as a threshold the midway point of the scoreof the 3rd ranked PFR out of 1000 Querying the complete gen-ome and selecting those with scores above the set point yielded207 PFRs of which 24 are the input PFRs (see SupplementalTable 1 httparepmedharvardeduenhancerssuptab1xls)

An indirect method of evaluating candidate CRMs con-sists of checking whether genes flanking the candidate gen-omic regions are expressed in early blastoderm developmentAlthough reporter construct assays provide stronger evidencecross-referencing with in situ and literature databases offersa reasonable and rapid first-pass assessment We surveyedtwo sources First we identified genes annotated in FlyBase(Flybase Consortium 2003) that are expressed in blastodermdevelopment in the segmentation pathway We also checkedRelease 2 of the BDGP in situ images database (Tomancaket al 2002) for staining during embryonic stages 4ndash6 in aspatial pattern consistent with segmentation For candidateCRMs located in introns we checked the expression patternsof the overlapping gene and neighboring upstream and down-stream genes For intergenic candidate CRMs we checkedthe adjacent flanking genes two on each side Of the 207PFRs with scores above our set point we found relevantexpression information from genes flanking 107 of the pre-dictions So far 79 PFRs predicted to correspond to CRMsthat drive expression in the early blastoderm are near at leastone gene that is expressed in the expected spatio-temporalpattern Using these numbers as a rough guide and assuming

that each predicted CRM near a blastodermal gene is a trueCRM we calculate that the combined PFR-SamplerPFR-Searcher method for identification of blastodermal CRMshas a hit rate of sim74 and thus a best-case false posit-ive rate of sim26 To get a sense of how significant theseresults are we generated five sets of 207 PFRs randomlyselected from across the genome and subjected them tothe same classification procedure In these five randomizedsamples the mean hit rate of PFRs was 37 plusmn 2 implyingthat our results with a 74 hit rate is considerably betterthan random We speculated above that the output of thePFR-Sampler runs might miss PFRs corresponding to trueCRMs due to the restriction that at most two PFRs from eachinput gene locus can contribute to the final output set OurPFR-Searcher program provides an opportunity to test thishypothesis by evaluating whether the false negative PFRspossess word profiles similar to the PFR-Sampler output Wesee in Figure 3 and Supplemental Figure 3 that seven addi-tional PFRs corresponding to known blastodermal enhancersturn up positive by the PFR-Searcher run thus supporting oursuspected cause of false negatives and lending encouragementthat other candidates across the genome also correspond totrue CRMs

Beyond those seven additional candidate CRMs are neargenes known to be blastodermally expressed such as cadcas dpn nub odd opa pros rpr slp1 slp2 tsh and wgFor several of these genes two or more regions scoring abovethe threshold were nearby The set of candidates also includesCRMs near genes shown in the BDGP database to have blasto-dermal stripe-like expression such as CG1815 CG10176CG33207 CG17383 and CG9924

State-transition composition of the modelThe premise of the algorithms presented here relies on thecorrespondence between conserved subsequence within PFRsand TF-binding sites We therefore sought to explore howwell the model derived from the set of 14 blastodermal geneswhich was used with PFR-Searcher as discussed above cor-responds to known TF-binding sites In essence the questionwe ask is whether the state transitions that contribute mostto the identification of a region as being similar to the modeloverlap with known TF-binding sites

We first identified the log-likelihood score associated witheach of the 4096 state transitions that comprise the model(see Supplemental Table 2) This score is an indication of thelikelihood that a given state-transition derives from the modelor from background and is the foundation of the algorithmrsquosdiscriminative power The score for each region then is a com-bination of the state transitions present their frequency andtheir log-likelihood score (see Algorithm) We chose the PFRcorresponding to the eve stripe 2 PFR as a case study becausethe binding sites within this region have been well character-ized (Ludwig et al 1998) The enhancer region is sim800 bpand lies mostly in the 5prime region of the 1600 bp phylogenetic

2746


Fig 4 Top-scoring state-transitions and their phylogenetic conservation in the PFR overlapping the eve stripe 2 element Sequence from thestripe 2 element in Dmelanogaster (mel) Derecta (ere) Dyakuba (yak) and Dpseudoobscura (pse) were aligned by ClustalW (Thompsonet al 1994) and binding site information collated from Ludwig et al (1998) Asterisks indicate columns of complete conservation in all fourspecies Regular expression matching identified locations of state-transitions within the stripe 2 element sequence (a) Sites for state-transitionequivalent to 6mer GGGTTA (rank 14) and its reverse complement TAACCC appear within known stripe 2 Kruumlppel binding sites Kr-6Kr-2 and Kr-1 along with two sites in the Dpseudoobscura stripe 2 region that are not well conserved in Dmelanogaster (b) Sites forstate-transition equivalent to 6mer ATAATC (rank 8) and its reverse complement GATTAT appear within known stripe 2 Bicoid binding sitesbcd-4 and bcd-3 as well as in a very well conserved block of sequence in the 3prime region of the stripe 2 enhancer See Supplemental Table 3 forlist of top scoring state-transitions and Supplemental Figure 4 for their mapping to the stripe 2 element sequence

footprint which presumably also includes the promoter in the3prime gene proximal region By multiplying each state trans-itionrsquos frequency including that of its reverse complementby its score as set by the model we can rank the contri-butions of each of this PFRrsquos state transitions according tohow much each state transition led to identifying this PFR assimilar to the model (a combination of the state-transitionrsquosscore given the model and background and its frequency inthe region under evaluation see Algorithm and SupplementalTable 3)

We then mapped the top 15 state-transition sequencesand their reverse complements on a ClustalW alignment(Thompson et al 1994) of the stripe 2 element fromDmelanogaster Dpseudoobscura Derecta and Dyakuba(see Supplemental Figure 3) We find that a number of thesestate-transitions are significantly contained (P = 00006 byhypergeometric distribution) within known binding sites forbicoid Kruppel hunchback and sloppy paired 1 (Fig 4Supplemental Figure 4 Ludwig et al 1998 Andrioli et al2002) This known binding sites for segmentation-related TFsare identified among the top 15 state-transitions supports theapproach taken by the algorithms described here We seethat some of these state-transitions appear in unconservedlocations in the sequence of all four species suggesting thepossibility of site turn over and consistent with recent results(Fig 4a and b Dermitzakis and Clark 2002) We observe fur-ther that some of the state-transitions that occur within knownbinding sites appear as well in highly conserved locationswhere no binding site has been characterized (Fig 4b)

DISCUSSIONWe model CRMs as functionally modular and separablesequences defined by the clusters of multiple TF-binding sitesthat populate the sequence As the overall function of anenhancer is subject to stabilizing selection in which a sub-set of TF-binding sites are evolutionarily conserved we treatthe conserved subsequences within phylogenetic footprintsas closely approximating the distinguishing sequence profileof enhancers Our implementation of algorithms to identifysimilar enhancers therefore employs phylogenetic footprint-ing as a way to define borders of enhancers and keys on theshort evolutionarily conserved words anchoring these foot-prints This approach effectively circumvents the need forprior knowledge about both the constellation of TFs acting inconcert at a given CRM and the binding motifs of these TFs

Our approach does not require that all phylogenetic foot-prints represent true enhancers it is likely that the set ofgenomic PFRs includes many non-CRM sequences from cod-ing sequences to areas of randomly observed conservationHowever our assumption that the set of genomic phylogeneticfootprints includes many corresponding to true CRMs is sup-ported by the correlation observed between PFRs and earlyblastoderm enhancers For the purposes of the algorithmspresented here the signal from PFRs corresponding to CRMscan overcome the noise generated by non-CRM PFRs

We designed two programs for CRM identification based onlooking for similarity to phylogenetic footprint word profilesIn the first PFR-Sampler we identify CRMs by searchingthe phylogenetic footprints surrounding co-regulated genes

2747

YHGrad et al

for the subset of the most similar PFRs The second PFR-Searcher constructs a model based on a set of input phylogen-etic footprints and then scores other PFRs for their similarityto the model as compared to background

We show in this study the feasibility of this approachWhen provided with an input of 10ndash19 blastodermallyexpressed genes and phylogenetic footprints from compar-ison of Dmelanogaster and Dpseudoobscura PFR-Samplersuccessfully identifies a subset of phylogenetic footprints thatcorrespond to known blastodermal enhancers Using the out-put PFRs from PFR-Sampler as the input training set forPFR-Searcher the program located many known and can-didate blastodermal enhancers multiple of which are knownto have in vivo function We estimate a false positive rateof sim26 which compares favorably to other computationalCRM prediction studies However we believe it important topoint out that computational evaluations employed here andin the other studies are a reasonable first pass but as demon-strated in our previous work (Halfon et al 2002) should notbe construed as definitive for any specific enhancer predictionor as a replacement for experimental validation

In our method we use AVID alignments of DmelanogasterDpseudoobscura sequences following the precedent set inthe global analysis by Bergman et al (2002) Benchmark-ing tests done by Bergman and Kreitman (2001) for study ofDmelanogaster ndash Dvirilis alignments and by Pollard et al(2004) on simulated data suggest that global alignment toolssuch as AVID and DiAlign perform well in the task of identi-fying conserved blocks of non-coding sequences howeveras shown by Pollard et al (2004) alignment tools have arange of success in detecting sequences subject to differingtypes of conservation pressure In the CRM-prediction methodproposed here varying the alignment algorithm may impacttwo key junctures (1) generation of PFRs and (2) profilingof conserved subsequence composition of individual PFRsSince PFRs are comprised of multiple conserved blocks ofsequence within proximity of one another the characteristicsof PFRs generated by an alignment algorithm depend not onlyon the extent of alignment and the distribution of lengths forconserved sequence blocks but also on how the conservedsubsequences cluster Alternative PFR boundaries result in ade facto difference in Markov state transition profiles sincethe same conserved subsequences used to define the PFR alsounderlie the state transition profile One avenue for furtherstudy then will be benchmarking of alignment tools withrespect to the boundaries of clusters of constrained sequences

Our programs build on previous approaches to the prob-lem of predicting cis-regulatory modules combining stud-ies to delimit sequence space in which CRMs may residewith models of CRMs as comprised of clusters of bindingsites Wasserman et al (2000) demonstrated the feasibil-ity of using phylogenetic footprints between human androdent and a set of co-regulated genes to uncover relevantTF-binding sites They noted that nearly all known binding

sites were localized to conserved blocks and that a Gibbssampling algorithm successfully identified several known TF-binding motifs However they considered only short sequencestretches upstream of coding sequences a limitation that pre-vents broad applicability of their approach since CRMs mayexist many kilobases upstream or downstream of the transcrip-tional start site The relationship between clusters of CNSs andcis-regulatory sequence has been noted in DmelangoasterndashDpseudoobscura alignments by Bergman et al (2002) whodemonstrated experimentally that a cluster of CNSs what werefer to here as a PFR near the gene apterous functions asa CRM The link between clusters of non-coding sequencesand CRMs fundamental to the work presented in our studyprovides a means by which to identify CRMs located bey-ond the promoter region What this link does not establishhowever is the relationship between a CRM and the expres-sion pattern it drives The approach we present in this paperadvances these avenues of research by providing a methodto generate precisely this relationship and using the prin-ciple highlighted in Wasserman et al (2000) to identify keyTF-binding sequences

Several studies have endeavored to specifically predict earlyblastoderm enhancers on the basis of TF-binding motif clus-tering using binding sites for bcd hb Kr kni and cad amongothers To gauge the extent of similarity to those approacheswith the ones described in this study we compared the overlapof results One study which generated word profiles withoutconsidering evolutionary conservation predicted 146 regions(Rajewsky et al 2002) There is only a small degree of over-lap between our set and theirs as only 14 of their regionswere near our predictions Another study which searched forhigh density clusters of TF-binding sites identified 28 can-didate regions (Berman et al 2002) of which 10 are near toour predictions a much higher extent of overlap A numberof the overlapping predictions are near genes known to haveblastodermal expression in a stripe-like pattern offering fur-ther encouragement that the different approaches to enhancerprediction might have correctly identified true cis-regulatoryregions at these locations

One recently developed tool for CRM prediction Stubbsuses an Hidden Markov model approach in conjunction withphylogenetic conservation to identify CRMs (Sinha et al2003) An advantage over our method conferred by the HMMapproach is the incorporation by the hidden model of poten-tially important relationships between binding motifs in thefunction of an enhancer and hence it may better define thearchitecture underlying CRMs This method however differsfrom ours in that it requires prior knowledge of both TF-binding motifs and the constellation of TFs acting together inthe regulatory network Hence it is limited both by the extentand biases of previously available information Another toolArgos avoids this limitation by searching for motifs overrep-resented within a window as compared to a background modelwithout requiring motifs as input (Rajewsky et al 2002) This

2748


method uses a shifting window to scan the genome rather thanfocusing as we do on potential CRMs predefined by phylo-genetic footprinting Further Argos does not by itself link itsCRM predictions with the expression patterns they are expec-ted to drive although through the logic outlined in Rajewskyet al (2002) and similar to the approach taken here it shouldtheoretically be possible to do so

Important caveats remain Here phylogenetic footprintingwith two species is sufficient to capture relevant informationHowever it is likely that study of other regulatory networkswill require the addition of other speciesrsquo genomes Mul-tiple species comparisons will greatly assist in elucidatingpotential regulatory regions and resolving key sequences towhich TFs may bind More fundamentally the model usedin these algorithms rewards words that are over-representedin the training set as compared to background and the fre-quency of appearance determines each wordrsquos contribution tothe overall PFR score Such a scoring procedure presumes amodel in which CRMs are densely packed with many bindingsites for each TF Although this appears to be the case for atleast a subset of blastodermal enhancers it is not clear that allenhancers work using the same model

CONCLUSIONSThis study proposes and gives preliminary results in supportof a method that employs a current model of CRM compos-ition and evolution toward the goal of predicting CRMs Weshow that analysis of PFRs surrounding co-regulated genescan identify regions sharing the most similar subsequence pro-files that these profiles can represent a functional link betweensequences and regulatory activity and that such analysis canbe used to correlate regulatory sequence and the expressionpattern it directs

The approach is predicated on a number of tools and modelsthat further development and study will continue to improveIncluded among these are algorithms for genome alignmentphylogenetic footprinting and comparisons of short sequenceprofiles and models for rates and types of constraint withinCRMs for CRM composition and for CRM architectureAdvances along these lines in simulations tools and mod-eling will be critical in deciphering the logic behind how agiven regulatory network works to coordinate the transcriptionof multiple genes

ACKNOWLEDGEMENTSThe authors would like to thank John Aach Jason Comanderand Matt Wright for critical readings of the manuscript JohnAach Sung Choe Jason Comander Vidyasagar KoduriAnthony Phillipakis Jay Shendure and Matt Wright forvaluable discussions

REFERENCESAbrilJF and GuigoR (2000) gff2ps visualizing genomic annota-

tions Bioinformatics 16 743ndash744

AdamsMD CelnikerSE HoltRA EvansCA GocayneJDAmanatidesPG SchererSE LiPW HoskinsRA GalleREet al (2000) The genome sequence of Drosophila melanogasterScience 287 2185ndash2195

AndrioliLP VasishtV TheodosopoulouE ObersteinA andSmallS (2002) Anterior repression of a Drosophila stripe enhan-cer requires three position-specific mechanisms Development129 4931ndash4940

BergmanCM PfeifferBD Rincon-LimasDE HoskinsRAGnirkeA MungallCJ WangAM KronmillerB PaclebJParkS et al (2002) Assessing the impact of comparative gen-omic sequence data on the functional annotation of the Drosophilagenome Genome Biol 3 RESEARCH0086

BermanBP NibuY PfeifferBD TomancakP CelnikerSELevineM RubinGM and EisenMB (2002) Exploiting tran-scription factor binding site clustering to identify cis-regulatorymodules involved in pattern formation in the Drosophila genomeProc Natl Acad Sci USA 99 757ndash762

BlackmanRK and MeselsonM (1986) Interspecific nucleotidesequence comparisons used to identify regulatory and struc-tural features of the Drosophila hsp82 gene J Mol Biol 188499ndash515

BouchardM St-AmandJ and CoteS (2000) Combinatorial activ-ity of pair-rule proteins on the Drosophila gooseberry earlyenhancer Dev Biol 222 135ndash146

BrayN DubchakI and PachterL (2003) AVID a global alignmentprogram Genome Res 13 97ndash102

CarrollSB GrenierJK and WeatherbeeSD (2001) From DNAto Diversity Molecular Genetics and the Evolution of AnimalDesign Blackwell Sciences Inc Malden MA

CliftenP SudarsanamP DesikanA FultonL FultonBMajorsJ WaterstonR CohenBA and JohnstonM (2003)Finding functional features in Saccharomyces genomes by phylo-genetic footprinting Science 301 71ndash76

CrowleyEM RoederK and BinaM (1997) A statistical model forlocating regulatory regions in genomic DNA J Mol Biol 2688ndash14

DavidsonEH (2001) Genomic Regulatory Systems Developmentand Evolution Academic Press San Diego CA

DermitzakisET BergmanCM and ClarkAG (2003) Tracing theevolutionary history of Drosophila regulatory regions with modelsthat identify transcription factor binding sites Mol Biol Evol 20703ndash714

DermitzakisET and ClarkAG (2002) Evolution of transcrip-tion factor binding sites in Mammalian gene regulatoryregions conservation and turnover Mol Biol Evol 191114ndash1121

DubchakI BrudnoM LootsGG PachterL MayorCRubinEM and FrazerKA (2000) Active conservation of non-coding sequences revealed by three-way species comparisonsGenome Res 10 1304ndash1306

DurbinR EddyS KroghA and MitchisonG (1998) BiologicalSequence Analysis Cambridge University Press Cambridge

FickettJW and WassermanWW (2000) Discovery and modelingof transcriptional regulatory regions Curr Opin Biotechnol 1119ndash24

Flybase Consortium (2003) The FlyBase database of the Drosophilagenome projects and community literature Nucleic Acids Res31 172ndash175

2749

YHGrad et al

FrithMC HansenU and WengZ (2001) Detection of cis-element clusters in higher eukaryotic DNA Bioinformatics 17878ndash889

GumucioDL SheltonDA BaileyWJ SlightomJL andGoodmanM (1993) Phylogenetic footprinting reveals unex-pected complexity in trans factor binding upstream from theepsilon-globin gene Proc Natl Acad Sci USA 90 6018ndash6022

HalfonMS GradY ChurchGM and MichelsonAM (2002)Computation-based discovery of related transcriptional regu-latory modules and motifs using an experimentally validatedcombinatorial model Genome Res 12 1019ndash1028

HughesJD EstepPW TavazoieS and ChurchGM (2000) Com-putational identification of cis-regulatory elements associatedwith groups of functionally related genes in Saccharomycescerevisiae J Mol Biol 296 1205ndash1214

JonesB and McGinnisW (1993) The regulation of empty spir-acles by Abdominal-B mediates an abdominal segment identityfunction Genes Dev 7 229ndash240

KellisM PattersonN EndrizziM BirrenB and LanderES(2003) Sequencing and comparison of yeast species to identifygenes and regulatory elements Nature 423 241ndash254

LevyS HannenhalliS and WorkmanC (2001) Enrichment ofregulatory signals in conserved non-coding genomic sequenceBioinformatics 17 871ndash877

LiX GutjahrT and NollM (1993) Separable regulatory elementsmediate the establishment and maintenance of cell states bythe Drosophila segment-polarity gene gooseberry EMBO J 121427ndash1436

LifanovAP MakeevVJ NazinaAG and PapatsenkoDA (2003)Homotypic regulatory clusters in Drosophila Genome Res 13579ndash588

LootsGG LocksleyRM BlankespoorCM WangZEMillerW RubinEM and FrazerKA (2000) Identification of acoordinate regulator of interleukins 4 13 and 5 by cross-speciessequence comparisons Science 288 136ndash140

LudwigMZ PatelNH and KreitmanM (1998) Functionalanalysis of eve stripe 2 enhancer evolution in Drosophila rulesgoverning conservation and change Development 125 949ndash958

LudwigMZ BergmanC PatelNH and KreitmanM (2000)Evidence for stabilizing selection in a eukaryotic enhancerelement Nature 403 564ndash567

MarksteinM MarksteinP MarksteinV and LevineMS (2002)Genome-wide analysis of clustered dorsal binding sites identifiesputative target genes in the Drosophila embryo Proc Natl AcadSci USA 99 763ndash768

MayorC BrudnoM SchwartzJR PoliakovA RubinEMFrazerKA PachterLS and DubchakI (2000) VISTA visu-alizing global DNA sequence alignments of arbitrary lengthBioinformatics 16 1046ndash1047

McGuireAM HughesJD and ChurchGM (2000) Conserva-tion of DNA regulatory motifs and discovery of new motifs inmicrobial genomes Genome Res 10 744ndash757

MisraS CrosbyMA MungallCJ MatthewsBBCampbellKS HradeckyP HuangY KaminkerJSMillburnGH ProchnikSE et al (2002) Annotation ofthe Drosophila melanogaster euchromatic genome a systematicreview Genome Biol 3 RESEARCH0083

MosesAM ChiangDY KellisM LanderES and EisenMB(2003) Position specific variation in the rate of evolution intranscription factor binding sites phylogenetically and spatiallyconserved word pairs associated with gene-expression changes inyeasts BMC Evol Biol 3 19

PollardDA BergmanCM StoyeJ CelnikerSE andEisenMB (2004) Benchmarking tools for the alignmentof functional noncoding DNA BMC Bioinformatics 5 6

PressW TeukolskyS VetterlingW and FlanneryB (1992)Numerical Recipes in C Cambridge University Press New York

RajewskyN VergassolaM GaulU and SiggiaED (2002) Com-putational detection of genomic cis-regulatory modules appliedto body patterning in the early Drosophila embryo BMC Bioin-formatics 3 30

RebeizM ReevesNL and PosakonyJW (2002) SCORE a com-putational approach to the identification of cis-regulatory modulesand target genes in whole-genome sequence data Site cluster-ing over random expectation Proc Natl Acad Sci USA 999888ndash9893

RussoCA TakezakiN and NeiM (1995) Molecular phylogenyand divergence times of drosophilid species Mol Biol Evol 12391ndash404

SackersonC FujiokaM and GotoT (1999) The even-skippedlocus is contained in a 16-kb chromatin domain Dev Biol 21139ndash52

SinhaS Van NimwegenE and SiggiaED (2003) A probabilisticmethod to detect regulatory modules Bioinformatics 19 (Suppl1) I292ndashI301

TagleDA KoopBF GoodmanM SlightomJL HessDL andJonesRT (1988) Embryonic epsilon and gamma globin genesof a prosimian primate (Galago crassicaudatus) Nucleotide andamino acid sequences developmental regulation and phylogeneticfootprints J Mol Biol 203 439ndash455

TavazoieS HughesJD CampbellMJ ChoRJ andChurchGM (1999) Systematic determination of geneticnetwork architecture Nat Genet 22 281ndash285

ThompsonJD HigginsDG and GibsonTJ (1994) CLUSTALW improving the sensitivity of progressive multiple sequencealignment through sequence weighting position-specific gappenalties and weight matrix choice Nucleic Acids Res 224673ndash4680

TomancakP BeatonA WeiszmannR et al (2002) Systematicdetermination of patterns of gene expression during Drosophilaembryogenesis Genome Biol 3 RESEARCH0088

VuillaumierS DixmerasI MessaiH LapoumeroulieCLallemandD GekasJ ChehabFF PerretC ElionJ andDenamurE (1997) Cross-species characterization of the pro-moter region of the cystic fibrosis transmembrane conductanceregulator gene reveals multiple levels of regulation Biochem J327 651ndash662

WassermanWW PalumboM ThompsonW FickettJW andLawrenceCE (2000) Humanndashmouse genome comparisons tolocate regulatory sites Nat Genet 26 225ndash228

WolffC PeplingM GergenP and KlinglerM (1999) Structureand evolution of a pair-rule interaction element runt regulatorysequences in D melanogaster and D virilis Mech Dev 8087ndash99

2750


that given sufficient evolutionary distance is significantlyhigher than background and that is related to the rate of bind-ing site turnover change andor loss of CRM function (Fickettand Wasserman 2000 Levy et al 2001 Dermitzakis andClark 2002 Moses et al 2003)

Existing sequence analysis-based tools for identificationof CRMs take two main approaches The first involvesinter-species comparisons designed to take advantage of evol-utionary conservation of regulatory sequences an approachtermed lsquophylogenetic footprintingrsquo (Tagle et al 1988) Herenon-coding sequences conserved between two or more relatedspecies are treated as likely candidates for regulatory regionsThe observed conservation arises from TF binding sites thatremain stable in relative location to one another and henceform lsquoanchorsrsquo that seed long linear sequence alignmentsPhylogenetic footprinting has routinely served as a guide to thediscovery of regulatory sequences (Blackman and Meselson1986 Gumucio et al 1993 Vuillaumier et al 1997 Lootset al 2000 McGuire et al 2000 Bergman et al 2002)This method is often used as a first step to define the bound-aries of potential regulatory sequences and is followed byexperimental studies to elucidate the spatio-temporal pat-tern of expression if any that those potential regulatorysequences might direct Since metazoan genes generally pos-sess complex expression patterns directed by multiple CRMsa search for a module responsible for a particular patternrequires laborious experimental testing of all nearby candidatesequences

Recently several groups have designed a second set ofapproaches that predict CRMs by identifying clusters ofpotential TF binding sites Starting with a set of coordin-ately acting TFs and their experimentally described bindingmotifs some of these algorithms search the genome forsequences of sim500ndash1000 bp with uncommonly high concen-trations or co-occurrences of predicted binding sites (Crowleyet al 1997 Fickett and Wasserman 2000 Frith et al 2001Berman et al 2002 Halfon et al 2002 Markstein et al2002 Rajewsky et al 2002 Rebeiz et al 2002) The pre-sumption underlying these approaches is that binding-motifrich sequences provide evidence of TF binding and hencemay direct similar expression patterns In several cases asubset of predictions has been experimentally verified How-ever the arduous task of testing all predictions generatedby these algorithms and the absence of extensive and well-characterized CRM datasets have hindered careful evaluationof the false positive and false negative rates It is also import-ant to point out that all tests in Drosophila melanogaster so farhave concentrated on only a small number of regulatory net-works (chiefly gene expression in the early blastoderm) Thegeneral applicability of TF-binding motif clustering meth-ods to other pathways has not yet been examined and wemay yet learn that other pathways are regulated in a man-ner refractory to or requiring more sophistication than siteclustering-type analyses Two obstacles preventing broader

evaluation of these algorithms are that (1) binding motifshave been confidently described for only a handful of TFs and(2) there are few well-characterized networks of coordinatelyacting TFs that could serve as starting points for additionalbinding site clusteringco-occurrence studies

Just as the availability of full genome sequence promptedthe development of computational tools to comprehensivelysearch for CRMs the availability of the genomes of pairs ofrelated species presents the opportunity to combine the aboveapproaches by introducing systematic phylogenetic compar-isons into algorithms designed to decipher the regulatorycode (Cliften et al 2003 Kellis et al 2003 Moses et al2003) The sequencing of Drosophila pseudoobscura andDmelanogaster a pair of species separated by approximately24 million years of evolution (Russo et al 1995) provides thetools to pursue such aims in metazoans (Adams et al 2000httpwwwhgscbcmtmceduprojectsdrosophila)

We propose here an approach to identify similarly act-ing cis-regulatory modules given genome sequences forDmelangoaster and Dpseudoobscura and a set of co-regulated genes We hypothesize that similarly acting enhan-cers can be identified by conserved subsequence signaturesin phylogenetic footprinted regions (PFRs) Drawing inspir-ation from Gibbs-sampling based motif-finding algorithmsthat successfully identify regulatory sequence motifs near co-regulated genes (Tavazoie et al 1999 Hughes et al 2000)we developed an approach to identify the most similar phylo-genetic footprints near co-regulated genes in a program wecall PFR-Sampler (Fig 1a) As a first step we assemble alist of PFRs To do this we use AvidVista (Dubchak et al2000 Mayor et al 2000 Bray et al 2003) to identify con-served non-coding regions across the genome which we thencollate into a genome-wide set of PFRs We treat each PFRas a potential CRM and generate for each PFR a profile ofthe conserved sequences that anchor the PFR alignment Bykeeping track of the conserved sequence profile with Markovchains we can use Markov chain discrimination to assessPFR similarity (Durbin et al 1998) We then use a samplingalgorithm PFR-Sampler to identify a subset of maximallyrelated PFRs among the full initial set of phylogenetic foot-prints near a set of previously defined co-regulated genes Theconserved sequences that are discovered to anchor the foot-printed regions are candidate TF-binding sequences and thePFRs in the subset of maximally related PFRs are candidateCRMs responsible for the observed co-regulation

This method of CRM discovery effectively circumventsobstacles facing algorithms that aim to predict CRMs bysearching for co-occurrence of TF-binding motifs in that itdoes not require prior knowledge of the constellation of TFsthat might act in the pathway of interest and further does notrequire any prior information regarding TF nucleotide bindingspecificities The ability to use a combination of phylogen-etic conservation and co-expression to compensate for lack ofknowledge of TF constellation and binding sites has precedent

2739

YHGrad et al







2740







c+st = cst +


csi t middot w



a+st = c+

stsumt prime c

+st prime





P (x | modelminus)=

Lsumi=1

loga+

ximinus1xi

aminusximinus1xi



2741

YHGrad et al



p = exp

(S

T

)








2742






2743

YHGrad et al














2744




Inputset

Numberof genes


TotalPFRssearched


a 10 563 92 19 343 7b 12 319 96 21 417 11c 14 2375 98 24 457 17d 17 46 94 30 555 18e 19 853 93 30 653 17






2745

YHGrad et al









2746








2747

YHGrad et al








2748


























2749

YHGrad et al
































2750

YHGrad et al







2740







c+st = cst +


csi t middot w



a+st = c+

stsumt prime c

+st prime





P (x | modelminus)=

Lsumi=1

loga+

ximinus1xi

aminusximinus1xi



2741

YHGrad et al



p = exp

(S

T

)








2742






2743

YHGrad et al














2744




Inputset

Numberof genes


TotalPFRssearched


a 10 563 92 19 343 7b 12 319 96 21 417 11c 14 2375 98 24 457 17d 17 46 94 30 555 18e 19 853 93 30 653 17






2745

YHGrad et al









2746








2747

YHGrad et al








2748


























2749

YHGrad et al
































2750







c+st = cst +


csi t middot w



a+st = c+

stsumt prime c

+st prime





P (x | modelminus)=

Lsumi=1

loga+

ximinus1xi

aminusximinus1xi



2741

YHGrad et al



p = exp

(S

T

)








2742






2743

YHGrad et al














2744




Inputset

Numberof genes


TotalPFRssearched


a 10 563 92 19 343 7b 12 319 96 21 417 11c 14 2375 98 24 457 17d 17 46 94 30 555 18e 19 853 93 30 653 17






2745

YHGrad et al









2746








2747

YHGrad et al








2748


























2749

YHGrad et al
































2750

YHGrad et al



p = exp

(S

T

)








2742






2743

YHGrad et al














2744




Inputset

Numberof genes


TotalPFRssearched


a 10 563 92 19 343 7b 12 319 96 21 417 11c 14 2375 98 24 457 17d 17 46 94 30 555 18e 19 853 93 30 653 17






2745

YHGrad et al









2746








2747

YHGrad et al








2748


























2749

YHGrad et al
































2750






2743

YHGrad et al














2744




Inputset

Numberof genes


TotalPFRssearched


a 10 563 92 19 343 7b 12 319 96 21 417 11c 14 2375 98 24 457 17d 17 46 94 30 555 18e 19 853 93 30 653 17






2745

YHGrad et al









2746








2747

YHGrad et al








2748


























2749

YHGrad et al
































2750

YHGrad et al














2744




Inputset

Numberof genes


TotalPFRssearched


a 10 563 92 19 343 7b 12 319 96 21 417 11c 14 2375 98 24 457 17d 17 46 94 30 555 18e 19 853 93 30 653 17






2745

YHGrad et al









2746








2747

YHGrad et al








2748


























2749

YHGrad et al
































2750




Inputset

Numberof genes


TotalPFRssearched


a 10 563 92 19 343 7b 12 319 96 21 417 11c 14 2375 98 24 457 17d 17 46 94 30 555 18e 19 853 93 30 653 17






2745

YHGrad et al









2746








2747

YHGrad et al








2748


























2749

YHGrad et al
































2750

YHGrad et al









2746








2747

YHGrad et al








2748


























2749

YHGrad et al
































2750








2747

YHGrad et al








2748


























2749

YHGrad et al
































2750

YHGrad et al








2748


























2749

YHGrad et al
































2750


























2749

YHGrad et al
































2750

YHGrad et al
































2750

bioinformatics - llama.mshri.on.callama.mshri.on.ca/publications/grad_bioinformatics_2004.pdf ·...

Documents