cwinnower algorithm for finding fuzzy dna motifs

6
cWINNOWER Algorithm for Finding Fuzzy DNA Motifs Shoudan Liang NASA AmesResearch Center NASA Advanced Supercomputing Division Moffett Field, CA 94035 USA [email protected] Abstract The c WINNOWER algorithm detects.fuzzy motifs in DNA sequences rich in protein-binding signals. A signal is de- fined as any short nucleotide pattern having up to d muta- tions differing from a motif of length i. The algorithm finds such motifs ifmultiple mutated copies of the motif(i.e., the signals) are present in the DNA sequencein sufficient abun- dance. The cWINNOWER algorithm substantially improves the sensitivity of the winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals. We studied the minimum number of detectable motifs qc as a function of sequence length N for random sequences. We found that qc increases linearly with N for a fast version of the algorithm based on counting three-member sub-cliques. Imposing consensusconstraints reduces qc by afactor of three in this case, which makes the algorithm dramatically more sensitive. Our most sensitive algorithm, which counts four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a sequence of length N = 12000for (i,d) = (15,4). tain common regulatory elements in the DNA upsb"eam of the b"anscription initiation points. Then the bioinformatics problem is to find common motifs in the upsb"eamof the clustered-and presumably co-regulated-genes that may play similar regulatory roles for each[3, 4]. Motif iden- tification should also aid the search for morphogen bind- ing sites from enhancers conb"olling early embryo develop- ment. Large-scale mapping of such cis-regulatory signals has been proposed[5] in which random genomic DNA se- quences are assessedfor their potential regulatory role in early embryo development. Detailed studies on sea urchin embryos have demonsb"ated that such enhancers often con- tain multiple b"anscription factor binding sites (34 have been identified in the case of endol6)[6]. Fuzzy motif-finding al- gorithms may prove to be a general method for identifying b"anscription factor binding sites from the large-scale map- ping of cis-regulatory elements. In a typical situation, a weak DNA signal is embedded in a set of experimentally identified DNA sequences that are enriched with the binding site. Not every sequence in the experimental set is guaranteed to contain the binding site. In addition, protein-binding DNA signals often contain ambiguous positions which can have more than one equiva- lent nucleotide. The bioinformatics problem is to determine the hidden signal, if any. Identifying such a weak signal is a non-trivial problem[7]. Pevzner and Sze[8] have formu- lated a 'grand challenge' problem of finding a hidden motif of length l in which each binding site can differ from the hidden motif in at most d places. Two length l motifs can share a common hidden pattern even when they differ in 2d places. If a graph is consbuctod where the nodes are consecutive length l patterns and two nodes are linked if two corresponding patterns differ in no more than 2d places, then in the set of patterns that are within Hamming distance d to the common motif, every node is connected to every other node. In other words, the nodes form a q-member clique (q-clique) where q is the size of the set However, the graph contains vastly more spurious connections. Pevzner and Sze proposed a winnower method that systematically 1. Introduction Transcription factors binding to fuzzy motifs in DNA is a process that underlies one of the most important modes of gene regulationin a cell. Algorithms that iden- tify such protein-binding signals in DNA are becoming especially important with the recently developedhigh- throughput techniques that showpromiseto uncover these interactions at a genome-wide scale. One such technique is genome-wide locationanalysis[l, 2], in which DNA mi- croarrays offer a means to identify the approximate loca- tions of the binding sitesof a transcription factor anywhere in the genome. Motif identificationalso proves useful for analyzingmicroarray data when the measured mRNA ex- pression levelsof all the genes on the microarray areclus- tered, and genes in the same cluster are assumed to con- Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

Upload: independent

Post on 10-Dec-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

cWINNOWER Algorithm for Finding Fuzzy DNA Motifs

Shoudan LiangNASA Ames Research Center

NASA Advanced Supercomputing DivisionMoffett Field, CA 94035 USA

[email protected]

Abstract

The c WINNOWER algorithm detects.fuzzy motifs in DNAsequences rich in protein-binding signals. A signal is de-fined as any short nucleotide pattern having up to d muta-tions differing from a motif of length i. The algorithm findssuch motifs ifmultiple mutated copies of the motif(i.e., thesignals) are present in the DNA sequence in sufficient abun-dance. The cWINNOWER algorithm substantially improvesthe sensitivity of the winnower method of Pevzner and Szeby imposing a consensus constraint, enabling it to detectmuch weaker signals. We studied the minimum number ofdetectable motifs qc as a function of sequence length N forrandom sequences. We found that qc increases linearly withN for a fast version of the algorithm based on countingthree-member sub-cliques. Imposing consensus constraintsreduces qc by afactor of three in this case, which makes thealgorithm dramatically more sensitive. Our most sensitivealgorithm, which counts four-member sub-cliques, needs aminimum of only 13 signals to detect motifs in a sequenceof length N = 12000for (i,d) = (15,4).

tain common regulatory elements in the DNA upsb"eam ofthe b"anscription initiation points. Then the bioinformaticsproblem is to find common motifs in the upsb"eam of theclustered-and presumably co-regulated-genes that mayplay similar regulatory roles for each[3, 4]. Motif iden-tification should also aid the search for morphogen bind-ing sites from enhancers conb"olling early embryo develop-ment. Large-scale mapping of such cis-regulatory signalshas been proposed[5] in which random genomic DNA se-quences are assessed for their potential regulatory role inearly embryo development. Detailed studies on sea urchinembryos have demonsb"ated that such enhancers often con-tain multiple b"anscription factor binding sites (34 have beenidentified in the case of endol6)[6]. Fuzzy motif-finding al-gorithms may prove to be a general method for identifyingb"anscription factor binding sites from the large-scale map-ping of cis-regulatory elements.

In a typical situation, a weak DNA signal is embeddedin a set of experimentally identified DNA sequences thatare enriched with the binding site. Not every sequence inthe experimental set is guaranteed to contain the bindingsite. In addition, protein-binding DNA signals often containambiguous positions which can have more than one equiva-lent nucleotide. The bioinformatics problem is to determinethe hidden signal, if any. Identifying such a weak signal isa non-trivial problem[7]. Pevzner and Sze[8] have formu-lated a 'grand challenge' problem of finding a hidden motifof length l in which each binding site can differ from thehidden motif in at most d places. Two length l motifs canshare a common hidden pattern even when they differ in2d places. If a graph is consbuctod where the nodes areconsecutive length l patterns and two nodes are linked iftwo corresponding patterns differ in no more than 2d places,then in the set of patterns that are within Hamming distanced to the common motif, every node is connected to everyother node. In other words, the nodes form a q-memberclique (q-clique) where q is the size of the set However, thegraph contains vastly more spurious connections. Pevznerand Sze proposed a winnower method that systematically

1. Introduction

Transcription factors binding to fuzzy motifs in DNAis a process that underlies one of the most importantmodes of gene regulation in a cell. Algorithms that iden-tify such protein-binding signals in DNA are becomingespecially important with the recently developed high-throughput techniques that show promise to uncover theseinteractions at a genome-wide scale. One such techniqueis genome-wide location analysis[l, 2], in which DNA mi-croarrays offer a means to identify the approximate loca-tions of the binding sites of a transcription factor anywherein the genome. Motif identification also proves useful foranalyzing microarray data when the measured mRNA ex-pression levels of all the genes on the microarray are clus-tered, and genes in the same cluster are assumed to con-

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

Table 1. The table compares the smallest copynumber q necessary to prune all spuriouslinks for one random sequence of length N.WINNOWER and cWINNOWER denotes win-nower algorithms without and with consen-sus bounds. Both cases are listed for thepruning criteria base on eliminating links bycounting the number of 3-cliques (k = 2) and4-cliques (k = 3). As q increases, the percent-age of links pruned changes from essentially0% to 100% in a very narrow range. The qclisted in this table is the smallest q when thenumber of spurious links left unpruned is lessthan 1000 (the total number of spurious linksis more than a million).

k=3WINNOWER 2k=3cWINNOWER

we can enumerate each sequence and count the numberof sequences that are within d mutations from it How-ever, the computation required goes up exponentially withthe length of the pattern and becomes impossible for mod-erately long patterns. The winnower method[8] on theother hand changes the finding of hidden motifs to a graph-theoretical problem.

Define a graph of n nodes each of which represents aconsecutive sub-string of input sequences. Two nodes areconnected if they differ in less than 2d positions. The cru-cial observation is that all the sub-strings sharing a commonmotif form a clique graph. (A q-clique graph has q nodesand every pair of nodes is connected.)

Most of the connections made by the 2d mismatch cri-terion are spurious, in that they are not part of a graph thatmakes a q-clique. and so must be eliminated. For example.any node that has less than (q -1) connections to the rest ofthe graph cannot be a part of the q-clique. and these connec-tions are all spurious. Furthermore, in order for a link to bea part of the q-clique, it must be a part of at least (q - 2) tri-

angles. More specifically. suppose the link is between nodea and node b. Define a triangle neighbor set C consistingof nodes c that are connected to both a and b. The numberof nodes in C has to be larger than (q - 2). This trianglecriteria is what Pevzner and Sze called the k = 2 case[8]. Amore stringent test is the k = 3 case. in which each mem-ber of set C is required to be in four-member sub-cliques.In order to belong to set C. there must be at least (q - 3) 4-

deletes spurious links that cannot be a part of the q-clique.The chief advantage of the winnower algorithm is that

it is guaranteed to find all hidden patterns that are presentat least q times in the sample DNA sequences within Ham-ming distance d from the hidden pattern. The hidden pat-tern itself does not have to even be present in the dataset.Most popular motif finding methods [1O, 11, 12, 13, 14, 15]rely on optimization of non-linear objective functions andtherefore cannot guarantee that the pattern found attainsthe global optimum. The winnower method effectively ser-aches through all possible hidden patterns and is unique inbeing able to claim the definitive absence of signal (I, d) forcopy number q.

In this paper, we introduce a consensus bound on pat-terns belonging to the same clique that enables the algo-rithm to remove more spurious links. As a result, the algo-rithm becomes substantially more sensitive. We discoveredparameter regions where the winnower method is effectivein detecting signals. We computed minimum clique sizes,qc, required for removing all spurious links generated froma random sequence. Our main results are summarized inTable 1, which compare qc for algorithm without and withapplicaiton of the consensus bound. (For N = 12000, onlya range is given for the WINNOWER algorithm when q isin the vicinity of qc because the algorithm took too long torun.) We find that qc increases approximately linearly withthe length of the random sequence for the case of k = 2

(that eliminates links by counting the member of 3-cliques).The consensus constraint reduces qc by a factor of three fork = 2. For our most sensitive case of k = 3(that counts

4-cliques), which is slower to run, qc = 13 for the longestsequence length we tried (N = 12000). This is about a

factor of two better than without consensus constraints. Weformulated the algorithm in terms of set operations result-ing in a much simpler implementation. For the most sen-sitive case of k = 3, which usually runs too slow to beuseful, we speeded up the calculation by saving certain in-termediate results. For the longest sequences we considered(N = 12000 and (I, d) = (15,4», the k = 3 algorithm isonly a factor of three slower than k = 2.

We first review the winnower algorithm before provinga consensus constraint. We then present the cWINNOWERalgorithm in terms of set operations and discuss its com-plexity. This is followed by performance tests on random

sequences.

2. Winnower Method

Imagine the hyperspace of all possible length l patternspopulated by words cut consecutively from the input se-quences. Regions of the hyperspace that have an unusu-ally high density of l-words indicates statistical signifi-cance and presumably biological meanings. Theoretically,

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

cliques containing nodes a, band c. Finally, the size of theset C must be larger than (q - 2). This easily generalizes to

higher clique graphs.

3. Consensus Constraint

By counting the number of 3-clique and 4-clique sub-graphs. the winnower method systematically eliminateslinks that cannot be a part of the q-clique. The main dif-ficulty is that the 2d-mismatch criterion that defines a con-nection for the link is too lenient. resulting in an explosionof spurious links. Here we develop a consensus constraintfor deciding if a link between a and b belongs to a q-clique.Let P be the set of nodes including a and b as well as thenodes connected with both a and b. Let n = IPI be thenumber of nodes in set P. In order for the link (a, b) to bea part of the q-clique. we must have n ?: q. In order forthe link to belong to the q-clique. the following consensusconstraint must be satisfied: min(q,

"

{N=~G.T}(~c5sttN)1slI n

Emin(q,E6s{,Cfi=1 ;=1

n

= min(q,~6sf.C,j=1

~ q(l - d) (1)

where the Kronecker delta function b"a,b is 1 if a = b and 0

otherwise; sf is the i-th element of the j-th sequence in P;C, is the i-th column consensus of the sequences sj, i.e.

fa

{N-T.~G.T}(~6sf.N)',=1

n}:6sf.c. =;-1

We now prove the consensus consb"aint. In order for theset P of n sequences filtered by either k = 2 or k = 3

criteria to be a part of the q-clique, we must have

A useful special case is when pruning a link that has themaximum allowed mismatch 2d. The majority of links willbe of this type. For such a link, the positions of matched nu-cleotides must agree with the consensus sequence. In sucha case, using the nucleotides from the matching part of thetwo nodes instead of the n sequence consensus will improvethe bound.

If the size of the set is too big the constraint becomesuseless. In Eq.(l), since Ej=lbS?,C, ~ r~l, where rxl

denotes the smallest integer larger'or equal to x, Eq.(l) isalways satisfied when r~l ~ q(l -1).

The consensus bound can be greatly improved if a fastalgorithm can be found to choose q motifs from n that formthe best q-motif consensus. We have also derived other con-straints for three nodes. However, none proved as useful inpractice as the consensus constraint.

p>I

818 E P. ~68',~ ?:; l-d'-I

(2)=

IF> (3)~{q

for some hidden motif h. Here {.} denotes a set and 1.1 isthe number of elements in it. I is the length of the string,and d is the maximum number of allowed mismatches.

This equation implies

4. cWINNOWER Algorithm

Patterns 1 nucleotides long cut consecutively from eachof the input sequences form a set of nodes. Any two nodesare linked if their mismatches are less than 2d. because theycan be within d-mutations from a common hidden pattern.

The winnower method systematically removes spuriousedges that cannot be a link in a q-clique because they lack asufficient number of sub-cliques. Here we give some detailsof the procedures used: k = 1 counts the number of links toeach node; k = 2 counts the number of triangles (3-cliques)for a given link; and k = 3 counts 4-cliques.

q I

LL6sf,~ ~q(l- d)

jsli=l

(4)

for at least one subset of q sequences in P>. If we can showno q sequences satisfy Eq.(4), then Eq.(3) fails.

Consider the consensus sequence of all sequences in P>.By definition, the consensus sequence C maximizes thesum 2::;=1 15 sj ,c, for each position i, where n = IF> I.

t

Obviously,q I I q

I:I:5s1"" = I:I:5S1,h,j=1 i=1 iz1 j=1

I q

s: EI:5s1,c;i=1jz1

I n

s: I:min(q,I:5S?,c.}i=1 j=1'

where C' is the consensus sequence of q sequences and Cis the consensus of the whole group of n sequences thatinclude q sequences as a subset. The last inequality followsbecause

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

4.1. k = 1 pruning criterion

If the number of links, n, a node has is smaller than q-l,then the node cannot be a part of any q-clique. Prune alllinks connected to it. Apply the consensus criteria. If it failsto satisfy, prune all links. The algorithm has a complexityofO(N2), where N is the total length of the sequence.

4.2. k = 2 pruning criterion

In order to decide if the link between a and b shouldbe pruned, define a set P of all nodes c that are connectedto both a and b. Next apply the consensus criterion withn = 3 and q = 3 on a, b and c to remove those from Pthat fail the criterion. Finally, if the size of the neighboringset IPI is smaller than q - 2, then the node cannot be in

any q-clique. Prune the link between a and b. Apply theconsensus criteria. If it fails to satisfy, prune the link (a, b).The algorithm has a complexity of O(N3).

Figure 1. Fraction of spurious links left un-pruned as a function of clique size q for thecase of k = 2 after WINNOWER (dashed lines)and cWINNOWER (solid lines) algorithms areapplied to random sequences of three sizes:N = 12000 (diamonds); N = 6000 (circles);and N = 3000 (stars). cWINNOWER usesconsensus constraints whereas WINNOWERdoes not. In each case, there is an abrupttransition from one (no link is purged) to zero(all links are purged). For WINNOWER algo-rithm N = 12000 (the rightmost curve), thereare no data for q between 71 and 75 because ittakes too long to complete the calculations.The hidden pattern sought is (I, d) = (15,4),i.e., a pattern of length 15 with at most 4 muta-tions allowed. Each point is the result of aver-aging between one to ten random sequences.

4.3. k = 3 pruning criterion

The goal is to prune the link between a and b. Here wealso have a neighboring set P. The criterion of admissionto the set is more stringent. Each member, C, of the set Pmust have at least q - 3 other nodes in P that together witha, band C form (q - 3) 4-cliques. In order for the four nodes

to form a 4-clique, they must satisfy the consensus criterionwith n = 4 and q = 4. Finally, the size of the set P must belarger than q - 2. The consensus criterion is applied to P.

The complexity of the algorithm is O(N4).

4.4. Speeding up techniques

time. The extra cost is a simple sorting. By the same token,we prune links widt 2d mismatches first. The algorithmis implemented in C++ using the standard template library,which has fast set operations.

5. Consensus Constraint Improves Perfor-mance

Because the winnower algorithm is guaranteed not toprune away true q-cliques. its performance is detenninedby its ability to remove spurious links. In a typical situ-ation. almost all of the links are spurious. For example.for (l, d) = (15,4). and total sequence length N = 6000

the total number of links is about one million with the 2dmismatch criterion. whereas embedding twenty signals onlycontributes 190 links.

By rearranging the order of calculation and saving someintermediate results, the code can be made much faster forthe case of k = 3. Let Nt denote the set of neighbors con-nected to node t. Define Ib = Na n Nb for each b in Na.Notice that Ib are typically much smaller than Na becauseit is the intersection of two sets. This set of sets will be usedrepetitively in pruning links between a and its neighbors bin set Na. To determine if any c E Ib belongs to set P,we require that the size of Ic n Ib be larger than or equal toq - 3. The complexity of the algorithm is O(N4). How-

ever, there is a much larger O(N3) term comparable to thek = 2 case. The running time for q = 32, (l, d) = (15,4)and N = 12000 for k = 2 is 3 hours 45 minutes on a SGIworkstation. This is to be compared with a running time of10 hours and 15 minutes for k = 3 at q = 18.

Another technique of speeding up computations is toprune the node with the smallest numbers of links first be-cause they are the easiest to be pruned. This will have adomino effect on other nodes resulting in a shorter running

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

Figure 2. Same as Fig.1 but for the case ofk = 3. Transitions are in general not as sharpas in the case of k = 2. For the rightmostcurve, for q between 23 and 31, the calculationtakes too long to complete.

In this paper we perform tests on random sequences.This is preferable to testing on random sequences embed-ded with signals[8] because, in our experience, on randomsequences embedded with a pattern the winnower algorithmoften leaves more links unpruned (by a factor of two tothree) even when the algorithm can remove all spuriouslinks for the random sequences of the same length with-out embedded signals. The last few hundred spurious linksmixed with links from true signals can be dealt with easilyby other methods of finding clique graphs.

The algorithm prunes all spurious links that is not a partof a clique of size q, which determines the minimum num-ber of signals detectable in the input sequence. The al-gorithm that can prune all spurious links with small copynumber q is superior. A motif will repeat even in a randomsequence when its length N is sufficiently large. The prob-ability to observe a q-clique in random sequence is given byPoisson distribution Li?;q ~e-A, where). = p(N - l) is

the probability, p, of finding the motif times the total num-ber of motifs of length l in sequence of length N. Theexpected number of repeats increases linearly with the se-quence length N. Of course an algorithm may not be ableto prune all spurious links even when there is no q-clique inthe random sequence, which is the case for the tests below.

We determine the minimum q = qc as a function of totalsequence length for pattern (l, d) = (15,4). FIG.! showsthe percentage of spurious links left unpruned as a functionof q after the application of winnower algorithms with andwithout consensus bounds for the case of k = 2 (which

counts number of 3-cliques to eliminate spurious links).

The calculations were performed for random sequences oflengths N = 3000, 6000 and 12000. This is roughly equiv-alent to embedding signal to q sequences of length Ii. each

. qconsIdered by Pevmer and Sze[8]. For each length, as thecopy number q increases, a sharp transition is seen whichdefines the minimum copy number qc. For q < qc, the algo-rithm is unable to prune all the spurious links and thereforeunable to find the q-clique containing the signals. We havealso performed similar calculations for the more sensitivecase of k = 3(FIG.2), which eliminates links by countingthe number of 4-cliques. Table I lists qc for three random se-quence lengths for the winnower algorithm with and with-out consensus bounds. The minimum detectable copy num-ber qc is much smaller with the consensus bounds. For thecase of k = 2, qc increases linearly with the random se-

quence length. (This is clearly true for WINNOWER andis most likely also true for cWINNOWER.) The minimumdetectable copy number qc is three times smaller with theconsensus constraint than without for k = 2. Therefore the

consensus constraint greatly improved the sensitivity of thealgorithm. For the more sensitive k = 3 algorithm, qc alsobecomes smaller when the consensus bound is imposed.The ratio of qc with and without consensus constraints in-creases with the sequence length (see Table I).

The factor of three reduction in qc can be explained asfollows: in k = 2 of WINNOWER algorithm, a link (a, b)will be pruned if the number, n, of nodes connected to botha and b is smaller than q - 2. In the case of cWINNOWER,

a link may be pruned even when n is larger than q becauseconsensus constraint. In Section ill, we have shown thatthe consensus constraint stops working for n larger thanroughly4(1-1)q = 4(1-i\)q ~ 3q. Assuming consensusconstraint is effective until n ~ 3q, qc for cWINNOWER isthree times smaller.

For N = 12000, there is a range of q near qc where the

calculation took too long to run. The reason is that the pro-gram only stops when it has gone through all links remain-ing and there are no more links to delete. However, near qc,each time it goes through the list, it is able to delete a fewlinks but the bulk of computing time is spent on decidingwhether to delete links that end up not being deleted.

Calculations reported in FIG.! and FIG.2 were averagedbetween one and ten random sequences depending on thesequence length. Where the calculations were repeated ondifferent random sequences of the same length, the resultsare always similar, which is expected because the numberof spurious links is very large: 0.25 x 106,106, and 4 x 106for N = 3000,6000 and 12000, respectively.

Winnower assumes mutations can occur anywhere in thesequence. However in most of the regulatory sequences,some sites are more conserved than others. A partial remedyfor this is to divide the pattern into two parts. For one part-for example, in the middle of the pattern-the allowed num-

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

her of mutations is less than that for the rest of the pattern.In summary, the winnower method affords several ad-

vantages. In addition to being able to allow mutations inthe hidden pattern, it has a clear-cut criterion for signal se-lection and is also unique in showing the absence of signal,i.e. it can prove that a certain (i, d) motif occurs less than qtimes in the input sequence.

[9] S.-H. Sze, M.S. Gelfand and P. A. Pevzner "Findingweak motifs in DNA sequences", Pac Symp Biocom-put. (2002),235-246.

[10] C.E. Lawrence, S. F. Altschul, M. S. Boguski, J. S.Liu, A. F. Neuwald, J. C. Wooton, "Detecting sub-tle sequence signals: a Gibbs sampling strategy formultiple alignment", Science (1993) 262, 208-14; J.S. Liu, A. F. Neuwald, and C.E. Lawrence, "Marko-vian Structures in Biological Sequence Alignments" JAm Statist Assoc. (1999) 94, 1-15.

6 Acknowledgements

I thank Profs. Bob Laughlin (Stanford University), HaoLi (University of California San Francisco), and ZhifengShao (University of Virginia) for numerous stimulatingand infonnative discussions. I also thank Bryan Biegeland Kevin Harrington for their critical readings of themanuscript.

[11] TL. Bailey, & C. Elkan, "Fitting a mixture modelby expectation maximization to discover motifs inbiopolymers" Proc Int ConI Intell Syst Mol Bioi.(1994) 2, 28-36.

[12] G.Z. Hertz and G.D. Stormo, "Identifying DNA andprotein patterns with statistically significant align-ments of multiple sequences" Bioinfonnatics (1999)15,563-77.

References

[1] B. Ren, F. Robert, JJ. Wyrick, O. Aparicio, E.G. Jen-nings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett,E. Kanin, TL. Volkert, C.J. Wilson, S. P. Bell, R.A.Young, "Genome-wide location and function of DNAbinding proteins", Science (2<XX» 290, 2306-9.

[13] H.J. Bussemaker, H. Li, & E.D. Siggia, "Building adictionary for genomes: identification of presumptiveregulatory sites by statistical analysis" Proc Natl AcadSci USA. (2CXX» 97,10096-100.

[14] X. Liu, DL. Brutlag,& J. S. Liu, "BioProspector: dis-covering conserved DNA motifs in upstream regula-tory regions of co-expressed genes" Pac Symp Bio-comput. (2001), 127-38.

[2] V.R. Iyer, C.E. Horak, C.S. Scafe, D. Botstein, M.Snyder, P.O.Brown, "Genomic binding sites of theyeast cell-cycle transcription factors SBF and MBF'Nature (2001) 409, 533-8.

[15] X.S. Liu, D.L. Brutlag, J.S. Liu, "An algorithm forfinding protein-DNA binding sites with applicationsto chromatin-immunoprecipitation microarray experi-ments" Nat Biotechnol. (2002) 20, 835-9.

[3] Tavazoie S, Hughes ill, Campbell MI, Cho RJ,Church OM. "Systematic determination of geneticnetwork architecture", Nat Genet. (1999) 22, 281-5.

[4] Z Zhu. Y Pilpel, GM Church. "Computational Iden-tification of Transcription Factor Binding Sites via a

Transcription-factor-centric Clustering (TFCC) Algo-rithm", J Mol Bioi. (2002) 318, 71-81.

[5] N. Harafuji, DN Keys, M. Levine, "Genome-wideidentification of tissue-specific enhancers in the Cionatadpole", Proc Natl Acad Sci USA. (2002) 99, 6802-5.

[6] E. Davidson. Genomic Regulatory Systems. Develop-ment and Evolution (Academic Press 2001).

[7] for a recent review, see for example H. Li "Compu-tational approaches to identifying transcription fac-tor binding sites in yeast genome", Methods EnzymoL(2002) 350, 484-95.

[8] P.A. Pevzner and S.-H. Sze, "Combinatorial ap-proaches to finding subtle signals in DNA sequences"Proc Int Conf Intell Svst Mol Bioi. (2CXX» 8, 269-78.

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE