algorithms for extracting motifs from biological weighted sequences

Journal of Discrete Algorithms 5 (2007) 229–242

www.elsevier.com/locate/jda

Algorithms for extracting motifs from biological weighted sequences

C. Iliopoulos a, K. Perdikuri b,c,∗, E. Theodoridis b,c, A. Tsakalidis b,c, K. Tsichlas a

a Department of Computer Science, King’s College London, London WC2R 2LS, England, UKb Computer Engineering & Informatics Department of University of Patras, 26500 Patras, Greece

c Research Academic Computer Technology Institute (RACTI), 61 Riga Feraiou Str., 26221 Patras, Greece

Available online 16 May 2006

Abstract

In this paper we present three algorithms for the Motif Identification Problem in Biological Weighted Sequences. The firstalgorithm extracts repeated motifs from a biological weighted sequence. The motifs correspond to repetitive words which areapproximately equal, under a Hamming distance, with probability of occurrence � 1/k, where k is a small constant. The secondalgorithm extracts common motifs from a set of N � 2 weighted sequences. In this case, the motifs consists of words that mustoccur with probability � 1/k, in 1 � q � N distinct sequences of the set. The third algorithm extracts maximal pairs from abiological weighted sequence. A pair in a sequence is the occurrence of the same word twice. In addition, the algorithms presentedin this paper improve previous work on these problems.© 2006 Elsevier B.V. All rights reserved.

Keywords: Motif extraction; Biological weighted sequences

1. Introduction

DNA and protein sequences can be seen as long texts over specific alphabets encoding the genetic informationof living beings. Searching specific sub-sequences over these texts is a fundamental operation for problems such asassembling the DNA chain from pieces obtained by experiments, looking for given DNA chains or determining howdifferent two genetic sequences are. However, exact searching is of little use since the patterns rarely match the textexactly. The experimental measurements have various errors and even correct chains may have small differences,some of which are significant due to mutations and evolutionary changes.

Finding approximate repetitions or signals is needed in several applications in molecular biology. Moreover, es-tablishing how different two sequences are, is important for reconstructing the tree of the evolution (phylogenetictrees). All these problems require a concept of similarity, or in other words a distance metric between two sequences.Additionally, many problems in Computational Biology involve searching for unknown repeated patterns, often calledmotifs, and identifying regularities in nucleic or protein sequences. Both imply inferring patterns, of unknown contentat first, from one or more sequences. Regularities in a sequence may come under many guises. They may correspondto approximate repetitions randomly dispersed along the sequence, or to repetitions that occur in a periodic or ap-

* Corresponding author.E-mail addresses: [email protected] (C. Iliopoulos), [email protected] (K. Perdikuri), [email protected] (E. Theodoridis),

[email protected] (A. Tsakalidis), [email protected] (K. Tsichlas).

1570-8667/$ – see front matter © 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.jda.2006.03.018

http://www.elsevier.com/locate/jda

mailto:[email protected]





http://dx.doi.org/10.1016/j.jda.2006.03.018

230 C. Iliopoulos et al. / Journal of Discrete Algorithms 5 (2007) 229–242

proximately periodic fashion. The length and number of repeated elements one wishes to be able to identify may behighly variable. The analysis of the distribution of repeated patterns permits biologists to determine whether thereexists an underlying structure and correlation at a local or global genomic level. Moreover in the study of gene ex-pression and regulation, it is important to be able to infer repeated structured patterns and answer various biologicalquestions. Structured patterns correspond to an ordered collection of p boxes (always of initially unknown content)and p − 1 intervals of distances (one between each pair of successive boxes in the collection). Structured patternsallow to identify conserved elements recognized by different parts of a same protein or macromolecular complex, orby various complexes that then interact with one another. A maximal pair is a special case of a structured pattern withp = 2 same boxes.

In this work, we examine various instances of the Motif Identification Problem in weighted sequences. In particular,we are given a set of weighted sequences S = {S1, S2, . . . , Sk}, Si ∈ Σ∗, and we are asked to extract interesting motifssuch that each motif occurs in at least q sequences.

Generally speaking, a weighted sequence could be defined as a sequence of (symbol, weight) pairs, S =〈(s1,w1), (s2,w2), . . . (sn,wn)〉, where wi is the weight of symbol si in position i (occurrence probability of si atposition i).

Biological weighted sequences can model important biological processes, such as the DNA-Protein BindingProcess or Assembled DNA Chains. Thus, motif extraction from biological weighted sequences is a very impor-tant procedure in the translation of gene expression and regulation. In more detail, the extracted motifs from weightedsequences correspond in general to binding sites. These are sites in a biological molecule that will come into con-tact with a site in another molecule permitting the initiation of some biological process (for instance, transcriptionor translation). In addition, these weighted sequences may correspond to complete chromosome sequences that havebeen obtained using a whole-genome shotgun strategy [12]. By keeping all the information the whole-genome shotgunproduces, we would like to dig out information that was previously undetected after being faded during the consensusstep. Finally, protein families can also be represented by weighted sequences ([6], in 14.3.1) (in this case weightedsequences are usually called profiles).

A great number of algorithms has been proposed in the related literature for inferring motifs in biological sequencesas frequently occurring substrings [19–22]. The majority of these algorithms relies on either statistical or machinelearning approaches for solving the motif inference problem. Moreover, the past few years, numerous tools havebecome available for the task of motif extraction from biological sequences (i.e.: MEME, AlignACE, MITRA, etc.),differing from each other chiefly in their definition of what constitutes a motif, what constitutes statistical over-representation of a motif and the method used to find statistically over-represented motifs. For a detailed evaluationof these tools the reader could refer to [23]. To the best of our knowledge, none of these tools handles weightedsequences. Moreover in [24] authors present a suite of software tools for the efficient and fast detection of over-orunder-represented words in nucleotide sequences. The inner core of these tools rests on subtly interwoven propertiesof statistics, pattern matching and combinatorics on words.

In [13] authors defined a notion of redundancy for motifs, based on the idea that some motifs could be enoughto build all the others. The goal is to define a basis of motifs, in other words a set of irredundant motifs that cangenerate all maximal motifs by simple mechanical rules. The idea of a basis of motifs, named tiling motifs, is alsoused in [14]. The problem of extracting maximal irredundant motifs from a string is also studied in [15], in whichauthors support the design of ad hoc data structures and constructs, and lead to develop an O(n3) time incrementaldiscovery algorithm.

Other approaches build all possible motifs by increasing length. These solutions have a high time and space com-plexity and cannot be applied in the case of weighted sequences, due to their combinatorial complexity. Finally, in[10,16] the authors use the suffix tree to spell all valid models (exact or approximate). In this way they do not buildall possible motifs but indirectly they generate only those which are relevant to the text.

In addition, finding maximal pairs in ordinary sequences was first described by Gusfield in [6]. This algorithm usesa suffix tree to report all maximal pairs in a string of length n in time O(n+α) and space O(n), where α is the numberof reported pairs. In [1] the authors presented methods for finding all maximal pairs under various constraints on thegap between the two substrings of the pair. In a string of length n, they find all maximal pairs with gap in an upperand lower bounded interval in time O(n logn + α). If the upper bound is removed the time is reduced to O(n + α).

C. Iliopoulos et al. / Journal of Discrete Algorithms 5 (2007) 229–242 231

1.1. Results

In this paper we present the following results:

– Maximal Pairs Problem: A set S of N weighted sequences with mean length n is given and we are asked to findall probable maximal pairs that occur in at least q of the N sequences. The time complexity of the proposedalgorithm is O(Nn log(Nn) + α), where α is the size of the output, using linear space.

– Repeated Motifs Problem: A weighted sequence s of length n is given and we are asked to find all probablemotifs of length � that occur at least q times in s with e mismatches. The time complexity of the algorithm isO(nV 2(e, �)q log logn), where V (e, �) is the number of words of length � that have at most Hamming distance e

between each other. The algorithm uses linear space.– Common Motifs Problem: A set S of N weighted sequences with mean length n is given and we are asked to find

all probable motifs of length � that occur at least in q sequences of S with e mismatches.We propose an algorithmwith O(nNqV (e, �)) time complexity using O(nNq) space.

The structure of the paper is as follows. In Section 2 we give some basic definitions on weighted sequences to beused in the rest of the paper. In Section 3 we address the problem of extracting simple models, while in Section 4 weaddress the problem of Motif Extraction in weighted sequences. Finally, in Section 5 we conclude and discuss openproblems in the area.

2. Preliminaries

In this section we provide some definitions needed throughout the paper. In addition, we define the problems wetackle and sketch the previous work on the problems we consider.

2.1. Basic definitions

Let Σ = {1,2, . . . , σ } be an alphabet of cardinality σ = |Σ |. A sequence s of length n is represented by s[1..n] =s[1]s[2] · · · s[n], where s[i] ∈ Σ for 1 � i � n, and n = |s| is the length of s. Sequence s is also called a solid sequencein order to distinguish them from weighted sequences. An empty sequence is denoted by ε; we write Σ∗ = Σ+ ∪ {ε}.A factor f of s is a substring of s, that is f = s[i . . . j ].

A substring w is a prefix of s if s = wu for u ∈ Σ∗, a proper prefix if u ∈ Σ+. Similarly, w is a suffix of s ifs = uw for u ∈ Σ∗. We denote by T (S) the suffix tree of s, as the compressed trie of all the suffixes of s$, $ /∈ Σ . LetL(v) denote the path-label of node v in T (s), which results by concatenating the edge labels along the path from theroot to v. Leaf v of T (s) is labeled with index i iff L(v) = s[i..n]. We define the leaf-list LL(v) of v as a list of theleaf-labels in the subtree below v.

The suffix tree is a fundamental data structure supporting a wide variety of efficient string searching algorithms. Inparticular, the suffix tree is well known for the efficient and simple solutions it provides to many problems concerningthe identification and location either of a set of patterns or repeated substrings (contiguous or not) in a given sequence.The reader can find an extended literature of such applications in [6].

A weighted sequence is defined as follows.

Definition 1. A weighted sequence s = s1s2 · · · sn over an alphabet Σ of cardinality |Σ | = σ is a sequence of sets ofcouples. In particular, each si is a set ((1,πi(α)), (2,πi(β)), . . . , (σ,πσ (σ ))), where πi(q) is the occurrence proba-bility of character q at position i. For every position 1 � i � n,

∑σq=1 πi(q) = 1.

We represent each position of the weighted sequence as a vector that contains all the symbols of the alphabetand their corresponding probabilities; if a character does not appear in a specific position then its probability is zero.A weighted biological sequence is often represented as a d ×n matrix, which is termed Position Weight Matrix, whered is the size of the respective alphabet (in the case of DNA weighted sequences d = 4) and n is the length of thesequence. Each cell of the Position Weight Matrix pij stores the probability of appearance of symbol i in the j th


position of the input sequence. An instance of a weighted (sub)sequence p is a (sub)sequence of p where a symbolhas been chosen for each position. We will represent each couple (q,πi(q)) as πi(q)q , as shown in the example below.

Example 1. Consider the alphabet Σ = {A,C,G,T }. Then

(1)s =⎛⎜⎝

0.3A 0.25A 0A 0.4A 0.8A 0A

0C 0.25C 1C 0.2C 0.05C 0.5C

0.2G 0.5G 0G 0.2G 0.1G 0G

0.5T 0T 0T 0.2T 0.05T 0.5T

⎞⎟⎠

is the weighted matrix that represents a weighted sequence of length 6. Note that the sum of probabilities for eachcolumn is 1. In this case s[1] is the set of couples {(A,0.3), (C,0), (G,0.2), (T ,0.5)}.

The following definition clarifies when a solid pattern p occurs in a weighted text t .

Definition 2. The solid pattern p = p[1 . . .m] occurs at position i of the weighted text t = t[1 . . . n] if and only if p[j ]occurs at position t[i + j − 1] for all 1 � j � m; that is, if and only if πi+j−1(p[j ]) > 0, for all 1 � j � m. We alsosay that p matches t at position i.

Example 2. Let t be the weighted sequence defined in Eq. (1). Then p = ACT A occurs in t at position 2, sinceπ2(A) = 0.25, π3(C) = 1, π4(T ) = 0.2, and π5(A) = 0.8.

Since each symbol at position i of the text t is assigned a probability of occurrence it is logical to assume thatan occurrence of a solid pattern p in the weighted text t must also have a probability of occurrence. In this way, wedefine how likely it is to find an occurrence of p in a specific position of t . Let p = p[1 . . .m] be a solid pattern,and t = t[1 . . . n] be a weighted text. Also assume that p matches t at position i. Then the probability of the match isdefined as the product of the probabilities of the symbols of t that match p:

P iprod =

m∏j=1

πi+j−1(p[j ])

In the following an example of the use of this measure is provided:

Example 3. Let t be the weighted sequence defined by the Weighted Matrix (1) and p = ACT A, which occurs in t atposition 2. Then the probability of occurrence of this match is P 2

prod = 0.25 · 1 · 0.2 · 0.8 = 0.04.

As previously defined a factor f in a sequence s is a substring of s. This definition can be extended for the caseof weighted sequences. Thus a weighted factor f is a substring f = s[i . . . j ] in a weighted sequence s and theprobability of occurrence of f is non-zero and given from the product of probabilities of the symbols in the weightedsequence that match f .

From a biological point of view in weighted sequences we are interested in repeated patterns which appear withhigh probability of occurrence. These patterns (or in other words factors) are called valid.

Definition 3. Given a factor f of length m at position i of a weighted text t and an integer k, we say that f is a validfactor of t at position i if and only if the probability of occurrence of f is � 1

k.

Valid motifs in weighted sequences correspond to valid factors that occur at least q times in the weighted sequence.If we consider approximate motifs, then the Hamming distance between any two valid factors of the motifs mustbe � e.

Moreover, a pair in a string is the occurrence of the same substring twice. A pair is maximal if the occurrencesof the substring cannot be extended to the left or to the right without making them different. The gap of a pair is thenumber of characters between the two occurrences of the substring. A pair in a weighted sequence is valid if eachsubstring appears with probability � 1 .

k


Fig. 1. The WST for the weighted sequence S = actt{(a,0.5)(c,0.5)}tc{(a,0.5)(c,0.3)(t,0.2)}t t t and k = 4.

Definition 4. A set M� of positions inside a weighted sequence s, represents a set of weighted factors of length � thatare similar, if and only if, there exists, (at least) a motif m ∈ Σ�, such that for all elements i in M�, dist�(m, si) � e.

In other words, the set M� contains all motifs of length � with at most e mismatches. The size of M� is denoted byV (e, �) and it is used as an upper bound for the maximum number of motifs reported as output.

The extraction of valid motifs in this work is based on the use of the Weighted Suffix Tree (WST). The WST of aweighted sequence s, WST(s), is the compressed trie of all valid weighted subwords, starting within each suffix si ofs$, $ /∈ Σ (an example is illustrated in Fig. 1). The WST is built in linear time and space when k is a small constant.The WST was firstly presented in [7] as an elegant data structure for reporting the repetitions within a weightedbiological sequence. In [8] authors presented an efficient algorithm for constructing the WST, while in [25] authorspresent several applications of the WST. A more formal definition of the WST follows.

Definition 5. Let S be a weighted sequence. For every suffix starting at position i we define a list of possible weightedsubwords so that the probability of appearance for each one of them is greater than 1/k. Denote each of them asSi,j , where j is the subword rank in arbitrary numbering. We define WST(S) the weighted suffix tree of a weightedsequence S, as the compressed trie of a portion of all the weighted subwords starting within each suffix Si of S$,$ /∈ Σ , having a probability of appearance greater than 1/k. Let L(v) denote the path-label of node v in WST(S),which results by concatenating the edge labels along the path from the root to v. Leaf v of WST(S) is labeled withindex i if ∃j > 0 such that L(v) = Si,j [i..n] and π(Si,j [i · · ·n]) � 1/k, where j > 0 denotes the j th weighted subwordstarting at position i. We define the leaf-list LL(v) of v as a list of the leaf-labels in the subtree below v.

We will use an example to illustrate the above definition. Consider again the weighted sequence shown in Fig. 1and suppose that we are interested in storing all suffixes with probability of appearance greater than k � 1/4. We havethe following possible prefixes for every suffix:

– Prefixes for suffix x[1 · · ·11]: S1,1 = acttatcatt t , π(S1,1) = 0.25, and S1,2 = acttctcatt t , π(S1,2) = 0.25.– Prefixes for suffix x[2 · · ·11]: S2,1 = cttatcatt t , π(S2,1) = 0.25, and S2,2 = cttctcatt t , π(S2,2) = 0.25, etc.

The weighted suffix tree for the above subwords appears in Fig. 1.

2.2. Definitions of problems

In this section we provide formal definitions of the problems we consider. The first problem we wish to solve is therepeated motifs problem.


Problem 1. Given a weighted sequence s and three integers 0 � k < c, e � 0 and q � 2, for some small constant c,find all factors f with probability of occurrence � 1

ksuch that f is present at least q times in s and the Hamming

distance between every pair of occurrences is � e. All these occurrences must not overlap.

The non-overlapping restriction is added because when two occurrences a and b of a model of s overlap, then itmay be the case that a cancels b. More specifically, assume that a and b overlap at position i. Then, it may be thecase that a uses symbol σ1 ∈ Σ with probability πi(σ1) while b uses symbol σ2 ∈ Σ with probability πi(σ2). Thismeans that for the same model and the same position in the weighted sequence two different characters were used. Toovercome this awkwardness we do not allow the occurrences of models to overlap. The second problem we wish tosolve is the common motifs problem.

Problem 2. Given a set of N weighted sequences S = si (1 � i � N ) and three integers 0 � k < c, e � 0 and2 � q � N , for some small constant c, find all factors f with probability of occurrence � 1

ksuch that f is present in

at least q distinct sequences of the set and the Hamming distance between every pair of occurrences is � e.

Finally, the third problem we wish to solve is the following.

Problem 3. Given a set of N weighted sequences S = {s1, s2, . . . , sN }, an integer 0 � k < c and a quorum q � N , forsome small constant c, find all maximal pairs m such that m is valid, that is, it appears with probability greater than 1

kin at least q sequences of the set S.

2.3. Previous work

In the following, we sketch the algorithms proposed by Sagot [16] and Iliopoulos et al. [9], on which our solutionsare based. The common characteristic of both papers is that the proposed algorithms make heavy use of suffix trees.As already described the suffix tree is an indexing structure for all suffixes of a string s and it is well known that it canbe constructed in linear time and linear space [11]. The generalized suffix tree is a suffix tree for more than one string.Since the suffix tree is a well known indexing structure for strings, we will assume that the reader is familiar with itsbasic properties and characteristics. In the discussion to follow, for reasons of clarity we discuss the algorithm on theuncompressed suffix tree (a sequence of nodes with just one child is not collapsing into a single edge).

The repeated and common motifs problems are handled in [16]. For the first problem the input is a string s withlength n over an alphabet Σ and three integers q � 2 (the quorum), e � 0 (the maximum number of mismatches) andthe length � of the wanted model. In the case where we seek to find all possible models we have to apply the algorithmfor each possible length. The output of the algorithm is only the extracted motifs and not the exact position of theirappearance.

Assuming that e = 0, the algorithm for the repeated motifs problem locates each node vi that corresponds to amodel mi of length � and then checks if this model is valid, that is if it satisfies the quorum constraint. This constraintis verified by checking whether the number of leaves of each node vi is larger than q . If we allow for errors, then amodel mi corresponds to many nodes vi1, vi2, . . . , vij on the suffix tree. The model is valid if the sum of the leavesof all these nodes is larger than q . By a simple linear-time preprocessing it is very easy to compute the number ofleaves for each node of the suffix tree. Note that occurrences of models may overlap. The space used by the algorithmis linear while the time complexity for a specific length � is O(nV (e, �)).

For the common motifs problem, the input is a set of strings S = s1, s2, . . . , sN and two integers q � 2 and e � 0.First, a generalized suffix tree is constructed for S in time O(nN). Then, the mechanism to check the quorum con-strained is implemented. For each node v in the suffix tree, a bit vector bv of N positions is constructed such thatbv[i] = 0 when in the subtree of v there is no leaf with label i, that is there is no occurrence of a suffix of string si inthe subtree of v (otherwise bv[i] = 1). Then, the procedure is exactly the same as in the repeated motifs algorithm withthe exception of the use of the bit vectors to check whether the quorum constraint is satisfied. The space requirements

of this algorithm is O(nN2

w), where w is the word length of the machine. The time complexity is O(nN2V (e, �)), for

a specified length �.Finally, we come to the solution described in Iliopoulos et al. [9]. In this work all maximal pairs which occur in

each string of a set of strings without any restrictions on the gaps are reported in O(n + α), where α is the size of the


output, and linear space. In addition, the algorithm reports all maximal pairs which occur in each string of a set ofstrings with the same gap that is upper bounded by a constant. This is achieved in O(n log2 n + αN logn) time, whereN is the number of strings and n is the total length of the strings, using linear space.

3. Extracting simple models

In this section we design an algorithm for reporting all maximal pairs in a set of weighted sequences. More specif-ically, given a set of N weighted sequences S = s1, s2, . . . , sN , a small integer k � 0 and a quorum q � N , we reportall maximal pairs, whose components appear with probability greater than 1/k in at least q sequences of the set S.We assume that the mean length of the weighted sequences in S is n.

We have considered two variations of this problem depending on the restrictions on the gaps. In the first versionwe assume that there is no restriction on the gaps of the pairs, thus one pair may appear in different sequenceswith different gaps. In the second version of the problem one pair has to come along with approximately the samegap, which is upper bounded by a constant value b. For solving these problems we suggest two methods that areextensions of the algorithms that are provided in [9] for these problems on plain sequences. Our solutions encounterthese problems on weighted sequences in a simpler and more efficient manner.

Initially, a generalized weighted suffix tree gWST(S) is constructed. A generalized weighted suffix tree is similarto the generalized suffix tree and is built upon all the weighted sequences of S (an example is illustrated if Fig. 2).For the construction of gWST(S) the algorithm of [8] is used for each of the weighted sequences in S and all theproduced factors are superimposed in the same compacted trie. The total time for this operation is linear to the sumof the length of each of the weighted sequences O(

∑Ni=1 |si |) = O(nN). The construction method is invoked for each

of the weighted sequences starting from the root of the same compacted trie. The suffix links are preserved so it islike building a generalized suffix tree from a set of regular sequences using the same auxiliary suffix tree. Thus, thespace of the gWST(S) is linear with the total length of the weighted sequences. The gWST(S) is a compacted trie without-degree of internal nodes at least 2 and at most σ = |Σ |.

The first step, as mentioned in [9] is to binarize gWST(S). Each node u with out-degree |u| � 2 is replaced by abinary tree with |u| leaves and |u| − 1 internal nodes. Each edge is labeled with the empty string ε so that all newnodes have the same path-label as node u, which they replace. Assuming that the alphabet size σ is constant, thewhole procedure needs linear time and the final data structure has linear space. Refer to Algorithm 1 for an outline ofbinarize method. This method is a recursive one and its first call is Binarize(gWST(S), root).

Every index, let it be j , at the leaves of the gWST(S) is organized in special leaf-lists according to:

– The weighted sequence si it belongs to– The character to the left at position j − 1 in the weighted sequence defined as left-character

There are two possible cases: either the position j − 1 of the left-character is a solid one, that means that there isonly one character with probability of appearance equal to 1 or the position is a branching one with more than oneappearing characters. For example consider the weighted sequence of Fig. 3. The branching positions are {5,8} whileall the other are solid positions. Supposing that the probability of appearance threshold is 0.25 consider the produced

Fig. 2. The gWST for the weighted sequences S1 = actt{(a,0.5)(c,0.5)} and S2 = tc{(a,0.5)(c,0.3)(t,0.2)}t t t and k = 4.


1: input: a Generalized Weighted Suffix Tree gWST(S), a node u of gWST(S)

2: lfs: a queue of nodes3: children: a stack of nodes4: x ← #num of children of u

5: for i ← x to 1 do6: children.add(ith children of u)7: end for8: h ← logx9: create a new node u′

10: u′ .path_label=u.path_label11: lfs.push(u’)12: for i ← 0 to h do13: for j ← 1 to 2i do14: nd ← lfs.pop()

15: create a new node lc as a left child of nd pointed by an edge with label ε

16: create a new node rc as a left child of nd pointed by an edge with label ε

17: lfs.push(lc)18: lfs.push(rc)19: end for20: end for21: for j ← 1 to 2h do22: nd ← lfs.pop()

23: nd1 ← children.pop()

24: nd2 ← children.pop()

25: make nd1 left child of nd26: make nd2 right child of nd27: Binarize(gWST(S),nd1)28: Binarize(gWST(S),nd2)29: end for30: delete u

31: return

Algorithm 1. Binarize.

Fig. 3. Weighted sequence example.

subwords S3,1 = T T AT CAT T T , S10,1 = T T and S9,1 = T T T . The first one has C as left-character, the second hasT while the third can have A or G as left-character.

The reason for using the left-character organization scheme is for validating if two subwords, viewed as candidatepairs, can not extend to the left and so if they comprise a left-maximal pair. The right-maximality is ensured by thetrie-like structural properties of the generalized weighted suffix tree. At every internal node with path label u everypair of indices from different subtrees we have two subwords sharing the same prefix u and surely differ at |u| + 1position.

In weighted sequences there are subwords with more than one left-character. We introduce a new type of left-character called ‘c∗’ for all those special subwords. This new class guarantees also the left maximality, as for anyleft-character of one index x in that class there is at least one index y with a different one. Lets think the previousexample (see Fig. 3) if we examine S3,1 = T T AT CAT T T with left-character C against S9,1 = T T T with left-character ∗, their prefix T T is left-maximal. Notice that subword S9,1 = T T T is inserted by S8,1 = AT T T and byS8,2 = GT T T . See [8] for a detailed description of weighted suffix tree construction algorithm. Eventually, a leaf-listis a set of N vectors, one for each of the weighted sequences, where each vector contains σ + 1 lists, one for each ofthe σ + 1 choices for left-character (see Fig. 4).

When the construction of the gWST(S) is completed, a bottom-up process is initiated. Let L� and Lr be the leaf-lists of the left and right descendants of a node v. The candidate maximal pairs for each of the sequences si , defined


Fig. 4. The leaf-lists where the indexes are organized.

by the path label of node u, can be found by combining for every j the indexes of list L�.si .lcj (the list for symbol cj

in weighted sequence si ) with the lists Lr.si .lcz, for every z �= j .If overlaps are not allowed on the components of a pair, the combination of lists L�.si .lcj and Lr.si .lcz is done

for all x ∈ L�.si .lcj and y ∈ Lr.si .lcz such that |x − y| − |path_label(u)| � 0. This is implemented efficiently bykeeping the lists sorted and merging virtually the one list with the other. The lists can be organized as AVL trees orfinger search trees for example. If we choose the smaller list and virtually merge it with the other, the following threelemmas guarantee that the total time needed is O(Nn log (Nn) + α).

Lemma 1. The sum over all nodes u of an arbitrary tree of size n of terms that are O(n1), where n1 � n2 are theweights (number of leaves)of the subtrees rooted at the two children of u, is O(n logn).

Proof. The sum is going to have the maximum possible value when at every internal node the one half is equal toanother. Thus n1 = n2. The cost we pay at the root of the tree then is n

2 , at its two children is 2n4 = n

2 , at the next level4n

8 = n2 at so on. Summing these costs for the logn levels we get a total cost of O(n logn). �

Lemma 2. Two AVL trees or finger search trees of size n1 and n2, where n1 � n2, can be merged in timeO

(log

(n1+n2

n1

)).

Proof. See also [2] or [3].The items of the smaller tree are inserted one by one to the larger tree, in increasing order. Ifthe position of an item is determined by a finger search operation from the previous inserted item, then each insertion,except the first one, needs O(logd) time; d denotes the distance between the new item and the previously insertedone. Thus the total time cost is O(logn2 + ∑

2�i�n1logdi) where O(logn2) is the cost to find the position of the first

inserted element. The time cost becomes maximum when the distances di are equal. In this case the total merge costis O

(logn2 + n2 log n2

n1

) = O(n2 log n2

n1

) = O(log

(n1+n2

n1

)). �

Lemma 3. Let T be an arbitrary binary tree with n leaves. The sum over all internal nodes u ∈ T of terms(n1+n2

n1

),

where n1 � n2 are the weights of the subtrees rooted at the two children of u, is O(n logn).

Proof. See also [1].We will prove by induction in the number of leaves of the binary tree that the sum is upperbounded by logn!. If T is a single leaf then the upper bound holds. Assume inductively that the upper bound holdsfor any tree with n − 1 leaves and consider a tree T with n leaves. Suppose that the left subtree of the root has n1 < n

leaves and the right subtree n2 < n leaves (n1 +n2 = n). According to the induction hypothesis the upper bound holdsfor the two subtrees of the root; that is the total sum of the terms on the nodes of each of these trees is bounded bylogn1! and logn2! respectively. Thus the total sum is less than:

logn1! + logn2! + log

(n1 + n2

n1

)� logn! �

Before the retrieval of the output, it must be checked whether at least q of the weighted sequences si output atleast one pair. This is accomplished during the virtual merging of the lists. The virtual merge is applied to all possiblecombinations of lists but two more operations are spent for each of the items of the smaller list to check whether there


is at least one candidate pair for the corresponding sequence. If at least q sequences have at least one maximal pairthe rest of the output is retrieved. This additional step costs O(n1) additional steps (the smaller half), thus accordingto Lemma 1 the overall cost is O(Nn log (Nn)). After the reporting step, the leaflists L� and Lr are merged, mergingeach list L�.si .lcj with the list Lr.si .lcj∀i, j . This step according to Lemma 3 costs O(Nn log (Nn)) in total. Theresult is summarized in the following theorem.

Theorem 1. Given a set of N weighted sequences S = s1, s2, . . . , sN , a small integer k � 0 and a quorum q � N , wecan find in time O(Nn log (Nn) + α) all maximal pairs m such that each component of m appears with probability� 1

kand with no overlaps in at least q sequences of the set S, where α is the size of the size of the answer and n the

mean length of si .

Refer to Algorithm 2 for a detailed outline of the above method. The initial call of this recursive method is ReportMaximal Pairs(gWST(S), root, q).

When the overlap constraint is removed the query becomes more time consuming. The output has to be filtered andchecked if the overlap of the components of a pair is the same substring. This is crucial since at each position of theoverlapping region there must be the same choices of symbols from the two components. This problem is overcomeby pre-processing the gWST(S) to support nearest ancestor queries in constant time [17]. When a candidate pair ofindexes x, y has an overlap (y � x + |path_label(u)|) then an nca(x, y) query on the gWST(S) dictates the longestcommon extension of these two sub-factors from positions x and y. If the answer of this query is greater than thepositions of the overlap it means that the overlapping region is the same sub-string in the two factors. In this case thetime complexity becomes O((Nn)2).

In the second version of the problem one pair has to come along with approximately the same gap, that is upperbounded by a constant value b, in at least q weighted sequences. The algorithm described above can be easily extendedto solve this variation of the problem, thus we only provide a sketch of the algorithm. At each internal node u duringthe reporting step we apply a virtual merge and for each index from the smaller list we retrieve as described aboveat most 2b indices for candidate pairs. The indices that overlap with the former index are validated with nca queriesand some are rejected. To check if a maximal pair with approximately the same gap occurs in at least q weightedsequences we apply the following bucketing scheme. We have b buckets, each for one of the permitted values of thegap. Each candidate pair is placed in one bucket according to the gap. At the end of the reporting step we scan allthe buckets and we report the ones that have size at least q . The buckets can be implemented as linear lists and thischecking can be done in constant time by storing the size of the lists. The reporting step is the same as in the caseof unrestricted gaps. The running time of this method is determined by the actual and virtual merging step that asbefore is O(Nn log(Nn)) as well as a constant number of operations in every internal node. The following theoremsummarizes the result:

Theorem 2. Given a set of N weighted sequences S = s1, s2, . . . , sN , a small integer k � 0 and a quorum q � N , wecan find in time O(Nn log (Nn) + α) all maximal pairs m such that each component of m appears with probability� 1

kand the gap is bounded by the constant b, in at least q sequences of the set S, where α is the size of the output.

4. Extracting simple motifs

In this section we present algorithms for the repeated and common motifs problems on weighted sequences. Thesealgorithms are based on the algorithms of Sagot [16] with modifications that affect the time and space complexities. Inparticular, the models in the repeated motifs problem must be non-overlapping while in the common motifs problemthe time and space complexity is slightly improved. Since we borrow most of the algorithmic machinery from [16] wechose to focus on these modifications.

4.1. The repeated motifs problem

We are given a weighted sequence s and four integers 0 < k � c, e � 0, � � 2 and q � 2, for some constant c, andwe are asked to find all models m of length � with probability of occurrence � 1

ksuch that m is present at least q

times in s and the Hamming distance between all occurrences of m is � e. The following restriction must hold:


1: input: binarized Generalized Weighted Suffix Tree gWST(S),node u of gWST(S), quorum q

2: Report Maximal Pairs(gWST(S), u.left_child, q)3: Report Maximal Pairs(gWST(S), u.right_child, q)4: Let l1 the smaller leaf-list of one of the two children of u

5: Let l2 the leaf-list of the other child6: counter ← 07: for all si ∈ S (i ← 1 to N ) do8: flag ← false /* flag shows if there is at least a max pair for si */9: for all σ1 ∈ Σ do

10: for all σ2 ∈ Σ AND σ2 �= σ1 do11: Let l′ a list with the items of l1.si .lcσ1 increased by |path_length(u)|

/* virtual step */12: virtually merge l′ into l2.si .lcσ213: for all item e ∈ l′ do14: if there are items at the right of e in l2.si .lcσ2 then15: flag ← true16: end if17: end for18: Let l′′ a list with the items of l1.si .lcσ1 decreased by |path_length(u)|

/* virtual step */19: virtually merge l′′ into l2.si .lcσ220: for all item e ∈ l′ do21: if there are items at the left of e in l2.si .lcσ2 then22: flag ← true23: end if24: end for25: end for26: end for27: if flag = true then28: counter ← counter + 129: end if30: end for31: if counter � q then32: for all si ∈ S (i ← 1 to N ) do33: for all σ1 ∈ Σ do34: for all σ2 ∈ Σ AND σ2 �= σ1 do35: Let l′ a list with the items of l1.si .lcσ1 increased by |path_length(u)|

/* virtual step */36: virtually merge l′ into l2.si .lcσ237: for all item e ∈ l′ do38: Retrieve all items at the right of e in l2.si .lcσ239: end for40: Let l′′ a list with the items of l1.si .lcσ1 decreased by |path_length(u)|

/* virtual step */41: virtually merge l′′ into l2.si .lcσ242: for all item e ∈ l′ do43: Retrieve all items at the left of e in l2.si .lcσ244: end for45: end for46: end for47: end for48: end if49: Merge leaf-lists l1 into l2 /* merge the corresponding sub-lists */50: Letu.leaf _list the product leaf-list of this merge51: return

Algorithm 2. Report Maximal Pairs.


Restriction 1. The occurrences must not overlap. (Non-overlapping Restriction)

First the weighted suffix tree WST(s) of s is constructed given that the minimum probability of occurrence is 1k

.This construction is accomplished in linear time and space. Then, all models of length � are spelled on the tree.

Some definitions are necessary before moving to the spelling mechanism. A node-occurrence of a motif m isrepresented by a pair (v, ev), where v is a node in WST(s) and ev is the number of errors between m and the path labelof v. In addition the father node of v is represented as father(v). The following lemma borrowed from [16] is in factthe recurrence that implements the spelling of models.

Lemma 4. A pair (v, ev) is a node-occurrence of m′ = mα with m ∈ Σρ for 1 � ρ � � and α ∈ Σ iff one of thefollowing two conditions hold:

– match: A pair (father(v), ev) is a node occurrence of m and the label of the arc from father(v) to v is α;– mismatch: A pair (father(v), ev − 1) is a node occurrence of m and the label of the arc from father(v) to v is

γ �= α.

In a nutshell, Lemma 4 states that a model m′ of length < � is extended by one character either with a match orwith a mismatch if the total number of errors in m′ is � e. This procedure continues recursively until either length �

is reached or the number of errors becomes larger than e.We will focus on the main problem of this algorithm on weighted sequences, which is the maintenance of the

non-overlapping restriction. This constraint is achieved by filtering the output of the algorithm on the WST.Assume that the nodes with path label of length � constitute a set L = v1, v2, . . . , v|L|. For each v ∈ L, the leaves

of its subtree are put in a sorted list v�. These lists are implemented as van Emde Boad trees [18]. Since the num-bers sorted are in the range [1, n], we can sort them in linear time. As a result, the time complexity for this step is∑|L|

i=1 O(|v�i |). Since all lists are disjoint, this sum is bounded by O(n).

Assume that L′ = vi1, vi2, . . . , vij ∈ L are the nodes of a path label with length � that constitute a candidate modelm. First the quorum constraint must be satisfied, that is it must be checked whether the sum of their leaves is largerthan q . If it is not, then the model is not valid since the quorum constraint is not satisfied. If there are at least q leaves,then the non-overlapping constraint must be checked.

The naive solution would be to merge all lists vli1, vl

i2, . . . , vl

ijand perform q queries. In this case, the time com-

plexity for q queries would be O(q log logn) (the log logn factor is by the van Emde Boas trees) but the merge steprequires O(n) time, which is very inefficient.

This inefficiency can be circumvented by checking among all nodes in L′ to find the one with the minimum positionof occurrence. This can be easily implemented in O(|L′|) time, since the lists for each node are sorted and only thefirst element is checked. Assume that this element is position x1 on the weighted sequence s. Then, among all lists thesuccessor of value x2 = x1 +|m|+1 is located. This procedure continues until the quorum constraint is either satisfied(the final query is of the form � xq−1 + |m| + 1) or violated. This solution has O(q|L| log log |n|) time complexitywhich leads to an O(nV 2(e, �)q log logn) time solution for the repeated motifs problem, for length � using linearspace. The following theorem states the result:

Theorem 3. Given a weighted sequence s and three integers 0 � k < c, e � 0 and q � 2, for some small constantc, we can locate all words f with probability of occurrence � 1

kin O(nV 2(e, �)q log logn) time using linear space,

such that f is present at least q times in s and the Hamming distance between every pair of occurrences is � e. Allthese occurrences do not overlap.

The successor location problem can be seen as a static data structure problem which we call the multiset dictionaryproblem.

Definition 6. Given a superset S = {S1, S2, . . . , Sx}, of sets Si ⊆ {1,2, . . . , n}, we wish to answer q successor querieson the subset S′ = {Si1, Si2, . . . , Six } ∈ S, where n = ∑x

i=1 |Si |.This problem can be seen as a generalization of the iterative search problem [4]. In this problem, we are given a set

of N catalogs and we are asked to answer N queries, one on each catalog. The straightforward solution is to search in


each catalog, which means that the time complexity is O(N logn), if each catalog has size n. If we apply the fractionalcascading technique [4], then the time complexity is reduced to O(N + logn). Unfortunately, we cannot apply thesame technique in the multiset dictionary problem, since we do not know in advance which catalogs are going to beused, while at the same time the queries are not confined to a single catalog but to their union. This problem is aninteresting data structure problem and it would be nice to see solutions with better complexity than the rather trivialO(qx log logn).

4.2. The common motifs problem

We are given a set of N weighted sequences S = si (1 � i � N ) and four integers 0 < k � c, e � 0, � � 2 andq � 2, for some constant c, and we want to find all models m of length � with probability of occurrence larger than 1

k

such that m is present at least in q strings in S and the Hamming distance between all occurrences of m is � e. Notethat the algorithm of [16] can be straightforwardly applied for weighted sequences. In the following, we propose somemodifications that improve the time and space complexity of the algorithm.

First, the generalized weighted suffix tree gWST(S) of S is constructed given that the minimum probability ofoccurrence is 1

k. This construction is accomplished in linear time and space for a small constant k. Then, all models

of length � are spelled by using on the gWST(S) Lemma 4. Similarly, a model m′ of length < � is extended by onecharacter either with a match or with a mismatch if the total number of errors in m′ is � e. In Section 2.3 the solutionof [16] is sketched with O(nN2V (e, l)) time complexity using O(nN2

w) space. By doing some minor modifications

the new algorithm reduces a factor N to q .This additional N factor in the space and time complexity comes from the check of the quorum constraint. Sagot

uses a bit vector of length N . However, note that if a node has q different strings then all its ancestors will certainlycontain q strings. In addition, it is not necessary to know the exact number of strings in the subtree as far as thisnumber is larger than q . In this way, an array of integers of length q is attached to each internal node. If this array getsfull then the quorum constraint is satisfied for all its ancestors and it is not necessary to maintain this information.

These arrays are filled by a post-order traversal of the suffix tree. The array attached to each node is sorted. Ifone of the children of v has a full array, then v will also have a full array. In the case where there is no full array,all these arrays are merged without maintaining repetitions. This can be easily accomplished in O(|Σ |q) time, sincethe maximum number of children of a node is Σ . Given that the number of internal nodes is O(nN), then the pre-processing time is O(nNq) while the space complexity of the suffix tree is O(nNq) (less than O(nN2) of [16]).Finally, the time complexity of the algorithm is O(nNqV (e, l)), which is better than [16] since q is at most equalto N . The following theorem states the result:

Theorem 4. Given a set of N weighted sequences S = s1, s2, . . . , sN and four integers 0 < k � c, e � 0, � � 2 andq � 2, for some constant c, we can find in time O(nNqV (e, l)) using O(nNq) space all models m of length � withprobability of occurrence larger than 1

ksuch that m is present at least in q strings in S and the Hamming distance

between all occurrences of m is � e.

5. Discussion and further work

The algorithms presented in this paper solve various instances of the motif identification problem in weightedsequences, which is very important in the area of protein sequence analysis.

We have identified many interesting problems related to motif identification on solid or weighted sequences:

1. Structured motifs identification problem: locate all structured motifs in weighted sequences. In this problem wewant to identify motifs consisting of dissimilar boxes with constraints on the gaps between the occurrences of theboxes.

2. Generalized maximal repetitions: extend the algorithm for the maximal pairs to identify general structured motifscomposed of p > 2 similar parts.

3. Edit distance: we would like to try and come up with efficient methods for edit distance instead of Hammingdistance. One good start are the methods described in [5].


4. Approximate results: as far as we know there is no work on approximate solutions for this problem. An ap-proximation algorithm would be welcome considering the time complexities of the known algorithms for motifidentification.

References

[1] G. Brodal, R. Lyngso, Ch. Pedersen, J. Stoye, Finding maximal pairs with bounded gap, Journal of Discrete Algorithms 1 (2000) 134–149.[2] M.R. Brown, R.E. Tarjan, A fast merging algorithm, Journal of the ACM 26 (2) (1979) 211–226.[3] M.R. Brown, R.E. Tarjan, Design and analysis of a data structure for representing sorted lists, SIAM Journal on Computing 9 (1980) 594–614.[4] B. Chazelle, L.J. Guibas, Fractional cascading: I. A data structuring technique, Algorithmica 1 (1986) 133–162.[5] R. Cole, L.-A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and don’t cares, in: Proc. of the 36th ACM Symposium

on Theory of Computing (STOC), 2004, pp. 91–100.[6] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New

York, 1997.[7] C. Iliopoulos, Ch. Makris, I. Panagis, K. Perdikuri, E. Theodoridis, A. Tsakalidis, Computing the repetitions in a weighted sequence using

weighted suffix trees, in: Proc. of the European Conference on Computational Biology (ECCB), 2003.[8] C. Iliopoulos, Ch. Makris, I. Panagis, K. Perdikuri, E. Theodoridis, A. Tsakalidis, Efficient algorithms for handling molecular weighted

sequences, Accepted for presentation in IFIP TCS, 2004.[9] C. Iliopoulos, C. Makris, S. Sioutas, A. Tsakalidis, K. Tsichlas, Identifying occurrences of maximal pairs in multiple strings, in: Proc. of the

13th Ann. Symp. on Combinatorial Pattern Matching (CPM), 2002, pp. 133–143.[10] L. Marsan, M.-F. Sagot, Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site

consensus identification, Journal of Computational Biology 7 (2000) 345–360.[11] E.M. McCreight, A space-economical suffix tree construction algorithm, Journal of the ACM 23 (1976) 262–272.[12] E.W. Myers, Celera Genomics Corporation, The whole-genome assembly of drosophila, Science 287 (2000) 2196–2204.[13] L. Parida, I. Rigoutsos, A. Floratos, D. Platt, Y. Gao, Pattern discovery on character sets and real-valued data: linear bound on irredundant

motifs and polynomial time algorithms, in: Proceedings of the Eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA 2000),2000, pp. 297–308.

[14] N. Pisanti, M. Crochemore, R. Grossi, M.-F. Sagot, A basis of tiling motifs for generating repeated patterns and its complexity for higherquorum, in: Proc. of the 28th Symp. on Mathematical Foundations of Computer Science (MFCS), in: Lecture Notes in Computer Science,vol. 2747, Springer, Berlin, 2003, pp. 622–632.

[15] A. Apostolico, L. Parida, Incremental paradigms of motif discovery, Journal of Computational Biology 11 (1) (2004) 15–25.[16] M.F. Sagot, Spelling approximate repeated or common motifs using a suffix tree, in: Proc. of the 3rd LATIN Symp., in: Lecture Notes in

Computer Science, vol. 1380, Springer, Berlin, 1998, pp. 111–127.[17] B. Schieber, U. Vishkin, On finding lowest common ancestors: simplifications and parallelization, SIAM Journal on Computing 17 (1988)

1253–1262.[18] P. van Emde Boas, Preserving order in a forest in less than logarithmic time and linear space, Information Processing Letters 6 (3) (1977)

80–82.[19] M. Tompa, An exact method for finding short motifs in sequences, with application to the ribosome binding site problem, in: Proceedings of

the 7th Intl. Conf. Intelligent Systems for Molecular Biology, 1999, pp. 262–271.[20] R. Staden, Methods for discovering novel motifs in nucleic acid sequences, Computer Applications in the Biosciences 5 (1989) 293–298.[21] J. Buhler, M. Tompa, Finding motifs using random projections, in: The Proceedings of RECOMB 2001, 2001, pp. 69–76.[22] A. Price, S. Ramabhadran, P.A. Pevzner, Finding subtle motifs by branching from sample strings, Bioinformatics 19 (2003) 149–155.[23] M. Tompa, et al., Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology 23 (2005)

137–144.[24] A. Apostolico, F. Gong, S. Lonardi, Verbumculus and the discovery of unusual words, Journal of Computer and Science Technology 19 (1)

(2004) 22–41 (special issue in Bioinformatics).[25] C. Iliopoulos, Ch. Makris, Y. Panagis, K. Perdikuri, E. Theodoridis, A. Tsakalidis, The weighted suffix tree: an efficient data structure for

handling molecular weighted sequences and its applications, Fundamenta Informaticae 71 (2–3) (2006) 259–277.

algorithms for extracting motifs from biological weighted sequences

Documents