[ieee 2008 cairo international biomedical engineering conference (cibec) - cairo, egypt...

Proceedings of the 2008 IEEE, CIBEC'08 978-1-4244-2695-9/08/$25.00 ©2008 IEEE

FINE TUNING THE ENHANCED SUFFIX ARRAY

M.I. Abouelhoda1,2, A. Dawood2

1Faculty of Engineering, Cairo University, Giza, Egypt2Nile University, Giza, Egypt

e-mails: [email protected], [email protected]

Abstract—The enhanced suffix array is an indexing datastructure used for a wide range of applications in Bioinformatics.It is basically the suffix array but enhanced with extra tablesthat provide extra information to improve the performance intheory and in practice. In this paper, we present a number ofimprovements to the enhanced suffix array: 1) We show how tofind a pattern of length m in O(m) time, i.e., independent of thealphabet size. 2) We present an improved representation of thebucket table. 3) We improve the access time of addressing theLCP (longest common prefix) table when one byte per entry isused in representing it. The basic idea behind these improvementsis the extensive use of the minimal perfect hashing technique, bywhich n static items can be stored in linear space while retainingO(1) access time.

I. INTRODUCTION

The ever growing biological data requires efficient datastructures and algorithms to analyze and manage it. Biologicaldata (DNA and protein sequences) are of static nature, whichsuggests the use of indexing data structures to speed up avariety of genome analysis tasks. In recent years, the enhancedsuffix array [1] became the data structure of choice forindexing biological data and solving versatile tasks. This isdue to its reduced memory consumption compared to the suffixtree [2–4] and its improved cash performance.

The enhanced suffix array is basically the suffix array [5,6] but enhanced with a set of tables, whose content andnumber depends on the application at hand. The algorithmsbased on the enhanced suffix array do not only have the samecomplexity as the corresponding algorithms based on the suffixtree, but are also faster in practice [1].

In this paper, we introduce the following improvements tothe enhanced suffix array.

1) Alphabet-independent exact pattern matching algorithm:The exact pattern matching problem is to find a patternof length m in a text of size n. The best algorithmfor exact pattern matching based on the enhanced suffixarray has time complexity of O(m log |Σ|) [7, 8], where|Σ| is the alphabet size. In this paper, we show that theexact pattern matching problem can be solved using theenhanced suffix array in O(m) time independent of thealphabet size, while retaining the alphabet-independentO(n) space consumption.

2) Improving bucket table representation: The bucket tableis a look-up table of length |Σ|d such that each string oflength d maps to just one entry in the table. The entriesrefer to where these strings (which are prefixes of somesuffixes) occur in the string It is clear that the space

requirement of the bucket table becomes prohibitive forlarge d and Σ. In this paper, we show how to overcomethis problem.

3) Improving access to the lcp-table: The lcp-table storesthe length of the longest common prefix between everytwo consecutive suffixes in the enhanced suffix array,and it requires 4 bytes per entry in the worst case. Inpractice, however, one byte is used to represent eachentry in the lcp-table, and entries with values largerthan 255 are stored in an auxiliary table. The originalvalues are retrieved from this table in O(log τ) timeusing binary search, where τ is the size of the auxiliarytable [1]. In this paper, we show how to retrieve thesevalues in constant time from this auxiliary table, whileretaining linear space.

All the above-mentioned new results are achieved by usingthe minimal perfect hashing technique (also known as storingsparse tables) with which we can store n static keys from auniverse U in O(n) space, while guaranteeing O(1) query(access) time. Note that the look-up table method requiresO(|U |) space to achieve constant time, while the traditionalhashing technique takes O(n) space but does not guarantee aconstant query time.

This paper is organized as follows: In the next sectionwe present the enhanced suffix array. Section III recallsthe basic definitions and algorithms of the minimal perfecthashing technique. In Section IV we show how to solve theexact pattern matching problem using an alphabet-independentalgorithm. The improved representation of the bucket tableis addressed in Section V. In Section VI we show how toguarantee O(1) access time to the lcp-table.

II. THE ENHANCED SUFFIX ARRAY

A. Basic notions and definitionsLet Σ be a finite ordered alphabet. In case of DNA, the

alphabet is basically composed of four characters A, C, G,and T; each corresponding to a chemical unit called nucleotide.For protein sequences, the alphabet size is twenty, with eachcharacter corresponding to a chemical unit called amino-acid.Let S be a string of length |S| = n over Σ, and let S[i] denotethe character at position i in S, for 0 ≤ i < n. For i ≤ j,S[i..j] denotes the substring S starting with the character atposition i and ending with the character at position j. The i-thsuffix of S, denoted by S(i), is the substring S[i..n− 1]. Forease of presentation, we use the special symbol $ /∈ Σ andassume it is higher than all other elements of Σ.


The suffix array (denoted by suftab) of the string S isan array of integers in the range 0 to n, specifying thelexicographic ordering of the n + 1 suffixes of the stringS$. That is, S(suftab[0]), S(suftab[1]), . . . , S(suftab[n]) isthe sequence of suffixes of S$ in ascending lexicographicorder; see Figure 1 (a). The suffix array can theoretically beconstructed in linear time, but some non-linear time algorithmsare faster in practice; see [9] for a survey.

The basic enhancement of the suffix array is the lcp-table,denoted by lcptab. The lcp-table is an array of integers in therange 0 to n. We define lcptab[0] = 0 and lcptab[i] to bethe length of the longest common prefix of S(suftab[i − 1])and S(suftab[i]), for 1 ≤ i ≤ n. Since S(suftab[n]) = $, wealways have lcptab[n] = 0. The lcp-table can be computed inlinear time from the suffix array [10].

B. The lcp-intervals of the suffix array

Definition 2.1: An interval [i..j] of the enhanced suffixarray of S is an lcp-interval of lcp-value ` if

1) lcptab[i] < `,2) lcptab[k] ≥ ` for all k with i + 1 ≤ k ≤ j,3) lcptab[k] = ` for at least one k with i + 1 ≤ k ≤ j,4) lcptab[j + 1] < `.

For short, we will write `-[i..j] to refer to an lcp-interval [i..j]of lcp-value `. Every index k, i+1≤k≤j, with lcptab[k] = `is called `-index.

It is not difficult to see that S(suftab[i]), S(suftab[i +1]), . . . , S(suftab[j]) of an `-[i..j] share the prefix ω =S[suftab[i]..suftab[i] + ` − 1]. As an example, consider thetable in Figure 1 (a). The interval [0..5] is a 1-interval becauselcptab[0] = 0 < 1, lcptab[5+1] = 0 < 1, lcptab[k] ≥ 1 for allk with 1 ≤ k ≤ 5, and lcptab[2]= lcptab[4]=1. Furthermore,1-[0..5] is the a-interval (all the suffixes share the prefix ω = a)and the `-indices of the interval [0..5] are {2, 4}.

Definition 2.2: An m-[l..r] is said to be embedded in an`-[i..j] if it is a subinterval of [i..j] (i.e., i ≤ l < r ≤ j)and m > `. The `-[i..j] is then called the interval enclosing[l..r]. If [i..j] encloses [l..r] and there is no interval embeddedin [i..j] that also encloses [l..r], then [l..r] is called a childinterval of [i..j].

The parent-child relationship between the lcp-intervals con-stitutes a conceptual (or virtual) tree which we call the lcp-interval tree of the suffix array. The root of this tree is the0-interval [0..n]; see Figure 1 (b) where all the lcp-intervalsare plotted, along with arrows representing the parent-childrelationships between them. Note that each individual suffixS(suftab[l]) of the suffix array can be considered as a singletoninterval [l..l] and can be virtually placed as a leaf in the lcp-interval tree. For instance, continuing the example of Figure 1,the child intervals of [0..5] are [0..1], [2..3], and [4..5]. Theinterval [0..1] has two singleton child intervals: [0..0] and[1..1].

Definition 2.3: Let `′-[i′..j′] be a child interval of `-[i..j],We define the branching character of `-[i..j] as the characterS[suftab[i′] + `].

i suftab lcptab home S(suftab[i])

0 2 0 [0..5] aaacatat$1 3 2 [0..1] aacatat$2 0 1 [2..3] acaaacatat$3 4 3 acatat$4 6 1 [4..5] atat$5 8 2 at$6 1 0 [6..7] caaacatat$7 5 2 catat$8 7 0 [8..9] tat$9 9 1 t$

10 10 0 $

(a)

0-[0..10]aPPPPPPPq

��)aaPPPPPPPq

��)a2-[0..1]

a2-[4..5]

a?a

3-[2..3]

1-[0..5]

a1-[8..9]

a?a

2-[6..7]

ac

t

ta c

(b)

Fig. 1. Part (a): The suffix array of the string S = acaaacatat$.The suffix array is enhanced with the lcp-table. The non-empty hometable entries are filled with the respective intervals, for illustration.Part (b): The lcp-interval tree of the enhanced suffix array of Part(a). The branching character of each interval is written on the edgeincident to it.

Figure 1 (b) shows the branching characters of some lcp-intervals.

III. MINIMAL PERFECT HASHING

Given a static set K of n keys drawn from a universe Uof size |U |, a perfect hash function h maps each key to aunique integer number in the range [0..m− 1], where m ≥ n.Specifically, for any two keys k1 6= k2 ∈ K, h(k1) 6= h(k2);i.e., there is no collision. A perfect hash function is calledminimal, if m = n. By means of perfect hashing, we cananswer membership queries in the form of “is x in K?, andif so, where can it be found?” in O(1) time.

Minimal perfect hash functions are widely used in informa-tion retrieval problems where the set of keys are static (i.e.,the keys are given in advance and there are neither insertionsnor deletions), which is also the case with the biological dataand the enhanced suffix array.

Fredman et al. [11] were the first to show how to storea sparse table in O(n) space with O(1) access time. Theirapproach is based on resolving collisions using a two-leveldata structure, rather than finding an explicit minimal perfecthash function. For n keys, the algorithm of Fredman et al.requires O(n log |U |) bits in the worst case. The data structureof Fredman et al. is constructed in expected O(n) time.Recently, many refined methods have been introduced thatexplicitly search for a minimal perfect hash function, see [12]for a survey. The major scheme of these algorithms is shownin Figure 2, where the representation of the hash function


Fig. 2. A set of n = 8 keys and the hash function representation,depicted as a cloud, mapping each key to a unique number between0 and 7. For a given query kx, we compute h(kx) to get an index inthe referencing table which contains (depending on the application)the key itself or a reference to it. Accessing this table is indispensablebecause we have to check that the key stored is what we search for.

is separated from the storage of the keys or the associatedreferences to them. To the best of our knowledge, the algorithmof Botelho et al. [13] is the state of the art method, whichrepresents the hash function using ≈ 2.7 bits per key. Notethat O(nlog|U |) bits are also needed to access the keys andcheck for their existence (by means of a referencing table). Thealgorithm of Botelho et al. finds the perfect hashing functionin expected O(n) time. We recommend to use the algorithm ofBotelho et al. because it requires the least space and efficientin practice [13].

In the following sections we show how to make use of theperfect hashing technique to improve the enhanced suffix array.

IV. EXACT PATTERN MATCHING REVISED

Abouelhoda et al. [1] showed that it is possible to solve theexact pattern matching problem in O(|Σ|m) time using thesuffix array enhanced with the lcp-table and an auxiliary tablecalled the child-table. Subsequently Kim et al. [7] showed thatit is possible to achieve O(m log |Σ|) time by reorganizingthe child-table. Recently Fischer et al. [8] achieved the samecomplexity but based on a data structure supporting rangeminimum queries.

The common idea of the above algorithms is as follows.Given a pattern P of length m, one traverses the virtuallcp-interval tree over the suffix array. First the lcp-interval[i..j] such that P [0] = S[suftab[i]] is identified. (Recall fromSubsection II-B that S[suftab[i]] is the branching character)Assume this interval is of lcp-value ` < m (i.e., all thesuffixes in this interval share a prefix of length `) then wecompare P [0..` − 1] to S[suftab[i]..suftab[i] + ` − 1]. Ifthis comparison fails, then we report the non-existence ofthe pattern in S. Otherwise, we go on and locate the childinterval [i1..j1] ⊂ [i..j] such that the branching characterS[suftab[i1] + `] equals P [`]. Let the lcp-value of [i1..j1]be `′ > `, then we compare the characters P [`..`′ − 1] toS[suftab[i′]+` . . . suftab[i′]+`′−1] and go on until the patternis totally matched.

The crucial part of this algorithm is to locate a child intervalwith a branching character a = P [x], for some x ∈ [0..m−1].Abouelhoda et al. showed that this can be done in O(|Σ|)time using an extra table called the child-table which takesO(n) space. The idea of the child table is to store for each

interval its `-indices. Because the `-indices bound the childintervals, keeping track of them enables traversal of the childintervals. The interval with the branching character a = P [x]can be determined by visiting the child intervals in sequentialorder, yielding O(|Σ|) algorithm. Kim et al. [7] showed thatthe child-table can be reorganized to locate the interval ofinterest in O(log |Σ|) time using a binary-search like procedureover the child intervals in O(k) space, where k is the numberof child intervals (yielding total space of O(n)). They alsoindicated that locating this interval can be done in O(1) timeusing a look-up table, but this would yield O(n|Σ|) space,which is not practical for larger alphabets. Fischer and Heunpresented another method to locate the child interval of interestin O(log |Σ|) time by searching for the `-indices boundingthis child interval using range minimum queries. We show inthis paper that another method based on the perfect hashingtechnique can be used to achieve O(1) time using O(k) space(yielding total space of O(n)).

The idea of our algorithm is to store for each lcp-interval aperfect hashing data structure containing the list of branchingcharacters and the respective `-indices. To attach the hashtables to the lcp-intervals, we need an identifier to eachinterval. For this purpose, we make use of the definition ofthe home of an lcp-interval [14].

Definition 4.1 (Strothmann [14]): For an lcp-interval [i..j]of lcp-value `,

home([i..j]) ={

i if lcptab[i] ≥ lcptab[j + 1]j otherwise.

In [14] it was proven that no two lcp-intervals have thesame home; Figure 1 (a) shows an example.

It is clear that the home value can be computed in constanttime given an interval [i..j]. Accordingly, we create a tablestoring the home values, and attach to each home cell theperfect hashing data structure to locate the child interval ofinterest in constant time. Since the total number of `-indicesis O(n) and the sparse table is constructed for all `-indices,the total space required is O(n).

Note that nearly half of the home table is empty, i.e.,it is also sparse, as shown in Figure 1 (a). This suggestsrepresenting it using the perfect hashing scheme. In practice,we can also make it much sparser, if we construct the perfecthashing scheme only for intervals [i..j] such that j − i > τ ,where τ is a user-defined parameter.

V. IMPROVING BUCKET TABLE REPRESENTATION

For a given parameter d, the bucket table bcktabd storesfor each substring w ∈ S of length d the smallest integer i,such that S[suftab[i] . . . suftab[i] + d − 1] = w. Note that wis bounded by the lcp-interval [i..j].

In [1], the bucket table was represented as a look-up table whose mapping function, for each w =S[suftab[i] . . . suftab[i] + d− 1] = S[l . . . l + d− 1], is

x =suftab[i]+d−1∑

j=suftab[i]

|Σ|d−j−1T (S[j])


where x is an integer value representing an index of the look-up table, |Σ| is the alphabet size, and T (S[i]) is a functionthat maps each character in the alphabet to a number in therange [0..|Σ|].

If the bucket table is used to locate patterns of length d inS, then the mapping function is first computed for the givenstring. If bcktabd[x] is empty, then the pattern does not exist.Otherwise, it exists and its occurrences are the suffixes inthe lcp-interval bounded by bcktabd[x] and bcktabd[y], wherebcktabd[y] is the next non-empty entry after bcktabd[y].

If the bucket table is used to locate substrings of length upto d characters, then we use an additional dummy characterN /∈ Σ to enable accessing all the subtrings of S of length upto d characters. More precisely, the bucket table is constructedover the alphabet N∪Σ, and for each prefix w′⊂w of lengthd′<d, we append (d−d′) N’s and then compute the mappingfunction. For a query P of length d′ ≤ d, we append Ncharacters such that the total length becomes d, then we launcha query as mentioned above.

The bucket table with the dummy character can be incor-porated in the exact pattern matching algorithm based on theenhanced suffix array. For small queries of length d′ ≤ d,we use the algorithm mentioned above. For larger queries,where |P | > d, the bucket table is used to locate the intervalcontaining the d-character prefix P [0..d−1] of the query P inO(d) time. Then our algorithm of Section III, which searchesfor the pattern P in S, starts from this interval instead of theinterval [0..n]. The advantage of this hybrid method is thatonly a smaller number of the `-indices will be representedusing the perfect hashing data structure, which further reducesthe space consumption.

However, the space consumption of the look-up table isprohibitive for large d and Σ. (Note that in case of using ofa dummy character, the waste of space is greater, because wehave substrings of no use; For example for a prefix “AA” andd = 4, it is enough to have “AANN”, not “AANA”, “AANC”,etc.) Therefore, we suggest to use the minimal perfect hashingtechnique to represent the bucket table. The parameter d isexperimentally chosen such that the space requirement to storethe table bcktabd and the sparse tables is minimum.

VI. IMPROVED ACCESS TO THE lcp-table

To further reduce the space consumption of the enhancedsuffix array, the lcp-table as mentioned in the introduction isstored as an array such that each entry takes 1 byte. The entrieswith lcp values larger than 255, which compose a fraction ofthe total entries in practice [1], are stored in an auxiliary table.

If the lcp-table is accessed in a certain application in asequential order, then using the auxiliary table scheme willentail no increase in the running time, because the values inthe auxiliary table will be addressed sequentially as well. Butif the lcp-table is accessed in a non-sequential way, as in theexact pattern matching application, then the original lcp valueshave to be determined by searching in the auxiliary table. Thisis achieved in [1] using binary search, which takes O(log τ)time, where τ is the size of the auxiliary table.

In this paper, we suggest that the minimal perfect hashingis used to represent the auxiliary lcp table, which straightfor-wardly guarantees O(1) access time using O(τ) space.

VII. EXPERIMENTS AND CONCLUSIONS

The purpose of this paper is basically to investigate the useof minimal perfect hashing with the enhanced suffix array toreach new theoretical results. Nevertheless, we ran experimentsto evaluate the performance of this data structure in practice. Ingeneral, answering queries on this data structure is faster thanthe binary search for large number of keys, at the expense ofmore space consumption. For one million queries over about3.5 million keys, it took 0.2 seconds using the perfect hashingtechnique, while 0.8 seconds was needed using the binarysearch. As for the bucket table, we found, as expected, that theperfect hashing technique takes less space for large d. Withthe dummy character included, the advantage of using perfecthashing representation becomes clear, even for moderate d.To take one example, we ran both techniques on the bacterialgenome E. coli. For d = 12 and Σ = 4, there were 3474814keys. The perfect hashing took about 43% of the space thelook-up table required. For the same d and Σ = 5 (due tothe dummy character), there were 8451811 keys. The perfecthashing took about 7% of the space the look-up table required.

REFERENCES

[1] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch, “Replacing suffix treeswith enhanced suffix arrays,” J. Discrete Algorithms, vol. 2, no. 1, pp.53–86, 2004.

[2] P. Weiner, “Linear pattern matching algorithms,” in Proc. of the 14thIEEE Symp. on Switching and Automata Theory, 1973, pp. 1–11.

[3] E. M. McCreight, “A space-economical suffix tree construction algo-rithm,” Journal of Algorithms, vol. 23, no. 2, pp. 262–272, 1976.

[4] D. Gusfield, Algorithms on Strings, Trees, and Sequences. New York:Cambridge University Press, 1997.

[5] U. Manber and E. W. Myers, “Suffix arrays: A new method for on-line string searches,” SIAM Journal on Computing, vol. 22, no. 5, pp.935–948, 1993.

[6] G. Gonnet, R. Baeza-Yates, and T. Snider, “New indices for text: PATtrees and PAT arrays,” in Information Retrieval: Algorithms and DataStructures. Prentice-Hall, 1992, pp. 66–82.

[7] D. Kim, M. Kim, and H. Park, “Linearized suffix tree: an efficient indexdata structure with the capabilities of suffix trees and suffix arrays,”Algorithmica, vol. X, no. x, pp. x–x, 200x.

[8] J. Fischer and V. Heun, “Representation of RMQ-information andimprovements in the enhanced suffix array,” in Proc. of the InternationalSymposium on Combinatorics, Algorithms, Probabilistic and Experimen-tal Methodologies (ESCAPE’07), ser. LNCS, 2007.

[9] S. Puglisi, S. W.F., and T. A.H., “A taxonomy of suffix array constructionalgorithms,” J. ACM, vol. 39, no. 2, pp. 278–294, 2007.

[10] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park, “Linear-timelongest-common-prefix computation in suffix arrays and its applica-tions,” in Proc. of the 12th Symp. on Combinatorial Pattern Matching,ser. LNCS, vol. 2089, 2001, pp. 181–192.

[11] M. Fredman, J. Komlos, and E. Szemeredi, “Storing a sparse table withO(1) worst case access time,” J. ACM, vol. 31, no. 3, pp. 538–544,1984.

[12] Z. Czech, G. Havas, and B. Majewski, “Fundamental study perfecthashing,” Theoretical Computer Science, vol. 182, pp. 1–143, 1997.

[13] F. Botelho, R. Pagh, and N. Ziviani, “Simple and space-efficient minimalperfect hash functions,” in Proceedings of the 4t, ser. LN, vol. X. SX,2008, pp. 1X–1X.

[14] D. Strothmann, “The affix array data structure and its applications toRNA secondary structure analysis,” Theoretical Computer Science, vol.389, pp. 278–294, 2007.

[ieee 2008 cairo international biomedical engineering conference (cibec) - cairo, egypt...

Documents