compressed compact suffix arrays veli mäkinen university of helsinki gonzalo navarro university of...

36
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Upload: rosamund-cannon

Post on 18-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Compressed Compact Suffix Arrays

Veli Mäkinen

University of Helsinki

Gonzalo Navarro

University of Chile

compact compress

Page 2: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Introduction

We consider exact string matching on static text.

The task is to construct an index for the text such that the occurrences of a given pattern can be found efficiently.

Well known optimal solution exists: build a suffix tree over the text.

Page 3: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Introduction...

The suffix-tree-based solution has a weakness:

In some applications the space usage is the real bottleneck, not the search efficiency.

It takes too much space!

Page 4: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Introduction...

During the last 10 years, many practical / theoretical solutions with reduced space complexities have been proposed.

The work can roughly be divided into three categories:(1) Reducing constant factors(2) Concrete optimization(3) Abstract optimization

Page 5: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Reducing constant factors

Suffix arrays (Manber & Myers 1990) Suffix cactuses (Kärkkäinen 1995) Sparse suffix trees (Kärkkäinen & Ukkonen

1996) Space-efficient suffix trees (Kurtz 1998) Enhanced suffix arrays (Abouelhoda &

Ohlebusch & Kurtz 2002)

Page 6: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Concrete optimization

“¼ Minimizing automata” DAWGS (Blumer & Blumer & Haussler &

McConnel & Ehrenfeucht 1983) Compact DAWGS (Crochemore & Vérin

1997) Compact suffix arrays (Mäkinen 2000)

Page 7: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Abstract optimization

Objective: Use as few space as possible to support the functionality of a given abstract definition of a data structure.

Space is measured in bits and usually given proportional to the entropy of the text.

Page 8: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Abstract optimization: Example

A full text index for a given text T supports the following operations:- Exists(P): is P a substring of T? - Count(P): how many times P occurs in T?- Report(P): list occurrences of P in T.

Page 9: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Abstract optimization...

Seminal work by Jacobson 1989: rank-select queries on bit-vectors.

Rank-select-type structures for suffix trees (Munro & Raman & Rao & Clark 1996-)

Lempel-Ziv index (Kärkkäinen & Ukkonen 1996)

Page 10: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Abstract optimization...

Compressed suffix arrays (Grossi & Vitter 2000, Sadakane 2000, 2002)

FM-index (Ferragina & Manzini 2000) LZ-self-index (Navarro 2002) Space-optimal full-text indexes (Grossi &

Gupta & Vitter 2003, 2004)

Page 11: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

This paper

We use both concrete and abstract optimization.

We compress compact suffix array into a succinct full-text index, supporting:- Exists(P), Count(P) in O(|P| log |T|) time.- Report(P) in O((|P|+occ)log |T|) time, where occ is the number of occurrences.

Page 12: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

This paper...

Space requirement of our index is O(n(1+Hk log n)) bits, where Hk=Hk(T) is the order-k empirical entropy of T.

Hk: “the average number of bits needed to encode a symbol after seeing the k previous ones, using a fixed codebook”.

Page 13: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

This paper...

In practice, the size of our index is 1.67 times the text size including the text.

Search times are comparable to compressed suffix arrays that occupy O(H0 n) bits.

Our index takes O(log n) times more space than FM-index and the other space-optimal indexes.

Page 14: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

This paper...

Simpler than the previous approaches and more efficient in practice.

No limitations on the alphabet size :- FM-index assumes constant alphabet.- Some compressed suffix arrays assume =polylog(n).

Page 15: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Big picture

Compact suffix array (CSA): some areas of a suffix array are replaced by links to similar areas.

Compressed CSA (CCSA): We use the conceptual structure of optimal CSA as such.

We represent the links with respect to the original suffix array.

Page 16: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Big picture...

A bit-vector represents the boundaries of areas replaced by links.

Each area is represented by an integer denoting the start of the linked area.

Some additional structures are attached to encode the text inside CCSA, etc.

Page 17: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Example: suffix array

sa suffix1: 12 $

2: 11 i$

3: 8 ippi$

4: 5 issippi$

5: 2 ississippi$

6: 1 mississippi$

7: 10 pi$

8: 9 ppi$

9: 7 sippi$

10: 4 sissippi$

11: 6 ssippi$

12: 3 ssissippi$

1 2 3 4 5 6 7 8 9 10 11 12m i s s i s s i p p i $T=

Page 18: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Example: CSA

sa 1: 12

2: 11

3: 8

4: 5

5: 2

6: 1

7: 10

8: 9

9: 7

10: 4

11: 6

12: 3

csa 1: (5,0,1)

2: (1,0,1)

3: (7,0,1)

4: (9,0,2)

5: (4,1,1)

6: (2,0,1)

7: (6,0,1)

8: (3,0,2)

9: (8,0,2)

Page 19: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Example: CCSA

sa 1: 12

2: 11

3: 8

4: 5

5: 2

6: 1

7: 10

8: 9

9: 7

10: 4

11: 6

12: 3

csa 1: (5,0,1)

2: (1,0,1)

3: (7,0,1)

4: (9,0,2)

5: (4,1,1)

6: (2,0,1)

7: (6,0,1)

8: (3,0,2)

9: (8,0,2)

ccsa1: 6

2: 1

3: 8

4: 11

5: 5

6: 2

7: 7

8: 3

9: 9

1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

Page 20: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Example: CCSA...

sa 1: 12

2: 11

3: 8

4: 5

5: 2

6: 1

7: 10

8: 9

9: 7

10: 4

11: 6

12: 3

ccsa1: 6

2: 1

3: 8

4: 11

5: 5

6: 2

7: 7

8: 3

9: 9

1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1 2 3 4 5 6 7 8 9 10 11 12m i s s i s s i p p i $

1: $

2: i

3: i

4: i

5: i

6: m

7: p

8: p

9: s

10: s

11: s

12: s

Page 21: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Example: CCSA...

sa 1: 12

2: 11

3: 8

4: 5

5: 2

6: 1

7: 10

8: 9

9: 7

10: 4

11: 6

12: 3

ccsa1: 6

2: 1

3: 8

4: 11

5: 5

6: 2

7: 7

8: 3

9: 9

1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1: $

2: i

3: m

4: p

5: s

1: 1

2: 1

3: 0

4: 0

5: 0

6: 1

7: 1

8: 0

9: 1

10: 0

11: 0

12: 0

1: $

2: i

3: i

4: i

5: i

6: m

7: p

8: p

9: s

10: s

11: s

12: s

Page 22: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Search on CCSA

We simulate the standard binary search of suffix array on CCSA.

A sub-problem in the search is to compare the pattern P against a suffix Tsa[i]...|T|.

For this, we extract tsa[i] , tsa[i]+1 , tsa[i]+2 , ..., tsa[i]+|

P|-1, following the links of the CCSA.

Page 23: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Example: Search on CCSA

ccsa1: 6

2: 1

3: 8

4: 11

5: 5

6: 2

7: 7

8: 3

9: 9

1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1: $

2: i

3: m

4: p

5: s

P=“isi” vs. Tsa[4]...|T| ?

4

1: 1

2: 1

3: 0

4: 0

5: 0

6: 1

7: 1

8: 0

9: 1

10: 0

11: 0

12: 0

2

Tsa[4]...|T| = i

Page 24: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Example: Search on CCSA

ccsa1: 6

2: 1

3: 8

4: 11

5: 5

6: 2

7: 7

8: 3

9: 9

1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1: $

2: i

3: m

4: p

5: s

1: 1

2: 1

3: 0

4: 0

5: 0

6: 1

7: 1

8: 0

9: 1

10: 0

11: 0

12: 0

i

9 5

s

P=“isi” vs. Tsa[4]...|T| ?

Tsa[4]...|T| =

Page 25: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Example: Search on CCSA

ccsa1: 6

2: 1

3: 8

4: 11

5: 5

6: 2

7: 7

8: 3

9: 9

1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1: $

2: i

3: m

4: p

5: s

1: 1

2: 1

3: 0

4: 0

5: 0

6: 1

7: 1

8: 0

9: 1

10: 0

11: 0

12: 0

i s

8 5

s> P

P=“isi” vs. Tsa[4]...|T| ?

Tsa[4]...|T| =

Page 26: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Search on CCSA...

To follow a link in constant time, we need the operations rank(i) and selectprev(i) on bit-vectors:- rank(i) gives the number of 1’s upto position i.- selectprev(i) gives the position of the previous 1 before position i.

Page 27: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Search on CCSA...

Lemma [Jacobson 89, Munro et al. 96]: A bit-vector of length n can be replaced with a structure of size n+o(n) so that queries rank(i) and selectprev(i) can be supported in constant time.

Page 28: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Search on CCSA...

Corollary: Existence and counting queries can be supported by CCSA in time O(|P| log |T|).

Reporting queries can be supported by a similar technique to access sampled suffixes.

Page 29: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Size of CCSA

Overall we use O(n)+n’log n bits of space, where n’ is the number of entries in the main CCSA table.

We show in the paper that n’ is also the number of runs of symbols in the Burrows-Wheeler transformed text.

Finally, we show that n’· 2Hk n +k.

Page 30: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Comparison: default settings

times |T|FM 0.36CSA 0.69CCSA 1.67LZ 1.5

Page 31: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Comparison: default settings...

times |T|FM 0.36CSA 0.69CCSA 1.67LZ 1.5

Page 32: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Comparison: same sample rate

times |T|FM 0.41CSA 0.58CCSA 1.67

Page 33: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Comparison: same space

times |T|FM 1.69CSA 1.59CCSA 1.67LZ 1.5

Page 34: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Comparison: same space...

times |T|FM 1.69CSA 1.59CCSA 1.67LZ 1.5

Page 35: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Conclusion

CCSA is much faster than the default implementations of other small indexes in reporting (except LZ-index).

However, as the basic structure of the other indexes takes less space, it is possible to implement them using smaller sampling step to make them occupy the same space as CCSA and to work as efficiently.

Page 36: Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress

Future

In a subsequent work we have developed an index (a cross between CCSA and FM-index) taking O(Hk log ) bits of space supporting counting queries in time O(|P|).- optimal space/time on constant alphabet- turns the exponential additive alphabet factor of FM-index into a logarithmic multiplicative factor.