18 may 2006 klaus-bernd schürmann jens stoye technische fakultät universität bielefeld germany...

24
18 May 2006 Klaus-Bernd Schürmann Jens Stoye Technische Fakultät Universität Bielefeld Germany Counting Suffix Arrays and Strings Counting Suffix Arrays and Strings

Upload: hugh-hart

Post on 18-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

18 May 2006

Klaus-Bernd SchürmannJens Stoye

Technische FakultätUniversität BielefeldGermany

Counting Suffix Arrays and StringsCounting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 2

Suffix Array Data StructureSuffix Array Data Structure

Suffix Array – lexicographically sorted list of all suffixes:

13 - $12 - C$10 - CTC$5 - CTCTTCTC$7 - CTTCTC$2 - CTTCTCTTCTC$11 - TC$9 - TCTC $4 - TCTCTTCTC$6 - TCTTCTC$1 - TCTTCTCTTCTC$8 - TTCTC$3 - TTCTCTTCTC$

Text to be indexed: T C T T1 2 3 4

C T C T5 6 7 8

T9C10

T C11 12

$13

Dagstuhl, May 2006 - Jens Stoye Slide 3

OverviewOverview

1. Classify strings sharing same suffix array

2. Counting strings sharing same suffix array

3. Counting suffix arrays Lower bound suffix array compression

4. Summation identities

Dagstuhl, May 2006 - Jens Stoye Slide 4

1. Classify Strings for Suffix Array1. Classify Strings for Suffix Array

t - string of length n,P - permutation of {1,..., n},R - inverse of P.

Theorem:

P is the suffix array of t if and only if for all i {1,...,n}

a) t[P[i]] t[P[i+1]] andb) t[P[i]] = t[P[i+1]] R[P[i]+1] R[P[i+1]+1]same asb) R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]]

Dagstuhl, May 2006 - Jens Stoye Slide 5

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

P[i]A ABCBA BCBBB CBC B

t[P[i]]

a) t[P[i]] t[P[i+1]] andb) R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]]

1. Classify Strings for Suffix Array1. Classify Strings for Suffix Array

R+-descent

Dagstuhl, May 2006 - Jens Stoye Slide 6

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

P[i]A ABCBA BCBBB CBC B

t[P[i]]

t2 = A A1 2

C D C3 4 5

t3 = A B1 2

D E D3 4 5

A ACDCA CDCC C DCD C

t2[P[i]]A BDEDB DEDDD EDE D

t3[P[i]]

Equivalences between strings

(order-equivalent) (order-distinct)

1. Classify Strings for Suffix Array1. Classify Strings for Suffix Array

Dagstuhl, May 2006 - Jens Stoye Slide 7

2. Counting Strings for Suffix Array2. Counting Strings for Suffix Array

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

P[i]

t2 = A A1 2

C D C3 4 5

t3 = A B1 2

D E D3 4 5

AABBC

t[P[i]]AACCD

t2[P[i]]+ 0 =+ 0 =+ 1 =+ 1 =+ 1 =

AABBC

t[P[i]]ABDDE

t3[P[i]]+ 0 =+ 1 =+ 2 =+ 2 =+ 2 =

Non-decreasing sequences

Base string

Dagstuhl, May 2006 - Jens Stoye Slide 8

2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray

Suffix array P of length n with d R+-descents.

Number of strings over alphabet of size a for P= Number of non-decreasing sequences over

a-d elements

1

1

da

dan

Dagstuhl, May 2006 - Jens Stoye Slide 9

2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray

Suffix array P of length n with d R+-descents.

Number of strings composed of exactly k distinct characters for P is

1

1

dk

dn

Dagstuhl, May 2006 - Jens Stoye Slide 10

2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray

Number of strings over alphabet size 20 for suffix arrays of length n with 10 R+-descents:

nStrings composed of up to 20 characters

Strings composed of all 20 characters

5 2,002 0

10 92,378 0

15 1,307,504 0

20 10,015,005 1

25 52,451,256 2,002

30 211,915,132 92,378

35 708,930,508 1,307,504

Dagstuhl, May 2006 - Jens Stoye Slide 11

a

dk dk

dn

1 1

1

2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray

Suffix array P of length n with d R+-descents

Number of order-distinct strings over alphabet of size a is

Number of order-distinct strings where all k distinct characters must appear is

1

1

dk

dn

Dagstuhl, May 2006 - Jens Stoye Slide 12

Definition:

Let P permutation of {1,..., n}.

Position i{1,...,n-1} is a permutation descentif P[i] > P[i+1].

Definition:

The Eulerian number gives the number of

permutations of {1,...,n} with exactly dpermutation descents.

d

n

3. Counting Suffix Arrays3. Counting Suffix Arrays

Dagstuhl, May 2006 - Jens Stoye Slide 13

3. Counting Suffix Arrays3. Counting Suffix Arrays

Well-known fact:

Recursive enumeration of Eulerian numbers

a) ,

b) for n d, and

c)

10

n

0d

n

1

1)(

1)1(

d

ndn

d

nd

d

n

Dagstuhl, May 2006 - Jens Stoye Slide 14

3. Counting Suffix Arrays3. Counting Suffix Arrays

Definition:Let A(n,d) be the number of permutations of

length n with d R+-descents.

Observation:a) A(n,0) = 1b) A(n,d) = 0 for n dc) see next

Dagstuhl, May 2006 - Jens Stoye Slide 15

3. Counting Suffix Arrays3. Counting Suffix Arrays

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

Pt[i]A ABCBA BCBBB CBC B

t[P[i]]

At = A A1 2

A B C3 4 5

B6

12364

PAt[i]

5

A AABCBA ABCBA BCBBB CB

At[P[i]]

C B

12345

i

6

(d+1) possible positions without additional R+-descent

Dagstuhl, May 2006 - Jens Stoye Slide 16

3. Counting Suffix Arrays3. Counting Suffix Arrays

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

Pt[i]A ABCBA BCBBB CBC B

t[P[i]]

Bt = B A1 2

A B C3 4 5

B6

23614

PBt[i]

5

A ABCBA BCBBB AABCBB CB

Bt[P[i]]

C B

12345

i

6

(d+1) possible positions without additional R+-descent

Dagstuhl, May 2006 - Jens Stoye Slide 17

3. Counting Suffix Arrays3. Counting Suffix Arrays

Together:a) A(n,0) = 1,b) A(n,d) = 0 for n d, andc) A(n,d) = (d+1) A(n-1,d) + (n-d) A(n-1,d-1)

Theorem:The number A(n,d) of permutations of length n

with d R+-descents is the Eulerian number .dn

Dagstuhl, May 2006 - Jens Stoye Slide 18

3. Counting Suffix Arrays3. Counting Suffix Arrays

The number of distinct suffix arrays of length n for strings over alphabet of size a:

Lower bound for compressibility of suffix arrays in the Kolmogorov sense:

1

0

a

d d

n

1

0

loga

d d

n

Dagstuhl, May 2006 - Jens Stoye Slide 19

3. Counting Suffix Arrays3. Counting Suffix Arrays

Number of distinct suffix arrays of length n for strings over alphabet of size 20:

n String count (20n) Suffix array count

4 160,000 24

6 6.4 107 720

8 2.6 1010 40,320

10 1.0 1013 3.6 106

12 4.1 1015 4.8 108

14 1.6 1018 8.7 1010

16 6.6 1020 2.1 1013

18 2.6 1023 6.4 1015

Dagstuhl, May 2006 - Jens Stoye Slide 20

3. Counting Suffix Arrays3. Counting Suffix Arrays

Number of distinct suffix arrays of length n for strings over alphabet of size 4:

n String count (4n) Suffix array count

4 256 24

6 4,096 662

8 65,536 20,160

10 1,048,576 504,046

12 16,777,216 10,670,040

14 268,435,456 202,964,470

16 4,294,967,296 3,614,083,520

18 68,719,476,736 61,786,015,150

Dagstuhl, May 2006 - Jens Stoye Slide 21

4. Summation Identities4. Summation Identities

Worpitzki‘s identity by summing up the number of strings of length n for each suffix array:

Summation rule for Eulerian numbers to generate the Stirling numbers of second kind:

i

a

d

n

n

ia

i

n

da

dan

d

na

1

0 1

1

i

k

d kn

i

i

n

dk

dn

d

n

k

nk

1

0 1

1!

Dagstuhl, May 2006 - Jens Stoye Slide 22

SummarySummary

Constructive proofs to count strings sharing the same suffix array

Constructive proof to count distinct suffix arrays yielding lower bound for suffix array compression

Constructive proofs for Worpitzki‘s identity and the summation rule of Eulerian numbers to count Stirling numbers of second kind

Dagstuhl, May 2006 - Jens Stoye Slide 23

OutlookOutlook

Efficient enumeration algorithm for suffix arrays

Compressed suffix arrays for fast querying in bioinformatics applications

Average case analysis under non-uniform model

18 May 2006

Klaus-Bernd SchürmannJens Stoye

Technische FakultätUniversität BielefeldGermany

Thank you for your attention!