string algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/phd-warsaw-period.pdf ·...
TRANSCRIPT
![Page 1: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/1.jpg)
String Algorithms
Maxime Crochemore
King’s College London Universite Paris-Est
&
M.C. PhD Univ. Warsaw 1/63
![Page 2: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/2.jpg)
Algorithms and combinatorics on words
⋆ Links between combinatorial properties of words and algorithms on words:
– some algorithms are based on combinatorial properties
– combinatorics is sometimes used to evaluate efficiency of algorithms
⋆ Examples:
– Text searching
– Fragment assembly and shortest common superstring
– Text indexing and suffix arrays
– Text compression and permutations
⋆ Other examples in Algorithms on Strings
[C., Hancart, Lecroq, 2007], [C., Rytter, 1994]
http://www.dcs.kcl.ac.uk/staff/mac/
⋆ Combinatorial aspects in Applied Combinatorics on Words
[Lothaire, 2004], [Lothaire, 2002]
http://igm.univ-mlv.fr/∼berstel/Lothaire/index.html
M.C. PhD Univ. Warsaw 2/63
![Page 3: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/3.jpg)
Periods and borders of words
⋆ Non-empty string u, integer p, 0 < p ≤ |u|
⋆ p is a period of u if any of these equivalent conditions is satisfied:
– u[i] = u[i + p], for 1 ≤ i ≤ |u| − p
– u is a prefix of some yk, k > 0, |y| = p
– u = yw = wz, for some strings y, z, w with |y| = |z| = p
String w is called a border of u
b o r d e r l a n d b o r d e r
b o r d e r l a n d b o r d e r
-� p
-�
p
⋆ period(u) = smallest period of u (can be |u|)
border(u) = longest proper border of u (can be empty)
⋆ Periods and borders of abaabaa
3 abaa
6 a
7 empty string
M.C. PhD Univ. Warsaw 3/63
![Page 4: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/4.jpg)
Periodicity Lemma
Lemma 1 (Periodicity Lemma)
If p and q are periods of a word x and satisfy p+q−GCD(p, q) ≤ |x|
then GCD(p, q) is a period of x.
[Fine, Wilf, 1965]
see Algebraic Combinatorics on Words [Lothaire, 2002]
a b a a b a b a a b a b a a b · · ·
a b a a b a b a a b a
a b a a b a b a a b a a b a b · · ·
Used in the analysis of KMP algorithm and of many other pattern
matching algorithms.
Lemma 2 (Weak Periodicity Lemma)
If p and q are periods of a word x and satisfy p + q ≤ |x| then
GCD(p, q) is a period of x.
M.C. PhD Univ. Warsaw 4/63
![Page 5: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/5.jpg)
Proof of the weaker statement
⋆ p and q periods of x with p + q ≤ |x| and p > q
⋆ p− q period of x
a b c
-p
�
q-�
p− q
⋆ the rest like Euclid’s induction
M.C. PhD Univ. Warsaw 5/63
![Page 6: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/6.jpg)
Proof of the weaker statement
⋆ p and q periods of x with p + q ≤ |x| and p > q
⋆ p− q period of x
a b c
-p
�
q-�
p− q
a bc�
q
-p
-�
p− q
⋆ the rest like Euclid’s induction
M.C. PhD Univ. Warsaw 5/63
![Page 7: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/7.jpg)
On-line String Matching
u c
u a
u a-�
compatible shift
M.C. PhD Univ. Warsaw 6/63
![Page 8: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/8.jpg)
On-line String Matching (1)
u c
u a
u a-�
compatible shift
⋆ compatible with match:
shift = period(u)
[Morris, Pratt, 1969]
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
M.C. PhD Univ. Warsaw 6/63
![Page 9: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/9.jpg)
On-line String Matching (2)
u c
u a
u a-�
compatible shift
⋆ compatible with match:
shift = period(u)
[Morris, Pratt, 1969]
⋆ idem+not incompatible with c
[Knuth, Morris, Pratt, 1977]
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
M.C. PhD Univ. Warsaw 6/63
![Page 10: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/10.jpg)
On-line String Matching (3)
u c
u a
u a-�
compatible shift
⋆ compatible with match:
shift = period(u)
[Morris, Pratt, 1969]
⋆ idem+not incompatible with c
[Knuth, Morris, Pratt, 1977]
⋆ best shift = period(uc)
[Simon, 1989], [Hancart, 1993]
[Breslauer, Colussi, Toniolo,
1993]
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
M.C. PhD Univ. Warsaw 6/63
![Page 11: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/11.jpg)
Delay
⋆ Delay: maximum number of comparisons on a text letter
⋆ MP algorithm
delay ≤ |x|
⋆ KMP algorithm
delay ≤ logΦ(|x| + 1)
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a b
a b a a b a a
a b a a b a a
a b a a b a a
Proof by the Periodicity Lemma
The bound is tight
⋆ Simon-Hancart algorithm
use of string-matching automaton
delay ≤ min(1 + log2 |x|, cardA)
M.C. PhD Univ. Warsaw 7/63
![Page 12: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/12.jpg)
Searching with an automaton
⋆ Uses the string-matching automaton SMA(x):
smallest deterministic automaton accepting A∗x
⋆ Example x = abaa
0 1 2 3 4a b a a
b a
b
b
a
b
⋆ Search for abaa in:
b a b b a a b a a b a a b b a · · ·
state 0 0 1 2 0 1 1 2 3 4 2 3 4 2 0 1 · · ·
M.C. PhD Univ. Warsaw 8/63
![Page 13: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/13.jpg)
Construction of SMA(x)
⋆ Unwinding arcs
⋆ From SMA(abaa) . . .
0 1 2 3 4a b a a
b a
b
b
a
b
⋆ . . . to SMA(abaab)
0 1 2 3 4 5a b a a b
b a
b
b
a
a
b
M.C. PhD Univ. Warsaw 9/63
![Page 14: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/14.jpg)
Complexity
⋆ Time and space optimization: implementation of significant arcs only
– Forward arcs: spell the pattern
– Backward arcs: arcs going backwards without reaching the
initial state
Lemma 3 SMA(x) contains at most |x| backward arcs.
⋆ Consequences:
– implementation of SMA(x) in O(|x|) space
– construction in O(|x|) time, independently of the alphabet size
⋆ There is a strategy on the choice of arcs for which:
delay ≤ min(1 + log2 |x|, cardA)
[Hancart, 1993], [Breslauer, Colussi, Toniolo, 1998]
M.C. PhD Univ. Warsaw 10/63
![Page 15: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/15.jpg)
Significant arcs
⋆ Complete SMA(ananas)
- -���� ���� ���� ���� ���� ���� ����0 1 2 3 4 5 6- - - - - -a n a n a s
����n, s ����a � ��a� ��a' $�a
� ��s � ��n� ��n, s& %�s& %�n, s& %�n, s
⋆ Forward arcs: spell the pattern
⋆ Backward arcs: arcs going backwards without reaching the
initial state
- -���� ���� ���� ���� ���� ���� ����0 1 2 3 4 5 6- - - - - -a n a n a s
����a � ��a� ��a' $�a
� ��n
Lemma 4 SMA(x) contains at most |x| backward arcs.
⋆ Consequence: the implementation of SMA(x) can be done in
O(|x|) time and space, independently of the alphabet size
M.C. PhD Univ. Warsaw 11/63
![Page 16: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/16.jpg)
Backward arcs in SMA
⋆ States of SMA(x) are identified with prefixes of x
A backward arc is of the form (u, τ, vτ ) (u, v prefixes of x, τ symbol) with
– vτ longest suffix of uτ that is a prefix of x, and u 6= v
Note: uτ is not a prefix of x
Let p(u, τ ) = |u| − |v| ; it is a period of u because v is a border of u
� ��τ
x τ σ
v τ v-�
p(u, τ)
⋆ Backward arcs to periods: p is injective
Each period p, 1 ≤ p ≤ |x|, corresponds to at most one backward arc,
thus there are at most |x| such arcs
⋆ A worst case: SMA(abm−1) has m backward arcs (a 6= b)
M.C. PhD Univ. Warsaw 12/63
![Page 17: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/17.jpg)
Backward arcs (followed)
⋆ Proof that p is injective
Two backward arcs (u, τ, vτ ), (u′, τ ′, v′τ ′)
Assume p(u, τ ) = p(u′, τ ′) = p ; we prove u = u′ and τ = τ ′.
� ��τ
x τ σ
v τ v
v′ τ ′ v′
-�p
⋆ If v = v′ then u = u′ and also τ = τ ′� ��τ� ��τ′
x τ στ ′ σ′
v τ v
v′ τ ′ v′
-�p
⋆ If v′ a proper prefix of v then v′τ ′ and v′σ′ are prefixes of v
thus τ ′ = σ′ a contradiction
M.C. PhD Univ. Warsaw 13/63
![Page 18: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/18.jpg)
Repetition
⋆ Repetition
(w, w) repetition of (u, v) if w 6= ε and both:
– w suffix of u or u suffix of w, and
– w prefix of v or v prefix of w
|w| is a local period of (u, v)
⋆ Local Period
localperiod(u, v) = minimum local period of (u, v)
a b a b b a
b b
a b a b b a
a a
a b a b b a
b a b a
a b a b b a
b b a b a b b a b a
M.C. PhD Univ. Warsaw 14/63
![Page 19: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/19.jpg)
Maximal Local Period
⋆ Word of period 5
a b a b b a
1 2 2 5 1 3 1
⋆ Note: localperiod(u, v) ≤ period(uv)
⋆ (u, v) is a critical factorization of uv if
localperiod(u, v) = period(uv)
localperiod(u, v) is maximal among all local periods
⋆ Computation of all local periods in linear time
[Duval, Kolpakov, Kucherov, Lecroq, Lefebvre, 2003]
M.C. PhD Univ. Warsaw 15/63
![Page 20: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/20.jpg)
Critical Factorization
Theorem 1 (Critical Factorization Theorem)
Any non-empty word x can be factorized into u · v with both:
• |u| < p and
• localperiod(u, v) = period(x).
[Cesari, Vincent, Duval, 1983]
a b a b b a
b b a b a b b a b a
Leads to time-space optimal string-matching algorithm:
two-way algorithm
M.C. PhD Univ. Warsaw 16/63
![Page 21: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/21.jpg)
Two-Way String Matching
text y
pattern x
w c
u w a
u
c w v
va w
v
shifts-�
|wc|-�
period(x)
⋆ Time-space optimality
– Search Time: linear time (≤ 2n comparisons) with constant extra space
– Preprocessing Time: idem, based on next theorem
[Crochemore, Perrin, 1992]
⋆ Other solutions
[Galil, Seiferas, 1983], [Crochemore, 1992]
[Crochemore, Rytter, 1994], [Rytter, 2002]
M.C. PhD Univ. Warsaw 17/63
![Page 22: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/22.jpg)
Two-way string matching (followed)
Searching for pattern x = x[0 . .m− 1] in text y = y[0 . . n− 1]
⋆ (u, v) critical factorization of x:
uv = x and |u| < localperiod(u, v) = period(x)
⋆ Searching
– search for an occurrence of v by left-to-right scan
if mismatch: shift length = scan length, else
– search for a preceding occurrence of u by right-to-left scan
shift length = period of x
M.C. PhD Univ. Warsaw 18/63
![Page 23: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/23.jpg)
Example
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
M.C. PhD Univ. Warsaw 19/63
![Page 24: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/24.jpg)
Example (1)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
left-to-right scan length = 3
M.C. PhD Univ. Warsaw 19/63
![Page 25: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/25.jpg)
Example (1)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
shift length = 3
M.C. PhD Univ. Warsaw 19/63
![Page 26: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/26.jpg)
Example (2)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
M.C. PhD Univ. Warsaw 20/63
![Page 27: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/27.jpg)
Example (2)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
left-to-right scan length = 4
M.C. PhD Univ. Warsaw 20/63
![Page 28: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/28.jpg)
Example (2)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
shift length = 4
M.C. PhD Univ. Warsaw 20/63
![Page 29: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/29.jpg)
Example (3)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
M.C. PhD Univ. Warsaw 21/63
![Page 30: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/30.jpg)
Example (3)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
left-to-right scan
M.C. PhD Univ. Warsaw 21/63
![Page 31: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/31.jpg)
Example (3)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
right-to-left scan
M.C. PhD Univ. Warsaw 21/63
![Page 32: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/32.jpg)
Example (3)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
a b a b b a b a
shift length = period = 5
M.C. PhD Univ. Warsaw 21/63
![Page 33: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/33.jpg)
Example (4)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
a b a b b a b a
M.C. PhD Univ. Warsaw 22/63
![Page 34: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/34.jpg)
Example (4)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
a b a b b a b a
left-to-right and right-to-left scans
occurrence found
next shift length = period = 5
M.C. PhD Univ. Warsaw 22/63
![Page 35: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/35.jpg)
Two-way string matching
Searching for pattern x = x[0 . .m− 1] in text y = y[0 . . n− 1]
⋆ (u, v) critical factorization of x:
uv = x and |u| < localperiod(u, v) = period(x)
⋆ Searching
– search for an occurrence of v by left-to-right scan
if mismatch: shift length = scan length, else
– search for a preceding occurrence of u by right-to-left scan
shift length = period of x
⋆ Preprocessing
– compute factorization (u, v)
– compute the (smallest) period of x
⋆ Algorithmic complexity
– total linear time; searching in less than 2n comparisons
– only constant additional memory space required
M.C. PhD Univ. Warsaw 23/63
![Page 36: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/36.jpg)
Orderings
⋆ Orderings
≤ lexicographic ordering based on the ordering ≤ of the alphabet
� lexicographic ordering based on the inverse ordering≤−1 of the alphabet
Theorem 2 x is a non-empty word
Let x = uv with v = suffix of x that is maximal for ≤
Let x = u′v′ with v′ = suffix of x that is maximal for �
If |v| ≤ |v′| then (u, v) is a critical factorization of x otherwise (u′, v′) is.
Moreover, |u| < period(x) and |u′| < period(x).
[Crochemore, Perrin, 1992]
a b a a b a a
a b a a b a a
a b a b a a b b a b a b a
a b a b a a b b a b a b a
M.C. PhD Univ. Warsaw 24/63
![Page 37: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/37.jpg)
Proof (1)
Four cases — w shortest repetition at (u, v)
⋆ w suffix of u and v prefix of w
v < w < wv, impossible
x u v
w w
M.C. PhD Univ. Warsaw 25/63
![Page 38: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/38.jpg)
Proof (2)
Four cases — w shortest repetition at (u, v)
⋆ w suffix of u and v prefix of w
v < w < wv, impossible
x u v
w w
⋆ w suffix of u and w prefix of v
z < v implies v = wz < wv
impossible
x u
w w
z
M.C. PhD Univ. Warsaw 25/63
![Page 39: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/39.jpg)
Proof (3)
Four cases — w shortest repetition at (u, v)
⋆ w suffix of u and v prefix of w
v < w < wv, impossible
x u v
w w
⋆ w suffix of u and w prefix of v
z < v implies v = wz < wv
impossible
x u
w w
z
⋆ u suffix of w and v prefix of w
(u, v) is a critical factorization
x u v
w w
M.C. PhD Univ. Warsaw 25/63
![Page 40: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/40.jpg)
Proof (4)
Four cases — w shortest repetition at (u, v)
⋆ w suffix of u and v prefix of w
v < w < wv, impossible
x u v
w w
⋆ w suffix of u and w prefix of v
z < v implies v = wz < wv
impossible
x u
w w
z
⋆ u suffix of w and v prefix of w
(u, v) is a critical factorization
x u v
w w
⋆ u suffix of w and w prefix of v
z < v; yz ≺ yv implies z ≺ v;
z prefix of v then border of v
w period of v and of x.
x
v′
w
y
y
z
M.C. PhD Univ. Warsaw 25/63
![Page 41: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/41.jpg)
Computing Maximal Suffixes
⋆ v maximal suffix of x; |w| its period; w′ proper prefix of w
x u v
w w w′
a b a c b c b a c b c b a c b cu w w w′
M.C. PhD Univ. Warsaw 26/63
![Page 42: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/42.jpg)
Computing Maximal Suffixes (followed)
⋆ v maximal suffix of x; |w| its period; w′ proper prefix of w
x u v
w w w′
a b a c b c b a c b c b a c b cu w w w′
⋆ Match: the periodicity continues
a b a c b c b a c b c b a c b cu w w new w′
? ?
b
M.C. PhD Univ. Warsaw 26/63
![Page 43: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/43.jpg)
Computing Maximal Suffixes (followed)
⋆ v maximal suffix of x; |w| its period; w′ proper prefix of w
x u v
w w w′
a b a c b c b a c b c b a c b cu w w w′
⋆ Match: the periodicity continues
a b a c b c b a c b c b a c b cu w w new w′
? ?
b
⋆ Smaller letter: new w, border-free
a b a c b c b a c b c b a c b cu new w
? ?
a
M.C. PhD Univ. Warsaw 26/63
![Page 44: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/44.jpg)
Computing Maximal Suffixes (followed)
⋆ v maximal suffix of x; |w| its period; w′ proper prefix of w
x u v
w w w′
a b a c b c b a c b c b a c b cu w w w′
⋆ Match: the periodicity continues
a b a c b c b a c b c b a c b cu w w new w′
? ?
b
⋆ Smaller letter: new w, border-free
a b a c b c b a c b c b a c b cu new w
? ?
a
⋆ Greater letter: new u and recomputation on the rest
a b a c b c b a c b c b a c b cnew u recomputation
? ?
c
M.C. PhD Univ. Warsaw 26/63
![Page 45: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/45.jpg)
Perfect factorization
Theorem 3 Any non-empty word x can be factorized into u · v with both:
• |u| < 2 period(v) and
• v starts with at most one cube of a primitive word.
[Galil, Seiferas, 1983], [Crochemore, Rytter, 1994]
[Mignosi, Restivo, Salemi, 1995] see [Lothaire, 2002]
⋆ Example word with period = 10
a a a b a a b a a b a a a b a a b a a b a a a b a a · · ·
⋆ Leads to a time-space optimal string-matching algorithm
M.C. PhD Univ. Warsaw 27/63
![Page 46: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/46.jpg)
Square prefixes
w w
v v
u u
Lemma 5 (Three prefix squares)
If u2 prefix of v2, v2 prefix of w2, and u primitive then |u|+ |v| ≤ |w|.
[Crochemore, Rytter, 1995]
10 a a b a a b a a a b a a b a a b a a a b
7 a a b a a b a a a b a a b a
3 a a b a a b
M.C. PhD Univ. Warsaw 28/63
![Page 47: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/47.jpg)
Squares in a word
Lemma 6 ([Fraenkel, Simpson, 1998])
No more that 2n squares of primitive root occur in a word of length n.
y
w w
v v
u u?
rightmost positions in y? impossible!
Direct proofs [Hickerson, 2004], [Ilie, 2005]
Best bound: 2n− Θ(log n) [Ilie, 2005]
Computation in linear time [Gusfield, Stoye, 1998]
Lemma 7 ([Crochemore, 1981], [Gusfield, Stoye, 1999])
Maximal number of occurrences of squares (with primitive root) : cn log n.
Maximum reached by Fibonacci strings.
Theorem 4 ([Kolpakov, Kucherov, 1998])
Linear number of occurrences of runs (maximal powers) in a word.
M.C. PhD Univ. Warsaw 29/63
![Page 48: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/48.jpg)
Sequencing and fragment assembly
⋆ “Reading” a biological molecular sequence
⋆ Straight sequencing for short fragments: ≤ 500 bases
⋆ Splicing long sequences, “shotgun sequencing”
=⇒ reconstruction problem
⋆ Difficulties :
– loss of orientation (double-stranded DNA)
– overlaps with errors between fragments
⋆ Formalization : F set of fragments, ε ∈ [0, 1[
compute s, a short sequence for which
∀f ∈ F ∃a fragment of s d(a, f) ≤ ε|a| or d(a, f) ≤ ε|a|
M.C. PhD Univ. Warsaw 30/63
![Page 49: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/49.jpg)
Shortest Common Superstring
⋆ Problem: let F = {f1, f2, . . . , fn} a factor code
Compute a shortest word s in which each fi occurs
⋆ SCS(F) = |s|
s = T A A T A T T A T A
f1 = A T A T
f2 = T A T T
f3 = T T A T
f4 = T A T A
f5 = T A A T
f6 = A A T A
Lemma 8 Computing SCS(F) is NP-hard.
M.C. PhD Univ. Warsaw 31/63
![Page 50: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/50.jpg)
Greedy Approximation
⋆ while card F > 1 do
assemble two fragments having the longest overlap
Theorem 5 The sequence s produced by the greedy algorithm
satisfies |s| ≤ 4× SCS(F).
⋆ Conjecture : |s| ≤ 2× SCS(F)
⋆ Variations based on the computation of hamiltonian cycle of
minimal weight
M.C. PhD Univ. Warsaw 32/63
![Page 51: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/51.jpg)
Graph of Overlaps
⋆ F = {f1, f2, . . . , fn}, factor code
⋆ Overlaps
ov(fi, fj) = max{|v| | fi = uv and fj = vw}
⋆ Nodes = {f1, f2, . . . , fn}
Arcs = {(fi, ov(fi, fj), fj) | 1 ≤ i, j ≤ n, i 6= j}
⋆ ov(f, g) computed in time O(min(|f |, |g|))
[Morris et Pratt, 1970]
s = A T A T T A T A T
f1 = A T A T
f2 = T A T T
f3 = T T A T
f4 = T A T A
f1 = A T A T
↔ ↔ ↔ ↔3 2 3 3
f4
f3
f1
f2�
��
���
@@
@@
@I
@@
@@
@R�
��
��
3
3
2
3
M.C. PhD Univ. Warsaw 33/63
![Page 52: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/52.jpg)
Graph of Prefixes
⋆ F = {f1, f2, . . . , fn}, factor code
⋆ Prefixes
pr(fi, fj) = |fi| − ov(fi, fj)
⋆ Nodes = {f1, f2, . . . , fn}
Arcs = {(fi, pr(fi, fj), fj) | 1 ≤ i, j ≤ n, i 6= j}
s = A T A T T A T A T
f1 = A T A T
f2 = T A T T
f3 = T T A T
f4 = T A T A
f1 = A T A T
↔ ←→ ↔ ↔1 2 1 1 = 5
f4
f3
f1
f2�
��
���
@@
@@
@I
@@
@@
@R�
��
��
1
1
2
1
M.C. PhD Univ. Warsaw 34/63
![Page 53: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/53.jpg)
Technique
⋆ A cycle (fi1, fi2, . . . , fik , fi1) produces a superstring s of
length:
|s| =∑k
j=1 |fij | −∑k−1
j=1 ov(fij , fij+1)
=∑k−1
j=1 pr(fij , fij+1) + |fik|
⋆ Computation of an hamiltonian cycle of maximal weight
in the graph of overlaps
⋆ . . . or of minimal weight in the graph of prefixes
⋆ NP-hard problems
M.C. PhD Univ. Warsaw 35/63
![Page 54: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/54.jpg)
3×OPT Approximation
⋆ Based on cycle covers of the graph
⋆ Polynomial running time
⋆ Schema:
(i) compute a cycle cover of minimal weight for the graph of prefixes
(ii) open cycles
(iii) s = assembly with overlaps of words obtained in (ii)
⋆ Approximation : |s| ≤ 3× SCS(F)
⋆ Proof based on the Overlap Lemma
[Blum, Jiang, Li, Tromp, Yanakakis, 1994]
M.C. PhD Univ. Warsaw 36/63
![Page 55: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/55.jpg)
Periodicities
⋆ Word x, integer p, 0 < p ≤ |x| ; p period of x if x[i] = x[i + p]
⋆ period(x) = smallest period of x
Lemma 9 (Overlap Lemma) Let u, v, p = period(u), q = period(v).
If u[1 . . p] et v[1 . . q] are not conjugate, ov(u, v) < p + q − GCD(p, q).
⋆ Exemple : period(baabaabaa) = 3, period(aabaaabaaaba) = 4
ov(baabaabaa, aabaaabaaaba) = |aabaa| = 5 < 3 + 4− 1
⋆ Straight corollary of the Periodicity Lemma
M.C. PhD Univ. Warsaw 37/63
![Page 56: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/56.jpg)
Better Approximation
⋆ Based on a more efficient cycle opening
⋆ Schema :
(i) compute a cycle cover of minimal weight for the graph of prefixes
(ii) open cycles at positions deduced from the Rotation Lemma
(iii) s = assembly with overlaps of words obtained in (ii)
⋆ Approximation : |s| ≤ (2 + 23)× SCS(F)
⋆ Proof based on the Rotation Lemma
[Jiang, Jiang, Breslauer, 1996]
M.C. PhD Univ. Warsaw 38/63
![Page 57: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/57.jpg)
Rotation Lemma
Lemma 10 (Rotation Lemma)
Let x and p = period(x) be such that |x| ≥ 3p. There is a factorization
u · v of x with both:
• |u| < p, and
• each w prefix of v with q = period(w) < p satisfies |w| ≤ 23(p + q).
[Breslauer, Jiang, Jiang, 1996]
x-� p
y y y
w-� q
z z-�
≤ 23(p + q)
⋆ The bound is tight: at each position < 2n + 3 of x = (anban+1b)3
starts a factor of length ≥ 2n + 2 having a period ≥ n + 1
M.C. PhD Univ. Warsaw 39/63
![Page 58: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/58.jpg)
Choice of the rotation
⋆ y3 prefix of x with |y| = period(x) = p
⋆ ≤ ordering on the alphabet, � inverse ordering
⋆ r conjugate of y that is a Lyndon word according to ≤
⋆ t conjugate of y that is a Lyndon word according to �
⋆ i first position of r on x, j first position of t on x
⋆ If i < j and j − i ≤ p, we choose u such that y = uu′, r = u′u
-� p
y y y
r
it
j
⋆ If i < j and j − i > p, we choose u such that y = uu′, t = u′u
M.C. PhD Univ. Warsaw 40/63
![Page 59: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/59.jpg)
Proof
⋆ r Lyndon word and j − i ≤ p2
x-� p
y y y
tj
r
i
w s s-�
q
– |w| < p since q < p and r is border-free
– If q ≥ p2, then |w| < 2
3(p + q)
– Else, q < p2, and |w| < j − i + q (Critical Factorization at j)
and j − i + q < p2
+ q < 23(p + q), QED
⋆ If j − i > p2, replace r by t and t by the next occurrence of r, which
brings back to the first case
M.C. PhD Univ. Warsaw 41/63
![Page 60: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/60.jpg)
Text Indexing
⋆ Set of factors (subwords) of a static text
⋆ Basic operations
– existence of patterns in the text
– number of occurrences of patterns
– list of positions of occurrences
⋆ Other applications
– finding repetitions in texts
– finding regularities in texts
– approximate matchings
– two-dimensional pattern matching
– . . .
M.C. PhD Univ. Warsaw 42/63
![Page 61: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/61.jpg)
Implementation of indexes
pattern
� �suffix of text
Implementation with efficient data structures
⋆ Suffix Trees
digital trees, PATRICIA trees (compact trees)
⋆ Suffix Automata or DAWG’s
minimal automata, compact automata
Implementation with efficient algorithm
⋆ Suffix Arrays
binary search in the ordered list of suffixes
M.C. PhD Univ. Warsaw 43/63
![Page 62: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/62.jpg)
Suffixes
Text y ∈ A∗ of length n
⋆ Suff (y) = set of suffixes of y
⋆ Suff (ababbb)
i 0 1 2 3 4 5
y[i] a b a b b b
positions
a b a b b b 0
b a b b b 1
a b b b 2
b b b 3
b b 4
b 5
ε 6 (empty string)
M.C. PhD Univ. Warsaw 44/63
![Page 63: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/63.jpg)
Suffix Structures
⋆ Suffix trie of ababbb
0 6
1 2
3 4 5 6 0
75
8 9 10 11 1
12 13 2
14
4
15 3
a
b
b
a
b
b b b
a
b
b b b
b
b
⋆ Its suffix automaton
0 1 2 3 4 5 6
2′ 5′
a b a
b
b b b
b
a
b
b
⋆ Its suffix tree
0 6
1 0
2 1
3
4 2
55
6 37
4
ab
b
abbb
bb
abbb
b
b
⋆ Its compact suffix automaton
0 12
32′
ab abbb
bb
b
abbb
b
b
Linear size data structures and O(n log card A) construction time, except for suffix tries
M.C. PhD Univ. Warsaw 45/63
![Page 64: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/64.jpg)
Suffix Array
⋆ Structure composed of:
lexicographically sorted list of non-empty suffixes, and
maximal lengths of common prefixes between suffixes consecutive in the list
⋆ Suffix array of ababbb
j 0 1 2 3 4 5
y[j] a b a b b b
k SUF[k] LCP[k]
0 0 0 a b a b b b
1 2 2 a b b b
2 5 0 b
3 1 1 b a b b b
4 4 1 b b
5 3 2 b b b
⋆ Benefit: use of binary search and of additional combinatorial features
lead to a O(m + log n) searching time (instead of a mere O(m× log n) time)
M.C. PhD Univ. Warsaw 46/63
![Page 65: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/65.jpg)
Searching—Case one
⋆ Hypotheses
Ld < x < Lf and ld ≤ |lcp(Li, Lf)| < lf
⋆ Example
Ld a a a c a
a a a c b a
Li a a b b a b a
a a b b a b b
Lf a a b b b a b
x a a b b b a a
x a a b b b a a
⋆ Conclusion
Li < x < Lf and |lcp(x, Li)| = |lcp(Li, Lf)|
M.C. PhD Univ. Warsaw 47/63
![Page 66: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/66.jpg)
Searching—Case two
⋆ Hypotheses
Ld < x < Lf and ld ≤ lf < |lcp(Li, Lf)|
⋆ Example
Ld a a a c a
a a a c b a
Li a a b b a b a
a a b b a b b
Lf a a b b b a b
x a a b a c b
x a a b a c b
⋆ Conclusion
Ld < x < Li and |lcp(x, Li)| = |lcp(x, Lf)|
M.C. PhD Univ. Warsaw 48/63
![Page 67: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/67.jpg)
Searching—Case three
⋆ Hypotheses
Ld < x < Lf and ld ≤ lf = |lcp(Li, Lf)|
⋆ Example
Ld a a a c a
a a a c b a
Li a a b b a b a
a a b b a b b
Lf a a b b b a b
x a a b b a b
x a a b b a b
⋆ Conclusion
compare x and Li from position lf
M.C. PhD Univ. Warsaw 49/63
![Page 68: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/68.jpg)
Sorting Suffixes on a Bounded Integer Alphabet
⋆ Schema
[1] bucket sort positions i according to first3(y[i . . n− 1]), for i = 3q or i = 3q + 1
t[i]: rank of i in the sorted list
[2] recursively sort the suffixes of the 2/3-shorter word
t[0]t[3] · · · t[3q] · · · t[1]t[4] · · · t[3q + 1] · · ·
s[i]: rank of suffix i in the sorted list (i = 3q or i = 3q + 1)
[3] sort suffixes y[j . . n− 1] for j of the form 3q + 2 (i.e., bucket sort pairs (y[j], s[j + 1]))
[4] merge lists obtained at steps 2 and 3
Note: comparing suffixes i (first list) and j (second list) remains to compare:
(x[i], s[i + 1]) and (x[j], s[j + 1]) if i = 3q
(x[i]x[i + 1], s[i + 2]) and (x[j]x[j + 1], s[j + 2]) if i = 3q + 1
⋆ Running time: T (n) = T (2n/3) + O(n) then T (n) = O(n)
[Karkkainen, Sanders, 2003]
M.C. PhD Univ. Warsaw 50/63
![Page 69: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/69.jpg)
Example
i 0 1 2 3 4 5 6 7 8 9 10
y[i] a a b a a b a a b b a
Rank t
0 a
1 a a b
2 a b a
3 a b b
4 b a
Rank s i Suff (11142230)
0 10 0
1 0 1 1 1 4 2 2 3 0
2 3 1 1 4 2 2 3 0
3 6 1 4 2 2 3 0
4 1 2 2 3 0
5 4 2 3 0
6 7 3 0
7 9 4 2 2 3 0
Rank j (y[j], s[j + 1])
0 2 (b, 2)
1 5 (b, 3)
2 8 (b, 7)
i 0 1 2 3 4 5 6 7 8 9 10
y[i] a a b a a b a a b b a
r[i] 1 4 8 2 5 9 3 6 10 7 0
SUF[i] 10 0 3 6 1 4 7 9 2 5 8
M.C. PhD Univ. Warsaw 51/63
![Page 70: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/70.jpg)
Computing LCP’s of suffixes
⋆ LCP’s of suffixes
LCP[i] = |lcp(y[SUF[i− 1] . . n− 1], y[SUF[i] . . n− 1])|, for 0 ≤ i ≤ n
i 0 1 2 3 4 5 6 7 8 9 10 11
y[i] a a b a a b a a b b a
SUF[i] 10 0 3 6 1 4 7 9 2 5 8
LCP[i] 0 1 6 3 1 5 2 0 2 4 1 0
j RANK[j]
0 1 a a b a a b a a b b a
3 2 a a b a a b b a
j RANK[j]
1 4 a b a a b a a b b a
4 5 a b a a b b a
Lemma 11 Let j ∈ (1, 2, . . . , n− 1) with RANK[j] > 0.
Then LCP[RANK[j − 1]]− 1 ≤ LCP[RANK[j]].
M.C. PhD Univ. Warsaw 52/63
![Page 71: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/71.jpg)
LCP algorithm
⋆ Rank RANK: RANK[j] = rank of suffix at position j
(RANK = SUF−1)
⋆ Permutation SUF: SUF[k] = position of suffix of rank k
Lcp(y, n, SUF, RANK)
1 ℓ← 0
2 for j ← 0 to n− 1 do
3 ℓ← max{0, ℓ− 1}
4 if RANK[j] > 0 then
5 i← SUF[RANK[j]− 1]
6 while y[i + ℓ] = y[j + ℓ] do
7 ℓ← ℓ + 1
8 LCP[RANK[j]]← ℓ
9 LCP[0]← 0
10 LCP[n]← 0
11 return LCP
⋆ Running time: O(n)
M.C. PhD Univ. Warsaw 53/63
![Page 72: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/72.jpg)
Burrows-Wheeler Transform
⋆ Text w = a1a2 · · · an a primitive word
w1, w2, . . . , wn sequence of conjugates of w in increasing order
bi = last letter of wi, then T (w) = b1b2 · · · bn
⋆ T (abracadabra) = rdarcaaaabb 1 a a b r a c a d a b r
2 a b r a a b r a c a d
3 a b r a c a d a b r a
4 a c a d a b r a a b r
5 a d a b r a a b r a c
6 b r a a b r a c a d a
7 b r a c a d a b r a a
8 c a d a b r a a b r a
9 d a b r a a b r a c a
10 r a a b r a c a d a b
11 r a c a d a b r a a b
⋆ T (w) depends only on the conjugacy class of w
we may suppose that w is a Lyndon word, i.e. that w = w1
M.C. PhD Univ. Warsaw 54/63
![Page 73: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/73.jpg)
Text Compression
⋆ Basis of bzip text compression
– intial text
aabracadabr
– BW Transform: T (aabracadabr) =
rdarcaaaabb
– Either run-length encoding (using ASCII code):
rdarca4b2
– ... or move-to-front encoding, roughly like encoding of:
17, 4, 2, 2, 4, 1, 0, 0, 0, 4, 0
M.C. PhD Univ. Warsaw 55/63
![Page 74: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/74.jpg)
Related works
⋆ Reversible transformation of a word used for text compression, used
in bzip [Burrows, Wheeler, 1994]
⋆ Analysis of compression by [Manzini, 2001]
⋆ Related to combinatorics on words, Sturmian words
[Mantaci, Restivo, Sciortino, 2003]
⋆ Particular case of a bijection due to Gessel and Reutenauer
[Crochemore, Desarmenien, Perrin, 2005]
⋆ Linear-time computation (adapting suffix sorting)
[Karkkainen, Sanders, 2003][Kim, Sim, Park, Park, 2003]
[Ko, Aluru, 2003][Nong, Zhang, Chan, 2009]
M.C. PhD Univ. Warsaw 56/63
![Page 75: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/75.jpg)
Permutation σ
⋆ Permutation σ: σ(i) = rank of i-th circular shift
⋆ w = aabracadabr
i σ(i)
1 1 a a b r a c a d a b r
2 3 a b r a c a d a b r a
3 7 b r a c a d a b r a a
4 11 r a c a d a b r a a b
5 4 a c a d a b r a a b r
6 8 c a d a b r a a b r a
7 5 a d a b r a a b r a c
8 9 d a b r a a b r a c a
9 2 a b r a a b r a c a d
10 6 b r a a b r a c a d a
11 10 r a a b r a c a d a b
σ−1(j) j
1 1 a a b r a c a d a b r
9 2 a b r a a b r a c a d
2 3 a b r a c a d a b r a
5 4 a c a d a b r a a b r
7 5 a d a b r a a b r a c
10 6 b r a a b r a c a d a
3 7 b r a c a d a b r a a
6 8 c a d a b r a a b r a
8 9 d a b r a a b r a c a
11 10 r a a b r a c a d a b
4 11 r a c a d a b r a a b
M.C. PhD Univ. Warsaw 57/63
![Page 76: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/76.jpg)
Permutation π
⋆ Permutation P (w) = π: π(i) = σ(σ−1(i) + 1)
⋆ w = aabracadabr
σ =
1 2 3 4 5 6 7 8 9 10 11
1 3 7 11 4 8 5 9 2 6 10
π as a cycle
π =(
1 3 7 11 4 8 5 9 2 6 10)
π as an array
π =
1 2 3 4 5 6 7 8 9 10 11
3 6 7 8 9 10 11 5 2 1 4
M.C. PhD Univ. Warsaw 58/63
![Page 77: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/77.jpg)
Inverse Transformation
⋆ w = a1a2 · · · an
y = b1b2 · · · bn with bi = last letter of wi
z = c1c2 · · · cn with ci = first letter of wi
⋆ ai = cσ(i)
bi = aσ−1(i)−1
ai = cπi−1(1)
ci = bπ(i)
⋆ Property:
i < j and ci = cj ⇒ π(i) < π(j)
Occurrences of a in w appear in the same
relative order in both y and z
⋆ Linear-time computation of π and w from y
⋆ w = a1a2b3r4a5c6a7d8a9b10r11
y = r11d8a1r4c6a9a2a5a7b10b3
i ci bπ(i)
1 a1 · · · r11
2 a9 · · · d8
3 a2 · · · a1
4 a5 · · · r4
5 a7 · · · c6
6 b10 · · · a9
7 b3 · · · a2
8 c6 · · · a5
9 d8 · · · a7
10 r11 · · · b10
11 r4 · · · b3
M.C. PhD Univ. Warsaw 59/63
![Page 78: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/78.jpg)
Descents of permutations
⋆ Descent of permutation P (w) = π: i such that π(i) > π(i + 1)
⋆ for π = P (aabracadabr): des(π) = {7, 8, 9}
π =
1 2 3 4 5 6 7 8 9 10 11
3 6 7 8 9 10 11 5 2 1 4
⋆ Parikh vector of w: (n1, n2, . . . , nk) where k = #Alphabet
with n1 + n2 + · · · + nk = n = |w|
⋆ ρ(v) = {n1, n1 + n2, . . . , n1 + · · · + nk−1}
⋆ for π = P (aabracadabr): v = (5, 2, 1, 1, 2) and ρ(v) = (5, 7, 8, 9)
⋆ From property: des(π) ⊆ ρ(v)
Theorem 6 Let v = (n1, n2, . . . , nk) be a positive vector, n1+n2+· · ·+nk = n.
The map P : w 7→ π is one to one from the set of conjugacy classes of pri-
mitive words of length n on k letters with Parikh vector v onto the set of cyclic
permutations on {1, 2, . . . , n} such that des(π) ⊆ ρ(v).
M.C. PhD Univ. Warsaw 60/63
![Page 79: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/79.jpg)
From permutation to words
⋆ Words w having a given cyclic permutation π: P (w) = π
No Parikh vector given
⋆ π = P (aabracadabr)
π =
1 2 3 4 5 6 7 8 9 10 11
3 6 7 8 9 10 11 5 2 1 4
π as a cycle =(
1 3 7 11 4 8 5 9 2 6 10)
on 5 letters z = a a a a a b b c d r r
w = a a b r a c a d a b r
on 4 letters z = a a a a a a a c d r r
w = a a a r a c a d a a r
on 7 letters z = a a b b c c c d e f g
w = a b c g b d c e a c f
number of words (up to alphabetical renaming): 26.21 = 128
M.C. PhD Univ. Warsaw 61/63
![Page 80: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/80.jpg)
Length 4
σ π descents w
1 2 3 4 2 3 4 1 1 a a a b
a b b c
a a b c
a b c d
1 2 4 3 2 4 1 3 1 a a b b
a b c c
a a c b
a b d c
1 3 2 4 3 4 2 1 2 a b a c
a c b d
σ π descents w
1 3 4 2 3 1 4 2 2 a c b c
a d b c
1 4 2 3 4 3 1 2 2 a b c b
a c d b
1 4 3 2 4 1 2 3 1 a b b b
a c c b
a c b b
a d c b
M.C. PhD Univ. Warsaw 62/63
![Page 81: String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5f348a47ea901e1e317da63e/html5/thumbnails/81.jpg)
Main references
References
[1] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms
on Strings. Cambridge University Press, 2007. 392 pages.
[2] Maxime Crochemore and Wojciech Rytter. Jewels of Stringology. World
Scientific Publishing, Hong-Kong, 2002. 310 pages.
[3] See also http://monge.univ-mlv.fr/ mac/REC/text-algorithms.pdf
or http://www.mimuw.edu.pl/ rytter/BOOKS/text-algorithms.pdf
M.C. PhD Univ. Warsaw 63/63