two dimensional parameterized matching

25
Two Dimensional Parameterized Matching Richard Cole Carmit Hazay Moshe Lewenstein , Dekel Tsur § Abstract Two equal length strings, or two equal sized two-dimensional texts, parameterize match (p-match) if there is a one-one mapping (relative to the alphabet) of their characters. Two-dimensional parameterized matching is the task of finding all m × m substrings of an n×n text that p-match to an m×m pattern. This models, for example, searching for color images with changing of color maps. We present two algorithm that solve the two-dimensional parameterized matching problem. The time complexities of our algorithms are O(n 2 log 2 m) and O(n 2 + m 2.5 · polylog(m)). 1 Introduction Let S and S be two equal length strings. We say that S and S parameterize match, or p-match for short, if there is a bijection π from the alphabet of S to the alphabet of S such that S [i]= π(S [i]) for every index i. In the parameterized matching problem, introduced by Baker [8,9], one is given a text T and a pattern P and the goal is to find all the substrings of T of length |P | that p-match to P . Baker introduced parameterized matching for applications that arise in software tools for analyzing source code. Other applications for parameterized matching arise in image processing and computational biology (see [2]). In [8, 9], an optimal linear time algorithm was given for the parameterized matching problem when the alphabet has constant size. An optimal algorithm for parameterized matching in the presence of an unbounded size alphabet was given in [5]. In [8, 9], the parameterized matching problem is solved by constructing parameterized suffix trees, which also allows for online p-matching. The parameterized suffix tree was further explored by Kosaraju [16] and faster constructions were given by Cole and Hariharan [12]. In [6], approximate parameterized matching was introduced and a solution for binary alphabets was given. In [15], an O(nk 1.5 + mk log m) time algorithm was given for approxi- mate parameterized matching with k mismatches, and a strong relation was shown between this problem and maximum matchings in bipartite graphs. * New York University, [email protected]. Bar-Ilan University, {harelc,moshe}@cs.biu.ac.il. ML was supported by an IBM faculty award grant. § Ben-Gurion University of the Negev, [email protected]. 1

Upload: independent

Post on 15-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Two Dimensional Parameterized Matching

Richard Cole∗ Carmit Hazay† Moshe Lewenstein†,‡ Dekel Tsur§

Abstract

Two equal length strings, or two equal sized two-dimensional texts, parameterizematch (p-match) if there is a one-one mapping (relative to the alphabet) of theircharacters. Two-dimensional parameterized matching is the task of finding all m×m

substrings of an n×n text that p-match to an m×m pattern. This models, for example,searching for color images with changing of color maps. We present two algorithm thatsolve the two-dimensional parameterized matching problem. The time complexities ofour algorithms are O(n2 log2 m) and O(n2 + m2.5 · polylog(m)).

1 Introduction

Let S and S ′ be two equal length strings. We say that S and S ′ parameterize match, orp-match for short, if there is a bijection π from the alphabet of S to the alphabet of S ′ suchthat S ′[i] = π(S[i]) for every index i. In the parameterized matching problem, introduced byBaker [8,9], one is given a text T and a pattern P and the goal is to find all the substrings of Tof length |P | that p-match to P . Baker introduced parameterized matching for applicationsthat arise in software tools for analyzing source code. Other applications for parameterizedmatching arise in image processing and computational biology (see [2]).

In [8, 9], an optimal linear time algorithm was given for the parameterized matchingproblem when the alphabet has constant size. An optimal algorithm for parameterizedmatching in the presence of an unbounded size alphabet was given in [5]. In [8, 9], theparameterized matching problem is solved by constructing parameterized suffix trees, whichalso allows for online p-matching. The parameterized suffix tree was further explored byKosaraju [16] and faster constructions were given by Cole and Hariharan [12].

In [6], approximate parameterized matching was introduced and a solution for binaryalphabets was given. In [15], an O(nk1.5 + mk log m) time algorithm was given for approxi-mate parameterized matching with k mismatches, and a strong relation was shown betweenthis problem and maximum matchings in bipartite graphs.

∗New York University, [email protected].†Bar-Ilan University, harelc,[email protected].‡ML was supported by an IBM faculty award grant.§Ben-Gurion University of the Negev, [email protected].

1

One of the interesting problems in web searching is searching for color images, see [4,7,17].If the colors are fixed, this is exact two-dimensional pattern matching [3]. However, imagescan appear under different color maps: The image is the same image, however each colorhas been recolored with a unique color. Two-dimensional parameterized search is preciselywhat is needed. An algorithm for two-dimensional parameterized matching was given in [2]and its time complexity is O(n2m log2 m log log m) for a text of size n× n and a pattern ofsize m×m.

It is an open question whether a linear time algorithm for two-dimensional parameterizedmatching exists. In this paper we show two new algorithms for the problem. The firstalgorithm is almost linear in the input size, and the second algorithm is linear in the textsize, but with a penalty in the preprocessing stage. We propose two algorithms. The firstalgorithm is a convolutions-based method that uses a novel reduction of the two-dimensionalspace to one dimension, with careful handling of out-of-boundary entries for two-dimensionalqueries. The second algorithm is a dueling based solution that utilizes properties of thetwo-dimensional form of the problem. Timewise, the first algorithm has time complexityO(n2 log2 m). The time complexity of the second algorithm is O(n2 +m2.5 ·polylog(m)). Foralphabets drawn from a large universe the time complexity increases to O(n2 log σ + m2.5 ·polylog(m)), where σ = min(m, |Σ|). However, this is unavoidable as it was shown in [5] thatone-dimensional parameterized matching requires Ω(n log σ) time in the comparison model.

One step of our algorithms is to count the number of distinct characters in every m×msubstring of an n × n string. Amir et al. [4] gave an O(n2 log m)-time algorithm for thisproblem. We show in this paper an O(n2)-time algorithm for the problem. This result maybe of independent interest.

The rest of the paper is organized as follows: Section 2 contains definitions. TheO(n2 log2 m)-time algorithm is given in Section 3. The O(n2 + m2.5 · polylog(m)) algorithmis described in Section 4. In Section 5 we present the algorithm for substrings charactercounting.

2 Preliminaries

Let S and S ′ be two-dimensional strings of equal size. We say that there is a functionmatching between S and S ′ if there is a mapping f from the alphabet of S to the alphabetof S ′ such that S ′[x, y] = f(S[x, y]) for all x and y. If the mapping f is one-to-one, we saythat S and S ′ parameterize match, or p-match for short. Note that the definition of functionmatching is asymmetric whereas the definition of parameterized matching is symmetric. Thetwo-dimensional parameterized matching problem is defined as follows:

Input: An n× n text T and an m×m pattern P .Output: All substrings of T of size m×m that p-match to P .

Observation 1. There is a parameterized matching between S and S ′ if and only if there isa function matching between S and S ′, and the number of distinct characters in S is equalto the number of distinct characters in S ′.

2

Exploiting Observation 1, our algorithms have the following structure: First, our algo-rithms create a list L of m×m substrings of T such that

1. Every string in L function matches to P .

2. Every substring of T that p-matches to P appears in L.

The second step is computing the number of distinct characters in every m ×m substringof T . The strings in L that have the same number of distinct characters as P are preciselythe substrings of T that p-match to P . This stage takes O(n2) time using the algorithm inSection 5.

The left-to-right/top-to-bottom traversal order of a two-dimensional string is an orderof the locations inside the string that starts with the locations of the first (topmost) rowfrom left to right, then the locations of the second row from left to right, and so on. Othertraversal orders are defined similarly.

We use [a, b] × [c, d] to denote the set of all pairs (x, y) of integers with a ≤ x ≤ b andc ≤ y ≤ d. We will write [a]×[c, d] instead of [a, a]×[c, d] and [a, b]×[c] instead of [a, b]×[c, c].

3 An O(n2 log2 m) algorithm

It is helpful to recall the one-dimensional parameterized matching algorithm, due to Amir,Farach, and Muthukrishnan [7]. It is similar to the Knuth-Morris-Pratt string matchingalgorithm. The key idea is to recode the occurrences of each character in terms of theirseparation; namely, if a character c occurs in the pattern (or text) at locations i1 < i2 <· · · < ik, these occurrences of c are replaced by the numbers 0, i2 − i1, i3 − i2, . . . , ik − ik−1,respectively. In other words, each character of the pattern (or text) is replaced by thedistance to the nearest occurrence of the same character to the left, or by 0 if there is no suchoccurrence. Except for the first occurrence of each character, a parameterized match in theoriginal text and pattern corresponds to a standard match in the recoded text and pattern.One perspective on this is that all occurrences of the same character in the pattern (and thetext) have been connected into a structure. Identifying a match becomes a question of findingthe alignments of the pattern against the text for which the structures in the pattern andtext match. We will use a similar construction for the two-dimensional problem. However,this will now require creating connections from each character to Θ(log m) occurrences ofthe character.

Our solution has the following form. Let t = ⌈log2 m⌉. For each location (x, y) in thepattern (or the text) we define 4t+ 4 disjoint rectangles. From each rectangle we may selecta location (x′, y′) that contain the same character as (x, y) (if there is such a location in therectangle). The location (x′, y′) is called a neighbor of (x, y).

We now describe the rectangles for location (x, y) in the pattern. The first four rectanglescomprise row x and column y partitioned at location (x, y), i.e., the rectangles are [x] ×[y + 1, m], [x] × [1, y − 1], [x + 1, m] × [y], and [1, x − 1] × [y]. Next, we define t disjoint

3

rectangles R(x,y),0, . . . , R(x,y),t−1 that cover the quadrant below and to the right of (x, y). Fori = 0, . . . , t− 2,

R(x,y),i = [x + m− 2i+1 + 1, x + m− 2i]× [y + 1, m]

andR(x,y),t−1 = [x + 1, x + m− 2t−1]× [y + 1, m].

Some of the rectangles may not fit inside the pattern, namely, they contain rows greaterthan m. These rectangles will be inactive and will not generate neighbors. For the quadrantabove and to the right of (x, y) we define the rectangles

S(x,y),i = [x−m + 2i, x−m + 2i+1 − 1]× [y + 1, m]

for i = 0, . . . , t− 2, and

S(x,y),t−1 = [x−m + 2t−1, x− 1]× [y + 1, m].

Analogous rectangles are defined for the remaining two quadrants. See Figure 1 for anexample.

Suppose that P [x, y] = c. Each active rectangle is traversed in a direction away from(x, y) until the first occurrence of c is found. More precisely, the traversal order is top-to-bottom/left-to-right in the rectangles to the right of column y, top-to-bottom/right-to-leftin the rectangles to the left of column y, top-to-bottom in the rectangle [x + 1, m]× [y], andbottom-to-top in the rectangle [1, x− 1]× [y]. The first occurrence of c in some rectangle isthe neighbor generated by the rectangle, and if there is no such occurrence then the rectangledoes not generate a neighbor.

The selection of neighbors for a location (x, y) in the text is performed again by definingrectangles around (x, y). The only difference is that the rectangles that are to the right ofcolumn y now extend until column n, and the rectangle [x + 1, m] × [y] now extends untilrow n. Note that the number of rectangles and their rows are still defined according to m.

We say that two locations (x, y) and (x′, y′) are linked if there is a series of locations(z0, w0) = (x, y), (z1, w1), . . . , (zl, wl), (zl+1, wl+1) = (x′, y′) such that for all i, either (zi, wi)is a neighbor of (zi+1, wi+1), or (zi+1, wi+1) is a neighbor of (zi, wi) (or possibly both). Thefollowing lemma shows that all the occurrences of some character in the pattern are linked.

Lemma 1. Let (x, y) and (x′, y′) be two locations in the pattern both holding the samecharacter. Then, these two locations are linked.

Proof. Clearly, if x = x′ then there is a series of locations along row x linking the locations(x, y) and (x′, y′). Similarly, these two locations are linked if y = y′. We now consider thecase that x 6= x′ and y 6= y′. W.l.o.g. suppose that x < x′ and y < y′. We claim thateither (x, y) lies in one of the active rectangles of (x′, y′) or (x′, y′) lies in one of the activerectangles of (x, y) (or possibly both).

W.l.o.g. suppose that x ≤ m/2. If (x′, y′) is inside one the active rectangles of (x, y) thenwe are done. Otherwise, there must be at least one inactive rectangle among R(x,y),0, . . . , R(x,y),t−1.

4

a

a

aa

a

a aaaa

a

a

a

aa

a

a aaaa

a

aa

Figure 1: An example of neighbors selection for a pattern of size 8 × 8 (left) and a text ofsize 11 × 11 (right). The neighbors of (3, 4) in P and T are marked in bold. The activerectangles for location (3, 4) in P and T are shown in the figure. The rectangles for location(3, 4) in P in the bottom-right quadrant are R(3,4),0 = [10] × [5, 8], R(3,4),1 = [8, 9] × [5, 8],and R(3,4),2 = [4, 7]× [5, 8]. Only the last rectangle is active. The corresponding rectanglesfor location (3, 4) in T are [10]× [5, 12], [8, 9]× [5, 12], and [4, 7]× [5, 12]. All these rectanglesare active. In this example P p-matches to T [1 . . 8, 1 . . 8], so by Observation 2 the neighborsof (3, 4) in P are aligned with the neighbors of (3, 4) in T .

Let R(x,y),j be the topmost inactive rectangle. From the fact that x ≤ m/2 it follows thatR(x,y),t−1 is active, so j ≤ t − 2. The location (x′, y′) must be inside R(x,y),j (since the lastrow of R(x,y),j is larger than m). By the symmetry of the construction we have that (x, y) isinside S(x′,y′),j. We will now show that S(x′,y′),j is active.

Since (x′, y′) ∈ R(x,y),j we have that x′ ≥ x + m − 2j+1 + 1. Moreover, from the factthat R(x,y),j does not fit inside the pattern we have that x + m− 2j > m, so x > 2j . Thus,x′ > 2j +m− 2j+1 +1 = m− 2j + 1, so x′−m + 2j > 1. The last inequality implies that therectangle S(x′,y′),j is active. We have shown that (x, y) is inside an active rectangle of (x′, y′).

We have shown that either (x, y) lies in one of the active rectangles of (x′, y′), or (x′, y′)lies in one of the active rectangles of (x, y). W.l.o.g. suppose that the latter occurs. It neednot be that (x′, y′) is a neighbor of (x, y), however. Nonetheless, by induction on y′ − y,we show that they are linked. The base case y = y′ has already been demonstrated. Let(x′′, y′′) denote the neighbor of (x, y) in the rectangle containing (x′, y′). Then y < y′′ ≤ y′.By induction, (x′′, y′′) and (x′, y′) are linked and the inductive claim follows.

We say that the neighbors of location (x, y) in P are aligned with the neighbors of location(x+a, y+b) in T if for all i, if the i-th rectangle of (x, y) in P produces a neighbor (x′, y′) thenthe i-th rectangle of (x+a, y+b) in T produces a neighbor and the neighbor is (x′+a, y′+b).

Observation 2. Let T ′ = T [a + 1 . . a + m, b + 1 . . b + m] be some substring of T .

5

1. If P p-matches to T ′ then for every location (x, y) in P , the neighbors of (x, y) in Pare aligned with the neighbors of (x + a, y + b) in T .

2. If for every (x, y), the neighbors of (x, y) in P are aligned with the neighbors of (x +a, y + b) in T , then there is a function matching between P and T ′.

Proof. Part 1 of the observation follows from the construction. To prove part 2, considertwo locations (x1, y1) and (x2, y2) in P with P [x1, y1] = P [x2, y2]. By Lemma 1, (x1, y1) islinked with (x2, y2) in P . As the neighbors of a location (x, y) in P are aligned with theneighbors of (x + a, y + b) for every (x, y), it follows that (x1 + a, y1 + b) is linked with(x2 +a, y2 + b) in T . Therefore T [x1 +a, y1 + b] = T [x2 +a, y2 + b]. Since this holds for everytwo locations in P that contain the same character, we conclude that there is a functionmatching between P and T ′.

We note that the converse of part 1 of the observation does not hold, namely, if P p-matches to T ′ then the neighbors of a location (x+ a, y + b) in T ′ are not necessarily alignedwith neighbors of (x, y) in P . For example, in Figure 1, location (3, 4) in T has a neighbor(3, 9) which has no corresponding neighbor in P .

We now describe how to transform the problem of aligning the neighbor structures intoan exact matching problem. The text and pattern are recoded as follows: Each location(x, y) in the pattern is encoded by a sequence of 4t + 4 = Θ(log m) characters c1 · · · c4t+4.The character ci corresponds to the i-th rectangle of (x, y). If the i-th rectangle of (x, y)produces a neighbor (x′, y′) then ci = (x − x′, y − y′). If the i-th rule does not produce aneighbor then ci = φ. The text is recoded similarly, except that when a rectangle yields noneighbor we set ci = 0.

Let P2 and T2 denote the recoded pattern and text. Treating φ as a wildcard, Observa-tion 2 implies that for every m×m substring T ′ of T and the corresponding substring T ′

2 ofT2,

1. If there is a parameterized matching between P and T ′ then there is an exact wildcardmatching between P2 and T ′

2.

2. If there is an exact wildcard matching between P2 and T ′2 then there is a function

matching between P and T ′.

Using standard techniques, the wildcard matching problem can be made one-dimensional.The recoded pattern and text have sizes O(m2 log m) and O(n2 log m), respectively. Thus,the wildcard matching problem can be solved in time O(n2 log2 m) [10, 11].

It remains to show how to identify the neighbors. This is readily done in O(m2 log2 m)time in the pattern and O(n2 log2 m) time in the text. We describe the approach for thepattern. Finding the neighbors in the rectangles [x]× [y+1, m], [x]× [1, y−1], [x+1, m]× [y],and [1, x − 1] × [y] for all locations (x, y) is done by simple scans of the rows and columnsof the pattern. We next describe how to find the neighbors in the remaining rectangles.The idea is to maintain for each character c, windows w1, . . . , wt−1 over the pattern. Awindow wi has a width of m columns and a height of 2i rows (except the window wt−1

6

which has height m− 2t−1). For each i, the window wi is slid down the pattern. During themovement of wi, the occurrences of c that are inside wi are kept in a balanced search tree intop-to-bottom/left-to-right order.

For a location (x, y) in P that contains the character c, we can find the neighbor of (x, y)in some rectangle R(x,y),i by doing a search on the binary search tree of wi at the time thatthe window wi covers the rows of R(x,y),i. Each search takes O(log m) time. Thus, over allcharacters and neighbors the searches take O(m2 log2 m) time. To slide a window down onerow requires deleting some character instances and adding others. This takes time O(log m)per change, and as each character instance is added once and deleted once from a windowof each size, this takes time O(m2 log2 m) over all characters and windows.

4 An O(n2 + m2.5 · polylog(m)) algorithm

In this section we present another algorithm for parameterized matching that follows the“duel-and-sweep” paradigm. This paradigm appeared in [3] (there it was named “consistencyand verification” and was used for two-dimensional exact matching) and is based upon theduelling technique [18, 19].

We begin by giving an overview of the “duel-and-sweep” algorithm for two-dimensionalexact matching. The algorithm maintains a list of candidates that initially contains all m×msubstrings of T . The list is pruned in two stages, called the duelling stage and the sweepingstage. After these stages the list contains exactly the substrings of T that match P .

Consider two candidates T1 = T [x . . x + m − 1, y . . y + m − 1] and T2 = T [x + a . . x +a + m− 1, y + b . . y + b + m− 1], where a ≥ 0 and b ≥ 0. Suppose that these two substringsshare common letters of T (in other words, the rectangles [x, x + m − 1] × [y, y + m − 1]and [x + a, x + a + m − 1] × [y + b, y + b + m − 1] have non-empty intersection). If boththese candidates match P then we conclude that when we align P against itself with offset(a, b), the overlapping areas in the two copies of P match. In other words, we have thatP1 = P [1 . . m−a, 1 . . m−b] matches P2 = P [a+1 . . m, b+1 . . m]. If on the other hand P1 doesnot match P2, then T1 and T2 cannot both match P . In this case at least one of the candidatesT1 and T2 can be ruled out in constant time in a process called duelling : Let (x′, y′) be alocation in P1 such that P1[x

′, y′] 6= P2[x′, y′]. The location (x′, y′) is called a witness for (a, b).

Let c = T [x+a+x′−1, y+b+y−1]. Clearly, we have that either c = P1[x′, y′], c = P2[x

′, y′],or neither of these equalities hold. In the first case c 6= P2[x

′, y′] = P [x′ + a, y′ + b], so T2

does not match P . In the second case c 6= P1[x′, y′] = P [x′, y′], so T1 does not match P . In

the last case, neither T1 nor T2 match P .In order to perform duels between candidates, the algorithm precomputes a witness table

that contains a witness for every mismatch offset (a, b). After performing all possible duelsbetween the candidates, we have that the remaining candidate are pairwise consistent. Amiret al. [3] showed how to perform the duels in O(n2) time. The duelling stage of Amir et al.relies on a transitivity property of the consistency relation, which follows from the fact thatexact matching is a transitive relation. Note that parameterized matching is also transitive,and this fact allows us to use the duelling stage of Amir et al. in our algorithm.

7

In the sweeping stage the algorithm checks for each candidate whether it forms a match.Using the fact that the candidates are consistent, this stage is also implemented in O(n2)time.

In the next sections we describe how to adapt the “duel-and-sweep” algorithm to two-dimensional parameterized matching.

4.1 Duelling stage

In exact matching a witness is simply one location with a mismatch between the two alignedcopies of the pattern. However, in parameterized matching one location is not sufficient torule out a match. The following definition captures the concept of a witness in parameterizedmatching.

Definition 1 (mismatch offset). Let P be a pattern of size m × m. An offset (a, b) withb ≥ 0 is called a mismatch offset of P if when P is aligned with itself with offset (a, b), theoverlapping areas in the two copies of P do not p-match (in other words, if a ≥ 0 then (a, b)is a mismatch offset if P [1 . . m− a, 1 . . m− b] does not p-match to P [a + 1 . . m, b + 1 . . m],and if a < 0 then (a, b) is a mismatch offset if P [1− a . . m, 1 . . m− b] does not p-match toP [1 . . m + a, b + 1 . . m].)

Definition 2 (witness). Let (a, b) be a mismatch offset of a pattern P . A witness w.r.t.(a, b) is a pair of locations (x, y), (x′, y′) such that one of the following holds:

1. P [x, y] = P [x′, y′] and P [x + a, y + b] 6= P [x′ + a, y′ + b].

2. P [x, y] 6= P [x′, y′] and P [x + a, y + b] = P [x′ + a, y′ + b].

If the first condition above is satisfied then the witness is called a type 1 witness, and other-wise it is called a type 2 witness.

Given a witness table for P , the duelling stage for parameterized matching is performedusing the duelling algorithm of Amir et al. [3]. In Section 4.4 we show how to construct thewitness table for P .

4.2 Sweeping stage

As in many pattern matching algorithms, we assume w.l.o.g. that n = 2m. Larger textscan be cut into overlapping pieces of size 2m× 2m which are handled independently, wheresuccessive pieces overlap in m− 1 rows or columns.

Definition 3 (strip). Let T be an n × n text. The i-th strip of T is the rectangle [1, n] ×[i, i + m− 1].

Definition 4 (predecessor). Let T be an n × n text, S a strip of T , and (x, y) a locationinside S. The predecessor of (x, y) w.r.t. S is the first location whose character is equalto P [x, y] encountered when traversing S in right-to-left/bottom-to-top order starting from(x, y).

8

a

a

aaa

aa

Figure 2: An example for the definition of predecessor. Here m = 4 and n = 8, so thefirst strip of T is the rectangle [1, 8] × [1, 4]. The predecessors (w.r.t. the first strip) of thelocations in the first strip that contain the character ‘a’ are shown in the figure ((7, 3) is thepredecessor of (8, 2), (6, 1) is the predecessor of (7, 3) etc.).

See Figure 2 for an example of the above definitions. A location and its predecessor inT will be called a location-predecessor pair. Recall that a candidate is an m ×m substringof T that survived the duelling stage. The start location of a candidate T [x . . x + m −1, y . . y + m − 1] is the location (x, y). We say that a candidate is contained in the y-thstrip if the start location of the candidate is (x, y) for some x. We say that a candidateT [x . . x + m − 1, y . . y + m− 1] contains a pair of locations (x1, y1), (x2, y2) if both (x1, y1)and (x2, y2) are inside the rectangle [x, x + m− 1]× [y, y + m− 1].

Definition 5 (mismatch pair). Let T ′ = T [x . . x + m− 1, y . . y + m− 1] be a candidate. Amismatch pair of T ′ is a pair of locations (x1, y1), (x2, y2) that is contained in T ′ such thatT [x1, y1] = T [x2, y2] and P [x1 − x + 1, y1 − y + 1] 6= P [x2 − x + 1, y2 − y + 1].

We now observe that:

Observation 3. For every candidate T ′ = T [x . . x + m− 1, y . . y + m− 1],

1. If P p-matches to T ′ then there are no mismatch pairs of T ′.

2. If there is no location-predecessor pair (w.r.t. the y-th strip) that is a mismatch pair ofT ′ then there is a function matching between P and T ′.

By Observation 3, it is suffices to check for each candidate whether the location-predecessorpairs (w.r.t. to the strip that contains the candidate) that are contained in the candidate aremismatch pairs for the candidate. Clearly, this approach is not efficient (the time complexityfor checking one candidate is Θ(m2), so the total time to check all candidate is Θ(m4) inthe worst case). However, we can use the fact that after the duelling stage, the remainingcandidates are pairwise consistent. Therefore, for each location-predecessor pair w.r.t. somestrip y, the pair is either a mismatch pair for every candidate (contained in the y-strip orin some other strip) that contains the pair or the pair is not a mismatch pair for any can-didate. Thus, for each location-predecessor pair w.r.t. some strip y we need to check only

9

two characters in P , and if there is a mismatch we can rule out all the remaining candidatesthat contain the pair. Our algorithm does not rule out all the candidates that contain thepair. Instead, it rules out the candidates with start location in column y′ ≥ y.

We first need to show how to compute all location-predecessor pairs. This is performedby going over all strips of T from left to right while maintaining the predecessor of everylocation inside the current strip. Computing the predecessor of every location in the firststrip takes O(m2) time. When going from the (y − 1)-th strip to the y-th strip, there are atmost 3n = 6m new location-predecessor pairs: (1) Every location in column y + m− 1 andits predecessor form a new pair, (2) Every location in column y + m − 1 may become thepredecessor of some location in columns y, . . . , y + m− 2, and (3) Every location in columnsy, . . . , y + m− 2 whose predecessor was in column y− 1 will form with its new predecessor anew pair. After a preprocessing step on T that will be described in Section 4.3, we can findall the new location-predecessor pairs in O(m) time.

When checking if there are location-predecessor pairs w.r.t. the current strip that aremismatch pairs, we only need to check the location-predecessor pairs which are new (i.e., didnot appear in the previous strip), as we already checked the old pairs in previous iterations.Let (x1, y1), (x2, y2) be some new location-predecessor pair for the y-th strip. As noted above,(x1, y1), (x2, y2) is either a mismatch pair for all the candidates that contain the pair, or fornone. Therefore, we only need to find one candidate (x′, y′) that contains the pair, withy′ ≥ y, and check whether P [x1 − x′ + 1, y1 − y′ + 1] = P [x2 − x′ + 1, y2 − y′ + 1]. If thesetwo character are not equal, then (x′, y′) cannot be a match, and moreover, no candidatethat contains the pair (x1, y1), (x2, y2) is a match. In other words, we can rule out everycandidate whose start location is inside the rectangle [x1 −m + 1, x2]× [y, min(y1, y2)].

We need to handle two issues: How to find a candidate that contains the pair (x1, y1), (x2, y2),and in the event of a mismatch, how to rule out the candidates in the corresponding rectangle.

We first deal with the former issue. Given a location-predecessor pair (x1, y1), (x2, y2),find the candidate which starts at the smallest row that may contain the pair (x1, y1), (x2, y2).This candidate will be denoted (x′, y′). More precisely, among all the candidates with startlocations inside the rectangle [x1−m+1, 2m]× [y, min(y1, y2)], (x′, y′) is the candidate whosestart location is in the smallest row (ties are broken arbitrarily). Clearly, if (x′, y′) does notcontain the pair (x1, y1), (x2, y2) (that is, if x′ > x2), then there are no candidates with startlocation in a column greater than y that contain the pair (x1, y1), (x2, y2).

We now describe how to find the candidate (x′, y′) in constant time. To do that, we builda 2m × 2m array A, where A[i, j] is the smallest integer r ≥ i such that (r, j) is a startlocation of a candidate. If there are no such integers then A[i, j] = 2m. The array A can beeasily computed by scanning the text column by column, from bottom to top. Now, givena location-predecessor pair (x1, y1), (x2, y2), in order to find the candidate (x′, y′) we findthe minimum value in the subrow A[x1 −m + 1, y.. min(y1, y2)]. This is the range minimaproblem [14], so after preprocessing each row of A in O(m) time per row, we can find theminimum element in some subrow of A in constant time.

We now turn to the problem of eliminating candidates. As discussed above, during thealgorithm we find rectangles that do not contain a match, and eliminate the candidates

10

with start location inside these rectangles. Instead of eliminating the candidates at the timeeach rectangle is discovered, we store the rectangle in a list L, and after obtaining all therectangles we perform a candidates elimination stage.

In the candidates elimination stage, we need to remove each candidate whose top-leftcorner is in some rectangle of L. This is done by moving a vertical sweep line from left toright. We maintain two vectors B and C, where B[i] (resp., C[i]) is the number of rectanglesin L that intersect the current sweep line, and whose top (resp., bottom) row is i. Usingthese vectors, we can compute the number of rectangles in L that contain each location onthe sweep line, and eliminate the candidates whose start location is contained in at leastone rectangle. Since the number of rectangles in L is at most 6m2, it follows that the timecomplexity of this stage is O(m2).

4.3 Text Preprocessing

In this section, we show how to compute an array of pointers that will be used to maintainthe predecessors w.r.t. the current strip.

Definition 6 (left predecessor). For a location (x, y) in T with y > m, the left predecessorof (x, y) is the predecessor of location (x, y) w.r.t. the (y −m + 1)-th strip of T .

Given the left predecessors of all the locations in T , we maintain the predecessor ofevery location w.r.t. to the current strip as follows. The predecessor and successor pointersfor the current strip form a collection of doubly linked lists, one per character, which arestored in place in T , i.e., each location holds its predecessor and successor pointers. The leftpredecessors are used in updating these lists when moving from the (y − 1)-th strip to they-th strip as follows. First, each location (x, y − 1) in column y − 1, in top-to bottom ordersay, is removed from its list by connecting its neighbors. Second, in top-to-bottom order,each location (x, y + m− 1) is inserted right after its left predecessor.

Now, we show how to compute the left predecessor of each location (x, y) in T withy > m. First, for each character σ we form a list LC

σ of candidate left predecessors. Thisconsists of the locations of σ in left-to-right/bottom-to-top order except that in each row onlythe rightmost location holding σ in columns 1 to m and the locations holding σ in columnsm + 1 to 2m are kept. Next LC

σ is traversed. We will maintain a second list Lσ which holds,in traversal order, traversed locations (x, y) with y > m for which the left predecessor hasnot yet been traversed. Our implementation of this process uses the following two propertiesof the Lσ lists.

Lemma 2. If (x, y) and (x′, y′) are two locations in Lσ, where (x, y) precedes (x′, y′) in Lσ,then x > x′ and y < y′.

Proof. Note that the traversal order is also the scan order of the locations. Now, supposefor a contradiction that either x ≤ x′ or y ≥ y′. From the fact that (x, y) appears before(x′, y′) in the left-to-right/bottom-to-top scan we have that either (1) x = x′ and y < y′, or(2) x > x′ and y ≥ y′. Since |y−y′| < m (this follows from the inequalities m < y ≤ 2m and

11

m < y′ ≤ 2m), in the first case we have that (x, y) is the left predecessor of (x′, y′), and in thesecond case we have that (x′, y′) is the left predecessor of (x, y). In both cases we have thatthe list Lσ cannot contain both (x, y) and (x′, y′), so we have reached a contradiction.

Lemma 3. Suppose that the traversal of LCσ has reached location (x, y). Let (x1, y1), . . . , (xs, ys)

be the locations in Lσ according to their order. The location (x, y) is either the left predeces-sor of (x1, y1), . . . , (xf , yf) for some 1 ≤ f ≤ s, the left predecessor of (xf , yf), . . . , (xs, ys)for some 1 < f ≤ s, or the left predecessor of none of these locations.

Proof. Consider the following cases:

1. If y > ys or if y ≤ y1 −m then (x, y) is the left predecessor of none of the locations inLσ.

2. If y1 ≤ y ≤ ys then let f be the minimum integer such that y ≤ yf . Then, (x, y) is theleft predecessor of (xf , yf), . . . , (xs, ys).

3. If neither case 1 nor 2 occurs, then y1 − m + 1 ≤ y < y1. Let f be the maxi-mum integer such that yf −m + 1 ≤ y. In this case, (x, y) is the left predecessor of(x1, y1), . . . , (xf , yf).

Lemma 3 suggests the following algorithm for computing the left predecessors. Sup-pose that currently Lσ = (x1, y1), (x2, y2), . . . , (xs, ys), and that (x, y) is the location beingtraversed. Then

Case 1 y > y1. Make (x, y) the left predecessor of all of (x1, y1), . . . , (xf , yf), where f isthe largest index such that yf − y < m.

Case 2 y > ym. Add (x, y) to the end of Lσ.

Case 3 y1 ≤ y ≤ ym. Make (x, y) the left predecessor of all of (xg, yg), . . . , (xm, ym), whereg is the least index such that y ≤ yg.

Clearly, the time complexity for computing the left predecessors is O(m2).

4.4 Pattern Preprocessing

In this section we show how to compute the witness table for the pattern. We need tocompute a witness for every mismatch offset (a, b). There are two cases to consider: whena ≥ 0 and when a < 0. In the following, we will handle the former case. The latter case issymmetrical, and thus omitted. Let l = ⌊√m⌋. We will assume throughout the section thata ≤ m− 3l and b ≤ m− 3l. At the end of this section we will show how to handle mismatchoffsets (a, b) with either a > m− 3l or b > m− 3l.

The main idea is similar to the algorithm in Section 3: We select neighbors for eachlocation in P and then we align the neighbors in regions of P by solving exact wildcard

12

Figure 3: The subsets of Da,b.

matching problems. Since we need to align regions of various sizes, the neighbor selectionscheme of Section 3 do not work here.

Let (a, b) be a mismatch offset, and let Da,b = [1, m−a]× [1, m−b] be the set of locationsin the overlap area when P is aligned against itself with offset (a, b). We partition Da,b intosubregions (see Figure 3): Da,b

1 , Da,b3 , Da,b

7 , and Da,b9 are the l × l squares in the corners of

Da,b. Da,b2 , Da,b

4 , Da,b6 , and Da,b

8 are the rectangles of width or height l forming the bordersof Da,b excluding the corners, and Da,b

5 is the remaining part of Da,b. Formally,

Da,b1 = [1, l]× [1, l],

Da,b2 = [1, l]× [l + 1, m− b− l],

Da,b3 = [1, l]× [m− b− l + 1, m− b],

and so on. We define Da,b[i1, i2, . . . , ik] = Da,bi1∪Da,b

i2∪ · · · ∪Da,b

ik.

We say that a witness w for (a, b) is non-simple if it satisfies any of the following condi-tions:

1. w is of type 1, one of the locations of w is in Da,b[1], and the other location is inDa,b[5, 6, 8, 9].

2. w is of type 2, one of the locations of w is in Da,b[9], and the other location is inDa,b[1, 2, 4, 5].

3. w is of type 1, one of the locations of w is in Da,b[3, 6] and the other location is inDa,b[7, 8].

4. w is of type 2, one of the locations of w is in Da,b[2, 3] and the other location is inDa,b[4, 7].

The algorithm comprises three stages:

1. Find simple witnesses.

13

Figure 4: The active rectangles of location (3, 4) in an 8 × 8 pattern, where l = 2. TheH(x,y),i rectangles are shown on the left and the V(x,y),i rectangles are shown on the right.

2. Find non-simple witnesses that satisfy Conditions 1 or 2 above.

3. Find non-simple witnesses that satisfy Conditions 3 or 4 above.

The three stages of the algorithm are described in the rest of this section. In each stage,we handle only those offsets for which no witness has been found so far. In the followingstages, we will describe only how to find witnesses of type 1, as handling the witnesses oftype 2 is symmetrical.

Stage 1

For each location (x, y) in P we choose 4⌊ml⌋+ 4l − 4 neighbors. Similarly to Section 3, we

define rectangles for each location in P , where each rectangle can provide one neighbor (notethat here the rectangles are not disjoint).

For a location (x, y) we define the following rectangles: For i = −⌊ml⌋, . . . , ⌊m

l⌋ − 1, let

H(x,y),i = [x + il, x + (i + 1)l − 1]× [1, y − 1]

and

V(x,y),i = [1, x− 1]× [y + il, y + (i + 1)l − 1].

See Figure 4 for an example. Furthermore, for j = −l + 1, . . . , l − 1, we define H2(x,y),j =

[x + j]× [1, y − 1] and V 2(x,y),j = [1, x− 1]× [y + j]. We say that a rectangle is active if it is

contained in [1, m]×[1, m]. For example, H(x,y),i is active if 1 ≤ x+il and x+(i−1)l−1 ≤ m.For every active rectangle of (x, y) we scan the locations inside the rectangle in a direction

away from (x, y) namely, the scan order of H(x,y),i is top-to-bottom/right-to-left, and the scanorder of V(x,y),i is left-to-right/bottom-to-top. The first occurrence of the character T [x, y]in the rectangle is the neighbor generated by the rectangle.

Consider some offset (a, b). We say that (x′, y′) is a Da,b-neighbor of (x, y) if (x′, y′) is aneighbor that is generated by an active rectangle of (x, y) and the corresponding rectangleof (x+ a, y + b) is also active (in other words, (x′, y′) is a neighbor of (x, y) that is generated

14

by a rectangle that is contained in Da,b). We say that two locations (x, y) and (x′, y′) inDa,b are Da,b-linked if there is a series of locations (z0, w0) = (x, y), (z1, w1), . . . , (zl, wl),(zl+1, wl+1) = (x′, y′) such that for all i, at least one of the locations (zi, wi) and (zi+1, wi+1)is a Da,b-neighbor of the other location.

Lemma 4. Let (a, b) be an offset. Let (x, y) and (x′, y′) be two locations inside Da,b withP [x, y] = P [x′, y′]. Suppose that neither of the following conditions hold

• One of the locations is in Da,b[1] and the other location is in Da,b[5, 6, 8, 9].

• One of the locations is in Da,b[3, 6] and the other location is in Da,b[7, 8].

Then, (x, y) and (x′, y′) are Da,b-linked.

Proof. We first claim that at least one case among the following cases must occur.

Case 1 |y − y′| < l.

Case 2 |x− x′| < l.

Case 3 The topmost location among (x, y) and (x′, y′) is in Da,b[2, 5, 8].

Case 4 The leftmost location among (x, y) and (x′, y′) is in Da,b[4, 5, 6].

For a contradiction, suppose that Cases 1–4 do not occur. W.l.o.g. we assume that x > x′,so (x′, y′) is the topmost location among (x, y) and (x′, y′). Since Case 3 does not occur, wehave that either y′ ≤ l or y′ ≥ m− b− l + 1.

Suppose first that y′ ≤ l. If y ≤ y′ then |y − y′| < l which contradicts the assumptionthat Case 1 does not occur. Therefore y > y′, so (x′, y′) is the leftmost location among (x, y)and (x′, y′). Since Case 4 does not occur, we have that either x′ ≤ l or x′ ≥ m − a − l + 1.If x′ ≥ m− a− l + 1 then m− a− l + 1 ≤ x′ < x ≤ m− a. Therefore |x− x′| < l, but thiscontradicts the assumption that Case 2 does not occur. On the other hand, if x′ ≤ l then(x′, y′) ∈ Da,b[1]. As we assumed that Cases 1 and 2 do not occur, (x, y) /∈ Da,b[1, 2, 3, 4, 7].Therefore, (x, y) ∈ Da,b[5, 6, 8, 9] which contradicts an assumption of the lemma.

From the above we conclude that y′ ≥ m− b− l + 1. Moreover, y < y′ otherwise Case 1occurs. Now, (x, y) is the leftmost location among (x, y) and (x′, y′), so it follows that eitherx ≤ l or x ≥ m − a − l + 1. The former case cannot occur otherwise Case 2 occurs. If thelatter case occurs then from the assumption that Cases 1 and 2 do not occur we obtain that(x′, y′) ∈ Da,b[3, 6] and (x, y) ∈ Da,b[7, 8] which contradicts an assumption of the lemma.Therefore, at least one case among Cases 1-4 must occur.

We now show that the first part of the lemma holds when Case 1 occurs and when Case 3occurs. The proof for Cases 2 and 4 is symmetric. W.l.o.g. we assume that x ≥ x′.

Case 1 First, if y = y′ then (x, y) and (x′, y′) are Da,b-linked by a series of locations oncolumn y (due to the V 2

·,0 rectangles). Moreover, if 0 < |y − y′| < l then we have that(x′, y′) ∈ V 2

(x,y),y′−y, so the rectangle V 2(x,y),y′−y generates a Da,b-neighbor (x′′, y′) for (x, y).

From the above, (x′′, y′) is Da,b-linked to (x′, y′) and therefore (x, y) and (x′, y′) are Da,b-linked. Similarly, (x, y) and (x′, y′) are Da,b-linked if x− x′ < l.

15

Case 3 We prove the lemma for Case 3 using induction on x−x′. The base of the induction(when x = x′) was already shown. Suppose now that x > x′. Let i be the integer such that(x′, y′) ∈ V(x,y),i. Since l+1 ≤ y′ ≤ m−b−l it follows that the rectangle V(x,y),i is contained inDa,b. Therefore, (x, y) has a Da,b-neighbor (x′′, y′′) with x < x′′ ≤ x′. By induction (x′′, y′′)and (x′, y′) are Da,b-linked, and therefore (x, y) and (x′, y′) are also Da,b-linked.

We now define strings P1 and P2. The string P1 is obtained from P by replacing eachcharacter P [x, y] in P with L = 4⌊m/l⌋ + 4l − 4 characters that encode the locations ofthe neighbors of (x, y) relative to (x, y). If the rectangle does not yield a neighbor then thecharacter for this rectangle is the wildcard character φ.

The string P2 is built in two steps. The first step is the same as for building P1, exceptthat we use 0 for rectangles that yield no neighbors. After the first step, P2 is an m ×mLstring. In the second step we expand P2 into a 2m×2mL string by padding with φ characters.

After building P1 and P2 we solve the exact wildcard matching problem on P1 and P2,where P1 is the pattern and P2 is the text. Moreover, using the algorithm of Alon andNaor [1] we find a witness for every mismatch between P1 and P2, namely, for every offset(a, b) such that P1 does not match P2[a + 1 . . a + m, b + 1 . . b + m] we find a location (x, y)such that P1[x, y] 6= P2[x + a, y + b].

By Lemma 4 we have that

• If P1 does not match P ′2 = P2[a + 1 . . a + m, b + 1 . . b + m] then (a, b) is a mismatch

offset. Moreover, from a witness to the mismatch of P1 and P ′2 we can obtain a type 1

witness for (a, b) in constant time.

• If (a, b) has a simple type 1 witness then P1 does not match P ′2.

The bottleneck in the time complexity of Stage 1 is the time for computing the mismatchwitnesses using the algorithm of Alon and Naor. The time complexity of this step is O(|P2| ·polylog(m)) = O((m3

l+ l) · polylog(m)) = O(m5/2 · polylog(m)).

Stage 2

Recall that the goal of Stage 2 is to find type 1 witnesses w for mismatch offsets (a, b) suchthat one location of w is in Da,b[1] and the other location is in Da,b[5, 6, 8, 9] = [l + 1, m −a]× [l + 1, m− b]. Stage 2 is composed of three substages.

Stage 2a

In this stage we find witnesses w for mismatch offsets (a, b) such that one location of w is inDa,b[1] and the other location is in [l+1, m−a]× [l+1, m−b−3l]. Recall that in Stage 1 wecreated one instance of exact wildcard matching and used this instance to generate witnessesfor all offsets. In this stage we create O(m2/l) instances. The main idea is to group offsetsinto sets and build an instance for each set. In each set, all the offsets are close to eachother, and we use this fact in the construction.

16

Let (a, b′) be an offset where b′ is a multiple of l. For a location (x, y) we define therectangle

Ha,b′

(x,y) = [l + 1, m− a]× [y + 1, y + m− b′ − 3l].

From each rectangle we generate a neighbor by scanning the rectangle in top-to-bottom/right-to-left order and selecting the first occurrence of P [x, y] (if there are any).

Lemma 5. Let (a, b) be an offset and let b′ = l⌊b/l⌋. Let (x, y) ∈ Da,b[1] and (x′, y′) ∈[l + 1, m− a]× [l + 1, m− b− 3l] be two locations with P [x, y] = P [x′, y′]. Then, (x, y) has a

neighbor (x′′, y′′) in the rectangle Ha,b′

(x,y) and (x′′, y′′) is Da,b-linked to (x′, y′).

Proof. It is easy to verify that (x′, y′) ∈ Ha,b′

(x,y). Thus, the rectangle Ha,b′

(x,y) generates a

neighbor; let (x′′, y′′) denote that neighbor. Due to the scan order, y′′ ≥ y′ ≥ l+1. Moreover,y′′ ≤ y +m− b′− 3l ≤ m− b− l− 1 and l + 1 ≤ x′′ ≤ m− a. Therefore, (x′′, y′′) ∈ Da,b[5, 8].We also have (x′, y′) ∈ Da,b[5, 8]. By Lemma 4, (x′, y′) and (x′′, y′′) are Da,b-linked.

We now describe how to find witnesses by building appropriate exact matching instances.For every a and b′ where b′ is a multiple of l we build strings P a,b′

1 and P a,b′

2 . The string P a,b′

1

is an l × l string, where P a,b′

1 [x, y] is the location of the neighbor of (x, y) in Ha,b′

(x,y) relative

to (x, y) (if the rectangle does not yield a neighbor then P a,b′

1 [x, y] = φ). To build P a,b′

2 wetake the string P ′ = P [a + 1 . . m, b′ + 1 . . m]. For each location (x, y) ∈ [1, l]× [1, 2l − 1] in

P ′, we compute the neighbor of (x, y) in the rectangle Ha,b′

(x,y) w.r.t. P ′. The string P a,b′

2 is an

l× (2l− 1) string consisting of the encodings of these neighbors (when a rectangle does not

yield a neighbor the corresponding character in P a,b′

2 is 0). For each P a,b′

1 and P a,b′

2 we solvethe exact wildcard matching problem on these strings and find witnesses to the mismatches.From these witnesses we obtain witnesses for the mismatch offsets.

Stage 2b

This stage finds witnesses w for mismatch offsets (a, b) such that one location of w is inDa,b[1] and the other location is in [l +1, m− a− 3l]× [l +1, m− b]. This stage is analogousto Stage 2b and we omit the details.

Stage 2c

In this stage we find witnesses w for offsets (a, b) such that one location of w is in Da,b[1],and the other location is in [m− a− 3l + 1, m− a]× [m− b− 3l + 1, m− b].

Let a′ be a multiple of l. For a location (x, y) and i = 1, . . . , 5l−2 we define the rectangle

Ha′

(x,y),i = [x−m + a′ + i]× [1, y − 1].

As before, each rectangle may generate a neighbor. The rectangles are scanned in right-to-leftorder.

17

Lemma 6. Let (a, b) be an offset and let a′ = l⌊a/l⌋. Let (x, y) ∈ [m− a− 3l + 1, m− a]×[m− b− 3l + 1, m− b] and (x′, y′) ∈ Da,b[1] be two locations with P [x, y] = P [x′, y′]. Then,there is an index 1 ≤ i ≤ 5l− 2 such that (x, y) has a neighbor (x′, y′′) in Ha′

(x,y),i and (x′, y′′)

is Da,b-linked to (x′, y′).

Proof. Let i = x′−x+m−a′. Since m−a−3l+1 ≤ x ≤ m−a and 1 ≤ x′ ≤ l we have thati ≥ 1− (m− a) + m− a′ ≥ 1 and i ≤ l− (m− a− 3l + 1) + m− a′ = 4l− 1 + a− a′ ≤ 5l− 2.Clearly, (x′, y′) ∈ Ha′

(x,y),i and the lemma follows.

For every a′ which is a multiple of l we build strings P a′

1 and P a′

2 . The string P a′

1 isa string of size (4l − 1) × (m · (5l − 2)) that contains encodings of neighbors for locations(x, y) ∈ [m− a′− 4l + 2, m− a′]× [1, m]. The string P a′

2 contains encodings of neighbors forlocations (x, y) ∈ [m− 3l +1, m]× [m− 3l +1, m]. P a′

2 is also padded with wildcards so thatits size is (5l − 2)× (2m · (5l − 2)). For each pair of strings P a′

1 and P a′

2 we solve the exactmatching with wildcard problem and find witnesses to the mismatches.

We now compute the time complexity of Stage 2. In Stage 2a and Stage 2b we constructO(m2/l) pairs of strings each of size O(l2). In Stage 2c we construct O(m/l) pairs of stringseach of size O(ml). Therefore, the time complexity of Stage 2 is O(m2l · polylog(m)) =O(m5/2 · polylog(m)).

Stage 3

Again we consider several sub-stages.

Stage 3a

In this stage we find witnesses w for mismatch offsets (a, b) such that one location of w is inDa,b[3], and the other location is in Da,b[7, 8].

This stage is similar to Stage 2c. Let a′ be a multiple of l. For a location (x, y) andi = 1, . . . , 3l − 2 we define the rectangle

Ha′,2(x,y),i = [x + m− a′ − i]× [1, y − 1].

The scan order of the rectangles is right-to-left order. We build strings P a′,21 and P a′,2

2 asfollows: The string P a′

1 is a string of size l×(m ·(3l−2)) that contains encodings of neighborsfor locations (x, y) ∈ [1, l] × [1, m]. The string P a′

2 is a string of size (2l − 1) × (2m) thatcontains encodings of neighbors for locations (x, y) ∈ [a′ +1, a′ +2l−1]× [m− l +1, m], andit is padded with wildcards.

Stage 3b

In this stage we find witnesses w for mismatch offsets (a, b) such that one location of w is inDa,b[7], and the other location is in Da,b[3, 6]. The stage is similar to Stage 3a and we omitthe details.

18

Stage 3c

Let (a, b) be some mismatch offset. For a rectangle D ⊆ Da,b define D+(a, b) = (x+a, x+b) :(x, y) ∈ D. We say that a rectangle D ⊆ Da,b is a mismatch rectangle if the number ofdistinct characters inside the region D of P is strictly less than the number of distinctcharacters inside the region D + (a, b) of P . A mismatch rectangle D is minimal if there isno mismatch rectangle that is contained in D.

If no witness for (a, b) was found during Stage 1 or Stage 2, then we have that for everytype 1 witness w.r.t. (a, b), one location of the witness is in Da,b[3, 6] and the other locationis in Da,b[7, 8]. If additionally no witness for (a, b) was found during Stage 3a or Stage 3b,then for every type 1 witness w.r.t. (a, b), one location of the witness is in Da,b[6] and theother location is in Da,b[8]. Symmetrically, we have that for every type 2 witness w.r.t. (a, b),one location of the witness is in Da,b[2] and the other location is in Da,b[4]. We now showhow to a find a witness for such an offset. Note that there are no type 2 witnesses w.r.t.(a, b) with both locations inside Da,b[5, 6, 8, 9]. This yields the following observations:

Observation 4. For every rectangle D ⊆ Da,b[5, 6, 8, 9], the number of distinct charactersinside the region D of P is less than or equal to the number of distinct characters inside theregion D + (a, b) of P with equality if and only if there is no witness w for (a, b) for whichboth locations are in D.

Proof. Since there are no type 2 witnesses w.r.t. (a, b) with both locations inside Da,b[5, 6, 8, 9],there is a function matching between the substrings of P defined by the region D + (a, b)and the region D. Therefore, the first part of the observation holds. The second part of theobservations follows from Observation 1.

Observation 5. Let D ⊆ Da,b[5, 6, 8, 9] be a minimal mismatch rectangle. Then, the loca-tions at the bottom left corner and the top right corner of D form a witness w.r.t. (a, b).

Proof. By Observation 4 there is a witness w for (a, b) for which both locations are in D.From the minimality of D the two locations of w must be at opposite corners of D. Sinceone location of w is in Da,b[6] and the other location is in Da,b[8], it follows that the locationsof w are at the bottom left corner and top right corner of D.

We do not know how to efficiently find a minimal mismatch rectangle. Instead, we willfind a rectangle D such that the coordinates of the corners of D are close to those of someminimal mismatch rectangle. The corners of D will not necessarily give a witness w.r.t.(a, b), so an additional search for the witness will be required.

Consider the following intersection counting problem: Given a set S of points in theplane where each point has a color, build a data-structure for S that can answer queries ofthe form “what is the number of distinct colors for the points inside the rectangle (x, y) ∈R

2 : x ≤ b, c ≤ y ≤ d?”. Denote by n the number of points in S. Gupta et al. [13] showeda data-structure for this problem with preprocessing time O(n log2 n) that answers queriesin O(log2 n) time. Using this result, we obtain the following lemma:

19

Lemma 7. Let P be an m × m string, and let l be an integer. After preprocessing of Pin O(m3

llog2 m) time, the following queries can be answered in O(log2 m) time: “what is

the number of distinct characters in the substring P [a . . b, c . . d]?”, where a, b, c, and d areintegers with either b = m or a is a multiple of l.

Proof. The preprocessing is as follows: Create a set of points S containing a point (x, y) forevery x = 1, . . . , m and y = 1, . . . , m. The color of (x, y) is P [x, y]. For every integer i ≤ m/l,build the data structure of Gupta et al. on the set of points Si = (x, y) ∈ S : x ≥ il. Alsobuild a data structure on S ′ = (−x, y) : (x, y) ∈ S.

Given a query (a, b, c, d), if a is a multiple of l, return the number of distinct colors inthe points of Sa/l that are inside the rectangle (x, y) ∈ R

2 : x ≤ b, c ≤ y ≤ d. If b is equalto m, return the number of distinct colors in the points of S ′ that are inside the rectangle(x, y) ∈ R

2 : x ≤ −a, c ≤ y ≤ d.

Assume that we have the data-structure of Lemma 7 on P without its first row. We nowdefine a mismatch rectangle D ⊆ Da,b[5, 6, 8, 9] by starting with D = Da,b[5, 6, 8, 9] and thenone by one moving the left, right, and top edges of D to the right, left, and down, respectively,as long as the rectangle remains a mismatch rectangle. More precisely, we construct D asfollows.

1. We find the maximum y such that the rectangle [l+1, m−a]× [y, m− b] is a mismatchrectangle. Note that from Observation 4, [l + 1, m − a] × [y′, m − b] is a mismatchrectangle for every y′ < y. Therefore, we can find the value of y using binary searchand queries to the data-structure of Lemma 7.

2. We find the minimum z such that the rectangle [l + 1, m − a] × [y, z] is a mismatchrectangle.

3. We find the maximum x such that x − 1 is a multiple of l and [x, m − a] × [y, z] is amismatch rectangle. Let D = [x, m− a]× [y, z].

From Observation 4 and the definition of D we obtain that there is a type 1 witness w.r.t.(a, b) with one endpoint in [m−a−l+1, m−a]×[y] and the other endpoint in [x, x+l−1]×[z].We find such a witness as follows.

1. Let V [1 . . |Σ|] and L[1 . . |Σ|] be arrays containing zeros.

2. For x′ = m− a− l + 1, . . . , m− a, V [P [x′, y]]← P [x′ + a, y + b] and L[P [x′, y]]← x′.

3. For x′ = x, . . . , x + l − 1, if V [P [x′, z]] /∈ 0, P [x′ + a, z + b] then output the witness(L[P [x′, z]], y), (x′, z) and stop.

By Lemma 7, the time complexity of this stage is O(m3

llog2 m + m2 log3 m + m2l) =

O(m5/2 log2 m).

20

4.5 Handling special offsets

We now describe how to handle a mismatch offset (a, b) with a > m− 3l. For each location(x, y) in P we define the following rectangles: For i = −3l+2, . . . , 3l−2, let H(x,y),i = [x+i]×[1, y−1] and H(x,y),i = [x+ i]× [y +1, m]. For each location we choose neighbors using theserectangles and then construct an exact wildcard matching problem as described in Stage 1.The size of the strings in this problem is O(m2l), so this stage takes O(m5/2 · polylog(m))time. Handling mismatch offsets (a, b) with b > m− 3l is similar.

As each stage of the pattern preprocessing takes O(m5/2 · polylog(m)) time, we concludethat the total time complexity for preprocessing the pattern is O(m5/2 · polylog(m)).

5 Substrings character counting

In this section we show an O(n2) time algorithm that counts the number of distinct charactersin every m×m substring of an n× n string T . W.l.o.g. we assume that n = 2m.

For each location (x, y) in T we associate a rectangle R(x,y) = [max(1, x−m+1), min(m+1, x)]× [max(1, y −m + 1), min(m + 1, y)]. The rectangle R(x,y) is the rectangle of all startlocations of the m ×m substrings of T that contain location (x, y). For a character c ∈ Σ,Sc is the set of all rectangles R(x,y) such that T [x, y] = c.

The counting algorithm is similar to the algorithm of [4]. The algorithm moves a strip(see Definition 3) along T and computes the number of distinct characters in every m ×msubstring of T contained in the current strip. We use the fact that a character c ∈ Σ appearsin a substring T [x . . x+m−1, y . . y +m−1] if and only if (x, y) ∈ ∪R∈Sc

R. Therefore, whenprocessing the y-th strip, the algorithm computes the intersection of ∪R∈Sc

R with column yfor all c ∈ Σ. These intersections are obtained from the intersections of ∪R∈Sc

R with columny − 1 for all c ∈ Σ.

Let S be a set of rectangles. The lower boundary (resp., upper boundary) of S is the setof all locations (x, y) such that (x, y) ∈ ∪R∈SR and (x + 1, y) /∈ ∪R∈SR (resp., (x − 1, y) /∈∪R∈SR). The boundary of S is the union of the lower and upper boundaries of S. A leftupper boundary location (resp., right upper boundary location) of S is a location (x, y) thatis inside the upper boundary of S and (x, y − 1) (resp., (x, y + 1)) is not inside the upperboundary. We define left and right lower boundary locations similarly.

The algorithm consists of two stages. In the first stage we compute the boundary of Sc

for every c ∈ Σ (more precisely, we compute the left and right boundary locations of Sc).The second stage uses this information to compute the number of distinct characters in eachm×m substring of T .

5.1 Computing the boundary

In this section we show how to compute the boundary for some set Sc.For a rectangle R = [x1, x2]× [y1, y2] let top(R) = x1, bottom(R) = x2, left(R) = y1, and

right(R) = y2. From the assumption that n = 2m we have that for every rectangle R ∈ Sc

21

either left(R) = 1 or right(R) = m + 1. Furthermore, for every R ∈ Sc either top(R) = 1 orbottom(R) = m + 1.

The algorithm consists of two steps: First, we compute the boundaries of S1c and S2

c ,where S1

c is the set of all rectangles R ∈ Sc such that top(R) = 1, and S2c = Sc \ S1

c . In thesecond stage we merge the two boundaries and obtain the boundary of Sc.

We now describe how to compute the boundary of S1c . Computing the boundary of S2

c

is symmetrical. We assume that there is at least one rectangle with left(R) = 1 and onerectangle with right(R) = m + 1 (handling the case when there are rectangles of only onetype is simpler).

From each group of rectangles R ∈ S1c with left(R) = 1 and the same value of right(R),

we pick the rectangle that has the largest value of bottom(R). Let S1, . . . , Sl be the chosenrectangles, ordered so that right(S1) < right(S2) < · · · < right(Sl). We build a set S1 bychoosing each rectangle Si such that bottom(Si) > bottom(Sj) for all j > i.

Similarly, from each group of rectangles R ∈ S1c with right(R) = m+1 and the same value

of left(R), we pick the rectangle that has the largest value of bottom(R). Let S ′1, . . . , S

′l′ be

the chosen rectangles, ordered such that left(S ′1) > left(S ′

2) > · · · > left(S ′l′). Define S2 to be

the set containing every rectangle S ′i for which bottom(S ′

i) > bottom(S ′j) for all j > i.

Now, let S1 = R1, . . . , Rk and S2 = R′1, . . . , R

′k′ where right(R1) < right(R2) < · · · <

right(Rk) and left(R′1) > left(R′

2) > · · · > left(R′k′). If right(Rk) < left(R′

k′) then the leftupper boundary locations of S1

c are

(1, 1), (1, left(R′

k′)),

the right upper boundary locations are

(1, right(Rk)), (1, m + 1),

the left lower boundary locations are

(bottom(R1), 1), (bottom(R2), right(R1) + 1), . . . ,

(bottom(Rk), right(Rk−1) + 1), (bottom(R′

k′), left(R′

k′)), . . . , (bottom(R′

1), left(R′

1))

and the right lower boundary locations are

(bottom(R1), right(R1)), . . . , (bottom(Rk), right(Rk)),

(bottom(R′

k′), left(R′

k′−1)− 1), . . . , (bottom(R′

2), left(R′

1)− 1), (bottom(R′

1), m + 1).

See Figure 5(a) for an example.Now suppose that right(Rk) ≥ left(R′

k′). Start with i = 1 and j = 1. While right(Ri) <left(R′

j), if j = k′ or bottom(Ri+1) ≥ bottom(R′j+1) then increase i by 1, and otherwise

increase j by 1. When this process finishes, we have right(Ri) ≥ left(R′j). There are sev-

eral case that we need to consider. In the first case assume that right(Ri) > left(R′j) and

bottom(Ri) < bottom(Rj). Then the left upper boundary locations of S1c are (1, 1), the right

upper boundary locations are (1, m + 1), the left lower boundary locations are

(bottom(R1), 1), (bottom(R2), right(R1) + 1), . . . , (bottom(Ri), right(Ri−1) + 1),

(bottom(R′

j), right(Ri) + 1), (bottom(R′

j−1), left(R′

j−1)), . . . , (bottom(R′

1), left(R′

1))

22

(a) (b)

Figure 5: An example of the boundary computation for S1c . Figure (a) shows the case

right(Rk) < left(R′k′) and figure (b) shows the case right(Rk) > left(R′

k′). In each figure,the left boundary locations are shown as black circles, and the right boundary locations areshown as white circles.

and the right lower boundary locations are

(bottom(R1), right(R1)), . . . , (bottom(Ri), right(Ri)),

(bottom(R′

j), left(R′

j−1)− 1), . . . , (bottom(R′

2), left(R′

1)− 1), (bottom(R′

1), m + 1).

See Figure 5(b) for an example. The other cases are similar.We now describe how to merge the boundaries of S1

c and S2c . We first define x1(y) (resp.,

x2(y)) to be the integer x such that (x, y) is a location on the lower boundary of S1c (resp.,

on the upper boundary of S2c ). If there is no such location then x1(y) is undefined.

To merge the boundaries of S1c and S2

c , we process the left and right boundary locationsof S1

c and S2c from left to right. Suppose that we process some location (x, y). There are

two possible cases. The first case is when either x1(y) < x2(y)− 1, or at least one of x1(y)and x2(y) is undefined. The second case is when both x1(y) and x2(y) are defined andx1(y) ≥ x2(y) − 1. In the former case (x, y) is also a left or right boundary location ofSc, while in the latter case it is not. The columns in which there is a change between thetwo cases above generate additional boundary locations of Sc. More precisely, suppose forexample that x1(y−1) < x2(y−1)−1 and x1(y) ≥ x2(y)−1. This means that either (x1(y), y)is a right lower boundary location of S1

c , or (x2(y), y) is a right upper boundary location ofS2

c , or both. Suppose that the first case occurs. We then have that (x2(y − 1), y − 1) is aleft upper boundary location of Sc that is not a boundary location of S1

c and S2c .

The time complexity for computing the boundary of Sc is O(|Sc|). Therefore, computingall the boundaries takes O(n2) time.

5.2 Counting the characters

For every c ∈ Σ, let LUc (resp., RUc) be the set of left (resp., right) upper boundary locationsof Sc. Let LLc (resp., RLc) be the set of left (resp., right) lower boundary locations of Sc.

23

The algorithm moves a strip along T and maintains a vector D such that∑x

i=1 D[i] givesthe number of distinct characters in the m×m substring that starts at row x in the currentstrip.

For the 1st strip, the vector D is constructed as follows: For every c ∈ Σ and every pair(x, 1) ∈ LUc, add 1 to D[x] (note that the pair (x, 1) can appear in several sets LUc whenx = 1). For every c ∈ Σ and every pair (x, 1) ∈ LLc, subtract 1 from D[x + 1]

When moving from the (y − 1)-th strip to the y-th strip we update D as follows. Forevery c ∈ Σ:

• For every pair (x, y) ∈ LUc, add 1 to D[x].

• For every pair (x, y) ∈ LLc, subtract 1 from D[x + 1].

• For every pair (x, y − 1) ∈ RUc, subtract 1 from D[x].

• For every pair (x, y − 1) ∈ RLc, add 1 to D[x + 1].

References

[1] N. Alon and M. Naor. Derandomization, witnesses for boolean matrix multiplicationand construction of perfect hash functions. Algorithmica, 16:434–449, 1996.

[2] A. Amir, Y. Aumann, M. Lewenstein, and E. Porat. Function matching. SIAM J. onComputing, 35(5):1007–1022, 2006.

[3] A. Amir, G. Benson, and M. Farach. An alphabet independent approach to two dimen-sional pattern matching. SIAM J. on Computing, 23(2):313–323, 1994.

[4] A. Amir, K. W. Church, and E. Dar. The submatrices character count problem: anefficient solution using separable values. Information and Computation, 190(1):100–116,2004.

[5] A. Amir, M. Farach, and S. Muthukrishnan. Alphabet dependence in parameterizedmatching. Information Processing Letters, 49(3):111–115, 1994.

[6] A. Apostolico, P. L. Erdos, and M. Lewenstein. Parameterized matching with mis-matches. J. of Discrete Algorithms, 5(1):135–140, 2007.

[7] G. P. Babu, B. M. Mehtre, and M. S. Kankanhalli. Color indexing for efficient imageretrieval. Multimedia Tools and Applications, 1(4):327–348, 1995.

[8] B. S. Baker. Parameterized string pattern matching: Algorithms and applications. J.of Computer and Systems Sciences, 52(1):28–42, 1996.

[9] B. S. Baker. Parameterized duplication in strings: Algorithms and an application tosoftware maintenance. SIAM J. on Computing, 26(5):1343–1362, 1997.

24

[10] P. Clifford and R. Clifford. Simple deterministic wildcard matching. Information Pro-cessing Letters, 101(2):53–54, 2007.

[11] R. Cole and R. Hariharan. Verifying candidate matches in sparse and wildcard matching.In Proc. 34th ACM Symposium on Theory Of Computing (STOC), pages 592–601, 2002.

[12] R. Cole and R. Hariharan. Faster suffix tree construction with missing suffix links.SIAM J. on Computing, 33(1):26–42, 2003.

[13] P. Gupta, R. Janardan, and M. Smid. Further results on generalized intersection search-ing problems: Counting, reporting, and dynamization. J. of Algorithms, 19(2):282–317,1995.

[14] D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAMJ. on Computing, 13(2):338–355, 1984.

[15] C. Hazay, M. Lewenstein, and D. Sokol. Approximate parameterized matching. ACMTransactions on Algorithms, 3(3), 2007.

[16] S. R. Kosaraju. Faster algorithms for the construction of parameterized suffix trees.In Proc. 36th Symposium on Foundation of Computer Science (FOCS), pages 631–637,1995.

[17] M. Swain and D. Ballard. Color indexing. International Journal of Computer Vision,7(1):11–32, 1991.

[18] U. Vishkin. Optimal parallel pattern matching in strings. In Proc. 12th InternationalColloquium on Automata, Languages and Programming (ICALP), pages 91–113, 1985.

[19] U. Vishkin. Deterministic sampling — a new technique for fast pattern matching. SIAMJ. on Computing, 20:303–314, 1991.

25