delta-gamma-parameterized matching
TRANSCRIPT
Parameterized Matching
Carmit HazayDepartment of Computer Science
M.Sc. Thesis
Submitted in partial fulfillment of the requirements for the M.Sc.
degree in the Department of Computer Science, Bar-Ilan University.
Ramat-Gan, Israel
September 2004
1
This work was carried out under the supervision of
Dr. Moshe Lewenstein,
Department of Computer Science, Bar-Ilan University.
Acknowledgments
I would like to thank my advisor Moshe Lewenstein for his guidance and directions. He taught me a lot and
I am grateful for the time and effort he put in this work.
Also, special thanks to Dina Sokol and Dekel Tzur. It was very inspiring working with them.
Abstract
Two equal length stringss ands′, over alphabetsΣs andΣs′ , parameterize matchif there exists a
bijectionπ : Σs → Σs′ , such thatπ(s) = s′, whereπ(s) is the renaming of each character ofs via π.
Parameterized matchingis the problem of finding all parameterized matches of a pattern stringp in a
text t. It was introduced as a model for software duplication detection in software maintenance systems
and also has applications in image processing and computational biology.
Two dimensional parameterized matchingis the task of finding all locations in ann × n-length text
where them × m subtext beginning at that location and a givenm × m-length pattern p-match. This
models searching for color images with changing color maps.Our first result is an algorithm for the two
dimensional case that runs in timeO(n2 + m2.5 · polylog(m)).
Approximate parameterized matchingis the problem of finding, at each location, a bijectionπ that
maximizes the number of characters that are mapped fromp to the appropriate|p|-length substring oft.
We consider the problem for which an error threshold,k, is given and the goal is to find all locations
in t for which there exists a bijectionπ which mapsp into the appropriate|p|-length substring oft with
at mostk mismatched mapped-elements.
We show that (1) the approximate parameterized matching, when |p|=|t|, is equivalent to the maxi-
mum matching problem on graphs, implying that (2) maximum matching is reducible to the approximate
parameterized matching with thresholdk, up till anO(log |t|) factor (this can be achieved by reducing
approximate parameterized matching to the problem by usinga binary search on thek’s). Given the
best known maximum matching algorithms anO(m1.5) time, wherem = |p| = |t|, is implied for ap-
proximate parameterized matching. We show that (3) for thek threshold problem we can do this in
O(m + k1.5) time.
Our second main result (4) is anO(nk1.5 + mk log m) time algorithm wherem = |p| andn = |t|.
Contents
Table of Contents 5
Contents 5
1 Introduction 7
2 Two Dimensional Parameterized Matching 10
2.1 Preliminaries and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 10
2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 10
2.3 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12
2.4 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 14
2.5 Pattern Preprocessing — Finding witnesses . . . . . . . . . . . . . . . . . . .. . . . . . . 15
2.5.1 Simple Witnesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 Corner Witnesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19
2.5.3 l Strip Witnesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Approximate Parameterized Matching 23
3.1 Preliminaries and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 23
3.2 String Comparison Problem - The Binary Case . . . . . . . . . . . . . . . . . .. . . . . . 23
3.2.1 Case 1 : Firstm/2 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Case 2 : Lastm/2 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Case 3 : Both Halves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
3.2.4 Case 4 : Less than2k + 1 Characters . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 String Comparison Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 25
3.3.1 Reducing Maximum Matching to the String Comparison Problem . . . . . . . . .. 25
3.3.2 Mismatch Pairs and Parameterized Properties . . . . . . . . . . . . . . . . . .. . . 26
3.3.3 String Comparison Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.4 Find Mismatch Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.5 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29
3.4.1 The Filter Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 Using Precomputed Information . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30
3.4.3 Putting it Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Pattern Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 32
3.6 2-D Parameterized Matching withk Mismatches . . . . . . . . . . . . . . . . . . . . . . . 33
7
Chapter 1
1 Introduction
In the traditional pattern matching model [14,24], one seeks exact occurrences of a given patternp in a textt,
i.e. text locations where every text symbol isequalto its corresponding pattern symbol. For two equal length
stringss ands′ we say thats is aparameterized match, p-matchfor short, ofs′ if there exists a bijectionπ
from the alphabet ofs to the alphabet ofs′ such that every symbol ofs′ is equal tothe image underπ of
the corresponding symbol ofs. In theparameterized matchingproblem, introduced by Baker [10, 12], one
seeks text locations for which the patternp p-matches the substring of length|p| beginning at that location.
Baker introduced parameterized matching for applications that arise in software tools for analyzing source
code. Specifically, the application is to identify duplicate code in large software systems for reuse. Here
it is desirable to find not only exact matches between program fragments but also parameterized matches,
namely, where the two program fragments are equal but possibly use interchangeable identifiers (represent-
ing variable, constant or function names).
When program fragments that repeat are discovered, other files of theprogram are searched to find repe-
titions of the program fragment in other locations. (It is often the case that repeating fragments have many
appearances.) This exactly defines theparameterized matching problem, or p-match problemfor short. It
turns out that the parameterized matching problem arises in other applications such as in image processing
and computational biology, see [3].
In [10,12] an optimal, linear time algorithm was given for p-matching. However, it was assumed that the
alphabet was of constant size. Amir et.al. [6] presented tight bounds forp-matching in the presence of an
unbounded size alphabet.
In [11] a novel method was presented for parameterized matching by constructing parameterized suffix
trees, which also allows for online p-matching. The parameterized suffix trees are constructed by converting
the pattern string into apredecessor string. A predecessor stringof a strings has at each locationi the
distance betweeni and the location containing the previous appearance of the symbol. The first appearance
of each symbol replaced with a 0. For example, the predecessor string ofaabbaba is 0, 1, 0, 1, 3, 2, 2. A
simple and well-known fact is that:
Observation 1. s ands′ p-match iff they have the same predecessor string.
The parameterized suffix tree is constructed in a manner that every path corresponds to a suffix, in the
standard sense, but is labeled with the predecessor string of the appropriate suffix. Branching in the param-
1 INTRODUCTION 8
eterized suffix tree occurs according to the labels of the predecessor strings. The parameterized suffix tree
was further explored by Kosaraju [25] and faster constructions weregiven by Cole and Hariharan [15].
One of the interesting problems in web searching is searching for color images, see [5, 9, 30]. The sim-
plest possible case is searching for an icon in a screen, a task that the Human-Computer Interaction Lab at
the University of Maryland was confronted with. If the colors are fixed,this is exact 2-dimensional pattern
matching [4]. However, if the color maps in pattern and text differ, the exact matching algorithm would not
find the pattern. Parameterized 2-dimensional search is precisely what is needed. The fastest algorithm solv-
ing the two-dimensional parameterized matching problem was given in [3] andruns in timeO(n2 log2 m)
for ann × n text.
It is an open question whether a linear time algorithm exists. In this work we show that it is possible to get
an algorithm that is linear in the text size with a penalty in the preprocessing stage, namelyO(n2 + m2.5 ·polylog(m)). For alphabets drawn from a large universe (larger than polynomial inn) the time increases to
O(n2 log |Σ| + m2.5 · polylog(m)). However, this seems to be unavoidable as it was shown in [6] that the
one-dimensional parameterized matching requiresΩ(n log |Σ|) time.
In reality, when searching for an image, one needs to take into account theoccurrence of errors that
distort the image. Errors occur in various forms, depending upon the application. Several distance metrics
have been defined to account for such errors. Two of the most classical distance metrics are the Hamming
distance and the edit distance [28]. TheHamming distancebetween two equal-length strings is the number
of mismatching characters when the strings are aligned. The edit distance is the minimal number of character
replacements, insertions, and deletions needed to convert one string into another [28].
In [13] the parameterized match problem was considered in conjunction with the edit distance. Here
the definition of edit distance was slightly modified so that the edit operations are defined to be insertion,
deletion and parameterized replacements, i.e. the replacement of a substringwith a string that p-matches it.
An algorithm for finding the “p-edit distance” of two strings was devised whose efficiency is close to the
efficiency of the algorithms for computing the classical edit distance. Also analgorithm was suggested for
the decision variant of the problem where a parameterk is given and it is necessary to decide whether the
strings are within (p-edit) distancek.
However, it turns out that the operation of parameterized replacement relaxes the problem to an easier
problem. The reason that the problem becomes easier is that two substringsthat participate in two parame-
terized replacements are independent of each other (in the parameterizedsense).
A more rigid, but more realistic, definition for the Hamming distance variant was given in [8]. For a
pair of equal length stringss ands′ and a bijectionπ defined on the alphabet ofs, theπ-mismatch is the
Hamming distance between the image underπ of s ands′. The minimalπ-mismatch over all bijectionsπ is
the approximate parameterized match. The problem considered in [8] is to find for each locationi of a text
t the approximate parameterized match of a patternp with the substring beginning at locationi.
In [8] the problem was defined and linear time algorithms were given for the case were the pattern is
1 INTRODUCTION 9
binary or the text is binary. However, this solution does not carry over tolarger alphabets.
Unfortunately, under this definition the methods for classical string matching with errors for Hamming
distance, e.g. [20,27], also known as pattern matching with mismatches, seemto fail. Following is an outline
of the method [20] for pattern matching with mismatches.
The pattern is compared separately to each suffix of the text, beginning at locations1 <= i <= n.
Using a suffix tree of the text and precomputed longest common ancestor (LCA) information (which can be
computed once in linear time, see [22, 29]) one can find the longest common prefix of the pattern and the
corresponding suffix (in constant time). There must be a mismatch immediately afterwards. The algorithm
jumps over the mismatch and repeats the process taking into consideration the offsets of the pattern and
suffix.
When attempting to apply this technique to a parameterized suffix tree, it fails. Toillustrate this, consider
the first matching substring (up until the first error) and the next matching substring (after the error). Both
of these substrings p-match the substring of the text that they are aligned with. However, it is possible
that combined they do not form a p-match. See the example below. In the example abab p-matchescdcd
followed by a mismatch and subsequently followed byabaa p-matchingefee. However, differentπ’s are
require for the local p-matches. This also emphasizes why the definition of [13] is a simplification.
p = a b a b a a b a a ...
t = ... c d c d d e f e e ...
In this work we consider the problem ofparameterized matching with k mismatches. The parameter-
ized matching problem withk mismatches seeks allp-matchesof a patternp in a text t, with at mostk
mismatches.
For the case where|p|=|t| = m, which we call the string comparison problem we show that this problem is
in a sense equivalent to the maximum matching problem on graphs. AnO(m1.5) algorithm for the problem
follows, by using maximum matching algorithms. A matching ”lower bound” follows, inthe sense that an
o(m1.5) algorithm implies better algorithms for maximum matching. For the string comparison problem
with thresholdk, we show anO(m + k1.5) time algorithm and a ”lower bound” ofΩ(m + k1.5
log m).
The first result is an algorithm that solves the parameterized matching problem with k mismatches prob-
lem, given a pattern of lengthm and a text of lengthn, in O(nk1.5 + mk log m) time. This immedi-
ately yields a 2-dimensional algorithm of time complexityO(n2mk1.5 + m2k log m), where|p| = m2 and
|t| = n2.
10
Chapter 2
2 Two Dimensional Parameterized Matching
2.1 Preliminaries and Definitions
Let T andT ′ be two texts of sizek × k. DenoteT by T [1, 1], . . . , T [1, k], T [2, 1], . . . , T [2, k], . . . , T [k, k].
We say that there is afunction matchingbetweenT andT ′ if there is a mappingf from the alphabet of
T , to the alphabet ofT ′ such thatT ′ = f(T ), wheref(T ) = f(T [1, 1]), . . . , f(T [1, k]), f(T [2, 1]), . . . ,
f(T [k, k]). If the mappingf is one-to-one, we say thatT and T ′ parameterize match, or p-match for
short. Note that the definition of function matching is asymmetric whereas the definition of parameterized
matching is symmetric. The parameterized matching problem is defined as follows:
Input: An n × n textT and anm × m patternP .
Output: All locations(i, j) of T where them×m substring ofT beginning at location(i, j) andP p-match.
Observation 1 gives a handle on finding p-matches for 1-dimensional texts. We would like to use a similar
observation for 2-dimensional texts as well. In order to do so we define thefollowing.
Definition 1. Let T be ann × n text. Astrip is a substring ofT of sizen × m. More specifically, thei-th
strip ofT is then × m substringT [1 : n, i : i + m − 1].
LetA be ann×k string. We define thelinearizationof A to be the one-dimensional stringA[1, 1], . . . , A[1, k],
A[2, 1], . . . , A[2, k], . . . , A[n, 1], . . . , A[n, k]. Now, letT (i) denote the linearization of thei-th strip ofT and
let P ′ be the linearization ofP . Observe that:
Observation 2. Them × m subtext ofT beginning at location(i, j) andP p-match iffP ′ p-matches with
them2 length substring ofT (i) beginning at location(j − 1)m + 1.
Since we now can verify p-matches via 1-dimensional strings, we can use Observation 1. For a 2-
dimensional stringS of width m, the predecessor of location(i, j) in S is defined to be the predecessor
of the location that corresponds to(i, j) in the linearization ofS.
2.2 Overview
We begin by giving an overview of the algorithm. As in many pattern matching algorithms, we assume
w.l.o.g. that the size of the textn × n satisfiesn = 2m. Larger texts can be cut into2m × 2m pieces and
each handled separately. We also assume thatΣ = 1, . . . , 4m2.
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 11
The outline of the algorithm follows the “duel-and-sweep”-paradigm that appeared in [4] (there it is
named “consistency and verification” and was used for 2-dimensional exact matching) and is based upon
ideas of dueling techniques [31, 32]. The idea of the “duel-and-sweep”-paradigm is to maintain a list of
candidate match locations inT , starting with all locations and pruning the list in two stages to a list of all
the candidates which actually match. The two stages are called thedueling stageand thesweeping stage.
However, exact matching turns out to be much simpler than parameterized matching. This is mainly because
of the ”witness” which is used in the dueling stage to differentiate between two pattern alignments (also the
sweeping stage is simpler for similar reasons). In exact matching a witness is simply one location with a
mismatch between the two alignments. This is not sufficient for parameterized matching as p-matching is
dependent on other locations. The following definition captures the concept of a witness in parameterized
matching.
Definition 2 (witness). LetP be anm × m size pattern. Consider an alignment ofP with itself aligned to
begin at location(a + 1, b + 1). A witnessrelative to alignment(a, b) is a pair of locations(x, y), (x′, y′)
such that one of the following holds:
1. P [x, y] = P [x′, y′] andP [x + a, y + b] 6= P [x′ + a, y′ + b].
2. P [x, y] 6= P [x′, y′] andP [x + a, y + b] = P [x′ + a, y′ + b].
In the dueling stage, we use the fact that if two p-matches ofP in T overlap, then there is a p-matching
between the two substrings ofP that correspond to the overlapping area of the two matches. Using a witness
array forP , we can check pairs of overlapping candidates which do not agree on their overlapping area, and
rule out at least one candidate from the pair in constant time per pair. Afterthis stage we remain only with
candidates that agree with each over.
In the sweeping stage, we need to check the remaining candidates. This is done by going over all the
strips ofT from left to right, and for each strip, checking the candidates that beginsin the first column of
the strip.
Suppose that the current strip starts at columny, and consider some candidateT ′ = T [x : x + m − 1, y :
y + m − 1] in the strip. Suppose that we know the predecessor for every location in the current strip ofT .
A pair of location inT (in the current strip) and its predecessor will be called alocation-predecessor pair.
Before we make the next observation a central concept which has a similarflavor to the witness defined
above is as follows.
Definition 3 (mismatch pair). Let R andS be two equal-length 2-dimensional strings. Amismatch pair
betweenR andS is a pair of locations(i, j), (i′, j′) such that one of the following holds,
1. R[i, j] = R[i′, j′] andS[i, j] 6= S[i′, j′] 2. R[i, j] 6= R[i′, j′] andS[i, j] = S[i′, j′].
We now observe that:
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 12
Observation 3. 1. If T ′ p-matches toP , then every location-predecessor pair insideT ′ (namely, both the
location and its predecessor are inT ′) is not a mismatch pair forT ′ andP .
2. If there is no location-predecessor pair inT ′ which is a mismatch pair forT ′ and P , then there is a
function matching betweenT ′ andP .
3. If there is a function matching betweenT ′ andP , then there is a p-matching betweenT ′ andP if and
only if the number of distinct characters inT ′ is equal to the number of distinct characters inP .
By Observation 3, the algorithm for checking the candidates will have two steps: In the first step, for
each candidate, we check whether every location-predecessor pair inthe candidate is a mismatch pair. In the
second step we compute the number of distinct characters in each candidateand compare it to the number
of distinct characters inP .
In order to implement the first step, we will maintain the predecessor of everylocation in the current strip.
Computing the predecessor of every location in the first strip ofT takesO(m2) time. When going from on
strip to the next strip, only at most6m predecessors are updated, and we will show how to do the update in
O(m) time.
Clearly, if we check each location-predecessor pair for each candidate separately, then the algorithm will
not be efficient (the time complexity will beΘ(m4) in the worst case). However, we can use the fact that
after the dueling stage, all the remaining candidates agree with each other. Therefore, for each location-
predecessor pair inT , the pair is either a mismatch pair forP and every candidate that contains the pair, or
the pair is not a mismatch pair for any candidate. Thus, for each location-predecessor pair we need to check
only two characters inP , and if there is a mismatch, we can rule out all the remaining candidates which
contain the pair.
The second step is done by computing the number of distinct character in every m × m window of T .
Amir et al. [5] gave an algorithm for this problem whose time complexity isO(m2 log m). We can solve the
problem inO(m2) time, but omit for space reasons. (Amir and Cole [2] also have solved the problem in the
same time bounds).
2.3 Algorithm Details
From Observation 3, we need to check for each candidate whether the location-predecessor pairs inside it
are mismatch pairs. As discussed in Section 2.2, we will check every location-predecessor pair only once
during the entire algorithm.
The main idea is as follows: We will go over the strips ofT from left to right. When going from the
(y − 1)-th strip to they-th strip, there are at most6m new location-predecessor pairs: (1) every location
in columny + m − 1 and its predecessors form a new pair, (2) every location in columny + m − 1 may
become the predecessor of some location in columnsy, . . . , y + m − 2, and (3) every location in columns
y, . . . , y+m−2 whose predecessor was in columny−1 will form with its new predecessor a new pair. After
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 13
a preprocessing step onT that will be described in Section 2.4, we can find all the new location-predecessor
pairs in timeO(m).
Now, let (r, c), (r′, c′) be some new location-predecessor pair for they-th strip. All the candidates that
contain the locations(r, c) and (r′, c′) agree with each other, so(r, c), (r′, c′) is a mismatch pair for all
of these candidates, or for none. Therefore, we only need to find onecandidate(z, w) (with w ≥ y) that
contains both locations, and check whetherP [r− z +1, c−w +1] = P [r′− z +1, c′−w +1]. If these two
characters are not equal, then(z, w) cannot be a match, and moreover, every candidate that contains(r, c)
and(r′, c′) cannot be a match. In other words, we can rule out every candidate thatbegins in the rectangle
r − m + 1, . . . , r′ × y, . . . ,min(c, c′). Note that it is possible that the predecessor of(r, c) changes at
stripy′ for y < y′ ≤ min(c, c′), so(r, c), (r′, c′) is not a location-predecessor pair for some of the candidate
that we rule out. However, this does not affect the correctness of the algorithm.
We now need to handle two issues: How to find a candidate that contains the pair (r, c), (r′, c′), and in a
case of a mismatch, how to rule out the candidates in the corresponding rectangle.
We first deal with the first issue. Given a location-predecessor pair(r, c), (r′, c′), we will find the highest
candidate that can contain the pair(r, c), (r′, c′), which will be denoted(z, w). More precisely, among all
the candidates inr − m + 1, . . . , r × y, . . . ,min(c, c′) (note that the rectangle here ends in rowr),
(z, w) is the candidate with smallest row number (ties are broken arbitrarily). Clearly, if (z, w) does not
contain the pair(r, c), (r′, c′) (that is,z > r′), then there is no candidate (that begin in column greater than
y) that contains the pair.
We now describe how to find the highest candidate in constant time. To do that,we build a2m × 2m
arrayA, whereA[i, j] is the smallest row in which a candidate starts, among all the candidates with a start
row at mosti − m + 1 and start columnj. If there is no such row, thenA[i, j] = 2m. The arrayA can be
easily computed by scanning the text column by column, from top to bottom.
Now, given a location-predecessor pair(r, c), (r′, c′), in order to find the highest candidate that contains
the pair, we need to find the minimum value among the subrowA[r, y], A[r, y + 1], . . . , A[r, min(c, c′)].
This is the range minima problem [18], so after preprocessing each row ofA in O(m) time (for each row),
we can find the minimum element in some subrow ofA in constant time.
We now turn to the problem of eliminating candidates. As discussed above, during the algorithm we find
rectangles that do not contain a match. Instead of eliminating the candidates atthe time each rectangle is
discovered, we store the rectangle in a listL, and after finding all the rectangles, we perform a candidates
elimination stage.
In the candidates elimination stage, we need to remove each candidate whose starting location is in some
rectangle ofL. This is done by moving a vertical sweep line from left to right. We maintain two vectors
V andB, whereV [i] (resp.,B[i]) is the number of rectangles inL which intersects the current sweep line,
and whose top (resp., bottom) row isi. Using these vectors, we can compute the number of rectangles that
contain each point on the sweep line, and eliminate the candidates that start in alocation that is contained by
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 14
at least one rectangle. The time needed to update the vectorsV andB is after moving the sweep line is linear
in the number of new rectangles that intersects the line after its movement and thenumber of rectangles that
stopped intersecting the line. Therefore, the total time complexity is linear in the number of rectangles inL,
which is at most6m2.
2.4 Text Preprocessing
In this section, we show how to compute an array of pointers, which will be used to maintain the predecessors
of the current strip.
Definition 4 (left predecessor).Theleft predecessorof every location(i, j) in T , is the predecessor of(i, j)
in T (1) if j ≤ m, and ifj > m, the left predecessor is the predecessor of location(i, m) in T (j−m+1).
We will show how to compute the left predecessor of every location inT . For computing the left pre-
decessor of all the locations in columns1, . . . , m, let T ′ be the linearization for the first strip ofT . We
compute the left predecessors by scanningT ′ from left to right. While scanning, we only need to keep the
last encountered location of eachσ ∈ Σ. The time complexity for this stage isO(m2).
For computing the left predecessor of the locations in columnsm + 1, . . . , 2m, we scan the entire text
bottom-up left-to-right. Following the left predecessor definition, we show how to create the left predecessor
for each text location(i, j).
We first define the following relation.
Definition 5. A left downrelation between two text locations(i, j) and(k, l) occurs whenk < i andl > j.
Given a new text location(l, k) with valueT [k, l], consider the list of text locations(i1, j1), . . . , (is, js)
such thatT [k, l] = T [i1, j1] = · · · = T [is, js] for which we did not locate their left predecessors yet, ordered
by their scan order. Then we can observe the following:
Observation 4. Every two elements in the above list maintain the left down relation.
Note that the position(k, l) may already have a left predecessor computed for it ifl ≤ m. If not, then a
left predecessor will need to be computed for it as well.
Now the new text location(k, l) may be the left predecessor of some of the locations(i1, j1), . . . , (is, js).
We claim that this left predecessor must be hereditary in the following sense: either(k, l) is the predecessor
of (i1, j1), . . . , (if , jf ) for some1 ≤ f ≤ s or (k, l) is the predecessor of(if , jf ), . . . , (is, js) for some
1 ≤ f ≤ s.
To see this we make some observations regarding text location(k, l):
1. k ≤ if for everyf , 1 ≤ f ≤ s. This follows from the bottom-up left-to-right scan.
2. if l > js or if l ≤ j1 − m then(k, l) is not the left predecessor of any location.
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 15
l
A1 A2
A3 A4
C
B4
B1
B2 B3
Figure 1: The subsets ofD.
3. if jf−1 < l butjf ≥ l then(k, l) is the left predecessor of(if , jf ), . . . , (is, js) (f −1 may be equal to 0).
4. if jf+1 > l + m− 1 but jf ≤ l + m− 1 then(k, l) is the left predecessor of(i1, j1), . . . , (if , jf ) (f + 1
may be equal to s+1).
By scanning the list from either side, this stage can be implemented in overallO(m2) time.
2.5 Pattern Preprocessing — Finding witnesses
Let P be anm×m string. We will show how to compute a witness for every offset(a, b) that has a witness.
Let l be some integer that will be specified later. Consider an alignment ofP against itself, with offset
(a, b). We say that(a, b) is asourceif there is a p-match between the overlapping areas in the two copies of
P . If (a, b) is not a source, then a there is at least one witness for(a, b). A witness(x, y), (x′, y′) is called
a witness of type 1if it satisfies condition 1 in Definition 2, and otherwise, it is called a witness of type 2.
There are four cases that we need to consider, according to the signs of a andb. In the following, we will
handle the case whena ≥ 0 andb ≥ 0. The other cases are symmetrical, and thus omitted.
Let (a, b) be a location which is not a source, and letD = 1, . . . , m − a × 1, . . . , m − b be the
overlapping area ofP whenP is aligned against itself with offset(a, b). We partition the setD into subsets
(see Figure 1):A1, . . . , A4 are thel × l squares in the corners ofD. B1, . . . , B4 are rectangles of width
or height l in the borders ofD excluding the corners, andC is the remaining part ofD (namely,C =
D \ (A1 ∪ · · · ∪ A4 ∪ B1 ∪ · · · ∪ B4)). Formally,
A1 = (x, y) ∈ 2 : 1 ≤ x ≤ l and1 ≤ y ≤ l,
A2 = (x, y) ∈ 2 : 1 ≤ x ≤ l andm − b − l + 1 ≤ y ≤ m − b,
etc.
We say that a witnessw for (a, b) is simpleif it does not satisfy any of the following conditions:
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 16
1. w is of type 1, and one of the locations ofw is in A1.
2. w is of type 1, one of the locations ofw is in A2 ∪ B3 ∪ A4, and the other location is inA3 ∪ B4 ∪ A4.
3. w is of type 2, and one of the locations ofw is in A4.
4. w is of type 2, one of the locations ofw is in A1 ∪ B1 ∪ A2, and the other location is inA1 ∪ B2 ∪ A3.
The algorithm consists of three stages:
1. Find simple witnesses.
2. Find witnesses that satisfy conditions 1 or 3 above.
3. Find witnesses that satisfy conditions 2 or 4 above.
The three stages of the algorithm are described in the next sub-sections.In each stage of the algorithm, we
will handle only offsets do not contain witnesses that were found during the previous stages.
2.5.1 Simple Witnesses
This stage is similar to the algorithm of Amir et al. [3]: We create new stringsP1 andP2 by replacing every
characterP [x, y] in P by 8ml characters, where each of these characters is a pointer to some location(x′, y′)
in P such thatP [x′, y′] = P [x, y] or a null pointer.
For each location(x, y) in P , we define rectangles on the grid2 in following way: Fori = −ml , . . . , m
l −1, let
Hi,(x,y) = (x′, y′) ∈ 2 : y′ ≤ y andx + il ≤ x′ < x + (i + 1)l,
Hi,(x,y) = (x′, y′) ∈ 2 : y′ ≥ y andx + il ≤ x′ < x + (i + 1)l,
Vi,(x,y) = (x′, y′) ∈ 2 : x′ ≥ x andy + il ≤ y′ < y + (i + 1)l,
and
Vi,(x,y) = (x′, y′) ∈ 2 : x′ ≤ x andy + il ≤ y′ < y + (i + 1)l.
Furthermore, letH(x,y) = (x, y′) ∈ 2 : y′ ≤ y, and we similarly define the rectanglesH(x,y), V(x,y),
andV(x,y). We say that a rectangleHi,(x,y) (or otherH rectangle) is inside the squareP if there is a column
of Hi,(x,y) that is contained in1, . . . , m × 1, . . . , m, namely, if1 ≤ y + il andy + (i − 1)l − 1 ≤ m.
A V rectangle is inside the square ofP if there is a row of the rectangle that is contained in1, . . . , m ×1, . . . , m.
The stringP1 is constructed by replacing every characterP [x, y] in P by the charactersc1, . . . , c8m/l+4
(P1 is anm × ((8ml + 4)m) string) which are defined as follows:
• For i = −ml , . . . , m
l − 1, if the rectangleHi,(x,y) is inside the square ofP , traverse over the elements of
Hi,(x,y) in a column major order (from right to left), until reaching a location(x′, y′) for whichP [x′, y′] =
P [x, y], if there is such a location. If a location(x′, y)′ was found, then setcm/l+i+1 = (x − x′, y − y′),
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 17
and we say in this case thatcm/l+i+1 points to(x′, y′). Otherwise (namely, the rectangleHi,(x,y) does not
contain a character equal toP [x, y]), setcm/l+i+1 = φ.
If the rectangleHi,(x,y) is not inside the square ofP , then setcm/l+i+1 = φ.
• For i = −ml , . . . , m
l − 1, the characterc3m/l+i+1 is built from the rectangleHi,(x,y) in the same way
described above, except that the rectangle is traversed from right to left, and furthermore, if the rectangle
does not contain a character equal toP [x, y], thenc3m/l+i+1 = 0.
• For i = −ml , . . . , m
l − 1, the charactersc5m/l+i+1 andc7m/l+i+1 are built from the rectanglesVi,(x,y)
andVi,(x,y) the same as above, except that the rectangles are traversed in a row majororder (if the cor-
responding rectangle does not contain a character equal toP [x, y], we useφ for c5m/l+i+1 and0 for
c7m/l+i+1).
• The characterc8m/l+1 corresponds to the rectangleH(x,y), and it is handled the same as the characters
c1, . . . , c2m/l. Furthermore, the charactersc8m/l+2, c8m/l+3, andc8m/l+4 correspond to the rectangles
H(x,y), V(x,y), andV(x,y), respectively.
The stringP2 is built in two steps. The first step is the same as buildingP1, except that we replace the roles
of φ and0, that is, we useφ for characters that correspond to the rectanglesHi,(x,y) andVi,(x,y), and0 for
characters that correspond to the rectanglesHi,(x,y) andVi,(x,y). After the first step,P2 is anm × (8ml · m)
string. In the second step, we expandP2 into a2m × 2(8ml · m) string, where all the new characters areφ.
After building P1 andP2, we solve the (standard) matching problem with don’t care symbols onP1 and
P2, whereP1 is the pattern,P2 is the text, andφ is the don’t care symbol. Moreover, using the algorithm
of Alon and Naor [1], we find witnesses for every mismatch betweenP1 andP2, namely, for every(a, b)
such thatP1 does not match toP2[a + 1 : a + m, b + 1 : b + m], we find a location(x, y) such that
P1[x, y] 6= P2[x + a, y + b].
Consider some fixed pair(a, b), and defineP ′2 = P2[a + 1 : a + m, b + 1 : b + m].
Claim 1. If P1 does not match toP ′2, then (a, b) is not a source. Moreover, from every witness to the
mismatch ofP1 andP ′2 we can obtain a witness for(a, b) in constant time.
The converse of Claim 1 is not true. A weaker result is given in the following lemma.
Lemma 2. If (a, b) has a simple witness thenP1 does not match toP ′2.
Proof. Suppose thatw = (x, y), (x′, y′) is a simple witness for(a, b). W.l.o.g. we assume thatx ≥ x′,
and moreover, ifx = x′, we assume thaty > y′. We will deal with the case whenw is a witness of type 1,
and omit the proof for the case whenw is of type 2. We consider 3 cases.
Case 1 In the first case, suppose that eitherx = x′ or (x′, y′) is in B1 ∪ C ∪ B4, or in other words,
l + 1 ≤ y′ ≤ m − b − l. We prove the lemma using induction onx − x′.
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 18
If x = x′, we have that(x, y′) ∈ H(x,y). Thus, the character ofP1 that corresponds toH(x,y) points
to some location(x, y′′) such thaty′ ≤ y′′ < y andP [x, y′′] = P [x, y]. If P [x + a, y′′ + b] 6= P [x +
a, y + b], then the character ofP2 that corresponds toH(x+a,y+b) either points to some location other
than P [x + a, y′′ + b], or is equal to0. In both cases, we conclude thatP1 does not match toP ′2. If
P [x + a, y′′ + b] = P [x + a, y + b], thenw′ = (x, y′′), (x, y′) is a simple witness for(a, b) of type 1.
We can now use the argument above onw′ and either obtain thatP1 does not match toP ′2, or obtain a new
witnessw′′. This process can be continued, and since this process must stop, we conclude thatP1 does not
match toP ′2.
Now, suppose thatx > x′, and we proved case 1 of the lemma for every witness in which the difference
between the rows in the two locations is less thanx − x′. Sincel + 1 ≤ y′ ≤ m − b − l, it follows that
the rectangleVi,(x,y) that contains the location(x′, y′) is inside the square ofP . Therefore, the character of
P1 that corresponds toVi,(x,y) points to some location(x′′, y′′), wherex ≤ x′′ < x′. If (x′′, y′′) then we
are done since the character ofP2 that corresponds toHi,(x+a,y+b) either points to some location other than
P [x′′+a, y′′+b], or it is equal to0. Otherwise, we use the induction hypothesis onw′ = (x′′, y′′), (x′, y′).
Note that there is a slight technical problem since it is possible that(x′′, y′′) ∈ A1 and thereforew′ is not
simple. However, the arguments we used above still work onw′, so we still obtain thatP1 does not match
to P ′2.
Case 2 The second case is when the eithery = y′ or leftmost location among(x, y) and (x′, y′) is in
B2 ∪ C ∪ B3. The proof for the case is symmetrical to the proof of case 1 (we use the rectanglesV(x∗,y∗)
andHi,(x∗,y∗) instead ofH(x,y) andVi,(x,y), where(x∗, y∗) is the rightmost location).
Case 3 Assume that cases 1 and 2 do not occur. Since case 1 does not occur,we have that eithery′ ≤ l or
y′ ≥ m − b − l + 1. We consider 4 sub-cases:
1. y′ ≤ l andy > y′. Since(x′, y′) is the leftmost location and case 2 does not occur, we have that either
x′ ≤ l or x′ ≥ m− a− l + 1. If x′ ≤ l then(x′, y′) ∈ A1, and thereforew is not simple, a contradiction.
If x′ ≥ m − a − l + 1, then(x, y), (x′, y′) ∈ A3 ∪ B4 ∪ A4 which again contradicts the assumption that
w is simple.
2. y′ ≤ l andy < y′. In this case, the rectangleV0,(x,y), which contains the point(x′, y′), is inside the
square ofP . Therefore, using the same arguments as in case 1, we obtain thatP1 does not match toP ′2.
3. y′ ≥ m − b − l + 1 andy > y′. We have that(x, y), (x′, y′) ∈ A2 ∪ B3 ∪ A4, which contradicts the
assumption thatw is simple.
4. y′ ≥ m − b − l + 1 andy < y′. Since(x, y) is the leftmost location, it follows that eitherx ≤ l or
x ≥ m − a − l + 1. In the former case, the rectangleH0,(x′,y′), which contains the location(x, y), is
inside the square ofP . It follows thatP1 does not match toP ′2. In the latter case, we obtain thatw is not
simple, a contradiction.
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 19
From Claim 1 and Lemma 2 we conclude that stage 1 of the algorithm correctly finds a witness for every
offset (a, b) that has simple witnesses. The time complexity of this stage isO(|P2| · log4+o(1) |P1|) =
O(m3
l · log4+o(1) m).
2.5.2 Corner Witnesses
In the following stages, we will describe only how to find witnesses of type 1,as handling the witnesses of
type 2 is symmetrical.
The second stage is composed of four sub-stages.
Stage 2a
In this stage we find all the witnessesw such that the two locations ofw are inA1. For each location(x, y)
in P , we define rectangles as follows:
H2i,(x,y) = (x + i, y′) ∈ 2 : y′ ≤ y
V 2i,(x,y) = (x′, y + i) ∈ 2 : x′ ≤ x.
Using these rectangles, we build stringsP 21 andP 2
2 by replacing each characterP [x, y] in P by 4l characters
that correspond to the rectanglesH2i,(x,y) andV 2
i,(x,y) for i = −l + 1, . . . , l − 1 (the construction is similar
to the construction ofP1 andP2). Then, we solve the matching problem forP 21 andP 2
2 and find witnesses
for the mismatches. The correctness of this sub-stage follows from the fact that for two locations inA1, one
of the location is inside the rectangles of the other locations.
Stage 2b
In this stage we find all the witnessesw such that one location ofw is in A1, and the other location is in
B1 ∪ B2 ∪ C ∪ A3 ∪ B4.
As in the previous stages, we encode a character ofP by storing pointers to other occurrences of the
character inP . However, instead of creating one matching problem instance, we will createm2/l instances.
The main idea is to group the different offsets into set and build a matching problem instance for each set,
and then use the instance to find witnesses for the offsets in the set. In eachset, all the offsets are close to
each other, and we use this fact in our construction.
Consider a set of offsets(a, b), (a, b+1), . . . , (a, b+l−1), whereb is a multiple ofl. Define the rectangles
H3,a,b(x,y) = (x′, y′) ∈ 2 : y ≤ y′ ≤ m − b − l and0 ≤ x′ ≤ m − a
H4,a,b(x,y) = (x′, y′) ∈ 2 : y′ ≤ y andx ≤ x′ ≤ m − a,
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 20
and build stringsP 3,a,b1 andP 3,a,b
2 as follows: For every(x, y) ∈ A1, the stringP 3,a,b1 contains two characters
for P [x, y] corresponding to the rectanglesH3,a,b(x,y) andH4,a,b
(x,y). For every location(x, y) with a + 1 ≤ x ≤a + l andb + 1 ≤ y ≤ b + 2l − 1, the stringP 3,a,b
2 contains two characters forP [x, y] corresponding to the
rectanglesH3,a,b(x,y) andH4,a,b
(x,y). Moreover, the stringP 3,a,b2 is padded with don’t care symbols.
As in the previous stages, we solve the matching problem forP 3,a,b1 andP 3,a,b
2 and find witnesses for the
mismatches.
Stage 2c
This stage finds all the witnessesw such that one location ofw is in A1, and the other location is inA2∪B3.
The algorithm is similar to stage 2b: For every set of offsets(a, b), (a + 1, b), . . . , (a + l − 1, b), wherea is
a multiple ofl, define
V 3,a,b(x,y) = (x′, y′) ∈ 2 : x ≤ x′ ≤ m − a − l and0 ≤ y′ ≤ m − b
V 4,a,b(x,y) = (x′, y′) ∈ 2 : x′ ≤ x andy ≤ y′ ≤ m − b,
and build stringsP 4,a,b1 andP 4,a,b
2 . Then, solve the matching problem forP 4,a,b1 andP 4,a,b
2 and find witnesses
for the mismatches.
Stage 2d
In this final sub-stage, we find all the witnessesw such that one location ofw is in A1, and the other location
is in A4. Recall that
H2i,(x,y) = (x + i, y′) ∈ 2 : y′ ≤ y.
For every set of offsets(a + i, b + j) : i = 0, . . . , l − 1 andj = 0, . . . , l − 1 wherea and b are
multiples ofl, build stringsP 5,a,b1 andP 5,a,b
2 : For every location(x, y) with m − a − 2l + 2 ≤ x ≤ m and
m − b − 2l + 2 ≤ y ≤ m, P 5,a,b1 contains3l − 3 characters forP [x, y], corresponding to the rectangles
H2i,(x,y) for i = m − a − 3l + 2, . . . , m − a − 1. Similarly, for every(x, y) with 1 ≤ x ≤ 2l − 1 and
1 ≤ y ≤ 2l − 1, P 5,a,b2 contains3l − 3 characters forP [x, y], corresponding to the rectanglesH2
i,(x,y) for
i = m − a − 3l + 2, . . . , m − a − 1.
The total time complexity of stage 2 isO(m2l · log4+o(1) m).
2.5.3 l Strip Witnesses
Let (a, b) be some offset, for which no witness was found during the previous stages. We first look for
witnesses with one location inB3 ∪ A4 and the other location inA3 ∪ B4 ∪ A4. DefineD′ = D \ (A1 ∪B1 ∪ A2). Since there are no simple witnesses for(a, b), and no witnesses that satisfy condition 3 (in the
definition of simple witness), we conclude that there are no type 2 witnesses with both locations insideD′.
Therefore, we make the following observation:
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 21
Claim 3. For every rectangleD′′ ⊆ D′, the number of distinct characters inside the regionD′′ of P is less
than or equal to the number of distinct characters inside the regionD′′ +(a, b) = (x+ a, x+ b) : (x, y) ∈D′′ of P , with equality if and only if there is no witnessw for (a, b) whose both locations are inD′′.
From Claim 3, we devise the following algorithm for finding a witness inD′: Check the number of distinct
characters inside the regionsD′ andD′ + (a, b) of P . If these numbers are equal then stop. Otherwise, find
a minimal rectangleD′′ in D′ for which the number of distinct characters inD′′ is strictly less than the
number of distinct characters inD′′ + (a, b). Then, one of the pairs of opposite corners of the rectangleD′′
is a witness for(a, b). The problem with this algorithm is that we do not know how to efficiently find such
a rectangleD′′. Instead, we will find a rectangleD∗ which will be approximately equal toD′′.
In order to findD∗, consider the following intersection counting problem: Given a setS of points in the
plane, where each point has a color, preprocessS in order to answer efficiently queries of the form “what is
the number of distinct color in the points inside the rectangle(−∞, b]× [c, d]?”. Denote byn the number of
points inS. Gupta et al. [21] showed a data-structure for this problem with preprocessing timeO(n log2 n)
which answers queries inO(log2 n) time. Using this result, we obtain the following lemma:
Lemma 4. LetP be anm × m string, andl be an integer. After preprocessing ofP in O(m3
l log2 m) time,
we can answer the following queries inO(log2 m) time: “what is the number of distinct characters in the
substringP [a : b, c : d]?”, for every integersa, b, c, andd, such that one these integers is either1, m, or a
multiple ofl.
Proof. We will show how to answer queries in whicha is either a multiple ofl or equal to1. Answering
other queries is done similarly.
The preprocessing is as follows: Create a set of pointsS containing a point(x, y) for everyx = 1, . . . , m
andy = 1, . . . , m. The color of(x, y) is P [x, y]. For every integeri ≤ m/l, build the data structure of
Gupta et al. on the set of pointsSi = (x, y) ∈ S : x ≥ il.
Given a query(a, b, c, d), If a is a multiple ofl, return the number of distinct colors of the setSa/l in
the rectangle(−∞, b] × [c, d]. If a is equal to1, return the number of distinct colors of the setSa/l in the
rectangle(−∞, b] × [c, d].
We now return to the problem of findingD∗. The algorithm is as follows: First, build the data-structure of
Lemma 4 onP . Then, compute the number of distinct characters inside the regionsD′ andD′ +(a, b) of P .
If these numbers are equal, stop. Otherwise, find a minimal rectangleD∗ = x, . . . , m − a × y, . . . , zsuch that the number of distinct characters inside the regionsD∗ andD∗ + (a, b) of P are not equal, and
x is a multiple ofl. FindingD∗ is done using binary search and queries to the data-structure of Lemma 4
(note that the queries we make are of the formx′, . . . , m− a × y′, . . . , z′ wherex′ is a multiple ofl or
x′, . . . , m × y′, . . . , z′).
2 TWO DIMENSIONAL PARAMETERIZED MATCHING 22
Let X be the set of all locations(c, d) such thatc ∈ x, . . . , x+ l− 1∪m−a− l +1, . . . , m−a and
d ∈ y, z. From Claim 3 and the minimality ofX, we obtain that there is a witnessw for (a, b) (of type 1)
whose both locations are inX. We find such a witness as follows:
1. Initialize tablesV [1..m] andL[1..m2] to zeros.
2. Go over the locations inX in some order. For every location(c, d), if V [P [c, d]] /∈ 0, P [c + a, d + b]output the witnessL[P [c, d]], (c, d). Otherwise, setV [P [c, d]] ← P [c+a, d+b] andL[P [c, d]] ← (c, d).
By Lemma 4, the time complexity of this stage isO(m3
l log2 m + m2 log3 m + m2l) (note that the initial-
ization of the tablesV andL above takesO(1) time). Therefore, the total time complexity of the algorithm
is O((ml + l)m2 · log4+o(1) m). The last expression isO(m5/2 · log4+o(1) m) whenl = Θ(
√m).
23
Chapter 3
3 Approximate Parameterized Matching
3.1 Preliminaries and Definitions
Given astrings = s1s2...sn of length|s| = n over an alphabetΣs, and a bijectionπ from Σs to some other
alphabetΣ′s, theimageof s underπ is the strings′ = π(s) = π(s1) · ... ·π(sn), that is obtained by applying
π to all characters ofs.
Given two equal-length stringsw andu over alphabetsΣw andΣu and a bijectionπ from Σw to Σu
theπ-mismatch betweenw andu is Ham(π(w), u), whereHam(s, t) is the Hamming distance between
s andt. We say thatw parameterizek-matchesu if there exists aπ such that theπ-mismatch≤ k. The
approximate parameterize matchbetweenw andu is the minimumπ-mismatch (over all bijectionsπ).
For given input textx, |x| = n, and patterny, |y| = m, the problem ofapproximate parameterized
matchingfor y in x consists of computing theapproximate parameterized matchbetweeny and every (con-
secutive) substring of lengthm of x. Hence, approximate parameterized searching requires computing theπ
yielding minimumπ-mismatch fory at each position ofx. Of course, the bestπ is not necessarily the same
at every position. The problem ofparameterized matching withk mismatchesconsists of computing for
each locationi of x whethery parameterizedk-matches the (consecutive) substring of lengthm beginning
at locationi. Sometimesπ itself is also desired (however, here anyπ with π-mismatch of no more thank
will be satisfactory).
3.2 String Comparison Problem - The Binary Case
We begin by presenting observations on the simpler variant of the problem where the pattern alphabet is
binary. We introduce an algorithm that takesO(nk) time for the following problem:
• Input: Two strings, t = t1, t2, ..., tn over Σt and p = p1, p2, ..., pm over Σ2, and an integer k.
• Output: All locations i in t where p parameterize k-matches ti, . . . , ti+m−1.
Assume thatΣ2 = 0, 1. There are four cases forp:
1. The firstm/2 characters ofp contains at least2k + 1 zeros and2k + 1 ones.
2. The lastm/2 characters ofp contains at least2k + 1 zeros and2k + 1 ones.
3. The firstm/2 characters ofp contains at least2k + 1 zeros and the lastm/2 characters ofp contains at
least2k + 1 ones, or the opposite.
4. One of the characters0, 1, appears in the pattern less than 2k+1 times.
3 APPROXIMATE PARAMETERIZED MATCHING 24
We now examine each case separately;
3.2.1 Case 1 : Firstm/2 Characters
To handle that case, we will choose two sample sequences of the first2k + 1 ones and2k + 1 zeros fromp.
These sample sequences will be checked against the appropriate alignment at each text location, from left
to right. If there is a place with more than k mismatches, it will be eliminated; otherwise, there will be at
leastk + 1 matches to one specific character of the text for each sample sequence. We call this character the
matched character.
Lemma 5. Two locations at distance at mostm/2 of each other will be candidates only if the matched
characters of their sample sequences are the same.
Proof. : Let us assume that there are two candidates within distancem/2 of each other. Then their overlap
includes the right candidate samples. If its matched characters are different from those of the left candi-
date (even by one character), then the left candidate must be eliminated, since it already has more thank
mismatches.
3.2.2 Case 2 : Lastm/2 Characters
This case is symmetric to the above case by reversing the checking order from right to left.
3.2.3 Case 3 : Both Halves
The first stage of running over the text checking the sample sequences isdone in the same manner as before.
The only difference is that for each group of candidates there will be nomore than three different matched
characters following the same reason that was introduced in the first case.
Now, for the above three cases, since we already determinedπ betweenp and every text location that
was not eliminated, we can run an algorithm for searching an exact patternmatching withk mismatches
in O(n√
klogk) time [7], for text size of3m/2 for eachπ. We partition the text into pieces that ensure
us at leastm/2 characters in the overlap of any two candidates, and since we have a constant number ofπ
functions, we also have a constant number of times to run the algorithm.
3.2.4 Case 4 : Less than2k + 1 Characters
In that case we will check the entire pattern against the first place in the text.In the next potential match, we
need to check at most2k text locations, since all the alignments of the pattern against itself are different in
at most2k places.
3 APPROXIMATE PARAMETERIZED MATCHING 25
Hence, the algorithm runs inO(nk) time. unfortunately, expanding these results to the general version
(where the pattern alphabet is arbitrary) does not lead to algorithm with attractive running time, hence we
present now the algorithm for the general case.
3.3 String Comparison Problem
We begin by evaluating two equal-length strings for a parameterizedk-match as follows:
• Input: Two strings, s = s1, s2, ..., sm and s′ = s′1, s′2, ..., s
′m, and an integer k.
• Output: True, if there exists π such that the π-mismatch of s and t is no more than k.
In standard string comparison, we simply compare the characters and count the number of mismatches in
O(m) time. In approximate parameterized matching the naive method of solving the problem is to check
theπ-mismatch for every possibleπ. However, this takes exponential time.
One way to solve the problem is to reduce the problem to a maximal bipartite weighted matching in a
graph. We construct a bipartite graphB = (U ∪V, E) in the following way.U = Σs andV = Σs′ . There is
an edge betweena ∈ Σs andb ∈ Σs′ iff there is at least onea in s that is aligned with ab in s′. The weight
of this edge is the number ofa’s in s that are aligned tob’s in s′. It is easy to see that:
Observation 5. A maximal weighted matching inB corresponds to a minimalπ-mismatch, whereπ is
defined by the edges of the matching.
The problem of maximum bipartite matching has been widely studied and efficientalgorithms have been
devised, e.g. [16, 17, 19, 23]. For integer weights, where the largestweight is bounded by|V |, the fastest
algorithm runs in timeO(E√
V ) [23]. This solution yields anO(m1.5) time algorithm for the problem of
parameterized string comparison with mismatches. The advantage of the algorithm is that it views the whole
string at once and efficiently finds the best function. In general, for theparameterized pattern matching
problem with inputst of lengthn, andp of lengthm, the bipartite matching solution yields anO(nm1.5)
time algorithm.
3.3.1 Reducing Maximum Matching to the String Comparison Problem
Lemma 6. Let B = (U ∪ V, E) be a bipartite graph. If the string comparison problem can be solved in
O(f(m)) time then a maximum matching inB can be found inO(f(|E|) time.
Proof. Take an enumeration of the nodesu1, u2, ..., u|U | andv1, v2, ..., v|V | and an enumeration of the
edgese1, ..., e|E|. Create two stringssU andsV of length|E|, where for edgeei = uj , vl we set theith
location ofsU to beuj and theith location ofsV to bevj .
Obviously a bijection from the ”alphabet” ofsU to the ”alphabet” ofsV corresponds to a matching inB.
Hence, the lemma follows.
3 APPROXIMATE PARAMETERIZED MATCHING 26
Since it is always true that|E| < |V |2 it follows that |E|1/4 <√
|V |. Hence, it would be surprising to
solve the string comparison problem ino(m1.25). In fact, even for graphs with a linear number of edges the
best known algorithm isO(E√
V ) and hence an algorithm for the string comparison problem ino(m1.5)
would have implications on the maximum matching problem as well.
Note that for the string comparison problem one can reduce the approximateparameterized matching
version to the parameterized matching withk mismatches version. This is done by binary searching onk
to find the optimal solution for the string comparison problem in the approximate parameterized matching
version. Hence an algorithm for detecting whether the string comparison problem has a bijection with fewer
thank mismatches with timeo( k1.5
log k + m) would imply a faster maximum matching algorithm.
3.3.2 Mismatch Pairs and Parameterized Properties
Obviously, one may use the solution of approximate parameterized string matching to solve the problem
of parameterized matching withk mismatches. However, it is desirable to construct an algorithm with
running time dependant onk rather thanm. The bipartite matching method does not seem to give insight
into how to achieve this goal. In order to give an algorithm dependent onk we introduce a new method
for detecting whether two equal-length strings have a parameterizedk-match. The ideas used in this new
comparison algorithm will be used in the following section to yield an efficient solution for the problem of
parameterized matching withk mismatches.
When faced with the problem ofp-matching with mismatches, the initial difficulty is to decide, for a
given location, whether it is a match or a mismatch. Simple comparisons do not suffice, since any symbol
can match (or mismatch) every other symbol. In the bipartite solution, this difficultyis overcome by viewing
the entire string at once. A bijection is found, over all characters, excluding at mostk locations. Thus, it
is obvious that the locations that are excluded from the matching are “mismatched” locations. For our
purposes, we would like to be able to decide locally,i.e. beforeπ is ascertained, whether a given location is
“good” or “bad.”
Recall definition 3 of a mismatch pair for two dimensional strings. The exact same definition holds for
one dimensional strings as well. A pair of locations (i,j) such that one of the following holds,
1. si = sj and s′i 6= s′j 2. si 6= sj and s′i = s′j .
It immediately follows from the definition that:
Lemma 7. Given two equal-length strings,s ands′, if (i, j) is a mismatch pair betweens ands′, then for
every bijectionΠ : Σs → Σs′ either locationi or locationj, or both locationsi andj are mismatches, i.e.
π(si) 6= s′i or π(sj) 6= s′j or both are unequal.
Conversely,
3 APPROXIMATE PARAMETERIZED MATCHING 27
Lemma 8. Lets ands′ be two equal-length strings and letS ⊂ 1, ..., m be a set of locations which does
not contain any mismatch pair. Then there exists a bijectionπ : Σs → Σs′ that is parameterized onS, i.e.
for everyi ∈ S, π(si) = s′i.
Proof: S does not contain any mismatch pair. Hence, for any two locationsi andj in S if si=sj then
s′i=s′j , and ifsi 6= sj thens′i 6= s′j . Hence there is a bijectionπ : Σs → Σs′ that is a parameterized match
on the set of locationsS.
The idea of our algorithm is to count the mismatch pairs betweens ands′. Since each pair contributes
at leastone error in the p-match betweens ands′ it follows immediately from Lemma 7 that if there are
more thank mismatch pairs thens does not parameterizek-matchs′. We claim that if there are fewer than
k/2 + 1 mismatch pairs thens parameterizek-matchess′.
Definition 6. Given two equal-length strings,s and s′, a collectionL of mismatch pairs is said to be a
maximal disjoint collectionif (1) all mismatch pairs inL are disjoint, i.e. do not share a common location,
and (2) there is no mismatch pair that can be added toL without violating disjointness.
Corollary 1. Let s ands′ be two strings, and letL be a maximal disjoint collection of mismatch pairs ofs
ands′. If |L| > k, then for everyΠ : Σs → Σ′s, theΠ-mismatch is greater thank. If |L| ≤ k/2 then there
existsΠ : Σs → Σ′s such that theΠ-mismatch counter is less than or equal tok.
Proof: Combining Lemma 7 and Lemma 8 yields the proof.
3.3.3 String Comparison Algorithm
The method uses Corollary 1. First the mismatch pairs need to be found. In fact, by Corollary 1 onlyk + 1
of them need to be found sincek + 1 mismatch pairs implies that there is no parameterizedk-match. After
the mismatch pairs are found if the number of mismatch pairsmp is less thank/2 or more thank we can
announce a mismatch or match immediately according to Corollary 1. The difficult case to detect is whether
there indeed is a parameterizedk-match fork/2 < mp ≤ k. This is done with the bipartite matching
algorithm. However, here the bipartite graph needs to be constructed somewhat differently. While the
case wheremp ≤ k/2 implies an immediate match, for simplicity of presentation we will not differentiate
between the case ofmp ≤ k/2 andk/2 < mp ≤ k. Thus from here on we will only bother with the cases
mp ≤ k andmp > k.
3.3.4 Find Mismatch Pairs
In order to find the mismatch pairs of equal-length stringss and s′ we use a simple stack scheme, one
stack for each character in the alphabet. First mismatch pairs are searched for according to the first rule of
definition 3, i.e. si = sj but s′i 6= s′j . Then within the leftover locations mismatch pairs are seeked that
satisfy the second rule of definition 3, i.e.si 6= sj buts′i = s′j .
3 APPROXIMATE PARAMETERIZED MATCHING 28
Algorithm: Find mismatch pairs(s,s’)
1. Construct a stack Sσ for each σ ∈ Σs.
2. For i = 1 to m do
if empty(Ssi) then push(Ssi
, < s′i, i >)
else if top(Ssi) =< s′i, ∗ > then push(Ssi
, < s′i, i >)
else < s′∗, j >← pop(Ssi); add (i, j) to the mismatch pairs
3. Set L to be all locations of 1, ..., m that do not appear in the mismatch pairs.
4. Reverse the roles of s and s′ and do step 2 while scanning items only from L
(i.e. here step 2 begins - ”For j ∈ L do”)
Time Complexity: The algorithm that finds all mismatch pairs between two strings of lengthm runs in
O(m) time. Each character in the string is pushed/popped onto at most one stack.
3.3.5 Verification
The verification is performed only when the number of mismatch pairs that werefound, mp, satisfies
mp ≤ k. Verification consists of a procedure that finds the minimalπ-mismatch over all bijectionsπ.
The technique used is similar to the bipartite matching algorithm discussed in the beginning of the section.
Let L ⊂ 1, ..., m be the locations that appear in a mismatch pair. We say that a symbola ∈ Σs is
mismatchedif there is a locationi ∈ L such thatsi = a. Likewise,b ∈ Σs′ is mismatchedif there is a
locationi ∈ L such thats′i = b. A symbola ∈ Σs, or b ∈ Σs′ , is free if it is not mismatched.
Construct a bipartite graphB = (U ∪ V, E) defined as follows.U contains all mismatched symbols
a ∈ Σs andV contains all mismatched symbolsb ∈ Σs′ . Moreover,U contains all free symbolsa ∈ Σs for
which there is a locationi ∈ 1, ..., m − L such thatsi = a ands′i is a mismatched symbol. Likewise,V
contains all free symbolsb ∈ Σs′ for which there is a locationi ∈ 1, ..., m − L such thats′i = b andsi is
a mismatched symbol. The edges, as before, have weights that correspond to the number of locations where
they are aligned with each other.
Lemma 9. The bipartite graphB is of sizeO(k). Moreover, given the mismatch pairs it can be constructed
in O(k) time.
Proof: This lemma follows by observing that for anya ∈ Σs all appearances ofa in locations of
1, ..., m− L are aligned with a given symbolb ∈ Σs′ by definition of mismatch pairs. Symmetrically, the
same is true for anyb ∈ Σs′ . On the other hand,L is of sizeO(k).
While the lemma states that the bipartite graph is of sizeO(k) it still may be the case that the edge weights
may be substantially larger. However, if there exists an edgee = (a, b) with weight> k then it must be
thatπ(a) = b for otherwise we immediately have> k mismatches. Thus, before computing the maximum
3 APPROXIMATE PARAMETERIZED MATCHING 29
matching, we drop every edge with weight> k, along with their vertices. What remains is a bipartite graph
with edge weights between 1 andk.
Theorem 10. Given two equal-length stringss ands′, with mp ≤ k mismatch pairs. It is possible to verify
whether there is a parameterizedk-match betweens ands′ in O(k1.5) time.
Proof: Since the size of the bipartite graphB isO(k), by Lemma 9, it follows that the maximum weighted
bipartite matching can be solved inO(k1.5) time algorithm [23].
We need to show that its solution solves our problem. The matchingM clearly yields a one-to-one
functionπ : Σs → Σ′s, since at most one edge is chosen per character ins ands′. Since the matching is
maximal, it is the optimal function, and every character not included in the matching is an error. If the sum
of the edge weights inE − M is more thank then there is no parameterizedk-match betweens ands′.
Time Complexity: Given two equal-length stringss ands′, it is possible to determine whethers param-
eterizedk-matchess′ in O(m + k1.5) time, O(m) to find the mismatch pairs andO(k1.5) to check these
pairs with the appropriate bipartite matching.
3.4 The Algorithm
We are now ready to introduce our algorithm for the problem of parameterized pattern matching withk
mismatches:
• Input: Two strings, t = t1, t2, ..., tn and p = p1, p2, ..., pm, and an integer k.
• Output: All locations i in t where p parameterized k-matches ti, . . . , ti+m−1.
Our algorithm has two phases, thepattern preprocessing phaseand thetext scanning phase. In this
section we present the text scanning phase and will assume that the patternpreprocessing phase is given.
In the following section we describe an efficient method to preprocess the pattern for the needs of the text
scanning phase.
Definition 7. Let s and s′ be two-equal length strings. LetL be a collection of disjoint mismatch pairs
betweens ands′. If L is maximal and|L| ≤ k thenL is said to be ak-good witnessfor s ands′. If |L| > k
thenL is said to be ak-bad witnessfor s ands′.
The text scanning phase has two stages the (a)filter stageand the (b)verification stage. In the filter stage
for each text locationi we find either ak-good witness or ak-bad witness forp andti . . . ti+m−1. Obviously,
by Corollary 1, if there is ak-bad witness thenp cannot parameterizek-matchti . . . ti+m−1. Hence, after
the filter stage, it remains to verify for those locationsi which have ak-good witness whetherp parameterize
k-matchesti . . . ti+m−1. The verification stage is identical to the verification procedure in Section 3.3.5 and
hence we will not dwell on this stage.
3 APPROXIMATE PARAMETERIZED MATCHING 30
3.4.1 The Filter Stage
The underlying idea of the filter stage is similar to [26]. Like the KMP algorithm [24] one location after
another is evaluated utilizing the knowledge accumulated at the previous locations combined with the in-
formation gathered in the pattern preprocessing stage. The information of the pattern preprocessing stage
is:
Output of pattern preprocessing: For each location i of p, a maximal disjoint collection of mis-
match pairs L for some pattern prefix p1, . . . , pj−i+1 and pi, . . . , pj such that either |L| = 3k + 3 or
j = m and |L| ≤ 3k + 3.
As the first step of the algorithm we consider the first text location, i.e. whentext t1 . . . tm is aligned
with p1 . . . pm. Using the method in Section 3.3.4 we can find all the mismatch pairs inO(m) time, and
hence find ak-good ork-bad witness for the first text location. When evaluating subsequent locations i
we maintain the following invariant:For all locationsl < i we have either ak-good ork-bad witness for
locationl.
It is important to observe that, when evaluating locationi of the text, if we discover a maximal disjoint
collection of mismatch pairs of sizek +1, i.e. ak-bad witness, between a prefixp1, ..., pj of the pattern and
ti, . . . , ti+j−1 we will stop our search, since this immediately implies thatp1, . . . , pm andti, . . . , ti+m−1
have ak-bad witness. We say thati + j is astop locationfor i. If we have ak-good witness at locationi
theni + m is its stop location. When evaluating locationi of the text, each of the previous locations has a
stop location. The locationℓ which has a maximal stop location, over all locationsl < i, is called amaximal
stopper fori.
The reason for working with the maximal stopper fori is that the text beyond the maximal stopper has
not yet been scanned. If we can show that the time to compute ak-bad witness or find a maximal dis-
joint collection of mismatch pairs for locationi up until the maximal stopper, i.e. forp1, . . . , pℓ′−i+1 with
ti, . . . , tℓ′ , isO(k) then we will spend overallO(n+nk) time for computing,O(nk) to compute up until the
maximal stopper for each location andO(n) for the scanning and updating the current maximal collection
of mismatch pairs.
3.4.2 Using Precomputed Information
The situation now is that we are evaluating a locationi of t. Let ℓ be the maximal stopper fori. We utilize
two pieces of precomputed information; (1) the pattern preprocessing forlocationi− ℓ + 1 of p and (2) the
k-good witness ork-bad witness (of sizek+1) for locationℓ of t. Let ℓ′ be the stop location ofℓ. We would
like to evaluate the overlap ofp1, . . . , pℓ′−i+1 with ti, . . . , tℓ′ and to find a maximal disjoint collectionL
of mismatch pairs on this overlap, or, if it exists, a maximal disjoint collectionL of mismatch pairs of size
k + 1 on a prefix of the overlap.
There are two cases. One possibility is that the pattern preprocessing returns3k + 3 mismatch pairs for
3 APPROXIMATE PARAMETERIZED MATCHING 31
a prefixp1, . . . , pj andpℓ−i+1, . . . , pℓ−i+j+1 wherej ≤ ℓ′ − i + 1. Yet, there are≤ k + 1 mismatch pairs
betweenp1, . . . , pℓ′−ℓ+1 andtℓ, . . . , tℓ′ .
Lemma 11. Lets, s′ ands′′ be three equal-length strings such that there is a maximal collection of mismatch
pairsLs,s′ betweens ands′ of size≤ M and a maximal collection of mismatch pairsLs′,s′′ betweens′ and
s′′ of size≥ 3M . Then there must be a maximal collection of mismatch pairsLs,s′′ betweens ands′′ of size
≥ M .
Proof. Since each mismatch pair is composed of 2 locations, there must be at leastM mismatch pairs
betweens′ ands′′ which do not participate in mismatch pairs betweens ands′. It follows from the definition
of mismatch pairs that theseM mismatch pairs are also mismatch pairs betweens ands′′.
SetM to bek + 1 and one can use the Lemma for the case at hand, namely, there must be at leastk + 1
mismatch pairs betweenp1, . . . , pℓ′−i+1 andti, . . . , tℓ′ , which defines ak-bad witness.
The second case is where the pattern preprocessing returns fewer than 3k + 3 mismatch pairs or it returns
3k + 3 mismatch pairs but it does so for a prefixp1, . . . , pj andpℓ−i+1, . . . , pℓ−i+j+1 wherej > ℓ′ − i + 1.
However, we still have an upper bound ofk +1 mismatch pairs betweenp1, . . . , pℓ′−ℓ+1 andtℓ, . . . , tℓ′ . So,
we can utilize the fact that:
Lemma 12. Given three strings,s, s′, s′′ such that there are maximal disjoint collections of mismatch pairs,
Ls,s′ andLs′,s′′ of sizesO(k). In O(k) time one can find ak-good ork-bad witness fors ands′′.
Proof. Let Ls,s′ be the indices that are not in any mismatch pair betweens ands′. By Lemma 8 there exists
a bijectionπ : Σs → Σs′ such thatπ defines a parameterized match onLs,s′ . Likewise, if Ls′,s′′ are the
indices that are not in any mismatch pair betweens′ ands′′, then there is a bijectionπ′ : Σs′ → Σs′′ that
defines a parameterized match onLs′,s′′ .
The set of indices that participates in no mismatch pair in either of the pairs isLs,s′ ∩ Ls′,s′′ . On this set
π′ · π() defines a parameterized match. However, the number of locations participating in mismatch pairs is
bounded byk. Using linked lists for each character one can now construct the, at mostO(k) mismatch pairs
in O(k) time.
3.4.3 Putting it Together
The filter stage takesO(nk) time and the verification stage takesO(nk1.5). Hence,
Theorem 13. Given the preprocessing stage we can announce for each locationi of t whetherp parame-
terizedk-matchesti, ..., ti+m−1 in O(nk1.5) time.
3 APPROXIMATE PARAMETERIZED MATCHING 32
3.5 Pattern Preprocessing
In this section we solve the pattern preprocessing necessary for the general algorithm.• Input: A pattern p = p1, . . . , pm and an integer k.
• Output: For each location i, 1 ≤ i ≤ m a maximal disjoint collection of mismatch pairs L for the
minimal pattern prefix p1, . . . , pj−i+1 and pi, . . . , pj such that either |L| = 3k + 3 or j = m and
|L| ≤ 3k + 3.
The naive method takesO(m2), by applying the method from Section 3.3.4. However, this does not
exploit any previous information for a new iteration. Since every alignment combines two previous align-
ments, we can get a better result. Assume, w.l.o.g., thatm is a power of 3. We divide the set of alignments
into log3 m + 1 sequences as follows;
R1 = [2, 3], R3 = [4, 5, 6, 7, 8, 9], ..., Ri = [3i−1 + 1, ..., 3i], ..., Rlog3
m = [3log3 m−1 + 1, ..., 3log3 m]
1 ≤ i ≤ log3 m.
At each step, we compute the desired mismatches for each set of alignmentsRi. Stepi uses the in-
formation computed in steps1...i − 1. We further divide each set into two halves, and compute the first
half followed by the second half. This is possible since eachRi contains an even number of elements, as
3i − 3i−1 = 2 ∗ 3i−1. We split each sequenceRi into two equal length sequencesR1i [3
i−1 + 1, ..., 2 ∗ 3i−1]
andR2i = [2 ∗ 3i−1 + 1, . . . , 3i]. For each of the new sequences we have that:
Lemma 14. If r1, r2 ∈ R1i (the first half of setRi) or r1, r2 ∈ R2
i (the second half ofRi) such thatr1 < r2
thenr2 − r1 ∈ Rj for j < i.
Proof: We assume thatr1, r2 ∈ R1i (the proof for the other case is symmetric). Obviouslyr2 ≤ 2 ∗ 3i−1
andr1 ≥ 3i−1 + 1, thenr2 − r1 ≤ 2 ∗ 3i−1 − 3i−1 − 1 = 3i−1 − 1.
This gives a handle on how to compute our desired output. Denotefi = minj ∈ R1i andmi =
minj ∈ R2i the representatives of their sequences. We compute the following two stages for each group
Ri, in orderR1, . . . , Rlog3
m,
1. Compute a maximal disjoint collection of mismatch pairsL for some pattern prefixp1, . . . , pj+1 and
pfi, . . . , pfi+j such that|L| = (3k + 3)3log3 m−i or fi + j = m and|L| ≤ (3k + 3)3log3 m−i. Do the
same withpmi, . . . , pmi+j .
2. Now for eachj ∈ R1i apply the algorithm for the text described in the previous section on the pattern p,
the pattern shifted byfi (R1i ’s representative) and the pattern shifted byj. Do the same withmi for R2
i .
The central idea and importance behind the choice of the number of mismatch pairs that we seek is to
satisfy Lemma 11. It can be verified that indeed our choice of sizes always satisfies that there are 3 times as
many mismatch pairs as in the previous iteration. Therefore,
Theorem 15. Given a patternp of lengthm. It is possible to precompute its3k + 3 mismatch pairs at each
alignment inO(km log3 m) time.
3 APPROXIMATE PARAMETERIZED MATCHING 33
The time complexity for each groupRi with O(3i−1) members is3log3
m−i+1(3k + 3) for finding the
mismatches, multiply byO(3i−1) members, which isO(mk) for a group, plusO(m) for the first and the
middle elements. All together it isO(km log3 m).
Corollary 2. Given a patternp and textt, we can solve the parameterized matching withk mismatches
problem inO(nk1.5 + km log m) time.
3.6 2-D Parameterized Matching withk Mismatches
Two-dimensional parameterized matching has applications in image search with variable colors and er-
rors. In this section we present an extension of our algorithm for 2-dimensional approximate parameterized
matching.
• Input: Two images, text T [1...n][1...n], and pattern P [1...m][1...m], and an integer k.
• Output: all locations (i, j) in T such that P parameterized k-matches the m × m image whose
upper left corner is T [i, j].
We begin with the assumption that the text has exactlym rows, and then multiply the resulting time
complexity byn. Both the text and the pattern are linearized column-by-column. Our string matching
algorithm of section 3.4 can be applied to the linear pattern and text. Care must be taken that only those
locations whose index is divisible bym will be considered. This can clearly be accomplished inO(n) time.
Thus, the time complexity for eachm rows of the text isO(nmk1.5).
Theorem 16. Given anm × m imageP and ann × n imageT , we can solve the parameterized matching
with k mismatches problem inO(n2mk1.5 + m2k log m) time.
REFERENCES 34
References
[1] N. Alon and M. Naor. Derandomization, witnesses for boolean matrix multiplication and construction
of perfect hash functions.Algorithmica, 16:434–449, 1996.
[2] A. Amir. Personal communication.
[3] A. Amir, Y. Aumann, R. Cole, M. Lewenstein, and E. Porat. Function matching: algorithms, applica-
tions and a lower bound. InProc. of the 30th International Colloquium on Automata, Languages and
Programming (ICALP), pages 929–942, 2003.
[4] A. Amir, G. Benson, and M. Farach. An alphabet independent approach to two dimensional pattern
matching.SIAM J. on Computing, 23(2):313–323, 1994.
[5] A. Amir, K. W. Church, and E. Dar. Separable attributes: a techniquefor solving the sub matrices
character count problem. InProc. 13th Symposium on Discrete Algorithms (SODA), pages 400–401,
2002.
[6] A. Amir, M. Farach, and S. Muthukrishnan. Alphabet dependencein parameterized matching.Infor-
mation Processing Letters, 49:111–115, 1994.
[7] A. Amir, M. Lewenstein, and E. Porat. Faster algorithms for string matching with k mismatches. In
Proc. 11th Symposium on Discrete Algorithms (SODA), pages 794–803, 2000.
[8] A. Apostolico, P. Erdos, and M. Lewenstein. Parameterized matching with mismatches.To appear in
j. of Discrete Algorithms.
[9] G.P. Babu, B.M. Mehtre, and M.S. Kankanhalli. Color indexing for efficient image retrieval.Multime-
dia Tools and Applications, 1(4):327–348, 1995.
[10] B. S. Baker. A theory of parameterized pattern matching: algorithms and applications. InProc. 25th
ACM Symposium on the Theory of Computation (STOC), pages 71–80, 1993.
[11] B. S. Baker. Parameterized string pattern matching.J. of Computer and System Sciences, 52(1):28–42,
1996.
[12] B. S. Baker. Parameterized duplication in strings: Algorithms and an application to software mainte-
nance.SIAM J. on Computing, 26(5):1343–1362, 1997.
[13] B. S. Baker. Parameterized diff. InProc. 10th Symposium on Discrete Algorithms (SODA), pages
854–855, 1999.
[14] R.S. Boyer and J.S. Moore. A fast string searching algorithm.Comm. ACM, 20:762–772, 1977.
REFERENCES 35
[15] R. Cole and R. Hariharan. Faster suffix tree construction with missingsuffix links. InProc. 32nd ACM
Symposium on Theory of Computing (STOC), pages 407–415, 2000.
[16] M.L. Fredman and R.E. Tarjan. Fibonacci heaps and their use in improved network optimizations
algorithms.J. of the ACM, 34(3):596–615, 1987.
[17] H. N. Gabow. Scaling algorithms for network problems.J. of Computer and System Sciences, 31:148–
168, 1985.
[18] H. N. Gabow, J. L. Bentley, and R. E. Tarjan. Scaling and related techniques for geometry problems.
Proc. 16th ACM Symposium on Theory of Computing (STOC), 67:135–143, 1984.
[19] H. N. Gabow and R.E. Tarjan. Faster scaling algorithms for network problems.SIAM J. on Computing,
18:1013–1036, 1989.
[20] Z. Galil and R. Giancarlo. Improved string matching withk mismatches.SIGACT News, 17(4):52–54,
1986.
[21] P. Gupta, R. Janardan, and M. Smid. Further results on generalized intersection searching problems:
Counting, reporting, and dynamization.J. of Algorithms, 19(2):282–317, 1995.
[22] D. Harel and R.E. Tarjan. Fast algorithms for finding nearest common ancestor.J. of Computer and
System Sciences, 13:338–355, 1984.
[23] M.Y. Kao, T.W. Lam, W.K. Sung, and H.F. Ting. A decomposition theoremfor maximum weight
bipartite matching.SIAM J. on Computing, 31(1):18–26, 2001.
[24] D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. on Computing,
6:323–350, 1977.
[25] S. R. Kosaraju. Faster algorithms for the construction of parameterized suffix trees.Proc. 36th IEEE
FOCS, pages 631–637, 1995.
[26] G.M. Landau and U. Vishkin. Introducing efficient parallelism into approximate string matching.Proc.
18th ACM Symposium on Theory of Computing (STOC), pages 220–230, 1986.
[27] G.M. Landau and U. Vishkin. Fast string matching with k differences.J. of Computer and System
Sciences, 37(1):63–78, 1988.
[28] V. I. Levenshtein. Binary codes capable of correcting, deletions, insertions and reversals.Soviet Phys.
Dokl., 10:707–710, 1966.
REFERENCES 36
[29] B. Schieber and U. Vishkin. On finding lowest common ancestors: Simplification and parallelization.
SIAM J. on Computing, 17:1253–1262, 1988.
[30] M. Swain and D. Ballard. Color indexing.International Journal of Computer Vision, 7(1):11–32,
1991.
[31] U. Vishkin. Optimal parallel pattern matching in strings.Proc.12th ICALP, pages 91–113, 1985.
[32] U. Vishkin. Deterministic sampling - a new technique for fast pattern matching. SIAM J. on Computing,
20:303–314, 1991.