delta-gamma-parameterized matching

Parameterized Matching

Carmit HazayDepartment of Computer Science

M.Sc. Thesis

Submitted in partial fulfillment of the requirements for the M.Sc.

degree in the Department of Computer Science, Bar-Ilan University.

Ramat-Gan, Israel

September 2004

1

This work was carried out under the supervision of

Dr. Moshe Lewenstein,

Department of Computer Science, Bar-Ilan University.

Acknowledgments

I would like to thank my advisor Moshe Lewenstein for his guidance and directions. He taught me a lot and

I am grateful for the time and effort he put in this work.

Also, special thanks to Dina Sokol and Dekel Tzur. It was very inspiring working with them.

Abstract

Two equal length stringss ands′, over alphabetsΣs andΣs′ , parameterize matchif there exists a

bijectionπ : Σs → Σs′ , such thatπ(s) = s′, whereπ(s) is the renaming of each character ofs via π.

Parameterized matchingis the problem of finding all parameterized matches of a pattern stringp in a

text t. It was introduced as a model for software duplication detection in software maintenance systems

and also has applications in image processing and computational biology.

Two dimensional parameterized matchingis the task of finding all locations in ann × n-length text

where them × m subtext beginning at that location and a givenm × m-length pattern p-match. This

models searching for color images with changing color maps.Our first result is an algorithm for the two

dimensional case that runs in timeO(n2 + m2.5 · polylog(m)).

Approximate parameterized matchingis the problem of finding, at each location, a bijectionπ that

maximizes the number of characters that are mapped fromp to the appropriate|p|-length substring oft.

We consider the problem for which an error threshold,k, is given and the goal is to find all locations

in t for which there exists a bijectionπ which mapsp into the appropriate|p|-length substring oft with

at mostk mismatched mapped-elements.

We show that (1) the approximate parameterized matching, when |p|=|t|, is equivalent to the maxi-

mum matching problem on graphs, implying that (2) maximum matching is reducible to the approximate

parameterized matching with thresholdk, up till anO(log |t|) factor (this can be achieved by reducing

approximate parameterized matching to the problem by usinga binary search on thek’s). Given the

best known maximum matching algorithms anO(m1.5) time, wherem = |p| = |t|, is implied for ap-

proximate parameterized matching. We show that (3) for thek threshold problem we can do this in

O(m + k1.5) time.

Our second main result (4) is anO(nk1.5 + mk log m) time algorithm wherem = |p| andn = |t|.

Contents

Table of Contents 5

Contents 5

1 Introduction 7

2 Two Dimensional Parameterized Matching 10

2.1 Preliminaries and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 10

2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 10

2.3 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12

2.4 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 14

2.5 Pattern Preprocessing — Finding witnesses . . . . . . . . . . . . . . . . . . .. . . . . . . 15

2.5.1 Simple Witnesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.2 Corner Witnesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19

2.5.3 l Strip Witnesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Approximate Parameterized Matching 23

3.1 Preliminaries and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 23

3.2 String Comparison Problem - The Binary Case . . . . . . . . . . . . . . . . . .. . . . . . 23

3.2.1 Case 1 : Firstm/2 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.2 Case 2 : Lastm/2 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.3 Case 3 : Both Halves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

3.2.4 Case 4 : Less than2k + 1 Characters . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 String Comparison Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 25

3.3.1 Reducing Maximum Matching to the String Comparison Problem . . . . . . . . .. 25

3.3.2 Mismatch Pairs and Parameterized Properties . . . . . . . . . . . . . . . . . .. . . 26

3.3.3 String Comparison Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.4 Find Mismatch Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.5 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29

3.4.1 The Filter Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.2 Using Precomputed Information . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30

3.4.3 Putting it Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Pattern Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 32

3.6 2-D Parameterized Matching withk Mismatches . . . . . . . . . . . . . . . . . . . . . . . 33

References 34

7

Chapter 1

1 Introduction

In the traditional pattern matching model [14,24], one seeks exact occurrences of a given patternp in a textt,

i.e. text locations where every text symbol isequalto its corresponding pattern symbol. For two equal length

stringss ands′ we say thats is aparameterized match, p-matchfor short, ofs′ if there exists a bijectionπ

from the alphabet ofs to the alphabet ofs′ such that every symbol ofs′ is equal tothe image underπ of

the corresponding symbol ofs. In theparameterized matchingproblem, introduced by Baker [10, 12], one

seeks text locations for which the patternp p-matches the substring of length|p| beginning at that location.

Baker introduced parameterized matching for applications that arise in software tools for analyzing source

code. Specifically, the application is to identify duplicate code in large software systems for reuse. Here

it is desirable to find not only exact matches between program fragments but also parameterized matches,

namely, where the two program fragments are equal but possibly use interchangeable identifiers (represent-

ing variable, constant or function names).

When program fragments that repeat are discovered, other files of theprogram are searched to find repe-

titions of the program fragment in other locations. (It is often the case that repeating fragments have many

appearances.) This exactly defines theparameterized matching problem, or p-match problemfor short. It

turns out that the parameterized matching problem arises in other applications such as in image processing

and computational biology, see [3].

In [10,12] an optimal, linear time algorithm was given for p-matching. However, it was assumed that the

alphabet was of constant size. Amir et.al. [6] presented tight bounds forp-matching in the presence of an

unbounded size alphabet.

In [11] a novel method was presented for parameterized matching by constructing parameterized suffix

trees, which also allows for online p-matching. The parameterized suffix trees are constructed by converting

the pattern string into apredecessor string. A predecessor stringof a strings has at each locationi the

distance betweeni and the location containing the previous appearance of the symbol. The first appearance

of each symbol replaced with a 0. For example, the predecessor string ofaabbaba is 0, 1, 0, 1, 3, 2, 2. A

simple and well-known fact is that:

Observation 1. s ands′ p-match iff they have the same predecessor string.

The parameterized suffix tree is constructed in a manner that every path corresponds to a suffix, in the

standard sense, but is labeled with the predecessor string of the appropriate suffix. Branching in the param-

https://www.researchgate.net/publication/220897119_Function_Matching_Algorithms_Applications_and_a_Lower_Bound?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/27523583_Alphabet_Dependence_in_Parameterized_Matching?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/221591044_Theory_of_parameterized_pattern_matching_Algorithms_and_applications?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==


https://www.researchgate.net/publication/279964791_Parameterized_duplication_in_strings_Algorithms_and_an_application_to_software_maintenance?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==


https://www.researchgate.net/publication/220618247_Fast_Pattern_Matching_in_Strings?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/224960000_A_Fast_String_Searching_Algorithm?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

1 INTRODUCTION 8

eterized suffix tree occurs according to the labels of the predecessor strings. The parameterized suffix tree

was further explored by Kosaraju [25] and faster constructions weregiven by Cole and Hariharan [15].

One of the interesting problems in web searching is searching for color images, see [5, 9, 30]. The sim-

plest possible case is searching for an icon in a screen, a task that the Human-Computer Interaction Lab at

the University of Maryland was confronted with. If the colors are fixed,this is exact 2-dimensional pattern

matching [4]. However, if the color maps in pattern and text differ, the exact matching algorithm would not

find the pattern. Parameterized 2-dimensional search is precisely what is needed. The fastest algorithm solv-

ing the two-dimensional parameterized matching problem was given in [3] andruns in timeO(n2 log2 m)

for ann × n text.

It is an open question whether a linear time algorithm exists. In this work we show that it is possible to get

an algorithm that is linear in the text size with a penalty in the preprocessing stage, namelyO(n2 + m2.5 ·polylog(m)). For alphabets drawn from a large universe (larger than polynomial inn) the time increases to

O(n2 log |Σ| + m2.5 · polylog(m)). However, this seems to be unavoidable as it was shown in [6] that the

one-dimensional parameterized matching requiresΩ(n log |Σ|) time.

In reality, when searching for an image, one needs to take into account theoccurrence of errors that

distort the image. Errors occur in various forms, depending upon the application. Several distance metrics

have been defined to account for such errors. Two of the most classical distance metrics are the Hamming

distance and the edit distance [28]. TheHamming distancebetween two equal-length strings is the number

of mismatching characters when the strings are aligned. The edit distance is the minimal number of character

replacements, insertions, and deletions needed to convert one string into another [28].

In [13] the parameterized match problem was considered in conjunction with the edit distance. Here

the definition of edit distance was slightly modified so that the edit operations are defined to be insertion,

deletion and parameterized replacements, i.e. the replacement of a substringwith a string that p-matches it.

An algorithm for finding the “p-edit distance” of two strings was devised whose efficiency is close to the

efficiency of the algorithms for computing the classical edit distance. Also analgorithm was suggested for

the decision variant of the problem where a parameterk is given and it is necessary to decide whether the

strings are within (p-edit) distancek.

However, it turns out that the operation of parameterized replacement relaxes the problem to an easier

problem. The reason that the problem becomes easier is that two substringsthat participate in two parame-

terized replacements are independent of each other (in the parameterizedsense).

A more rigid, but more realistic, definition for the Hamming distance variant was given in [8]. For a

pair of equal length stringss ands′ and a bijectionπ defined on the alphabet ofs, theπ-mismatch is the

Hamming distance between the image underπ of s ands′. The minimalπ-mismatch over all bijectionsπ is

the approximate parameterized match. The problem considered in [8] is to find for each locationi of a text

t the approximate parameterized match of a patternp with the substring beginning at locationi.

In [8] the problem was defined and linear time algorithms were given for the case were the pattern is



https://www.researchgate.net/publication/220663973_Color_Indexing_for_Efficient_Image_Retrieval?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/2541054_Faster_Suffix_Tree_Construction_with_Missing_Suffix_Links?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/3635341_Faster_algorithms_for_the_construction_of_parameterized_suffix_trees?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/2764319_An_Alphabet_Independent_Approach_To_Two_Dimensional_Pattern_Matching?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/243772764_Binary_codes_capable_of_correcting_insertions_and_reversals?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==


https://www.researchgate.net/publication/2379746_Separable_Attributes_a_Technique_for_Solving_the_Sub_Matrices_Character_Count_Problem?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

1 INTRODUCTION 9

binary or the text is binary. However, this solution does not carry over tolarger alphabets.

Unfortunately, under this definition the methods for classical string matching with errors for Hamming

distance, e.g. [20,27], also known as pattern matching with mismatches, seemto fail. Following is an outline

of the method [20] for pattern matching with mismatches.

The pattern is compared separately to each suffix of the text, beginning at locations1 <= i <= n.

Using a suffix tree of the text and precomputed longest common ancestor (LCA) information (which can be

computed once in linear time, see [22, 29]) one can find the longest common prefix of the pattern and the

corresponding suffix (in constant time). There must be a mismatch immediately afterwards. The algorithm

jumps over the mismatch and repeats the process taking into consideration the offsets of the pattern and

suffix.

When attempting to apply this technique to a parameterized suffix tree, it fails. Toillustrate this, consider

the first matching substring (up until the first error) and the next matching substring (after the error). Both

of these substrings p-match the substring of the text that they are aligned with. However, it is possible

that combined they do not form a p-match. See the example below. In the example abab p-matchescdcd

followed by a mismatch and subsequently followed byabaa p-matchingefee. However, differentπ’s are

require for the local p-matches. This also emphasizes why the definition of [13] is a simplification.

p = a b a b a a b a a ...

t = ... c d c d d e f e e ...

In this work we consider the problem ofparameterized matching with k mismatches. The parameter-

ized matching problem withk mismatches seeks allp-matchesof a patternp in a text t, with at mostk

mismatches.

For the case where|p|=|t| = m, which we call the string comparison problem we show that this problem is

in a sense equivalent to the maximum matching problem on graphs. AnO(m1.5) algorithm for the problem

follows, by using maximum matching algorithms. A matching ”lower bound” follows, inthe sense that an

o(m1.5) algorithm implies better algorithms for maximum matching. For the string comparison problem

with thresholdk, we show anO(m + k1.5) time algorithm and a ”lower bound” ofΩ(m + k1.5

log m).

The first result is an algorithm that solves the parameterized matching problem with k mismatches prob-

lem, given a pattern of lengthm and a text of lengthn, in O(nk1.5 + mk log m) time. This immedi-

ately yields a 2-dimensional algorithm of time complexityO(n2mk1.5 + m2k log m), where|p| = m2 and

|t| = n2.

https://www.researchgate.net/publication/220573949_Fast_String_Matching_with_k_Differences?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/234816932_Improved_string_matching_with_k_mismatches?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==


https://www.researchgate.net/publication/221448228_On_Finding_Lowest_Common_Ancestors_Simplification_and_Parallelization?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/248779657_Fast_algorithms_for_finding_nearest_common_ancestor?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

10

Chapter 2

2 Two Dimensional Parameterized Matching

2.1 Preliminaries and Definitions

Let T andT ′ be two texts of sizek × k. DenoteT by T [1, 1], . . . , T [1, k], T [2, 1], . . . , T [2, k], . . . , T [k, k].

We say that there is afunction matchingbetweenT andT ′ if there is a mappingf from the alphabet of

T , to the alphabet ofT ′ such thatT ′ = f(T ), wheref(T ) = f(T [1, 1]), . . . , f(T [1, k]), f(T [2, 1]), . . . ,

f(T [k, k]). If the mappingf is one-to-one, we say thatT and T ′ parameterize match, or p-match for

short. Note that the definition of function matching is asymmetric whereas the definition of parameterized

matching is symmetric. The parameterized matching problem is defined as follows:

Input: An n × n textT and anm × m patternP .

Output: All locations(i, j) of T where them×m substring ofT beginning at location(i, j) andP p-match.

Observation 1 gives a handle on finding p-matches for 1-dimensional texts. We would like to use a similar

observation for 2-dimensional texts as well. In order to do so we define thefollowing.

Definition 1. Let T be ann × n text. Astrip is a substring ofT of sizen × m. More specifically, thei-th

strip ofT is then × m substringT [1 : n, i : i + m − 1].

LetA be ann×k string. We define thelinearizationof A to be the one-dimensional stringA[1, 1], . . . , A[1, k],

A[2, 1], . . . , A[2, k], . . . , A[n, 1], . . . , A[n, k]. Now, letT (i) denote the linearization of thei-th strip ofT and

let P ′ be the linearization ofP . Observe that:

Observation 2. Them × m subtext ofT beginning at location(i, j) andP p-match iffP ′ p-matches with

them2 length substring ofT (i) beginning at location(j − 1)m + 1.

Since we now can verify p-matches via 1-dimensional strings, we can use Observation 1. For a 2-

dimensional stringS of width m, the predecessor of location(i, j) in S is defined to be the predecessor

of the location that corresponds to(i, j) in the linearization ofS.

2.2 Overview

We begin by giving an overview of the algorithm. As in many pattern matching algorithms, we assume

w.l.o.g. that the size of the textn × n satisfiesn = 2m. Larger texts can be cut into2m × 2m pieces and

each handled separately. We also assume thatΣ = 1, . . . , 4m2.

2 TWO DIMENSIONAL PARAMETERIZED MATCHING 11

The outline of the algorithm follows the “duel-and-sweep”-paradigm that appeared in [4] (there it is

named “consistency and verification” and was used for 2-dimensional exact matching) and is based upon

ideas of dueling techniques [31, 32]. The idea of the “duel-and-sweep”-paradigm is to maintain a list of

candidate match locations inT , starting with all locations and pruning the list in two stages to a list of all

the candidates which actually match. The two stages are called thedueling stageand thesweeping stage.

However, exact matching turns out to be much simpler than parameterized matching. This is mainly because

of the ”witness” which is used in the dueling stage to differentiate between two pattern alignments (also the

sweeping stage is simpler for similar reasons). In exact matching a witness is simply one location with a

mismatch between the two alignments. This is not sufficient for parameterized matching as p-matching is

dependent on other locations. The following definition captures the concept of a witness in parameterized

matching.

Definition 2 (witness). LetP be anm × m size pattern. Consider an alignment ofP with itself aligned to

begin at location(a + 1, b + 1). A witnessrelative to alignment(a, b) is a pair of locations(x, y), (x′, y′)

such that one of the following holds:

1. P [x, y] = P [x′, y′] andP [x + a, y + b] 6= P [x′ + a, y′ + b].

2. P [x, y] 6= P [x′, y′] andP [x + a, y + b] = P [x′ + a, y′ + b].

In the dueling stage, we use the fact that if two p-matches ofP in T overlap, then there is a p-matching

between the two substrings ofP that correspond to the overlapping area of the two matches. Using a witness

array forP , we can check pairs of overlapping candidates which do not agree on their overlapping area, and

rule out at least one candidate from the pair in constant time per pair. Afterthis stage we remain only with

candidates that agree with each over.

In the sweeping stage, we need to check the remaining candidates. This is done by going over all the

strips ofT from left to right, and for each strip, checking the candidates that beginsin the first column of

the strip.

Suppose that the current strip starts at columny, and consider some candidateT ′ = T [x : x + m − 1, y :

y + m − 1] in the strip. Suppose that we know the predecessor for every location in the current strip ofT .

A pair of location inT (in the current strip) and its predecessor will be called alocation-predecessor pair.

Before we make the next observation a central concept which has a similarflavor to the witness defined

above is as follows.

Definition 3 (mismatch pair). Let R andS be two equal-length 2-dimensional strings. Amismatch pair

betweenR andS is a pair of locations(i, j), (i′, j′) such that one of the following holds,

1. R[i, j] = R[i′, j′] andS[i, j] 6= S[i′, j′] 2. R[i, j] 6= R[i′, j′] andS[i, j] = S[i′, j′].

We now observe that:

https://www.researchgate.net/publication/222775556_Optimal_parallel_pattern_matching_in_strings?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/221590792_Deterministic_Sampling-A_New_Technique_for_Fast_Pattern_Matching?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==



Observation 3. 1. If T ′ p-matches toP , then every location-predecessor pair insideT ′ (namely, both the

location and its predecessor are inT ′) is not a mismatch pair forT ′ andP .

2. If there is no location-predecessor pair inT ′ which is a mismatch pair forT ′ and P , then there is a

function matching betweenT ′ andP .

3. If there is a function matching betweenT ′ andP , then there is a p-matching betweenT ′ andP if and

only if the number of distinct characters inT ′ is equal to the number of distinct characters inP .

By Observation 3, the algorithm for checking the candidates will have two steps: In the first step, for

each candidate, we check whether every location-predecessor pair inthe candidate is a mismatch pair. In the

second step we compute the number of distinct characters in each candidateand compare it to the number

of distinct characters inP .

In order to implement the first step, we will maintain the predecessor of everylocation in the current strip.

Computing the predecessor of every location in the first strip ofT takesO(m2) time. When going from on

strip to the next strip, only at most6m predecessors are updated, and we will show how to do the update in

O(m) time.

Clearly, if we check each location-predecessor pair for each candidate separately, then the algorithm will

not be efficient (the time complexity will beΘ(m4) in the worst case). However, we can use the fact that

after the dueling stage, all the remaining candidates agree with each other. Therefore, for each location-

predecessor pair inT , the pair is either a mismatch pair forP and every candidate that contains the pair, or

the pair is not a mismatch pair for any candidate. Thus, for each location-predecessor pair we need to check

only two characters inP , and if there is a mismatch, we can rule out all the remaining candidates which

contain the pair.

The second step is done by computing the number of distinct character in every m × m window of T .

Amir et al. [5] gave an algorithm for this problem whose time complexity isO(m2 log m). We can solve the

problem inO(m2) time, but omit for space reasons. (Amir and Cole [2] also have solved the problem in the

same time bounds).

2.3 Algorithm Details

From Observation 3, we need to check for each candidate whether the location-predecessor pairs inside it

are mismatch pairs. As discussed in Section 2.2, we will check every location-predecessor pair only once

during the entire algorithm.

The main idea is as follows: We will go over the strips ofT from left to right. When going from the

(y − 1)-th strip to they-th strip, there are at most6m new location-predecessor pairs: (1) every location

in columny + m − 1 and its predecessors form a new pair, (2) every location in columny + m − 1 may

become the predecessor of some location in columnsy, . . . , y + m − 2, and (3) every location in columns

y, . . . , y+m−2 whose predecessor was in columny−1 will form with its new predecessor a new pair. After



a preprocessing step onT that will be described in Section 2.4, we can find all the new location-predecessor

pairs in timeO(m).

Now, let (r, c), (r′, c′) be some new location-predecessor pair for they-th strip. All the candidates that

contain the locations(r, c) and (r′, c′) agree with each other, so(r, c), (r′, c′) is a mismatch pair for all

of these candidates, or for none. Therefore, we only need to find onecandidate(z, w) (with w ≥ y) that

contains both locations, and check whetherP [r− z +1, c−w +1] = P [r′− z +1, c′−w +1]. If these two

characters are not equal, then(z, w) cannot be a match, and moreover, every candidate that contains(r, c)

and(r′, c′) cannot be a match. In other words, we can rule out every candidate thatbegins in the rectangle

r − m + 1, . . . , r′ × y, . . . ,min(c, c′). Note that it is possible that the predecessor of(r, c) changes at

stripy′ for y < y′ ≤ min(c, c′), so(r, c), (r′, c′) is not a location-predecessor pair for some of the candidate

that we rule out. However, this does not affect the correctness of the algorithm.

We now need to handle two issues: How to find a candidate that contains the pair (r, c), (r′, c′), and in a

case of a mismatch, how to rule out the candidates in the corresponding rectangle.

We first deal with the first issue. Given a location-predecessor pair(r, c), (r′, c′), we will find the highest

candidate that can contain the pair(r, c), (r′, c′), which will be denoted(z, w). More precisely, among all

the candidates inr − m + 1, . . . , r × y, . . . ,min(c, c′) (note that the rectangle here ends in rowr),

(z, w) is the candidate with smallest row number (ties are broken arbitrarily). Clearly, if (z, w) does not

contain the pair(r, c), (r′, c′) (that is,z > r′), then there is no candidate (that begin in column greater than

y) that contains the pair.

We now describe how to find the highest candidate in constant time. To do that,we build a2m × 2m

arrayA, whereA[i, j] is the smallest row in which a candidate starts, among all the candidates with a start

row at mosti − m + 1 and start columnj. If there is no such row, thenA[i, j] = 2m. The arrayA can be

easily computed by scanning the text column by column, from top to bottom.

Now, given a location-predecessor pair(r, c), (r′, c′), in order to find the highest candidate that contains

the pair, we need to find the minimum value among the subrowA[r, y], A[r, y + 1], . . . , A[r, min(c, c′)].

This is the range minima problem [18], so after preprocessing each row ofA in O(m) time (for each row),

we can find the minimum element in some subrow ofA in constant time.

We now turn to the problem of eliminating candidates. As discussed above, during the algorithm we find

rectangles that do not contain a match. Instead of eliminating the candidates atthe time each rectangle is

discovered, we store the rectangle in a listL, and after finding all the rectangles, we perform a candidates

elimination stage.

In the candidates elimination stage, we need to remove each candidate whose starting location is in some

rectangle ofL. This is done by moving a vertical sweep line from left to right. We maintain two vectors

V andB, whereV [i] (resp.,B[i]) is the number of rectangles inL which intersects the current sweep line,

and whose top (resp., bottom) row isi. Using these vectors, we can compute the number of rectangles that

contain each point on the sweep line, and eliminate the candidates that start in alocation that is contained by

https://www.researchgate.net/publication/221590107_Scaling_and_Related_Techniques_for_Geometry_Problems?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==


at least one rectangle. The time needed to update the vectorsV andB is after moving the sweep line is linear

in the number of new rectangles that intersects the line after its movement and thenumber of rectangles that

stopped intersecting the line. Therefore, the total time complexity is linear in the number of rectangles inL,

which is at most6m2.

2.4 Text Preprocessing

In this section, we show how to compute an array of pointers, which will be used to maintain the predecessors

of the current strip.

Definition 4 (left predecessor).Theleft predecessorof every location(i, j) in T , is the predecessor of(i, j)

in T (1) if j ≤ m, and ifj > m, the left predecessor is the predecessor of location(i, m) in T (j−m+1).

We will show how to compute the left predecessor of every location inT . For computing the left pre-

decessor of all the locations in columns1, . . . , m, let T ′ be the linearization for the first strip ofT . We

compute the left predecessors by scanningT ′ from left to right. While scanning, we only need to keep the

last encountered location of eachσ ∈ Σ. The time complexity for this stage isO(m2).

For computing the left predecessor of the locations in columnsm + 1, . . . , 2m, we scan the entire text

bottom-up left-to-right. Following the left predecessor definition, we show how to create the left predecessor

for each text location(i, j).

We first define the following relation.

Definition 5. A left downrelation between two text locations(i, j) and(k, l) occurs whenk < i andl > j.

Given a new text location(l, k) with valueT [k, l], consider the list of text locations(i1, j1), . . . , (is, js)

such thatT [k, l] = T [i1, j1] = · · · = T [is, js] for which we did not locate their left predecessors yet, ordered

by their scan order. Then we can observe the following:

Observation 4. Every two elements in the above list maintain the left down relation.

Note that the position(k, l) may already have a left predecessor computed for it ifl ≤ m. If not, then a

left predecessor will need to be computed for it as well.

Now the new text location(k, l) may be the left predecessor of some of the locations(i1, j1), . . . , (is, js).

We claim that this left predecessor must be hereditary in the following sense: either(k, l) is the predecessor

of (i1, j1), . . . , (if , jf ) for some1 ≤ f ≤ s or (k, l) is the predecessor of(if , jf ), . . . , (is, js) for some

1 ≤ f ≤ s.

To see this we make some observations regarding text location(k, l):

1. k ≤ if for everyf , 1 ≤ f ≤ s. This follows from the bottom-up left-to-right scan.

2. if l > js or if l ≤ j1 − m then(k, l) is not the left predecessor of any location.


l

A1 A2

A3 A4

C

B4

B1

B2 B3

Figure 1: The subsets ofD.

3. if jf−1 < l butjf ≥ l then(k, l) is the left predecessor of(if , jf ), . . . , (is, js) (f −1 may be equal to 0).

4. if jf+1 > l + m− 1 but jf ≤ l + m− 1 then(k, l) is the left predecessor of(i1, j1), . . . , (if , jf ) (f + 1

may be equal to s+1).

By scanning the list from either side, this stage can be implemented in overallO(m2) time.

2.5 Pattern Preprocessing — Finding witnesses

Let P be anm×m string. We will show how to compute a witness for every offset(a, b) that has a witness.

Let l be some integer that will be specified later. Consider an alignment ofP against itself, with offset

(a, b). We say that(a, b) is asourceif there is a p-match between the overlapping areas in the two copies of

P . If (a, b) is not a source, then a there is at least one witness for(a, b). A witness(x, y), (x′, y′) is called

a witness of type 1if it satisfies condition 1 in Definition 2, and otherwise, it is called a witness of type 2.

There are four cases that we need to consider, according to the signs of a andb. In the following, we will

handle the case whena ≥ 0 andb ≥ 0. The other cases are symmetrical, and thus omitted.

Let (a, b) be a location which is not a source, and letD = 1, . . . , m − a × 1, . . . , m − b be the

overlapping area ofP whenP is aligned against itself with offset(a, b). We partition the setD into subsets

(see Figure 1):A1, . . . , A4 are thel × l squares in the corners ofD. B1, . . . , B4 are rectangles of width

or height l in the borders ofD excluding the corners, andC is the remaining part ofD (namely,C =

D \ (A1 ∪ · · · ∪ A4 ∪ B1 ∪ · · · ∪ B4)). Formally,

A1 = (x, y) ∈ 2 : 1 ≤ x ≤ l and1 ≤ y ≤ l,

A2 = (x, y) ∈ 2 : 1 ≤ x ≤ l andm − b − l + 1 ≤ y ≤ m − b,

etc.

We say that a witnessw for (a, b) is simpleif it does not satisfy any of the following conditions:


1. w is of type 1, and one of the locations ofw is in A1.

2. w is of type 1, one of the locations ofw is in A2 ∪ B3 ∪ A4, and the other location is inA3 ∪ B4 ∪ A4.

3. w is of type 2, and one of the locations ofw is in A4.

4. w is of type 2, one of the locations ofw is in A1 ∪ B1 ∪ A2, and the other location is inA1 ∪ B2 ∪ A3.

The algorithm consists of three stages:

1. Find simple witnesses.

2. Find witnesses that satisfy conditions 1 or 3 above.

3. Find witnesses that satisfy conditions 2 or 4 above.

The three stages of the algorithm are described in the next sub-sections.In each stage of the algorithm, we

will handle only offsets do not contain witnesses that were found during the previous stages.

2.5.1 Simple Witnesses

This stage is similar to the algorithm of Amir et al. [3]: We create new stringsP1 andP2 by replacing every

characterP [x, y] in P by 8ml characters, where each of these characters is a pointer to some location(x′, y′)

in P such thatP [x′, y′] = P [x, y] or a null pointer.

For each location(x, y) in P , we define rectangles on the grid2 in following way: Fori = −ml , . . . , m

l −1, let

Hi,(x,y) = (x′, y′) ∈ 2 : y′ ≤ y andx + il ≤ x′ < x + (i + 1)l,

Hi,(x,y) = (x′, y′) ∈ 2 : y′ ≥ y andx + il ≤ x′ < x + (i + 1)l,

Vi,(x,y) = (x′, y′) ∈ 2 : x′ ≥ x andy + il ≤ y′ < y + (i + 1)l,

and

Vi,(x,y) = (x′, y′) ∈ 2 : x′ ≤ x andy + il ≤ y′ < y + (i + 1)l.

Furthermore, letH(x,y) = (x, y′) ∈ 2 : y′ ≤ y, and we similarly define the rectanglesH(x,y), V(x,y),

andV(x,y). We say that a rectangleHi,(x,y) (or otherH rectangle) is inside the squareP if there is a column

of Hi,(x,y) that is contained in1, . . . , m × 1, . . . , m, namely, if1 ≤ y + il andy + (i − 1)l − 1 ≤ m.

A V rectangle is inside the square ofP if there is a row of the rectangle that is contained in1, . . . , m ×1, . . . , m.

The stringP1 is constructed by replacing every characterP [x, y] in P by the charactersc1, . . . , c8m/l+4

(P1 is anm × ((8ml + 4)m) string) which are defined as follows:

• For i = −ml , . . . , m

l − 1, if the rectangleHi,(x,y) is inside the square ofP , traverse over the elements of

Hi,(x,y) in a column major order (from right to left), until reaching a location(x′, y′) for whichP [x′, y′] =

P [x, y], if there is such a location. If a location(x′, y)′ was found, then setcm/l+i+1 = (x − x′, y − y′),


and we say in this case thatcm/l+i+1 points to(x′, y′). Otherwise (namely, the rectangleHi,(x,y) does not

contain a character equal toP [x, y]), setcm/l+i+1 = φ.

If the rectangleHi,(x,y) is not inside the square ofP , then setcm/l+i+1 = φ.

• For i = −ml , . . . , m

l − 1, the characterc3m/l+i+1 is built from the rectangleHi,(x,y) in the same way

described above, except that the rectangle is traversed from right to left, and furthermore, if the rectangle

does not contain a character equal toP [x, y], thenc3m/l+i+1 = 0.

• For i = −ml , . . . , m

l − 1, the charactersc5m/l+i+1 andc7m/l+i+1 are built from the rectanglesVi,(x,y)

andVi,(x,y) the same as above, except that the rectangles are traversed in a row majororder (if the cor-

responding rectangle does not contain a character equal toP [x, y], we useφ for c5m/l+i+1 and0 for

c7m/l+i+1).

• The characterc8m/l+1 corresponds to the rectangleH(x,y), and it is handled the same as the characters

c1, . . . , c2m/l. Furthermore, the charactersc8m/l+2, c8m/l+3, andc8m/l+4 correspond to the rectangles

H(x,y), V(x,y), andV(x,y), respectively.

The stringP2 is built in two steps. The first step is the same as buildingP1, except that we replace the roles

of φ and0, that is, we useφ for characters that correspond to the rectanglesHi,(x,y) andVi,(x,y), and0 for

characters that correspond to the rectanglesHi,(x,y) andVi,(x,y). After the first step,P2 is anm × (8ml · m)

string. In the second step, we expandP2 into a2m × 2(8ml · m) string, where all the new characters areφ.

After building P1 andP2, we solve the (standard) matching problem with don’t care symbols onP1 and

P2, whereP1 is the pattern,P2 is the text, andφ is the don’t care symbol. Moreover, using the algorithm

of Alon and Naor [1], we find witnesses for every mismatch betweenP1 andP2, namely, for every(a, b)

such thatP1 does not match toP2[a + 1 : a + m, b + 1 : b + m], we find a location(x, y) such that

P1[x, y] 6= P2[x + a, y + b].

Consider some fixed pair(a, b), and defineP ′2 = P2[a + 1 : a + m, b + 1 : b + m].

Claim 1. If P1 does not match toP ′2, then (a, b) is not a source. Moreover, from every witness to the

mismatch ofP1 andP ′2 we can obtain a witness for(a, b) in constant time.

The converse of Claim 1 is not true. A weaker result is given in the following lemma.

Lemma 2. If (a, b) has a simple witness thenP1 does not match toP ′2.

Proof. Suppose thatw = (x, y), (x′, y′) is a simple witness for(a, b). W.l.o.g. we assume thatx ≥ x′,

and moreover, ifx = x′, we assume thaty > y′. We will deal with the case whenw is a witness of type 1,

and omit the proof for the case whenw is of type 2. We consider 3 cases.

Case 1 In the first case, suppose that eitherx = x′ or (x′, y′) is in B1 ∪ C ∪ B4, or in other words,

l + 1 ≤ y′ ≤ m − b − l. We prove the lemma using induction onx − x′.


If x = x′, we have that(x, y′) ∈ H(x,y). Thus, the character ofP1 that corresponds toH(x,y) points

to some location(x, y′′) such thaty′ ≤ y′′ < y andP [x, y′′] = P [x, y]. If P [x + a, y′′ + b] 6= P [x +

a, y + b], then the character ofP2 that corresponds toH(x+a,y+b) either points to some location other

than P [x + a, y′′ + b], or is equal to0. In both cases, we conclude thatP1 does not match toP ′2. If

P [x + a, y′′ + b] = P [x + a, y + b], thenw′ = (x, y′′), (x, y′) is a simple witness for(a, b) of type 1.

We can now use the argument above onw′ and either obtain thatP1 does not match toP ′2, or obtain a new

witnessw′′. This process can be continued, and since this process must stop, we conclude thatP1 does not

match toP ′2.

Now, suppose thatx > x′, and we proved case 1 of the lemma for every witness in which the difference

between the rows in the two locations is less thanx − x′. Sincel + 1 ≤ y′ ≤ m − b − l, it follows that

the rectangleVi,(x,y) that contains the location(x′, y′) is inside the square ofP . Therefore, the character of

P1 that corresponds toVi,(x,y) points to some location(x′′, y′′), wherex ≤ x′′ < x′. If (x′′, y′′) then we

are done since the character ofP2 that corresponds toHi,(x+a,y+b) either points to some location other than

P [x′′+a, y′′+b], or it is equal to0. Otherwise, we use the induction hypothesis onw′ = (x′′, y′′), (x′, y′).

Note that there is a slight technical problem since it is possible that(x′′, y′′) ∈ A1 and thereforew′ is not

simple. However, the arguments we used above still work onw′, so we still obtain thatP1 does not match

to P ′2.

Case 2 The second case is when the eithery = y′ or leftmost location among(x, y) and (x′, y′) is in

B2 ∪ C ∪ B3. The proof for the case is symmetrical to the proof of case 1 (we use the rectanglesV(x∗,y∗)

andHi,(x∗,y∗) instead ofH(x,y) andVi,(x,y), where(x∗, y∗) is the rightmost location).

Case 3 Assume that cases 1 and 2 do not occur. Since case 1 does not occur,we have that eithery′ ≤ l or

y′ ≥ m − b − l + 1. We consider 4 sub-cases:

1. y′ ≤ l andy > y′. Since(x′, y′) is the leftmost location and case 2 does not occur, we have that either

x′ ≤ l or x′ ≥ m− a− l + 1. If x′ ≤ l then(x′, y′) ∈ A1, and thereforew is not simple, a contradiction.

If x′ ≥ m − a − l + 1, then(x, y), (x′, y′) ∈ A3 ∪ B4 ∪ A4 which again contradicts the assumption that

w is simple.

2. y′ ≤ l andy < y′. In this case, the rectangleV0,(x,y), which contains the point(x′, y′), is inside the

square ofP . Therefore, using the same arguments as in case 1, we obtain thatP1 does not match toP ′2.

3. y′ ≥ m − b − l + 1 andy > y′. We have that(x, y), (x′, y′) ∈ A2 ∪ B3 ∪ A4, which contradicts the

assumption thatw is simple.

4. y′ ≥ m − b − l + 1 andy < y′. Since(x, y) is the leftmost location, it follows that eitherx ≤ l or

x ≥ m − a − l + 1. In the former case, the rectangleH0,(x′,y′), which contains the location(x, y), is

inside the square ofP . It follows thatP1 does not match toP ′2. In the latter case, we obtain thatw is not

simple, a contradiction.


From Claim 1 and Lemma 2 we conclude that stage 1 of the algorithm correctly finds a witness for every

offset (a, b) that has simple witnesses. The time complexity of this stage isO(|P2| · log4+o(1) |P1|) =

O(m3

l · log4+o(1) m).

2.5.2 Corner Witnesses

In the following stages, we will describe only how to find witnesses of type 1,as handling the witnesses of

type 2 is symmetrical.

The second stage is composed of four sub-stages.

Stage 2a

In this stage we find all the witnessesw such that the two locations ofw are inA1. For each location(x, y)

in P , we define rectangles as follows:

H2i,(x,y) = (x + i, y′) ∈ 2 : y′ ≤ y

V 2i,(x,y) = (x′, y + i) ∈ 2 : x′ ≤ x.

Using these rectangles, we build stringsP 21 andP 2

2 by replacing each characterP [x, y] in P by 4l characters

that correspond to the rectanglesH2i,(x,y) andV 2

i,(x,y) for i = −l + 1, . . . , l − 1 (the construction is similar

to the construction ofP1 andP2). Then, we solve the matching problem forP 21 andP 2

2 and find witnesses

for the mismatches. The correctness of this sub-stage follows from the fact that for two locations inA1, one

of the location is inside the rectangles of the other locations.

Stage 2b

In this stage we find all the witnessesw such that one location ofw is in A1, and the other location is in

B1 ∪ B2 ∪ C ∪ A3 ∪ B4.

As in the previous stages, we encode a character ofP by storing pointers to other occurrences of the

character inP . However, instead of creating one matching problem instance, we will createm2/l instances.

The main idea is to group the different offsets into set and build a matching problem instance for each set,

and then use the instance to find witnesses for the offsets in the set. In eachset, all the offsets are close to

each other, and we use this fact in our construction.

Consider a set of offsets(a, b), (a, b+1), . . . , (a, b+l−1), whereb is a multiple ofl. Define the rectangles

H3,a,b(x,y) = (x′, y′) ∈ 2 : y ≤ y′ ≤ m − b − l and0 ≤ x′ ≤ m − a

H4,a,b(x,y) = (x′, y′) ∈ 2 : y′ ≤ y andx ≤ x′ ≤ m − a,


and build stringsP 3,a,b1 andP 3,a,b

2 as follows: For every(x, y) ∈ A1, the stringP 3,a,b1 contains two characters

for P [x, y] corresponding to the rectanglesH3,a,b(x,y) andH4,a,b

(x,y). For every location(x, y) with a + 1 ≤ x ≤a + l andb + 1 ≤ y ≤ b + 2l − 1, the stringP 3,a,b

2 contains two characters forP [x, y] corresponding to the

rectanglesH3,a,b(x,y) andH4,a,b

(x,y). Moreover, the stringP 3,a,b2 is padded with don’t care symbols.

As in the previous stages, we solve the matching problem forP 3,a,b1 andP 3,a,b

2 and find witnesses for the

mismatches.

Stage 2c

This stage finds all the witnessesw such that one location ofw is in A1, and the other location is inA2∪B3.

The algorithm is similar to stage 2b: For every set of offsets(a, b), (a + 1, b), . . . , (a + l − 1, b), wherea is

a multiple ofl, define

V 3,a,b(x,y) = (x′, y′) ∈ 2 : x ≤ x′ ≤ m − a − l and0 ≤ y′ ≤ m − b

V 4,a,b(x,y) = (x′, y′) ∈ 2 : x′ ≤ x andy ≤ y′ ≤ m − b,

and build stringsP 4,a,b1 andP 4,a,b

2 . Then, solve the matching problem forP 4,a,b1 andP 4,a,b

2 and find witnesses

for the mismatches.

Stage 2d

In this final sub-stage, we find all the witnessesw such that one location ofw is in A1, and the other location

is in A4. Recall that

H2i,(x,y) = (x + i, y′) ∈ 2 : y′ ≤ y.

For every set of offsets(a + i, b + j) : i = 0, . . . , l − 1 andj = 0, . . . , l − 1 wherea and b are

multiples ofl, build stringsP 5,a,b1 andP 5,a,b

2 : For every location(x, y) with m − a − 2l + 2 ≤ x ≤ m and

m − b − 2l + 2 ≤ y ≤ m, P 5,a,b1 contains3l − 3 characters forP [x, y], corresponding to the rectangles

H2i,(x,y) for i = m − a − 3l + 2, . . . , m − a − 1. Similarly, for every(x, y) with 1 ≤ x ≤ 2l − 1 and

1 ≤ y ≤ 2l − 1, P 5,a,b2 contains3l − 3 characters forP [x, y], corresponding to the rectanglesH2

i,(x,y) for

i = m − a − 3l + 2, . . . , m − a − 1.

The total time complexity of stage 2 isO(m2l · log4+o(1) m).

2.5.3 l Strip Witnesses

Let (a, b) be some offset, for which no witness was found during the previous stages. We first look for

witnesses with one location inB3 ∪ A4 and the other location inA3 ∪ B4 ∪ A4. DefineD′ = D \ (A1 ∪B1 ∪ A2). Since there are no simple witnesses for(a, b), and no witnesses that satisfy condition 3 (in the

definition of simple witness), we conclude that there are no type 2 witnesses with both locations insideD′.

Therefore, we make the following observation:


Claim 3. For every rectangleD′′ ⊆ D′, the number of distinct characters inside the regionD′′ of P is less

than or equal to the number of distinct characters inside the regionD′′ +(a, b) = (x+ a, x+ b) : (x, y) ∈D′′ of P , with equality if and only if there is no witnessw for (a, b) whose both locations are inD′′.

From Claim 3, we devise the following algorithm for finding a witness inD′: Check the number of distinct

characters inside the regionsD′ andD′ + (a, b) of P . If these numbers are equal then stop. Otherwise, find

a minimal rectangleD′′ in D′ for which the number of distinct characters inD′′ is strictly less than the

number of distinct characters inD′′ + (a, b). Then, one of the pairs of opposite corners of the rectangleD′′

is a witness for(a, b). The problem with this algorithm is that we do not know how to efficiently find such

a rectangleD′′. Instead, we will find a rectangleD∗ which will be approximately equal toD′′.

In order to findD∗, consider the following intersection counting problem: Given a setS of points in the

plane, where each point has a color, preprocessS in order to answer efficiently queries of the form “what is

the number of distinct color in the points inside the rectangle(−∞, b]× [c, d]?”. Denote byn the number of

points inS. Gupta et al. [21] showed a data-structure for this problem with preprocessing timeO(n log2 n)

which answers queries inO(log2 n) time. Using this result, we obtain the following lemma:

Lemma 4. LetP be anm × m string, andl be an integer. After preprocessing ofP in O(m3

l log2 m) time,

we can answer the following queries inO(log2 m) time: “what is the number of distinct characters in the

substringP [a : b, c : d]?”, for every integersa, b, c, andd, such that one these integers is either1, m, or a

multiple ofl.

Proof. We will show how to answer queries in whicha is either a multiple ofl or equal to1. Answering

other queries is done similarly.

The preprocessing is as follows: Create a set of pointsS containing a point(x, y) for everyx = 1, . . . , m

andy = 1, . . . , m. The color of(x, y) is P [x, y]. For every integeri ≤ m/l, build the data structure of

Gupta et al. on the set of pointsSi = (x, y) ∈ S : x ≥ il.

Given a query(a, b, c, d), If a is a multiple ofl, return the number of distinct colors of the setSa/l in

the rectangle(−∞, b] × [c, d]. If a is equal to1, return the number of distinct colors of the setSa/l in the

rectangle(−∞, b] × [c, d].

We now return to the problem of findingD∗. The algorithm is as follows: First, build the data-structure of

Lemma 4 onP . Then, compute the number of distinct characters inside the regionsD′ andD′ +(a, b) of P .

If these numbers are equal, stop. Otherwise, find a minimal rectangleD∗ = x, . . . , m − a × y, . . . , zsuch that the number of distinct characters inside the regionsD∗ andD∗ + (a, b) of P are not equal, and

x is a multiple ofl. FindingD∗ is done using binary search and queries to the data-structure of Lemma 4

(note that the queries we make are of the formx′, . . . , m− a × y′, . . . , z′ wherex′ is a multiple ofl or

x′, . . . , m × y′, . . . , z′).

https://www.researchgate.net/publication/47842608_Further_Results_on_Generalized_Intersection_Searching_Problems_Counting_Reporting_and_Dynamization?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==


Let X be the set of all locations(c, d) such thatc ∈ x, . . . , x+ l− 1∪m−a− l +1, . . . , m−a and

d ∈ y, z. From Claim 3 and the minimality ofX, we obtain that there is a witnessw for (a, b) (of type 1)

whose both locations are inX. We find such a witness as follows:

1. Initialize tablesV [1..m] andL[1..m2] to zeros.

2. Go over the locations inX in some order. For every location(c, d), if V [P [c, d]] /∈ 0, P [c + a, d + b]output the witnessL[P [c, d]], (c, d). Otherwise, setV [P [c, d]] ← P [c+a, d+b] andL[P [c, d]] ← (c, d).

By Lemma 4, the time complexity of this stage isO(m3

l log2 m + m2 log3 m + m2l) (note that the initial-

ization of the tablesV andL above takesO(1) time). Therefore, the total time complexity of the algorithm

is O((ml + l)m2 · log4+o(1) m). The last expression isO(m5/2 · log4+o(1) m) whenl = Θ(

√m).

23

Chapter 3

3 Approximate Parameterized Matching

3.1 Preliminaries and Definitions

Given astrings = s1s2...sn of length|s| = n over an alphabetΣs, and a bijectionπ from Σs to some other

alphabetΣ′s, theimageof s underπ is the strings′ = π(s) = π(s1) · ... ·π(sn), that is obtained by applying

π to all characters ofs.

Given two equal-length stringsw andu over alphabetsΣw andΣu and a bijectionπ from Σw to Σu

theπ-mismatch betweenw andu is Ham(π(w), u), whereHam(s, t) is the Hamming distance between

s andt. We say thatw parameterizek-matchesu if there exists aπ such that theπ-mismatch≤ k. The

approximate parameterize matchbetweenw andu is the minimumπ-mismatch (over all bijectionsπ).

For given input textx, |x| = n, and patterny, |y| = m, the problem ofapproximate parameterized

matchingfor y in x consists of computing theapproximate parameterized matchbetweeny and every (con-

secutive) substring of lengthm of x. Hence, approximate parameterized searching requires computing theπ

yielding minimumπ-mismatch fory at each position ofx. Of course, the bestπ is not necessarily the same

at every position. The problem ofparameterized matching withk mismatchesconsists of computing for

each locationi of x whethery parameterizedk-matches the (consecutive) substring of lengthm beginning

at locationi. Sometimesπ itself is also desired (however, here anyπ with π-mismatch of no more thank

will be satisfactory).

3.2 String Comparison Problem - The Binary Case

We begin by presenting observations on the simpler variant of the problem where the pattern alphabet is

binary. We introduce an algorithm that takesO(nk) time for the following problem:

• Input: Two strings, t = t1, t2, ..., tn over Σt and p = p1, p2, ..., pm over Σ2, and an integer k.

• Output: All locations i in t where p parameterize k-matches ti, . . . , ti+m−1.

Assume thatΣ2 = 0, 1. There are four cases forp:

1. The firstm/2 characters ofp contains at least2k + 1 zeros and2k + 1 ones.

2. The lastm/2 characters ofp contains at least2k + 1 zeros and2k + 1 ones.

3. The firstm/2 characters ofp contains at least2k + 1 zeros and the lastm/2 characters ofp contains at

least2k + 1 ones, or the opposite.

4. One of the characters0, 1, appears in the pattern less than 2k+1 times.

3 APPROXIMATE PARAMETERIZED MATCHING 24

We now examine each case separately;

3.2.1 Case 1 : Firstm/2 Characters

To handle that case, we will choose two sample sequences of the first2k + 1 ones and2k + 1 zeros fromp.

These sample sequences will be checked against the appropriate alignment at each text location, from left

to right. If there is a place with more than k mismatches, it will be eliminated; otherwise, there will be at

leastk + 1 matches to one specific character of the text for each sample sequence. We call this character the

matched character.

Lemma 5. Two locations at distance at mostm/2 of each other will be candidates only if the matched

characters of their sample sequences are the same.

Proof. : Let us assume that there are two candidates within distancem/2 of each other. Then their overlap

includes the right candidate samples. If its matched characters are different from those of the left candi-

date (even by one character), then the left candidate must be eliminated, since it already has more thank

mismatches.

3.2.2 Case 2 : Lastm/2 Characters

This case is symmetric to the above case by reversing the checking order from right to left.

3.2.3 Case 3 : Both Halves

The first stage of running over the text checking the sample sequences isdone in the same manner as before.

The only difference is that for each group of candidates there will be nomore than three different matched

characters following the same reason that was introduced in the first case.

Now, for the above three cases, since we already determinedπ betweenp and every text location that

was not eliminated, we can run an algorithm for searching an exact patternmatching withk mismatches

in O(n√

klogk) time [7], for text size of3m/2 for eachπ. We partition the text into pieces that ensure

us at leastm/2 characters in the overlap of any two candidates, and since we have a constant number ofπ

functions, we also have a constant number of times to run the algorithm.

3.2.4 Case 4 : Less than2k + 1 Characters

In that case we will check the entire pattern against the first place in the text.In the next potential match, we

need to check at most2k text locations, since all the alignments of the pattern against itself are different in

at most2k places.

https://www.researchgate.net/publication/2802753_Faster_Algorithms_for_String_Matching_with_k_Mismatches?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==


Hence, the algorithm runs inO(nk) time. unfortunately, expanding these results to the general version

(where the pattern alphabet is arbitrary) does not lead to algorithm with attractive running time, hence we

present now the algorithm for the general case.

3.3 String Comparison Problem

We begin by evaluating two equal-length strings for a parameterizedk-match as follows:

• Input: Two strings, s = s1, s2, ..., sm and s′ = s′1, s′2, ..., s

′m, and an integer k.

• Output: True, if there exists π such that the π-mismatch of s and t is no more than k.

In standard string comparison, we simply compare the characters and count the number of mismatches in

O(m) time. In approximate parameterized matching the naive method of solving the problem is to check

theπ-mismatch for every possibleπ. However, this takes exponential time.

One way to solve the problem is to reduce the problem to a maximal bipartite weighted matching in a

graph. We construct a bipartite graphB = (U ∪V, E) in the following way.U = Σs andV = Σs′ . There is

an edge betweena ∈ Σs andb ∈ Σs′ iff there is at least onea in s that is aligned with ab in s′. The weight

of this edge is the number ofa’s in s that are aligned tob’s in s′. It is easy to see that:

Observation 5. A maximal weighted matching inB corresponds to a minimalπ-mismatch, whereπ is

defined by the edges of the matching.

The problem of maximum bipartite matching has been widely studied and efficientalgorithms have been

devised, e.g. [16, 17, 19, 23]. For integer weights, where the largestweight is bounded by|V |, the fastest

algorithm runs in timeO(E√

V ) [23]. This solution yields anO(m1.5) time algorithm for the problem of

parameterized string comparison with mismatches. The advantage of the algorithm is that it views the whole

string at once and efficiently finds the best function. In general, for theparameterized pattern matching

problem with inputst of lengthn, andp of lengthm, the bipartite matching solution yields anO(nm1.5)

time algorithm.

3.3.1 Reducing Maximum Matching to the String Comparison Problem

Lemma 6. Let B = (U ∪ V, E) be a bipartite graph. If the string comparison problem can be solved in

O(f(m)) time then a maximum matching inB can be found inO(f(|E|) time.

Proof. Take an enumeration of the nodesu1, u2, ..., u|U | andv1, v2, ..., v|V | and an enumeration of the

edgese1, ..., e|E|. Create two stringssU andsV of length|E|, where for edgeei = uj , vl we set theith

location ofsU to beuj and theith location ofsV to bevj .

Obviously a bijection from the ”alphabet” ofsU to the ”alphabet” ofsV corresponds to a matching inB.

Hence, the lemma follows.

https://www.researchgate.net/publication/1955169_A_Decomposition_Theorem_for_Maximum_Weight_Bipartite_Matchings?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==


https://www.researchgate.net/publication/3770656_Fibonacci_Heaps_And_Their_Uses_In_Improved_Network_Optimization_Algorithms?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/4354987_Scaling_Algorithms_for_Network_Problems?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

https://www.researchgate.net/publication/220618280_Faster_Scaling_Algorithms_for_Network_Problems?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==


Since it is always true that|E| < |V |2 it follows that |E|1/4 <√

|V |. Hence, it would be surprising to

solve the string comparison problem ino(m1.25). In fact, even for graphs with a linear number of edges the

best known algorithm isO(E√

V ) and hence an algorithm for the string comparison problem ino(m1.5)

would have implications on the maximum matching problem as well.

Note that for the string comparison problem one can reduce the approximateparameterized matching

version to the parameterized matching withk mismatches version. This is done by binary searching onk

to find the optimal solution for the string comparison problem in the approximate parameterized matching

version. Hence an algorithm for detecting whether the string comparison problem has a bijection with fewer

thank mismatches with timeo( k1.5

log k + m) would imply a faster maximum matching algorithm.

3.3.2 Mismatch Pairs and Parameterized Properties

Obviously, one may use the solution of approximate parameterized string matching to solve the problem

of parameterized matching withk mismatches. However, it is desirable to construct an algorithm with

running time dependant onk rather thanm. The bipartite matching method does not seem to give insight

into how to achieve this goal. In order to give an algorithm dependent onk we introduce a new method

for detecting whether two equal-length strings have a parameterizedk-match. The ideas used in this new

comparison algorithm will be used in the following section to yield an efficient solution for the problem of

parameterized matching withk mismatches.

When faced with the problem ofp-matching with mismatches, the initial difficulty is to decide, for a

given location, whether it is a match or a mismatch. Simple comparisons do not suffice, since any symbol

can match (or mismatch) every other symbol. In the bipartite solution, this difficultyis overcome by viewing

the entire string at once. A bijection is found, over all characters, excluding at mostk locations. Thus, it

is obvious that the locations that are excluded from the matching are “mismatched” locations. For our

purposes, we would like to be able to decide locally,i.e. beforeπ is ascertained, whether a given location is

“good” or “bad.”

Recall definition 3 of a mismatch pair for two dimensional strings. The exact same definition holds for

one dimensional strings as well. A pair of locations (i,j) such that one of the following holds,

1. si = sj and s′i 6= s′j 2. si 6= sj and s′i = s′j .

It immediately follows from the definition that:

Lemma 7. Given two equal-length strings,s ands′, if (i, j) is a mismatch pair betweens ands′, then for

every bijectionΠ : Σs → Σs′ either locationi or locationj, or both locationsi andj are mismatches, i.e.

π(si) 6= s′i or π(sj) 6= s′j or both are unequal.

Conversely,


Lemma 8. Lets ands′ be two equal-length strings and letS ⊂ 1, ..., m be a set of locations which does

not contain any mismatch pair. Then there exists a bijectionπ : Σs → Σs′ that is parameterized onS, i.e.

for everyi ∈ S, π(si) = s′i.

Proof: S does not contain any mismatch pair. Hence, for any two locationsi andj in S if si=sj then

s′i=s′j , and ifsi 6= sj thens′i 6= s′j . Hence there is a bijectionπ : Σs → Σs′ that is a parameterized match

on the set of locationsS.

The idea of our algorithm is to count the mismatch pairs betweens ands′. Since each pair contributes

at leastone error in the p-match betweens ands′ it follows immediately from Lemma 7 that if there are

more thank mismatch pairs thens does not parameterizek-matchs′. We claim that if there are fewer than

k/2 + 1 mismatch pairs thens parameterizek-matchess′.

Definition 6. Given two equal-length strings,s and s′, a collectionL of mismatch pairs is said to be a

maximal disjoint collectionif (1) all mismatch pairs inL are disjoint, i.e. do not share a common location,

and (2) there is no mismatch pair that can be added toL without violating disjointness.

Corollary 1. Let s ands′ be two strings, and letL be a maximal disjoint collection of mismatch pairs ofs

ands′. If |L| > k, then for everyΠ : Σs → Σ′s, theΠ-mismatch is greater thank. If |L| ≤ k/2 then there

existsΠ : Σs → Σ′s such that theΠ-mismatch counter is less than or equal tok.

Proof: Combining Lemma 7 and Lemma 8 yields the proof.

3.3.3 String Comparison Algorithm

The method uses Corollary 1. First the mismatch pairs need to be found. In fact, by Corollary 1 onlyk + 1

of them need to be found sincek + 1 mismatch pairs implies that there is no parameterizedk-match. After

the mismatch pairs are found if the number of mismatch pairsmp is less thank/2 or more thank we can

announce a mismatch or match immediately according to Corollary 1. The difficult case to detect is whether

there indeed is a parameterizedk-match fork/2 < mp ≤ k. This is done with the bipartite matching

algorithm. However, here the bipartite graph needs to be constructed somewhat differently. While the

case wheremp ≤ k/2 implies an immediate match, for simplicity of presentation we will not differentiate

between the case ofmp ≤ k/2 andk/2 < mp ≤ k. Thus from here on we will only bother with the cases

mp ≤ k andmp > k.

3.3.4 Find Mismatch Pairs

In order to find the mismatch pairs of equal-length stringss and s′ we use a simple stack scheme, one

stack for each character in the alphabet. First mismatch pairs are searched for according to the first rule of

definition 3, i.e. si = sj but s′i 6= s′j . Then within the leftover locations mismatch pairs are seeked that

satisfy the second rule of definition 3, i.e.si 6= sj buts′i = s′j .


Algorithm: Find mismatch pairs(s,s’)

1. Construct a stack Sσ for each σ ∈ Σs.

2. For i = 1 to m do

if empty(Ssi) then push(Ssi

, < s′i, i >)

else if top(Ssi) =< s′i, ∗ > then push(Ssi

, < s′i, i >)

else < s′∗, j >← pop(Ssi); add (i, j) to the mismatch pairs

3. Set L to be all locations of 1, ..., m that do not appear in the mismatch pairs.

4. Reverse the roles of s and s′ and do step 2 while scanning items only from L

(i.e. here step 2 begins - ”For j ∈ L do”)

Time Complexity: The algorithm that finds all mismatch pairs between two strings of lengthm runs in

O(m) time. Each character in the string is pushed/popped onto at most one stack.

3.3.5 Verification

The verification is performed only when the number of mismatch pairs that werefound, mp, satisfies

mp ≤ k. Verification consists of a procedure that finds the minimalπ-mismatch over all bijectionsπ.

The technique used is similar to the bipartite matching algorithm discussed in the beginning of the section.

Let L ⊂ 1, ..., m be the locations that appear in a mismatch pair. We say that a symbola ∈ Σs is

mismatchedif there is a locationi ∈ L such thatsi = a. Likewise,b ∈ Σs′ is mismatchedif there is a

locationi ∈ L such thats′i = b. A symbola ∈ Σs, or b ∈ Σs′ , is free if it is not mismatched.

Construct a bipartite graphB = (U ∪ V, E) defined as follows.U contains all mismatched symbols

a ∈ Σs andV contains all mismatched symbolsb ∈ Σs′ . Moreover,U contains all free symbolsa ∈ Σs for

which there is a locationi ∈ 1, ..., m − L such thatsi = a ands′i is a mismatched symbol. Likewise,V

contains all free symbolsb ∈ Σs′ for which there is a locationi ∈ 1, ..., m − L such thats′i = b andsi is

a mismatched symbol. The edges, as before, have weights that correspond to the number of locations where

they are aligned with each other.

Lemma 9. The bipartite graphB is of sizeO(k). Moreover, given the mismatch pairs it can be constructed

in O(k) time.

Proof: This lemma follows by observing that for anya ∈ Σs all appearances ofa in locations of

1, ..., m− L are aligned with a given symbolb ∈ Σs′ by definition of mismatch pairs. Symmetrically, the

same is true for anyb ∈ Σs′ . On the other hand,L is of sizeO(k).

While the lemma states that the bipartite graph is of sizeO(k) it still may be the case that the edge weights

may be substantially larger. However, if there exists an edgee = (a, b) with weight> k then it must be

thatπ(a) = b for otherwise we immediately have> k mismatches. Thus, before computing the maximum


matching, we drop every edge with weight> k, along with their vertices. What remains is a bipartite graph

with edge weights between 1 andk.

Theorem 10. Given two equal-length stringss ands′, with mp ≤ k mismatch pairs. It is possible to verify

whether there is a parameterizedk-match betweens ands′ in O(k1.5) time.

Proof: Since the size of the bipartite graphB isO(k), by Lemma 9, it follows that the maximum weighted

bipartite matching can be solved inO(k1.5) time algorithm [23].

We need to show that its solution solves our problem. The matchingM clearly yields a one-to-one

functionπ : Σs → Σ′s, since at most one edge is chosen per character ins ands′. Since the matching is

maximal, it is the optimal function, and every character not included in the matching is an error. If the sum

of the edge weights inE − M is more thank then there is no parameterizedk-match betweens ands′.

Time Complexity: Given two equal-length stringss ands′, it is possible to determine whethers param-

eterizedk-matchess′ in O(m + k1.5) time, O(m) to find the mismatch pairs andO(k1.5) to check these

pairs with the appropriate bipartite matching.

3.4 The Algorithm

We are now ready to introduce our algorithm for the problem of parameterized pattern matching withk

mismatches:

• Input: Two strings, t = t1, t2, ..., tn and p = p1, p2, ..., pm, and an integer k.

• Output: All locations i in t where p parameterized k-matches ti, . . . , ti+m−1.

Our algorithm has two phases, thepattern preprocessing phaseand thetext scanning phase. In this

section we present the text scanning phase and will assume that the patternpreprocessing phase is given.

In the following section we describe an efficient method to preprocess the pattern for the needs of the text

scanning phase.

Definition 7. Let s and s′ be two-equal length strings. LetL be a collection of disjoint mismatch pairs

betweens ands′. If L is maximal and|L| ≤ k thenL is said to be ak-good witnessfor s ands′. If |L| > k

thenL is said to be ak-bad witnessfor s ands′.

The text scanning phase has two stages the (a)filter stageand the (b)verification stage. In the filter stage

for each text locationi we find either ak-good witness or ak-bad witness forp andti . . . ti+m−1. Obviously,

by Corollary 1, if there is ak-bad witness thenp cannot parameterizek-matchti . . . ti+m−1. Hence, after

the filter stage, it remains to verify for those locationsi which have ak-good witness whetherp parameterize

k-matchesti . . . ti+m−1. The verification stage is identical to the verification procedure in Section 3.3.5 and

hence we will not dwell on this stage.



3.4.1 The Filter Stage

The underlying idea of the filter stage is similar to [26]. Like the KMP algorithm [24] one location after

another is evaluated utilizing the knowledge accumulated at the previous locations combined with the in-

formation gathered in the pattern preprocessing stage. The information of the pattern preprocessing stage

is:

Output of pattern preprocessing: For each location i of p, a maximal disjoint collection of mis-

match pairs L for some pattern prefix p1, . . . , pj−i+1 and pi, . . . , pj such that either |L| = 3k + 3 or

j = m and |L| ≤ 3k + 3.

As the first step of the algorithm we consider the first text location, i.e. whentext t1 . . . tm is aligned

with p1 . . . pm. Using the method in Section 3.3.4 we can find all the mismatch pairs inO(m) time, and

hence find ak-good ork-bad witness for the first text location. When evaluating subsequent locations i

we maintain the following invariant:For all locationsl < i we have either ak-good ork-bad witness for

locationl.

It is important to observe that, when evaluating locationi of the text, if we discover a maximal disjoint

collection of mismatch pairs of sizek +1, i.e. ak-bad witness, between a prefixp1, ..., pj of the pattern and

ti, . . . , ti+j−1 we will stop our search, since this immediately implies thatp1, . . . , pm andti, . . . , ti+m−1

have ak-bad witness. We say thati + j is astop locationfor i. If we have ak-good witness at locationi

theni + m is its stop location. When evaluating locationi of the text, each of the previous locations has a

stop location. The locationℓ which has a maximal stop location, over all locationsl < i, is called amaximal

stopper fori.

The reason for working with the maximal stopper fori is that the text beyond the maximal stopper has

not yet been scanned. If we can show that the time to compute ak-bad witness or find a maximal dis-

joint collection of mismatch pairs for locationi up until the maximal stopper, i.e. forp1, . . . , pℓ′−i+1 with

ti, . . . , tℓ′ , isO(k) then we will spend overallO(n+nk) time for computing,O(nk) to compute up until the

maximal stopper for each location andO(n) for the scanning and updating the current maximal collection

of mismatch pairs.

3.4.2 Using Precomputed Information

The situation now is that we are evaluating a locationi of t. Let ℓ be the maximal stopper fori. We utilize

two pieces of precomputed information; (1) the pattern preprocessing forlocationi− ℓ + 1 of p and (2) the

k-good witness ork-bad witness (of sizek+1) for locationℓ of t. Let ℓ′ be the stop location ofℓ. We would

like to evaluate the overlap ofp1, . . . , pℓ′−i+1 with ti, . . . , tℓ′ and to find a maximal disjoint collectionL

of mismatch pairs on this overlap, or, if it exists, a maximal disjoint collectionL of mismatch pairs of size

k + 1 on a prefix of the overlap.

There are two cases. One possibility is that the pattern preprocessing returns3k + 3 mismatch pairs for

https://www.researchgate.net/publication/221591480_Introducing_Efficient_Parallelism_into_Approximate_String_Matching_and_a_New_Serial_Algorithm?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==



a prefixp1, . . . , pj andpℓ−i+1, . . . , pℓ−i+j+1 wherej ≤ ℓ′ − i + 1. Yet, there are≤ k + 1 mismatch pairs

betweenp1, . . . , pℓ′−ℓ+1 andtℓ, . . . , tℓ′ .

Lemma 11. Lets, s′ ands′′ be three equal-length strings such that there is a maximal collection of mismatch

pairsLs,s′ betweens ands′ of size≤ M and a maximal collection of mismatch pairsLs′,s′′ betweens′ and

s′′ of size≥ 3M . Then there must be a maximal collection of mismatch pairsLs,s′′ betweens ands′′ of size

≥ M .

Proof. Since each mismatch pair is composed of 2 locations, there must be at leastM mismatch pairs

betweens′ ands′′ which do not participate in mismatch pairs betweens ands′. It follows from the definition

of mismatch pairs that theseM mismatch pairs are also mismatch pairs betweens ands′′.

SetM to bek + 1 and one can use the Lemma for the case at hand, namely, there must be at leastk + 1

mismatch pairs betweenp1, . . . , pℓ′−i+1 andti, . . . , tℓ′ , which defines ak-bad witness.

The second case is where the pattern preprocessing returns fewer than 3k + 3 mismatch pairs or it returns

3k + 3 mismatch pairs but it does so for a prefixp1, . . . , pj andpℓ−i+1, . . . , pℓ−i+j+1 wherej > ℓ′ − i + 1.

However, we still have an upper bound ofk +1 mismatch pairs betweenp1, . . . , pℓ′−ℓ+1 andtℓ, . . . , tℓ′ . So,

we can utilize the fact that:

Lemma 12. Given three strings,s, s′, s′′ such that there are maximal disjoint collections of mismatch pairs,

Ls,s′ andLs′,s′′ of sizesO(k). In O(k) time one can find ak-good ork-bad witness fors ands′′.

Proof. Let Ls,s′ be the indices that are not in any mismatch pair betweens ands′. By Lemma 8 there exists

a bijectionπ : Σs → Σs′ such thatπ defines a parameterized match onLs,s′ . Likewise, if Ls′,s′′ are the

indices that are not in any mismatch pair betweens′ ands′′, then there is a bijectionπ′ : Σs′ → Σs′′ that

defines a parameterized match onLs′,s′′ .

The set of indices that participates in no mismatch pair in either of the pairs isLs,s′ ∩ Ls′,s′′ . On this set

π′ · π() defines a parameterized match. However, the number of locations participating in mismatch pairs is

bounded byk. Using linked lists for each character one can now construct the, at mostO(k) mismatch pairs

in O(k) time.

3.4.3 Putting it Together

The filter stage takesO(nk) time and the verification stage takesO(nk1.5). Hence,

Theorem 13. Given the preprocessing stage we can announce for each locationi of t whetherp parame-

terizedk-matchesti, ..., ti+m−1 in O(nk1.5) time.


3.5 Pattern Preprocessing

In this section we solve the pattern preprocessing necessary for the general algorithm.• Input: A pattern p = p1, . . . , pm and an integer k.

• Output: For each location i, 1 ≤ i ≤ m a maximal disjoint collection of mismatch pairs L for the

minimal pattern prefix p1, . . . , pj−i+1 and pi, . . . , pj such that either |L| = 3k + 3 or j = m and

|L| ≤ 3k + 3.

The naive method takesO(m2), by applying the method from Section 3.3.4. However, this does not

exploit any previous information for a new iteration. Since every alignment combines two previous align-

ments, we can get a better result. Assume, w.l.o.g., thatm is a power of 3. We divide the set of alignments

into log3 m + 1 sequences as follows;

R1 = [2, 3], R3 = [4, 5, 6, 7, 8, 9], ..., Ri = [3i−1 + 1, ..., 3i], ..., Rlog3

m = [3log3 m−1 + 1, ..., 3log3 m]

1 ≤ i ≤ log3 m.

At each step, we compute the desired mismatches for each set of alignmentsRi. Stepi uses the in-

formation computed in steps1...i − 1. We further divide each set into two halves, and compute the first

half followed by the second half. This is possible since eachRi contains an even number of elements, as

3i − 3i−1 = 2 ∗ 3i−1. We split each sequenceRi into two equal length sequencesR1i [3

i−1 + 1, ..., 2 ∗ 3i−1]

andR2i = [2 ∗ 3i−1 + 1, . . . , 3i]. For each of the new sequences we have that:

Lemma 14. If r1, r2 ∈ R1i (the first half of setRi) or r1, r2 ∈ R2

i (the second half ofRi) such thatr1 < r2

thenr2 − r1 ∈ Rj for j < i.

Proof: We assume thatr1, r2 ∈ R1i (the proof for the other case is symmetric). Obviouslyr2 ≤ 2 ∗ 3i−1

andr1 ≥ 3i−1 + 1, thenr2 − r1 ≤ 2 ∗ 3i−1 − 3i−1 − 1 = 3i−1 − 1.

This gives a handle on how to compute our desired output. Denotefi = minj ∈ R1i andmi =

minj ∈ R2i the representatives of their sequences. We compute the following two stages for each group

Ri, in orderR1, . . . , Rlog3

m,

1. Compute a maximal disjoint collection of mismatch pairsL for some pattern prefixp1, . . . , pj+1 and

pfi, . . . , pfi+j such that|L| = (3k + 3)3log3 m−i or fi + j = m and|L| ≤ (3k + 3)3log3 m−i. Do the

same withpmi, . . . , pmi+j .

2. Now for eachj ∈ R1i apply the algorithm for the text described in the previous section on the pattern p,

the pattern shifted byfi (R1i ’s representative) and the pattern shifted byj. Do the same withmi for R2

i .

The central idea and importance behind the choice of the number of mismatch pairs that we seek is to

satisfy Lemma 11. It can be verified that indeed our choice of sizes always satisfies that there are 3 times as

many mismatch pairs as in the previous iteration. Therefore,

Theorem 15. Given a patternp of lengthm. It is possible to precompute its3k + 3 mismatch pairs at each

alignment inO(km log3 m) time.



The time complexity for each groupRi with O(3i−1) members is3log3

m−i+1(3k + 3) for finding the

mismatches, multiply byO(3i−1) members, which isO(mk) for a group, plusO(m) for the first and the

middle elements. All together it isO(km log3 m).

Corollary 2. Given a patternp and textt, we can solve the parameterized matching withk mismatches

problem inO(nk1.5 + km log m) time.

3.6 2-D Parameterized Matching withk Mismatches

Two-dimensional parameterized matching has applications in image search with variable colors and er-

rors. In this section we present an extension of our algorithm for 2-dimensional approximate parameterized

matching.

• Input: Two images, text T [1...n][1...n], and pattern P [1...m][1...m], and an integer k.

• Output: all locations (i, j) in T such that P parameterized k-matches the m × m image whose

upper left corner is T [i, j].

We begin with the assumption that the text has exactlym rows, and then multiply the resulting time

complexity byn. Both the text and the pattern are linearized column-by-column. Our string matching

algorithm of section 3.4 can be applied to the linear pattern and text. Care must be taken that only those

locations whose index is divisible bym will be considered. This can clearly be accomplished inO(n) time.

Thus, the time complexity for eachm rows of the text isO(nmk1.5).

Theorem 16. Given anm × m imageP and ann × n imageT , we can solve the parameterized matching

with k mismatches problem inO(n2mk1.5 + m2k log m) time.

REFERENCES 34

References

[1] N. Alon and M. Naor. Derandomization, witnesses for boolean matrix multiplication and construction

of perfect hash functions.Algorithmica, 16:434–449, 1996.

[2] A. Amir. Personal communication.

[3] A. Amir, Y. Aumann, R. Cole, M. Lewenstein, and E. Porat. Function matching: algorithms, applica-

tions and a lower bound. InProc. of the 30th International Colloquium on Automata, Languages and

Programming (ICALP), pages 929–942, 2003.

[4] A. Amir, G. Benson, and M. Farach. An alphabet independent approach to two dimensional pattern

matching.SIAM J. on Computing, 23(2):313–323, 1994.

[5] A. Amir, K. W. Church, and E. Dar. Separable attributes: a techniquefor solving the sub matrices

character count problem. InProc. 13th Symposium on Discrete Algorithms (SODA), pages 400–401,

2002.

[6] A. Amir, M. Farach, and S. Muthukrishnan. Alphabet dependencein parameterized matching.Infor-

mation Processing Letters, 49:111–115, 1994.

[7] A. Amir, M. Lewenstein, and E. Porat. Faster algorithms for string matching with k mismatches. In

Proc. 11th Symposium on Discrete Algorithms (SODA), pages 794–803, 2000.

[8] A. Apostolico, P. Erdos, and M. Lewenstein. Parameterized matching with mismatches.To appear in

j. of Discrete Algorithms.

[9] G.P. Babu, B.M. Mehtre, and M.S. Kankanhalli. Color indexing for efficient image retrieval.Multime-

dia Tools and Applications, 1(4):327–348, 1995.

[10] B. S. Baker. A theory of parameterized pattern matching: algorithms and applications. InProc. 25th

ACM Symposium on the Theory of Computation (STOC), pages 71–80, 1993.

[11] B. S. Baker. Parameterized string pattern matching.J. of Computer and System Sciences, 52(1):28–42,

1996.

[12] B. S. Baker. Parameterized duplication in strings: Algorithms and an application to software mainte-

nance.SIAM J. on Computing, 26(5):1343–1362, 1997.

[13] B. S. Baker. Parameterized diff. InProc. 10th Symposium on Discrete Algorithms (SODA), pages

854–855, 1999.

[14] R.S. Boyer and J.S. Moore. A fast string searching algorithm.Comm. ACM, 20:762–772, 1977.
















https://www.researchgate.net/publication/271105321_Derandomization_Witnesses_for_Boolean_Matrix_Multiplication_and_Construction_of_Perfect_Hash_Functions?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==



https://www.researchgate.net/publication/224960000_A_Fast_String_Searching_Algorithm?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==




REFERENCES 35

[15] R. Cole and R. Hariharan. Faster suffix tree construction with missingsuffix links. InProc. 32nd ACM

Symposium on Theory of Computing (STOC), pages 407–415, 2000.

[16] M.L. Fredman and R.E. Tarjan. Fibonacci heaps and their use in improved network optimizations

algorithms.J. of the ACM, 34(3):596–615, 1987.

[17] H. N. Gabow. Scaling algorithms for network problems.J. of Computer and System Sciences, 31:148–

168, 1985.

[18] H. N. Gabow, J. L. Bentley, and R. E. Tarjan. Scaling and related techniques for geometry problems.

Proc. 16th ACM Symposium on Theory of Computing (STOC), 67:135–143, 1984.

[19] H. N. Gabow and R.E. Tarjan. Faster scaling algorithms for network problems.SIAM J. on Computing,

18:1013–1036, 1989.

[20] Z. Galil and R. Giancarlo. Improved string matching withk mismatches.SIGACT News, 17(4):52–54,

1986.

[21] P. Gupta, R. Janardan, and M. Smid. Further results on generalized intersection searching problems:

Counting, reporting, and dynamization.J. of Algorithms, 19(2):282–317, 1995.

[22] D. Harel and R.E. Tarjan. Fast algorithms for finding nearest common ancestor.J. of Computer and

System Sciences, 13:338–355, 1984.

[23] M.Y. Kao, T.W. Lam, W.K. Sung, and H.F. Ting. A decomposition theoremfor maximum weight

bipartite matching.SIAM J. on Computing, 31(1):18–26, 2001.

[24] D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. on Computing,

6:323–350, 1977.

[25] S. R. Kosaraju. Faster algorithms for the construction of parameterized suffix trees.Proc. 36th IEEE

FOCS, pages 631–637, 1995.

[26] G.M. Landau and U. Vishkin. Introducing efficient parallelism into approximate string matching.Proc.

18th ACM Symposium on Theory of Computing (STOC), pages 220–230, 1986.

[27] G.M. Landau and U. Vishkin. Fast string matching with k differences.J. of Computer and System

Sciences, 37(1):63–78, 1988.

[28] V. I. Levenshtein. Binary codes capable of correcting, deletions, insertions and reversals.Soviet Phys.

Dokl., 10:707–710, 1966.













https://www.researchgate.net/publication/221590107_Scaling_and_Related_Techniques_for_Geometry_Problems?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==















REFERENCES 36

[29] B. Schieber and U. Vishkin. On finding lowest common ancestors: Simplification and parallelization.

SIAM J. on Computing, 17:1253–1262, 1988.

[30] M. Swain and D. Ballard. Color indexing.International Journal of Computer Vision, 7(1):11–32,

1991.

[31] U. Vishkin. Optimal parallel pattern matching in strings.Proc.12th ICALP, pages 91–113, 1985.

[32] U. Vishkin. Deterministic sampling - a new technique for fast pattern matching. SIAM J. on Computing,

20:303–314, 1991.

https://www.researchgate.net/publication/222775556_Optimal_parallel_pattern_matching_in_strings?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==



https://www.researchgate.net/publication/221448228_On_Finding_Lowest_Common_Ancestors_Simplification_and_Parallelization?el=1_x_8&enrichId=rgreq-34a582dbc288201742d384edcdc67bc8-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU4MDA0NDtBUzoxMzEwOTYwODM3MDE3NjFAMTQwODI2NzE1MDM3NA==

delta-gamma-parameterized matching

Documents