overlap matching

58
1 Overlap Overlap Matching Matching By Itamar Nabriski By Itamar Nabriski A. Amir, R. Cole, G. Landau, R. Hariharan, M. Lewenstein, E. Porat, Overlap Matching, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms (2001) 279-288,.

Upload: fitzgerald-benjamin

Post on 30-Dec-2015

40 views

Category:

Documents


3 download

DESCRIPTION

Overlap Matching. By Itamar Nabriski. A. Amir, R. Cole, G. Landau, R. Hariharan, M. Lewenstein, E. Porat, Overlap Matching, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms (2001) 279-288,. Lecture Structure. Discrete Convolutions Overlap Matching Problem Definition - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Overlap Matching

1

Overlap Overlap MatchingMatching

By Itamar NabriskiBy Itamar Nabriski

A. Amir, R. Cole, G. Landau, R. Hariharan, M. Lewenstein, E. Porat, Overlap Matching, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms (2001) 279-288,.

Page 2: Overlap Matching

2

Lecture StructureLecture Structure

Discrete ConvolutionsDiscrete Convolutions Overlap Matching Problem Overlap Matching Problem

DefinitionDefinition Overlap Matching AlgorithmOverlap Matching Algorithm Reduction from Swap MatchingReduction from Swap Matching

Page 3: Overlap Matching

3

Discrete Discrete ConvolutionsConvolutions

Page 4: Overlap Matching

4

Definition 1Definition 1Let T be a function whose domain is {0,…,n-1}Let T be a function whose domain is {0,…,n-1}

Let P be a function whose domain is {0,…,m-1}Let P be a function whose domain is {0,…,m-1}

(we’ll view them as arrays of numbers of length n and m (we’ll view them as arrays of numbers of length n and m respectively)respectively)

1

0

][][])[(m

i

iPijTjPT

},...,0{ mnj

The Convolution of T and P at index j is defined as follows:

Page 5: Overlap Matching

5

Exempli GratiaExempli Gratia

TT=Ronaldinho =Ronaldinho

(n=10)(n=10)PP=Deco =Deco

(m=4)(m=4)

(TxP)[2]=

nnDD+a+aee+l+lcc+d+doo

(assume, for now, each letter represents a number)

Thus Naïve Computional Time is O(m)

RRoonnaallddiinnhhoo

DDeeccoo

(TxP)[2]=

Page 6: Overlap Matching

6

Since

number of possible convolutions is O(n)

Computing All The ConvolutionsComputing All The Convolutions

Naïve ApproachNaïve Approach ::

},...,0{ mnj

For each j pay O(m) time, for a total of O(nm)

Devious ApproachDevious Approach ::Using the “Fast Fourier Transform” (FFT) For each j pay O(log m) time, for a total of O(n log m)

(machine word size must be O(log m) though)

Page 7: Overlap Matching

7

Using ConvolutionsUsing Convolutions

Before we perform convolutions on T and P we preprocess each letter using a constant number of constant time functions (total O(n)), retaining the running time of O(nlogm)

For each index j we can use a constant time function f to format the output of the constant number of convolutions, retaining a running time of O(nlogm)

For each index j we can preform a constant amount of convolutions, retaining the running time of O(nlogm)

Preprocessing

Using several Convolutions

Postprocessing

Page 8: Overlap Matching

8

Using Convolutions - Using Convolutions - ExampleExample

∑ = {a,b}

T = ababbaaa

P = abba

Thus, for example, T and P exactly match at index j = 2 :

aabbaabbbbaaaaaa

aabbbbaa

2

Testing For Exact Matching at index j

Page 9: Overlap Matching

9

Using Convolutions - Using Convolutions - ExampleExample

otherwise 0

xif 1prep(x)

Preprocessing functions (x is a letter):

otherwise 0

xif 1(x)prep

When we write prep(S), S being a string, we mean we apply prep to all characters of S

Testing For Exact Matching at index j

Page 10: Overlap Matching

10

Using Convolutions - Using Convolutions - ExampleExample

])[()(

])[()(

jPprepTprep

jPprepTprep

bb

aa

otherwise 0

0y xif 1),( yxf

Convolutions we will use:

Postprocessing function f :

For every index j, 1]))[()(],)[()(( jTprepPprepjTprepPprepf bbaa

Iff there is an exact matching of P and T at index j.

Testing For Exact Matching at index j

Page 11: Overlap Matching

11

Using Convolutions - Using Convolutions - ExampleExample

Testing For Exact Matching at index j

T = ababbaaa

prepa(T)=10100111

prepb(T)=01011000

P = abba

prepa(P)=0110

prepb(T)=1001

1100110000111111

00111100j =2

prepa(T) X prepa(P)[2]=

= 1*0+0*1+0*1+1*0=0

0011001111000000

11000011

prepb(T) X prepb(P)[2]=

= 0*1+1*0+1*0+0*1=0

F(0,0) = 1 = exact match at 2

Page 12: Overlap Matching

12

Overlap Overlap Matching Matching Problem Problem

DefinitionDefinition

Page 13: Overlap Matching

13

Stractural StringStractural String Linear structure made of segmentsLinear structure made of segments Each segment can be marked or Each segment can be marked or

unmarkedunmarked A Stractural String is a A Stractural String is a

concatenation of segmentsconcatenation of segments

Exampli Gratia

TThhee__WWaallllss__OOff__JJeerriicchhoo

001122334455667788991010111112121212141415151616171718181919

Thus, the actual characters are not

important

Page 14: Overlap Matching

14

Defintion of Overlap MatchingDefintion of Overlap Matching

Stractural String T (text) of length nStractural String T (text) of length n Stractural String P (pattern) of length mStractural String P (pattern) of length m m≤nm≤n Both T and P have some marked Both T and P have some marked

segmentssegments

InputInput

All locations k in T where if P All locations k in T where if P is aligned at k, all marked is aligned at k, all marked segments have overlaps of segments have overlaps of

even lengtheven length

OutputOutput

Page 15: Overlap Matching

15

ExampleExampleT = Franz_Beckenbauer

P = The_Kaiser

FFrraannzz__BBeecckkeennbbaauueerr

TThhee__KKaaiisseerr

FFrraannzz__BBeecckkeennbbaauueerr

TThhee__KKaaiisseerr

j=3

Overlaps (j=3 is not a valid overlap match – has an odd overlap):

3

Page 16: Overlap Matching

16

Overlap Overlap Matching Matching AlgorithmAlgorithm

Page 17: Overlap Matching

17

General PreprocessingGeneral PreprocessingEach segment can Each segment can startstart at either an at either an eveneven or or oddodd index and index and endend at either an at either an eveneven or or oddodd indexindex

We will produce from We will produce from TT four new four new segmentssegments TToooo,T,Teeee,T,Toeoe andand TTeoeo

TToooo will have 1’s in the place of all will have 1’s in the place of all characters belonging to segments that characters belonging to segments that start start andand end at an odd index end at an odd index andand 0’s 0’s otherwiseotherwise , for example:, for example:

TToooo = = 0001111100 0001111100

And analgously for the other segment And analgously for the other segment types …types …

3 7

Page 18: Overlap Matching

18

General PreprocessingGeneral PreprocessingSince the pattern Since the pattern PP tends to move around tends to move around we will need to treat its we will need to treat its segment indexessegment indexes a a bit differentlybit differently

We will produce from We will produce from PP eight new eight new segmentssegments POPOoooo,PO,POeeee,PO,POoeoe,PO,POeoeo, , PEPEoooo,PE,PEeeee,PE,PEoeoe andand PE PEeoeo

The bigThe big ‘O’ ‘O’ inin PO POeeee meansmeans the all the all segments in P that start and end at an segments in P that start and end at an even location even location relative to T’s indexrelative to T’s index, , when when PP is aligned to an odd index of is aligned to an odd index of TT

And analgously for the other segment And analgously for the other segment types …types …

(don’t worry there is an example in the next slide …)

Page 19: Overlap Matching

19

General PreprocessingGeneral Preprocessing

001122334455667788T=

000000000000000000Too=

000000001100000000Tee=

000000000000111100Teo=

001111000000000000Toe=

Page 20: Overlap Matching

20

General PreprocessingGeneral Preprocessing

00112233445566P =

Since P is always aligned to T at some index j we treat’s P’s indexes

relative to T, thus:j+j+00

j+j+11

j+j+22

j+j+33

j+j+44

j+j+55

j+j+66

P =

Assume j is now odd, then for that location we will use the four PO’s:

00000000000000POoo =

00000011111100POee =

00000000000000POeo =

11110000000000POoe =

Page 21: Overlap Matching

21

General PreprocessingGeneral Preprocessing

Thus for every location j we have 16 (a constant) possible number of Text-Pattern pairings:

{Too,Tee,Toe,Tee} × {PXoo,PXee,PXoe,PXee}

If we can determine, using convolutions, for each pairing

if it only contains even overlaps we can solve the

Overlap Matching problem in O(n log m) time

X=parity(j)

Page 22: Overlap Matching

22

Case 1Case 1

Case 1 occurs when for Tab, Pcd either a=c or b=d

This covers 12 of the 16 cases.

We now show a solution for when a=c. This covers 8 cases, we use the solution on the reverse* strings of T and P (thus ‘a’ becomes ‘c’ and ‘b’ becomes ‘d’) to solve the 4 remaining cases.* Computing the reverse strings does not alter the run time (do it during general preprocessing)

Page 23: Overlap Matching

23

Case 1 (a=c)Case 1 (a=c)

For every two marked segments St in Tab starting at index x and Sp in Pcd starting at index y:

|x-y| is always even

(since even-even = even and odd-odd =

even)

We now create a convolution that will return 0 for index j iff there is no odd overlap at j

Page 24: Overlap Matching

24

Case 1 (a=c)Case 1 (a=c)

For every segment in Tab we replace the 1’s by an alrenating series of 1’s and -1’s beginning with 1.

00000011111111 00000011--1111--11

000011--1111--1111

001111 = 1 - 1 = 0

In case where we have only even (and/or no) overlaps:

00000011--1111--11

001111111111 = 1 – 1 + 1 = 1 > 0

In case where we have at least one odd overlap:

31 5 6

Page 25: Overlap Matching

25

Case 2Case 2

Case 2 occurs when Toe, Peo (Teo, Poe is symmetric)

If a segment in Toe is contained in a segment in Peo or vice versa then the overlap is even, otherwise overlap is odd.

00111111110000111111111111111100

111111110000111111110000

2 4 8 11

Page 26: Overlap Matching

26

Case 2Case 2Containment Elimination Property

Convolution at index j gives zero if all overlaps are containments, otherwise it gives a positive result .

To achieve this we will actually use 3 convolutions, a combination of their output will give us the desired answer.

Page 27: Overlap Matching

27

Case 2Case 2Fleshing Out The Solution

For each segment St in Toe that starts at index st, replace the segment’s 1’s by st,1…1,-stFor each segment Sp in Peo that starts at index sp, replace the segment’s 1’s by sp,1…1,-sp

00000011111111 000000331111--33

3 3

Page 28: Overlap Matching

28

Case 2Case 2Containment:

St, 1, 1, ……………. ………………….1, -St, 1, 1, ……………. ………………….1, -StSt

Sp, 1, 1, …………….1, Sp, 1, 1, …………….1, -Sp-Sp

St, 1, 1, …………..1, -St, 1, 1, …………..1, -StSt

Sp, 1, 1, ……………………….……….1, -Sp, 1, 1, ……………………….……….1, -SpSp

sp + (len(Sp)-2) + -sp =

len(Sp)-2st + (len(St)-2) + -st =

len(St)-2

St, 1, 1, ……………. St, 1, 1, ……………. ………….1, -St………….1, -St

Sp, 1, 1, ………………….1, Sp, 1, 1, ………………….1, -Sp-Sp

sp + (k-2) + -st

k-2St, 1, 1, ……………. St, 1, 1, …………….

………….1, -St………….1, -St

Sp, 1, 1, ………………….1, Sp, 1, 1, ………………….1, -Sp-Sp

st + (k-2) + -sp

k-2

No Containment (overlap of length k):

Page 29: Overlap Matching

29

Case 2Case 2Problem 1The indexes of the pattern Peo change for each index j , raising the preprocessing time to O(m) for each convolution!Problem 2We need to find a way to remove “The size of the overlap -2” from the resulting convolution.

len(Sp)-2len(St)-2

sp + (k-2) + -stst + (k-2) + -sp

00

sp - st 0<

st - sp 0<

Containment

No ContainmentRemove

“overlap - 2”

Page 30: Overlap Matching

30

Case 2Case 2Solving Problem 2

Perform another convolution, The “Overlap Length Convolution” subtract its value from the main convolution.

Every segment both Toe and Peo is replaced by 0,1,1,….1,0 giving us “size of overlap -2” for each overlap. 00000011111111 00000000111100

3 3

00000000111100

00111111110000

Overlap of length 4 :

= 0+1+1+0 = 2 = “overlap -2”

Page 31: Overlap Matching

31

Case 2Case 2Solving Problem 1

The trouble is with the pattern Peo segments whose indexes change in each index j. Instead treat the pattern segments relative to Peo. (“Zero Containment Convolution”)

000000331111--33

0000441111--4400

TP

3 4

000000331111--33

0000221111--2200

TP

3 4

0 1 2

Page 32: Overlap Matching

32

Case 2Case 2Solving Problem 1

We created a new problem, overlap convolutions can be negative and thus the overall convolution at index j can turn out to be zero when there is an odd overlap.

00000000001111771111--770000

0000221111--220000

TP

7

= 2+1-7 = -4

2

Page 33: Overlap Matching

33

Case 2Case 2Solving Problem 1

We want to get the benefits of both worlds. Towards that end we’ll add to the result a third convolution “The Shifting Convolution”. This simply corrects the problem caused by using the pattern indexes.

Every segment in T is replaced by 1,0…0,1 and every segment in P is replaced by 0,1…,1,0 and the result is multiplied by index j.

0011000011000011000011

001111111111110000

TP

3 4

0 1 2

= 27

2 * j = 2 * 2 = 4

j2

This replenishes our “losses”

Page 34: Overlap Matching

34

Case 2Case 2Solving Problem 1

0011000011000000000000

00111111000000

TP

2

= 1

1

Thus, the convolution gives 0 for each containment overlap and 1 for each non-containment overlap.

0011000000000000110000

00111111000000

TP

2

= 0

1

Thus multiplying by j we return “one j” to each non-containment

overlap

Page 35: Overlap Matching

35

Case 2Case 2Final Algorithm

Thus we implement the “Containment Elimination Property” by:

Zero Containment Convolution

+

Shifting Convolution

-

Overlap Length Convolution

=

Containment Elimination Property

Page 36: Overlap Matching

36

Very Powerful Technique!

Amazing! He’s a master of the “Shifting Convolution”

Page 37: Overlap Matching

37

Case 3Case 3

Case 3 occurs when Too, Pee (Tee, Poo is symmetric)

If a segment in Too is contained in a segment in Pee or vice versa then the overlap is odd, otherwise overlap is even.

001111110000001111111111110000

111111000000111111000000

2 3 8 107 13

1 4

Page 38: Overlap Matching

38

Case 3 - Using Case 2Case 3 - Using Case 2Containment:

St, 1, 1, ……………. ………………….1, -St, 1, 1, ……………. ………………….1, -StSt

Sp, 1, 1, …………….1, Sp, 1, 1, …………….1, -Sp-Sp

St, 1, 1, …………..1, -St, 1, 1, …………..1, -StSt

Sp, 1, 1, ……………………….……….1, -Sp, 1, 1, ……………………….……….1, -SpSp

sp + (len(Sp)-2) + -sp =

len(Sp)-2st + (len(St)-2) + -st =

len(St)-2

St, 1, 1, ……………. St, 1, 1, ……………. ………….1, -St………….1, -St

Sp, 1, 1, ………………….1, Sp, 1, 1, ………………….1, -Sp-Sp

sp + (k-2) + -st

k-2St, 1, 1, ……………. St, 1, 1, …………….

………….1, -St………….1, -St

Sp, 1, 1, ………………….1, Sp, 1, 1, ………………….1, -Sp-Sp

st + (k-2) + -sp

k-2

No Containment (overlap of length k):

Page 39: Overlap Matching

39

Case 3Case 3We’ll use the same convolution as in Case 2 and two additional ones:

Conv1: Every segment in Too of length len replace by 0,1,2,…,len-1.

Replace Pee segments by 1,0,…,0.

00111111111100 00001122334400

1 1

Conv2: (Opposite of 1) Every segment in Pee of length len replace by 0,1,2,…,len-1.

Replace Too segments by 1,0,…,0.

0011111100 0011000000

T

P

Page 40: Overlap Matching

40

Case 3Case 3

000011223344000000

0011000000= 3

The first convolution gives us the length of all areas like the one marked in green:

00110000000000

0000112200 = 0

If, for some overlap, the first convolution is positive the second will be zero, and vice versa.

It gives us for every two overlapping segments which St is “ahead” of Sp

Page 41: Overlap Matching

41

Case 3Case 3

000000110000000000

00112233440000= 3

The reverse case (second convolution):

And the first one is now zero:

000000001122000000

11000000000000= 0

Page 42: Overlap Matching

42

Case 3Case 3

000011223344556600

0011000000= 3

This is true also for containments:

The convolution from Case 2 gives the same

value for non containments and zero for

containments.

Page 43: Overlap Matching

43

Case 3Case 3

Thus:

Conv1 + Conv2 – ConvCase2 = positive

= containments = odd overlap

Conv1 + Conv2 – ConvCase2 = 0

= no containments = only even overlaps

Page 44: Overlap Matching

44

Overlap MatchingOverlap Matching

Each Case (1,2,3) takes O(n log m) :

1. A constant number of preprocessing functions O(n)

2. A constant number of convolutions O(n log m)

3. A constant time computable function O(1)

for a total runtime of O(n log m)

Algorithm Final Outcome

Page 45: Overlap Matching

45

Swap MatchingSwap Matching

Page 46: Overlap Matching

46

Swap MatchingSwap Matching

CCOONNNNEERR__MMAACCLLEEOODD

CCNNOONNRREE__MMAALLCCEEDDOO

Page 47: Overlap Matching

47

Swap MatchingSwap MatchingFormal Definition

Let S =s1,…,s2 be a string over alphabet ∑

A swap permutation for S is a permutation

π : {1,…,n} → {1,…,n} such that:1.If π(i) = j then π(j) = I

2.For all i, π(i) member of { i-1 , i , i+1 }

3. If π(i) ≠ i then sπ(i) ≠ si

Page 48: Overlap Matching

48

Swap MatchingSwap Matching

Lemma (will not be proven):

A solution to swap matching over alphabet {a,b} of time O(f(n,m)) implies a solution of time O(log|∑|f(n,m)) over alphabet ∑.

And there exists an algorithm to do so.A. Amir, Y. Aumann, G. Landau, M. Lewensten, N. Lewenstein, Pattern matching with swaps, J. Algorithms 37 (2) (2000) 247-266.

Page 49: Overlap Matching

49

Swap MatchingSwap Matching

aabbbbbbbbaaaaaaaabbaabbbbbb

Maximal Alternating Segment (MAS)

Page 50: Overlap Matching

50

Swap MatchingSwap Matching

Lemma:

The pattern P does NOT match in a particular alignment iff there exists a MAS A in T and MAS B in P such that:

1.The characters of A and B misalign in the overlap

2.The overlap is of odd length

Page 51: Overlap Matching

51

Swap MatchingSwap MatchingLemma Intuition

aaaaaabbaabbbbbbaabbaa

aaaaaabbaabbbbbbaabbaabb

Even overlap mismatch

Odd overlap mismatch

Page 52: Overlap Matching

52

Swap MatchingSwap MatchingProof →(by contradiction):

Assume P is aligned to T at index j and we can’t swap match and the two MAS A,B do not exist:1.All MAS’ overlaps match exactly –

contradiction – we don’t even need to swap.

2.There exists at least one pair A,B that do not match exactly in an even overlap:w.l.o.g overlapA=(ab)* overlapB=(ba)*

we can swap within the overlap boundaries and get the desired result - contradictionThus, there must be one MAS A,B that have

a misaligned odd overlap

Page 53: Overlap Matching

53

Swap MatchingSwap MatchingProof ←(by contradiction):Assume there exist MAS A,B that misalign in an odd overlap and P and T swap match at index j :

w.l.o.g overlapA=(ab)*a overlapB=(ba)*b

The we must swap with letters outside of the overlap but by definition of MAS this will not help and we can’t swap match. Contradiction.

Page 54: Overlap Matching

54

Swap MatchingSwap MatchingAlgorithm

Construct from T:

1.Teven-a where all MAS with a’s on even indexes are marked segments.

2.Todd-a where all MAS with a’s on odd indexes are marked segments.

aabbaabbbbbbbbaaaaaabbT =

aabbaabbbbbbaaaaaaaabb

aabbaabbbbbbaaaaaaaabbTodd-a =

0

Teven-a =

2 6 87 9

Page 55: Overlap Matching

55

Swap MatchingSwap MatchingAlgorithm

We provide a similar construction for P :

Peven-a ,Podd-a using P’s index !

When matching, if the index j of T is odd we will use one for the other (Peven-a becomes Podd-a and vice versa)

aabbbbbbaaPeven-a =

0 4

Aligned at T ’s index 3 it becomes Podd-a:

aabbbbbbaa

3 7

Page 56: Overlap Matching

56

Swap MatchingSwap MatchingAlgorithm

If index j is even, T swap matches P iff Teven-a overlap matches Podd-a at j and Todd-a overlap matches Peven-a at j.

If index j is odd, T swap matches P iff Teven-a overlap matches Peven-a at j and Todd-a overlap matches Podd-a at j.

Page 57: Overlap Matching

57

Swap MatchingSwap MatchingAlgorithm – Why does it work?

An even-a MAS and an odd-a MAS will never exactly match:

By the lemma if their overlap is odd then swap matching is not possible and this is exactly what we examine using the Overlap Matching method

aabbaabbaa

bbaabbaabb

0

even-a MASodd-a MAS

Page 58: Overlap Matching

58

Swap MatchingSwap MatchingAlgorithm

Runtime O(n log m):

1.We pay O(n) to segmentize to MAS.

2.We pay O(n log m) to run overlap matching.Thus, for an alphabet ∑ we can swap match at O(n log m log|∑|)

Improvement over previous deterministic upper bound of O(nm1/3log m log|∑|)

A. Amir, Y. Aumann, G. Landau, M. Lewensten, N. Lewenstein, Pattern matching with swaps, J. Algorithms 37 (2) (2000) 247-266.