speaker: c. c. lin adviser: r. c. t. lee

42
1 String Matching with k Mismatches by Using Kangaro o Method Efficient string with k mismatches, Landau, G.M., and V ishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249 Speaker: C. C. Lin Adviser: R. C. T. Lee

Upload: dusan

Post on 06-Feb-2016

59 views

Category:

Documents


0 download

DESCRIPTION

String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249. Speaker: C. C. Lin Adviser: R. C. T. Lee. Problem definition: Input: A text T with length n , a pattern P with - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Speaker: C. C. Lin Adviser: R. C. T. Lee

1

String Matching with k Mismatches by Using Kangaroo Method

Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249

Speaker: C. C. LinAdviser: R. C. T. Lee

Page 2: Speaker: C. C. Lin Adviser: R. C. T. Lee

2

Problem definition:Input: A text T with length n , a pattern P with length m and a mismatching threshold k.

Output: All sub-strings of T with length m matching P with k maximal number of mismatches.

T = A G C T G C D C A C G I A B...1 4 3 2

P = A G C C

If k = 2k:

P = A G C CP = A G C CP = A G C C

Page 3: Speaker: C. C. Lin Adviser: R. C. T. Lee

3

The concept of the Kangaroo method can be explained as the following figure.Assume that it is known before hand there t1t2…ta=p1p2…pa and ta+1 is not equal to pa+1.

Thus we do not have to examine t1t2…ta+1 with

p1p2…pa+1 and jump directly to match the suffixes

beginning from ta+2 and pa+2.

Text: t1 t2… ta ta+1 ta+2 ta+3…tk…………Pattern: p1p2…pa pa+1 pa+2pa+3...pk…………

mismatch

Page 4: Speaker: C. C. Lin Adviser: R. C. T. Lee

4

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

start

k=0

Page 5: Speaker: C. C. Lin Adviser: R. C. T. Lee

5

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

k=1

Page 6: Speaker: C. C. Lin Adviser: R. C. T. Lee

6

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

k=2

Page 7: Speaker: C. C. Lin Adviser: R. C. T. Lee

7

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

k=3

Page 8: Speaker: C. C. Lin Adviser: R. C. T. Lee

8

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

k=4

Page 9: Speaker: C. C. Lin Adviser: R. C. T. Lee

9

We continue the above process. Whenever we come to the situation that it is known a substring of T exactly matching with a substring of P, we skip this substring. This process is stopped when k+1 mismatches have been found.

Input: T=ABAABBCCDD, P=ACDCB and k=2.T=ABAABCCDDP=ACDCBk=3, we stop and discard ABAAB, then we start to compare “BAADB” and “ACDCB”.

Page 10: Speaker: C. C. Lin Adviser: R. C. T. Lee

10

Before we introduce the Kangaroo algorithm, we shall first introduce the suffix tree and the lowest common ancestor of two nodes.

The properties of suffix tree and the lowest common ancestor of two nodes will be used in Kangaroo algorithm.

Page 11: Speaker: C. C. Lin Adviser: R. C. T. Lee

11

S = ABCDEADDBE

Suffix tree of a string with length n can be constructed in O(n).

Weiner, 1973McCreight, 1976Ukkonen, 1995

3

CDEADDBE$

A

B DE

61

924 7 8

105

BCDEADDBE$DDBE$

CDEADDBE$ E$

EADDBE$DBE$

BE$

ADDBE$ $

Page 12: Speaker: C. C. Lin Adviser: R. C. T. Lee

12

The lowest common ancestor of two leaf nodes can be found in O(1) by O(n) preprocessing in constructing time.

Harel and Tarjan, 1984

3

CDEADDBE$

A

B DE

61

924 7 8

105

BCDEADDBE$DDBE$

CDEADDBE$ E$

EADDBE$DBE$

BE$

ADDBE$ $

Page 13: Speaker: C. C. Lin Adviser: R. C. T. Lee

13

The Kangaroo method constructs a suffix tree for text T and pattern P. Let the leaf node corresponding to the substring starting from the location be denoted as X. Let the leaf corresponding to the pattern be denoted as Y. The Kangaroo Method finds the lowest common ancestor of X and Y to verify a text location with k mismatches in O(k).

Let us consider the next page to figure out the Kangaroo method.

Page 14: Speaker: C. C. Lin Adviser: R. C. T. Lee

14

ANBECF$

ANCEC$

AN

BECF$ CEC$

Two suffix strings:

ANBECF$

ANCEC$

ANBECF$

ANCEC$

Then we can know that they have the same prefix “AN” and a mismatch “B” and “C”.

We now have to find whether there is any mismatches between ECF and EC.

ANBECF$ ANCEC$

mismatches=1

Page 15: Speaker: C. C. Lin Adviser: R. C. T. Lee

15

We get remaining suffix strings:

ECF$

EC$

EC

$F$Then we can know that they have the same prefix “EC” and because we touch $, we finish the verification.

ECF$

EC$

ECF$ EC$

mismatches=1

Thus we could know thatthe mismatches between “ANBECF” and “ANCEC”is 1.

Page 16: Speaker: C. C. Lin Adviser: R. C. T. Lee

16

We will not have to compare all characters by using the finding of the lowest common ancestor of two strings of text and pattern in the suffix tree.

This is useful if there are many equivalent characters between the text and the pattern because we will not have to compare those equivalent characters.

Finding the lowest common ancestor between two suffixes is to find the next mismatch between two strings.

Page 17: Speaker: C. C. Lin Adviser: R. C. T. Lee

17

Input: T=ABCCBDCDBC, P=ABCD and k=2 The suffix tree of T and P is:

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

Page 18: Speaker: C. C. Lin Adviser: R. C. T. Lee

18

The lowest common ancestor of “ABCD” and“ABCCBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1, return “ABCC”.

Page 19: Speaker: C. C. Lin Adviser: R. C. T. Lee

19

The lowest common ancestor of “ABCD” and“BCCBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 20: Speaker: C. C. Lin Adviser: R. C. T. Lee

20

The lowest common ancestor of “BCD” and“CCBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 21: Speaker: C. C. Lin Adviser: R. C. T. Lee

21

The lowest common ancestor of “CD” and“CBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “BCCB”.

Page 22: Speaker: C. C. Lin Adviser: R. C. T. Lee

22

The lowest common ancestor of “ABCD” and“CCBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 23: Speaker: C. C. Lin Adviser: R. C. T. Lee

23

The lowest common ancestor of “BCD” and“CBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 24: Speaker: C. C. Lin Adviser: R. C. T. Lee

24

The lowest common ancestor of “CD” and“BDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “CCBD”.

Page 25: Speaker: C. C. Lin Adviser: R. C. T. Lee

25

The lowest common ancestor of “ABCD” and“CBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 26: Speaker: C. C. Lin Adviser: R. C. T. Lee

26

The lowest common ancestor of “BCD” and“BDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 27: Speaker: C. C. Lin Adviser: R. C. T. Lee

27

The lowest common ancestor of “D” and“CDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “CBDC”.

Page 28: Speaker: C. C. Lin Adviser: R. C. T. Lee

28

The lowest common ancestor of “ABCD” and“BDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 29: Speaker: C. C. Lin Adviser: R. C. T. Lee

29

The lowest common ancestor of “BCD” and“DCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 30: Speaker: C. C. Lin Adviser: R. C. T. Lee

30

The lowest common ancestor of “CD” and“CDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2, return “BDCD”.

Page 31: Speaker: C. C. Lin Adviser: R. C. T. Lee

31

The lowest common ancestor of “ABCD” and“DCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 32: Speaker: C. C. Lin Adviser: R. C. T. Lee

32

The lowest common ancestor of “BCD” and“CDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 33: Speaker: C. C. Lin Adviser: R. C. T. Lee

33

The lowest common ancestor of “CD” and“DBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “DCDB”.

Page 34: Speaker: C. C. Lin Adviser: R. C. T. Lee

34

The lowest common ancestor of “ABCD” and“CDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 35: Speaker: C. C. Lin Adviser: R. C. T. Lee

35

The lowest common ancestor of “BCD” and“DBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 36: Speaker: C. C. Lin Adviser: R. C. T. Lee

36

The lowest common ancestor of “CD” and“BC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “CDBC”.

Page 37: Speaker: C. C. Lin Adviser: R. C. T. Lee

37

Input: T=ABCCBDCDBC, P=ABCD and k=2.

Output: “ABCC” and “BDCD”.

Page 38: Speaker: C. C. Lin Adviser: R. C. T. Lee

38

In order to use Kangaroo method, we construct a suffix tree for the text T with the length n and the pattern p with the length m in O(n+m).

By using Kangaroo method, we take O(1) time to find one mismatch. We stop when there are more than k mismatches. Therefore, we take O(k) time to find at most k mismatches.

Page 39: Speaker: C. C. Lin Adviser: R. C. T. Lee

39

Thus, the time complexity of finding out all locations of text T with k maximal mismatches with the pattern P is O(nk).

Page 40: Speaker: C. C. Lin Adviser: R. C. T. Lee

40

References

For Construction of Suffix trees:[M76] McCreight, E.M., A Space-Economical Suffix Tree Construction Algorithm, J. ACM 23 (1976): 262-272.[U95] Ukkonen, E., On-line Construction of Suffix Trees, Algorithmica 41 (1995): 249-260.

For Finding Lowest Common Ancestor:[HT84] Harel, D. and Tarjan, R.E., Fast Algorithms for Finding Nearest Common Ancestor, SIAM Journal on Computing 13 (1984): 338-355.

Page 41: Speaker: C. C. Lin Adviser: R. C. T. Lee

41

References

For String Matching with k Mismatches:

[LV86] Landau, G.M., and Vishkin, U., Efficient string with k mismatches, Theoret. Comput Sci 43 (1986): 239-249.

Page 42: Speaker: C. C. Lin Adviser: R. C. T. Lee

42

Thank you