pattern matching in the streaming model

35
Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University

Upload: raine

Post on 16-Jan-2016

26 views

Category:

Documents


1 download

DESCRIPTION

Pattern Matching in the streaming model. Ely Porat Google inc & Bar-Ilan University. Problem definition - Pattern Matching. Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P. T=. n. P=. m. Problem definition - Online Pattern Matching. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pattern Matching in the streaming model

Pattern Matching in the streaming model

Ely PoratGoogle inc & Bar-Ilan University

Page 2: Pattern Matching in the streaming model

Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P.

Problem definition - Pattern Matching

T=

P=

n

m

Page 3: Pattern Matching in the streaming model

Problem definition - Online Pattern Matching

• We get the text character by character

P=

T=

Page 4: Pattern Matching in the streaming model

Motivation…

• Stock market

Page 5: Pattern Matching in the streaming model
Page 6: Pattern Matching in the streaming model

Motivation..

• Espionage

The rest we monitor

Page 7: Pattern Matching in the streaming model

Motivation…

• Viruses and malware

Software solutions:Snort: 73.5MbClamAV: 1.48Gb

Using TCAMs:Snort: 680KbClamAV: 25Mb

Our solution (software):Snort: 51KbClamAV: 216Kb

Page 8: Pattern Matching in the streaming model

Motivation…

• Monitoring internet traffic

Page 9: Pattern Matching in the streaming model

Streaming model

250 BPS 250 BPS

We can't store the whole input

In our case we seek for algorithm which require poly(log m) space

Page 10: Pattern Matching in the streaming model

Related work

• Karp-Rabin: Randomized Algorithm for exact pattern matching

• Clifford, Porat, and Porat: A black box algorithm for online approximate pattern matchingo Almost any pattern matching algorithm can be converted to

run online.

Page 11: Pattern Matching in the streaming model

p0p1p2p3...pm-1

Karp-Rabin Algorithm

t0 t1 t2 . . . ti ti+1  . . . ti+m-1 ti+m  . . . tn

p0rm-1+p1rm-2+p2rm-3+...+pm-1modq

Si=tirm-1+ti+1rm-2+...ti+m-1modq

Si+1=ti+1rm-1+...ti+m-1r+ti+mmodq

Si+1=Sir+ti+m-tirm

Require O(m) memory

Choosing randomly r

Page 12: Pattern Matching in the streaming model

The idea - Simple case

P=  Z

ZT

Signature

Start signing

Signature

The pattern start with z, and there is no more z's in the pattern

Z

Signature

Start signing

Page 13: Pattern Matching in the streaming model

Case 1

P= U

UT

Signature

Start signing

Signature

There is a prefix U s.t U appear only once in the pattern

U

Signature

Start signing

m<=m/2

Seek in recursion

Page 14: Pattern Matching in the streaming model

Case 2: No small U

P= W

Look on the first m/2 characterThey appear again somewhere

W

P= v v v v v v v v

Prefix of v

Option 1

Option 2

P= v v v v w

w isn't a prefix of vand v isn't a prefix of w

v=<m/2

Page 15: Pattern Matching in the streaming model

Solving case 2

Option 2

P= v v v v w

v=<m/2

Search in recursion for v, and count how many time you found it

Sign on w

T v v v v

Start signing

Signature

v

Signature

Start signing

Page 16: Pattern Matching in the streaming model

Solving case 2 - continue

Option 2

P= v v v v w

v=<m/2

Search in recursion for v, and count how many time you found it

Sign on w

T v v v v

Start signing

Signature

v

Using O(log m) signatures and counters in the worst case

v v v

>m/2

<m/2Signature

Start signing

Page 17: Pattern Matching in the streaming model

p0p1p2p3...pm-1

Karp-Rabin Algorithm

t0 t1 t2 . . . ti ti+1  . . . ti+m-1 ti+m  . . . tn

p0rm-1+p1rm-2+p2rm-3+...+pm-1modq

Si=tirm-1+ti+1rm-2+...ti+m-1modq

Si+1=ti+1rm-1+...ti+m-1r+ti+mmodq

Si+1=Sir+ti+m-tirm

Choosing randomly r

Page 18: Pattern Matching in the streaming model

p0p1p2p3...pm-1

Rothschild signature 07

p0rm-1+p1rm-2+p2rm-3+...+pm-1modq

p0+p1r+p2r2+...+pm-1rm-1modq

t0 t1 t2 t3 . . . ti

qrtSi

j

jji mod

0

Page 19: Pattern Matching in the streaming model

Forward signatures

P= U

UT

Signature

Calculate X=Si+Sig*ri+1

Signature

There is a prefix U s.t U appear only once in the pattern

m<=m/2

Seek in recursion

Check if equal to XRemember X for this position

Page 20: Pattern Matching in the streaming model

0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,10, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1

Example: q=7 r=3

0, 1, 1, 0, 1, 1, 1

0, 1, 1

P:

T: 0

Level 1:Level 2:Level 3:

1 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 1 0 1 1 0 1 00 1 1 0 1 1 0 1 1 0 1 1 0 1 10 1 1 0 1 1 1

5

1

4

0 3 3

ri=

2 6 6 6 2 4 4 1 6 0 0 0 0 0 0 1 4 4 6 3 3 1 1

32645132164513264513264513

5 6 3 4 6 1 3 2 6 4 3 00 1

1

Level 3:Level 2:Level 1:

Page 21: Pattern Matching in the streaming model

Worst case - time

t0 t1 t2 t3 . . . ti

X1

X2

Xlogm

Check using hash table

X1=X2=…=Xlogm ???

We can work in lazy approach without blowup in the memory

Time: O(1)

Amortized O(1), but what about worst case?

Page 22: Pattern Matching in the streaming model

Average / Random/ Smooth case

P:m

log∑m

log∑m

log∑log∑m

Total number of iteration is O(log* ∑m)

Page 23: Pattern Matching in the streaming model

Worst case

P:m

m/2

m/2m/4

Total number of iteration is O(log m) = O(log m logδ) space.

Page 24: Pattern Matching in the streaming model

Multi-Pattern search (dictionary matching)

• Given a set of patterns D={P1,P2,P3,…,Pd}– The patterns can be of different length

• We will want to report whenever one of the patterns appear.

• Our algorithm will require O(∑i=1dlog|Pi|)

memory, and will require O(log d) time per text character.

Page 25: Pattern Matching in the streaming model

Multi-Pattern search (dictionary matching)

• Denote M=maxi |Pi|

• Our algorithm will have 2 cases:– Case 1: d>M– Case 2: d<M

Page 26: Pattern Matching in the streaming model

Case 1: d>M

• In this case we can allocate an array of size M+1

t0 t1 t2 t3 . . . tl-M tl-M+1 . . . tl

Sl-MSl-M+1 . . . Sl

qrtSi

j

jji mod

0

It is easy to maintain such a sliding window in O(1) time and O(M) memory

Page 27: Pattern Matching in the streaming model

Case 1: d>M - continueqrxxxxxSig

i

j

jji mod)...(

0210

For each Pi in D: (Pi=a0 a1 a2 … ami-1) e=mi

while e!=0:find j s.t 2j=<e and 2j+1>ee=e-2j

if e!=0 HashTable(Sig(aeae+1…ami))

HashTable(Sig(a0a1…ami),matchi)

Example

Pi=a0 a1 a2 … a38

We will store in the hash table:

Sig(a7a6…a38)

Sig(a3a4…a38)

Sig(a1,a2…a38)

Sig(a0a1…a38),matchi

We will store at most log |Pi| points

Page 28: Pattern Matching in the streaming model

Case 1: d>M - continue

2i

2i +2j

2i +2j +2l

At most logPi levels

Page 29: Pattern Matching in the streaming model

Case 1: d>M

• In this case we can allocate an array of size M+1

t0 t1 t2 t3 . . . tl-M tl-M+1 . . . tl

Sl-MSl-M+1 . . . SlqrtSi

j

jji mod

0

Notice that it take O(1) to calculate Sig(titi+1…tl)

qrxxxxxSigi

j

jji mod)...(

0210

iil

lii r

SStttSig 1

1 )...(

Page 30: Pattern Matching in the streaming model

Case 1: d>M - continue

We will do binary search over the sliding window

Sl-M Sl-M+1 . . . Sl

l-2j

Is it in the HashTable?

j

j

l

ll

r

SS2

12

No

l-2j-1

Is it in the HashTable?

1

1

2

12

j

j

l

ll

r

SS

Yes

l-2j-1-2j-2

Is it in the HashTable?

21

21

22

122

jj

jj

l

ll

r

SS

Page 31: Pattern Matching in the streaming model

Case 2: d<M

• In this case we will split our dictionary D into 2 dictionaries:– D1 – all the patterns shorter then d.

On this dictionary we will run case 1.

– D2 – all the patterns longer then d.We need only to deal with this case.

Page 32: Pattern Matching in the streaming model

Case 2: d<M - continue

For each Pi in D2:

Pi = a0 a1 a2 . . . ad-1 ad . . . am

SPi=Sig(a0a1…ad-1)

Store in hash table SPi

Page 33: Pattern Matching in the streaming model

Case 2: d<M - continue

If Pi contain a period prefix of length more then d

Pi = u u u u u u v . . am

SPi SPiSPi

We store as well the number of time we need to see SPi

w.h.p won’t be SPi

We will start a process which will seek for Pi only after seeing enough SPi.Therefore the minimum number of characters we have to see between 2 process of Pi is at least d.

Page 34: Pattern Matching in the streaming model

Case 2: d<M - continue

• We run the algorithm from the beginning of the lecture.

• Amortized it take O(1/d) per pattern per text character.

• Overall it take O(1) amortized time per text character.

• By lazy approach we get O(1) time in worst case.

Page 35: Pattern Matching in the streaming model

Open problems

• Multi pattern search case 2 takes O(1) time, however case 1 takes O(logd)– Improve case 1 to be O(1)– With heuristic almost all the dictionary take O(1)

time, and O(1) space per pattern.

• Lower bound– We believe that single pattern search lower bound is

Ώ(log m log δ)

• Supporting wildcards & mismatches