improve sketching of hamming distance with error correcting

Post on 30-Jan-2016

57 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Improve sketching of Hamming Distance with Error Correcting. Ely Porat Bar-Ilan University Google Inc. Ohad Lipsky Bar-Ilan University Check Point Inc. December 2003. Problem Definition (1). Alice. Bob. T A. T B. n. n. hamm(T A ,T B ). Given k - bound on the number of mismatches. - PowerPoint PPT Presentation

TRANSCRIPT

Improve sketching of Hamming Distance with Error Correcting

Ely Porat

Bar-Ilan University

Google Inc

Ohad Lipsky

Bar-Ilan University

Check Point Inc

December 2003

Problem Definition (1)Alice Bob

n nTA TB

hamm(TA,TB)

Given k - bound on the number of mismatches

December 2003

Problem Definition (2)

n nTA TB

Calculate hamm(TA,TB) given only SA,SB

SA SB

S S

Finding the mistakes

Given k - bound on the number of mismatches

December 2003

Motivations

• Data Bases

• Internet

• Error Correcting

Router A

Router B

Router C

Router D

December 2003

Outline:

• Simple Solution

• Error Correcting

• Improved Solution

• Improve more

• Recursion

• File sharing

December 2003

Simplest Solution - O(k2log1/)

• Binary Alphabet

• Allocate k2 cells.

• Take the input array and hash each bit to one of the cells.

• In each cell remember the xor of all the values hash to it.

0 1 1 0December 2003

Simplest Solution - O(k2log1/)

1 1 0 0

0 1 0 0

December 2003

Simplest Solution - O(k2log1/)

• Due to the birthday principal:The probability that 2 Error will fallto the same cell < 1/2

• log1/ - to get a probability to fail

0 1 1 0December 2003

Alphabet

• Denote with S the size of the alphabet.• We can encode each latter with it’s unary

representation.

• The only effect is that each mistake will be counted twice.

0 - 1000000….01 - 0100000….0.S-1 - 0000000….1

0 - 1000000….05 - 0000010….0

December 2003

Error correcting - O(k2logNS)

• Here we allocate two kind of k2 cellsk2 of logS bits. k2 of logNS bits.

5 8 3 2

15 6 7 8

C1[h(A[i])]+=A[i]

C2[h(A[i])]+=iA[i]

December 2003

Error correcting - O(k2logNS)

• As before with probability > 1/2 there won’t fall 2 Errors in the same cell.

5 8 3 2

15 6 7 8

C1[h(A[i])]+=A[i]

C1[h(A[i])]+=iA[i]

December 2003

Error correcting - O(k2logNS)

• We get from the red cells:

5 8 3 2

C1[h(A[i])]+=A[i]

5 6 3 2

5

3

8 - 6 = 5 - 3

December 2003

Error correcting - O(k2logNS)

• We get from the blue cells:

15 11 7 5

15 9 7 5

5

3

11 - 9 = 2*(5 - 3) => i=2

C2[h(A[i])]+=iA[i]

0 1 2

December 2003

Error correcting - O(k2logNS)

• The probability to succeed is about 1/2.

• To lower the failer probability we will run it 3 times.

• We will get a list of possible mistakes each time.

• Output all the mistakes that appear in at least 2 of the 3 runs.

December 2003

O(klog2k) - Solution

• The Idea is two stage hashes:

k/logk

w.h.p O(logk)

Bar-Yossef, Jayram, Kumar, Sivakumar 03 December 2003

O(klog2k) - Solution

O(logk)

O(log2k)

The Probability to fail is less then 1/2.

Run it 2logk timesAnd take the max.

=> failer probabilty less then 1/k2

Space = O(log3k)

keep accumulated XOR

Bar-Yossef, Jayram, Kumar, Sivakumar 03 December 2003

O(klog2k) - Solution

k/logkO(log3k) O(log3k) O(log3k) O(log3k)

O(klog2k)

P(Failer) k/logk * 1/k2 < 1/k

Bar-Yossef, Jayram, Kumar, Sivakumar 03 December 2003

O(k2log*klogk) -Idea (recursion)

k/logk

logk/loglogk

Pr(F)<1/logck

logk/loglogk runs, take max

December 2003

Error Correcting O(klogNS)Alice Bob

n nTA TB

r0r1r2…

p=(N3S)

ri random w.p

1

k0 o.w

1 TA riaimod pi0

n 1

1 TB ribimod pi0

n 1

1 TA

1 TA 1 TB 0

rj a j b j random

nomistake

onemistake

more thenone

Constant Probability

December 2003

Error Correcting O(klogNS)Alice Bob

n nTA TB

1 TA riaimod pi0

n 1

1 TB ribimod pi0

n 1

1 TA

1 TA 1 TB 0

rj a j b j random

nomistake

onemistake

more thenone

1' TA iriaimod pi0

n 1

1' TB iribimod pi0

n 1

1' TA

1' TA 1' TB 1 TA 1 TB

jrj a j b j rj a j b j

j

If we wrong w.h.p j>n

December 2003

Error Correcting O(klogNS)Alice Bob

n nTA TB

1' TA 1' TB 1 TA 1 TB

j

rj , aj - bj

December 2003

Error Correcting O(klogNS)Alice Bob

n nTA TB

1 TA ,1' TA

2 TA ,2 ' TA

ck ln k TA ,ck ln k ' TA

O(klnk)

December 2003

RecursionAlice Bob

n nTA TB

1 TA ,1' TA

2 TA ,2 ' TA

ck TA ,ck ' TA

ck

ri random w.p

1

k0 o.w

n nTA TB

1 TA ,1' TA

2 TA ,2 ' TA

ck2TA ,ck

2' TA

ri random w.p

2

k0 o.w

ck

2

December 2003

RecursionAlice Bob

n nTA TB

ck

ri random w.p

1

k0 o.w

ri random w.p

2

k0 o.w

ck

2

ri random w.p

4

k0 o.w

ck

4

ck ck

2ck

4 2ck

O(klogNS)

December 2003

Complexity

n nTA TB

SA SB

S S

Size: O(klogNS)Computing sketch: O(nlogk)Comparing sketches: O(klogk)

December 2003

O(klogk) -Solution

• We can just encode in unary and hash the input to k3 cells and then run the O(klogNS)=O(klogk) algorithm.

December 2003

Reed-Solomon Codes

1 1 1 1

1 2 3 2k

1 22 32 2k 2

1 2n 3n 2k n

a0

a1

a2

an 1

p 1 p 2

p 2k

p x a0 a1x a2x2 an 1x

n 1

We manage to develop a deterministic algorithm based on that.But the encoding and the decoding is slower.

Amir, Farach 95Feigenbaum, Ishai, Malkin, Nissim, Strauss, Wright 01Bar-Yossef, Jayram, Kumar, Sivakumar 03

Efremenko, Porat, Rothschild 06Efremenko, Porat 07

File Sharing

nsource Napster

Source need to stay until someone will have the whole file. (and willing to stay)

There is bottleneck at the end.

File Sharing

nsource emule/kazaa/torrent

The source has to send nlnn blocksbefore disconnecting.

Sometimes there are some bottlenecks

Improved File Sharing - Ver 1

a0a1a2…………….an-1n

source

p x a0 a1x a2x2 an 1x

n 1

ai F2b

0 , p

0 ,

1 , p1 ,

2 , p2 , n6, p n6

n6

Improved File Sharing - Ver 1n6

Each client that got n points can recreate the file

There is no more nlnn

Almost no bottlenecks

Improved File Sharing - Ver 2

ai F2ba0a1a2…………….an-1

nsource

Send linear equations on the file.

r0,0 r0,1 r0,n 1

r1,0 r1,1 r1,n 1

rn 1,0 rn 1,1 rn 1,n 1

Pr success 12b

n 1

2bn

1

2bn 2

2bn

1

2bn i

2bn

1

1

2bn

1 2 b 1

Improved File Sharing - Ver 2

a0a1a2…………….an-1n

source

Problems: 1. Heavy to encode each packet we need to go over all the file.2. Very heavy to decode O(n2) block operation + O(n3) fields operations.

Facts:1. If you get n(1/2-) random combination of two blocks you won’t have dependents w.h.p.2. If you have d - pairs combinations you can easilly reduce your system to n-d variables.

Solution: Use sparse functionals

Improved File Sharing - Ver 2

a0a1a2…………….an-1n

source

Futures: 1. Backward compatibility.2. Even if you don’t have the whole file you can mix functionals.

top related