data stream methods - rutgers universitymuthu/198-4.pdf · probabilistic counting • the approach...
TRANSCRIPT
![Page 2: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/2.jpg)
Plan of attack
• Frequent Items / Heavy Hitters• Counting Distinct Elements• Clustering items in Streams
![Page 3: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/3.jpg)
Motivating Distinct Elements
Many network flows between (source, dest) pairs
Want a snapshot at time t of the flows
This defines a (massive) vector, and we ask:
• Summarise the current state
• How does state at time t compare with at t’?
• Which past situation does this most resemble,etc.?
![Page 4: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/4.jpg)
Counting Distinct Values
Application 1: Maintaining number of distinct valuesin a relation with inserts and deletes
Important to know number of values for queryoptimization, approximate query answering, join sizeestimation etc.
Fully dynamic case, with inserts and deletes:sampling from the relation itself has been shown tobe inaccurate.
Computing the answer with a scan of the relationwill be slow, will consume a lot of memory
![Page 5: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/5.jpg)
Application to NetworksApplication 2: Many questions possible about networkstreams:
• How many packet flows between distinct pairs of(source, destination)?• How many flows are losing packets (wherepackets in one side not equal to packets out)?• Denial of service attacks signalled by largenumbers of requests (from spoofed IPs) — so manydistinct sources.
All these can be solved by computing distinct values orextensions thereof.
![Page 6: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/6.jpg)
Exact Algorithm
• Keep an array, a[1..U], initially all zero• Also keep a counter C• Every time an item i arrives, look at a[i]• If it is zero, increment C and set a[i]=1• Return C as the number of distinct items• Time: O(1) per update and per query• But space is O(U)•
![Page 7: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/7.jpg)
Lower bound• Use the same trick as last time, take a bitstring B and
encode it as a stream: i is in the stream if B[i] = 1,and i is not in stream otherwise.
• Feed this stream to the algorithm• To test whether any item is in the bitstring, keep a
copy of the memory contents of the algorithm.• Query the number of distinct items.• Then send item i, and query again• If the number of distinct items has increased, then
bit i = 0, else bit i = 1• Roll back the memory contents, and repeat...
![Page 8: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/8.jpg)
Lower bounds contd.This way we can extract the entire bitstring B, so the
memory space must be at least U bits.This hold even probabilistically: even if the procedure is
allowed to be wrong with some probability, it stillrequires Ω(U) bits, by a reduction from anothercommunication complexity problem, Index.
So can we make any progress on this problem?Yes, if we approximate: find some approximation d of
the true answer D so that (1-ε)D < d < (1+ε)D withprobability 1-δ
If we can choose the parameters ε and δ, then this is avery powerful approximation scheme.
![Page 9: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/9.jpg)
Probabilistic Counting
• The approach of Probabilistic Counting, dueto Flajolet & Martin, 1982, is a powerfulmethod of approximating the number ofdistinct elements.
• Detailed analysis was given in the paper ofAlon Matias & Szegedy, 1996
• Fairly simple to implement, has some niceproperties.
![Page 10: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/10.jpg)
Probabilistic Counting
• The basic idea:• Keep an array a[1..log U], initially 0.• Use a hash function f: 1..U ! 0.. log U• Compute f(i) for every item in the stream, and
set a[f(i)] to 1• Somehow extract from this the approximate
number of distinct items.
![Page 11: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/11.jpg)
Probabilistic CountingWhat kind of hash function to use?We will use Universal hash functions from last
time (remember, these can be represented insmall space)
If we apply them directly, then how long beforewe have covered all items in the array?
Coupon collector problem again... if the numberof distinct items is more than (log U ln log U)then all items will be covered.
Instead, we’ll do something a bit different.
![Page 12: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/12.jpg)
Probabilistic CountingSuppose the probability of mapping item i to a[r] is
1/2r
Then ½ the items fall in first cell, ¼ in the second, 1/8in the third, and so on.
(each item falls in the same cell every time it isencountered, so it is as if only one of each distinctitem arrives)
If there are D distinct items, then we might expectlog D cells to be occupied...
a ½ ¼ 1/8 1/16 1/32 1/64 ......
![Page 13: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/13.jpg)
Probabilistic CountingLet’s do this formally:Let f be drawn from a family of strongly 2-Universal
hash functions mapping onto 0..U-1Let r(x) be the function that returns the number of
trailing zeros in the binary representation of xHence r(12) = r(11002) = 2, r(257) = 0For each item i in the data stream, set a[r(f(i))] = 1Let R be the maximum j such that a[j]=1. Output 2R
as the approximate number of distinct values.
![Page 14: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/14.jpg)
ExampleSuppose the stream is 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1...Let f(x) = 3x + 1 mod 5So the transformed stream (f applied to each item) is
4, 5, 2, 4, 2, 5, 3, 5, 4, 2, 5, 4We compute r of each item in the stream:
2, 0, 1, 2, 1, 0, 0, 0, 2, 1, 0, 2Hence: a[0] = 1, a[1] = 1, a[2] = 1, a[3]=0, a[4]=0So R = 2. Output 22 = 4.(We got lucky this time, on a toy example. How will
things work out in general? What can we proveabout this approach?
![Page 15: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/15.jpg)
Probabilistic CountingWhat is the expectation of the quantity R?f() distributes uniformly over 0..U-1 so
Pr[r(f(i)) ≥ j] = 2-j = pj
Let Zj be the number of distinct items in thestream i for which r(f(i)) ≥ j.
By linearity of expectation, we getE(Zj) = D(1pj+ 0(1-pj)) = Dpj = D/2j.
What is the variance? (we will useE(XY)=E(X)E(Y) due to pairwise independence)
![Page 16: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/16.jpg)
VarianceVar(X) = E(X2) – E(X)2
Var(X+Y) = E(X2+Y2 +2XY) – E(X+Y)2
= E(X2) + E(Y2) + 2E(XY) – (E(X)+E(Y))2
= E(X2)–E(X)2+E(Y2)–E(Y)2+2E(X)E(Y)–2E(X)E(Y)= Var(X) + Var(Y) (for pairwise independence)
Variance for a single value of i ispj(12) – (1pj)2 = pj(1-pj) = 2-j(1 – 2-j)
Variance for D of these is D2-j(1 – 2-j) < D/2j = E(Zj)
![Page 17: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/17.jpg)
Probability BoundsMarkov Inequality:For a random variable Y which takes only non-negative values.
Pr[Y ≥ k] ≤ E(Y)/k(This will be < 1 only for k > E(Y))
Chebyshev’s Inequality:For a random variable Y:
Pr[|Y-E(Y)| ≥ k] ≤ Var(Y)/k2
Proof: Set X = (Y – E(Y))2
E(X) = E(Y2+E(Y)2–2YE(Y)) = E(Y2)+E(Y)2-2E(Y)2= Var(Y)So: Pr[|Y-E(Y)| ≥ k] = Pr[(Y – E(Y))2 ≥ k2]. Using Markov:
≤ E(Y – E(Y))2/k2 = Var(Y)/k2
![Page 18: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/18.jpg)
Applying Probability BoundsWhat are the chances that the highest entry,
R, is too big (so we will overestimate)?Then some entry a[j] = 1 and 2j > cD for
some constant c > 1Use the Markov inequality for the probability
of this event occurring:Pr[Zj ≥ 1] ≤ E(Zj)/1
= D/2j
< 1/c
![Page 19: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/19.jpg)
Applying Probability BoundsWhat if the answer R is too small, so we
underestimate?Then some entry a[j]=0 and 2j < D/c.What is the probability that entry j is zero?
Pr[Zj = 0] = Pr[|Zj – E(Zj)| ≥ E(Zj)] ≤ Var(Zj)/E(Zj)2 (by Chebyshev)
< E(Zj)/E(Zj)2
= 1/E(Zj)= 2j/D< 1/c
![Page 20: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/20.jpg)
Putting the bounds togetherThe probability that the answer we get is
between D/c and cD is 2/cFairly weak bounds, in practice it performs
pretty well. Tighter analysis is possible.Some heuristics:
– Use the average of several different runs (do eachrun in parallel) using different hash function
– Use the location of the smallest zero, not thehighest one (need to do some scaling of theresult)
![Page 21: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/21.jpg)
A Different ApproachStart again with Distinct ItemsRepresent current state as a vector, and represent theproblem as a problem on that vector.Initially, the vector is zeroAdd one to entry i when i arrives in the streamSubtract one from entry j when j departs in the stream.This is the exact algorithm we originally came up with.• Distinct items = number of non-zero entries• (Frequent items = index of entries with value > n/k)
![Page 22: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/22.jpg)
Vector Reduction
So, we have a vector a, which is being updated.Formally, we want to approximate
|i | a[i] ≠ 0|Suppose instead, we computed the Lp norm of a
Σi |a[i]|p
What do we get?p = 2: Sum of squaresp = 1: Sum of the (absolute) entriesp < 1? As p→0, if a[i]=0, we get 0
but if a[i] ≠0, we get 1.
![Page 23: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/23.jpg)
Hamming Norm of a StreamWe call the number of non-zero entries in the vector,the Hamming norm of the vector (explain why later).For a vector generated by a stream, call this theHamming norm of the stream(= number of distinct items, but more general).Example, when the stream consists of integer updates:(5,+3), (2,-1), (3,+2), (7,+9), (5,-2), (6,-1), (6,-3),(2,+1), (4,+2), (3,-2), (7,-5), (5,+2), (6,-2), (4,-3), (5,-1)
1 2 3 4 5 6 7 8
0 0 0 -1 2 -3 4 0
Hamming norm of the stream is 4 (4 non-zero entries)
![Page 24: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/24.jpg)
Zeroing in on Hamming Norm
We can approximate the Hamming norm by finding theLp norm to the power p for small enough p,Provided we guarantee that total in any entry < B
Hamming norm of vector a is |a|H = Σ |ai|0
where 00 defined = 0
Lp norm of a vector is (Σ |ai|p)1/p = ||a||p|a|H = Σ |ai|0 ≤ Σ |ai|p ≤ Σ Bp |ai|0 ≤ Bp |a|H
Setting Bp = (1+ε) means |a|H ≤ ||ai||pp ≤ (1+ε) |a|H
Fixes p = ε / log B, so can approximate Hamming Norm
![Page 25: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/25.jpg)
Intuition: Sum of squares
If we compute sum of squares of a, Σ a[i]2
Would be easy if we received each a[i]togetherSuppose we did store aCompute a vector r by drawing each entry ofr from a Gaussian (Normal) distribution.Compute s = r • a = Σ r[i]*a[i]What is the expectation of s?
![Page 26: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/26.jpg)
Gaussian DistributionWe know the following:Sum of Gaussians. If X, Y are Gaussian then cX + dY is
a GaussianThe expectation is cE(X) + dE(Y)The variance is c2Var(X) + d2Var(Y)So, if each member of r is drawn from Normal(0,1),
(mean = 0, variance = 1) then:E(s) = 0Var(s) = a[1]2 + a[2]2 + ... = Σ a[i]2
Suppose we output s2. What is E(s2)?
![Page 27: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/27.jpg)
Expected ResultsVar(X) = E(X2) – E(X)2
So, E(s2) = Var(X) + E(X)2
= Σ a[i]2 + 0Which is what we want. How does this help?We can compute s incrementally.Initially s is 0.When we see i arrive in the stream, we compute
s = s + r[i]After we have seen the whole of the stream, we have
computed s = r • a, without explicitly storing a.... but instead we have explicitly stored r.
![Page 28: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/28.jpg)
Saving SpaceWe don’t have to store r.We just need that every time we ask for r[i] then we
get the same answer.Suppose we use a random number generator.Every time we ask for r[i], seed the rng with i, and
extract a “random” number [0..1].Put this into the Box-Muller transform that outputs a
value drawn from N(0,1)Every time we do this, we will get the same result.So it works, the space is reduced.
![Page 29: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/29.jpg)
Some Extra DetailsDoing this for a single value will be fairly inaccurate.
Possible to improve the accuracy by repeating severaltimes, taking the average.
Analysis skipped in this presentation, need to analyzethe variance (we’ll analyze next alg in painful detail)
Bottom line: to get an approximation that is correctwithin a factor of 1± ε with probability 1-δ requiresspace O(1/ε2
log 1/δ)
We’d like to do the same thing for sum of a[i]p
![Page 30: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/30.jpg)
Recapping our approachAn exact answer is not possible in small space, so wefind an approximate answer with probabilityguarantees.
We will use statistical distributions with provableproperties.
• Pairs (i, j) arrive (meaning “add j to location i”)
• The total of values xi is bounded |xi| < U
Will create a small summarizing “sketch” for thestream that allows distinct items to be approximated.
![Page 31: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/31.jpg)
Stable Distributions
Let X be a random variable distributed with a stabledistribution. Stable distributions have the property that
a1X1+a2X2+a3X3+ … anXn ~ ||(a1, a2, a3, … , an)||pX
if X1 … Xn are stable with stability parameter p
Gaussian distribution is stable with parameter 2
Stable distributions exist and can be simulated for allparameters 0 < p < 2.
So, let X = x1,1 … xk,n be a matrix of values drawnfrom a stable distribution with parameter p...
![Page 32: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/32.jpg)
Computing the Sketch
• Sketch s is s1 ... sk for small k• Initially 0• When item i in the stream arrives, we update
the sketch:• sj " sj + xj,i for j =1 to k• The result is
si = a • xj = a1X1+a2X2+a3X3+ … anXn
So we get what we wanted.
![Page 33: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/33.jpg)
Finding the Hamming NormWe can use the sketch to extract the number of
distinct itemsCompute median |s1|p ... |sk|pWe know each sj is distributed as ||a||pXmedian|sj| distributed as median(||a||p|X|)
= ||a||p median(|X|)
• Bound probability of being far from correct answer• We take the median of k = 3/ε2 log 1/δ repeats
![Page 34: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/34.jpg)
Probability Calculation
• Let min be defined by Pr[|X|<min] = ½-ε• Suppose Pr[|X|<median|sj|] < ½ - ε• Then median|si|<min• Then at least k/2 values are smaller than min• Define Yi = 0 if si < min, 1 otherwise• We want to know, what is Pr[Σ Yi < k/2]• Yi are independent, Pr[Yi = 1] = ½ + ε• E(Σ Yi) = k(½ + ε)•
![Page 35: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/35.jpg)
Chernoff Bound
For independent 0/1trials X = Σ Xi and 0<ρ<1Pr[X < (1-ρ)E(X)] < exp(-E(X)ρ2/2)
• Apply this here: want to know Pr[Σ Yi < k/2]• k/2 = k/2(½ + ε)/(½ + ε) = E(Y)½ /( ½ + ε)
~ (1 – 2ε)E(Y)• So Pr[Σ Yi < k/2] < exp(-k(½ + ε)ε2/2)• = exp(-3log 1/δ (½ + ε)/2)• < exp(-¾ log 1/δ) < δ/2
![Page 36: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/36.jpg)
Using the boundSo Pr[Pr[|X|<median|sj|] < ½ - ε ] < δ/2We can make a similar argument to show the same δ/2
bound for (1+ε)Write F(x) for cumulative distribution function of |X|Pr[F(median|sj|) ∈ [ ½ - ε, ½ + ε]] > 1-δPr[median|sj| ∈ [ F-1(½ - ε), F-1(½ + ε)]] > 1-δSince the derivative of F is bounded around the
median, we concludePr[median|sj|∈ [F-1(½)(1–O(ε)),F-1(½)(1+O(ε))]]>1-δ
![Page 37: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/37.jpg)
Consequences
Pr[ (1-ε) median|X| ≤ median|sj|≤ (1+ε) median|X|] > 1-δ
• Overall probability we are within (1 ± ε) is > = 1- δ• Sets k = O(1/ε2 log 1/δ) repetitions
But… we need to store all xi,j = O(kn) storage
… which is more than just storing the vector a
![Page 38: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/38.jpg)
Reducing space needs
• xi,j must be from stable distribution with parameter p
• xi,j must be the same every time it is used
We will generate values from a stable distribution bytransforming from a uniform distribution
Use a random number generator that is good enoughso that f(x) appears to be drawn from a uniform dbn.
Then x1,j = stable(f(x)) x2,j = stable(f(f(x)))
x3,j = stable(f(f(f(x)))) etc.
![Page 39: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/39.jpg)
Generating Stable Distributions• Compute r1, r2 as uniform random variables in
range [0...1]• Set θ= π(r1 – ½)• Define
• stable(r1, r2, p) is distributed with stabledistribution with parameter p
•
pp
p rppprrstable
−
−
−=
1
2/121 ln
))1(cos(cossin),,( θ
θθ
![Page 40: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/40.jpg)
Guaranteed AccuracyOne estimate is not accurate (variance is high), so repeatseveral times independently: keep k copies based onindependent drawings of the vector x.
Store the values of âH in the short L0 sketch, s[1…k].
Find mediani(|sk[i]|), scale by median(|Stable(p,0)|)=m
Fix k = O(1/ε2 log 1/δ). Then
(1-ε) |a|H ≤ median(sk)/m ≤ (1+ε)2 |a|H with probability 1-δ
![Page 41: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/41.jpg)
Complete Algorithm
initialize sk[1…k] = 0.0for all for all for all for all tuples (i,j) dodododo initialize random with i for for for for s = 1 to to to to k dodododo r1 = random(); r2 = random() sk[s] = sk[s]+j*stable(r1,r2,p)
for for for for s = 1 to to to to k dodododo sk[s] = absolute(sk[s])p
return median(sk)*scalefactor(p)
Simple to implement, can run quickly, small space
![Page 42: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/42.jpg)
How to measure streams?
The state at any time defines a massive vector
• Hamming norm: Σ (xi ≠ 0)
Number of non-zero entries of the vector
• Union Size: Σ (xi + yi ≠ 0)
• Hamming difference: Σ ((xi - yi) ≠ 0) = Σ (xi ≠ yi)
This is the number of places where the vectorsdiffer - a fundamental concept.
![Page 43: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/43.jpg)
Properties
Difference and union of streams is easy tocompute:
sk(a + b) = sk(a) + sk(b)sk(a - b) = sk(a) - sk(b)
by linearity of dot product, so can approximate|a - b|H and |a + b|H with the same accuracy.Space usage is small: the L0 sketch consists of O(1/ε2 log 1/δ) countersTime per item is to update each counter, O(1/ε2 log 1/δ)
![Page 44: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/44.jpg)
Practical Use
So with O(k) space we can create a sketch to allowrapid comparison of huge streaming vectors.Note k << n, in fact k is almost independent of n.Implemented and tested in:[C, Indyk, Koudas, Muthukrishnan02] - On massivetabular data, looking for clusterings using sketchcomputations to speed up comparisons for L1, L2 andother Lp distances[C, Datar, Indyk, Muthukrishnan02] - On streamingvectors, to count number of distinct elements, findHamming norm and Hamming distance.
![Page 45: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/45.jpg)
Experimental Evaluation
Data Sets
• Generated synthetic data from Zipf distributionswith a range of parameters
• Took real Netflow data from one of AT&T’snetworks
• Each data stream was around 20Mb, working spacewas around a few Kb.
Parameters We fixed p = 0.02 (as small as possible),sets scale factor, median(|Stable(0.02,0)|) = 1.425
![Page 46: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/46.jpg)
Existing Techniques
Compared against the “probabilistic counting”algorithm of Flajolet and Martin
+ Uses a similar amount of space
+ Operates in the data stream model
+ Fast per-item processing
– Can’t cope with all situations (eg negative values)
– Can’t find the difference between two streams
![Page 47: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/47.jpg)
Hamming Norm Tests
• Performance of our algorithm is better than FM85
• Improves with more workspace
• Somewhat slower in practice
![Page 48: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/48.jpg)
• Shows that FM85 can’t cope when values are allowedto be negative, but L0 sketches retain their accuracy.
![Page 49: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/49.jpg)
• Good performance (~7% error), small memory cost
• Performance of finding union of streams (not shown)also good.
![Page 50: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/50.jpg)
Conclusions
We examined techniques for computing numbers ofdistinct items.
Can approximate the Hamming norm, Number ofDistinct Items, Hamming difference with only a few kbof space
Suitable for indexing streams
The “L0 sketch” can be used as a surrogate for thestream in other computations: clustering, searching,querying, all based only on the sketches
![Page 51: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/51.jpg)
Bonus Material: Dominance Norms
The “worst case influence”is important to knowSuppose we are receiving a number of signals.Stream consists of (i, ai), meaning signal i took
value ai
Take sum of maximum of each signal Σi max ai
(so not quite the cash register model)We define this to be the dominance norm of
the streamCan we compute the dominance norm?
![Page 52: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/52.jpg)
Example
Stream consists of:(5,3), (2,1), (3,2), (7,9), (5,2), (6,1), (6,3), (2,1),
(4,2), (3,2), (7,5), (5,2), (6,2), (4,3), (5,1)
Worst case influence is 1+2+3+3+3+9=21We will use counting distinct elements as a tool
to help us answer this question.
1 2 3 4 5 6 7 8
0 1 2 3 3 3 9 0
![Page 53: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/53.jpg)
Approximating Dominance Norm
• Want to approximate the dominance norm asbefore up to a (1± ε) factor
• Consider just approximating for a single signal• We see stream of values for this signal• Want to take the max of these• Suppose we represent ai as 1+1+2+4+... 2j
• Total = 2j+1, 2j ≤ ai ≤ 2j+1
![Page 54: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/54.jpg)
A 2-approximationWe will insert symbols x0, x1, x2, ... xj into separate
distinct elements algorithms D0, D1, D2, ... Dj
If we do this for every ai encountered, then we count 1for xi
So we can compute max(ai) approx to a factor of 2.If we do the same for every signal i, then we can
compute the dominance norm up to a factor of 2:Output D0 + D1 + 2D2 + ... + 2jDj
![Page 55: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/55.jpg)
Generalizing• Instead of powers of 2, we can use powers of (1+ε)• This will allow us to make a 1+ε approximation
![Page 56: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/56.jpg)
Analysis
• How much space do we need?• Suppose B is the maximum value seen• Then we need j algorithms, (1+ε)j > B• j = log B / log (1+ε) ~ log B / ε• Space for each algorithm = O(1/ε2 log 1/δ)• Total space = O(1/ε3 log B log 1/δ)
![Page 57: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/57.jpg)
Min dominance?
HW: Suppose instead you wish to compute thebest-case influence.
That is, the minimum of each signal,Σi min(ai)
Either: design an efficient algorithm to solve thisproblem on the stream, or give a lower boundon the space required.
![Page 58: Data Stream Methods - Rutgers Universitymuthu/198-4.pdf · Probabilistic Counting • The approach of Probabilistic Counting, due to Flajolet & Martin, 1982, is a powerful method](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf9a39ad6a402d666aefb0/html5/thumbnails/58.jpg)
ReferencesN. Alon, Y. Matias, M. Szegedy “The Space Complexity ofApproximating the Frequency Moments”, STOC 1996
G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan,“Comparing Data Streams Using Hamming Norms”, VLDB2002
G. Cormode, P. Indyk, N. Koudas, S. Muthukrishnan, “FastMining of Tabular Data via Approximate DistanceComputations”, ICDE 2002
P. Flajolet, N. Martin “Probabilistic Counting”, FOCS 1983
P. Indyk “Stable Distributions, Pseudorandom Generators,Embeddings and Data Stream Computations”, FOCS 2000
J. Nolan, “An Introduction to Stable Distributions”(on web)