1 finding (recently) frequent items in distributed data streams amit manjhi, vladislav shkapenyuk,...
TRANSCRIPT
1
Finding (Recently) Frequent Items in Distributed Data Streams
Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston
CMU-CS-05
Speaker :董原賓 Advisor :柯佳伶
2
Introduction
Mining distributed data streams Challenge : Transfer among nodes is costly Goal : Minimize communication requirements
Application Monitor usage in large-scale distributed systems
Content Delivery Network Detect malicious activities in networked systems
Detect worms
3
Definition Data streams S1, S2 ,…,Sm, m ≧ 1 R : root node ε: error tolerance, 0 ≦ε≦ s s : minimum support c(u) : frequency of item u in S, u ∈ universe U of items N = ∑u∈U c(u) (u) : estimate count of c(u),ĉ max{0, c(u) – ε* N} ≦ (u) ĉ ≦ c(u)
4
Definition T : a period of time units, an epoch α: decay rate, 0 < α ≦ 1 (ε,α)-synopsis : consist of S: (u) and S:nĉ l ≧2 : number of levels in the hierarchy ε≧ ε1≧ ε2≧…≧εl-1 d≧2 : fanout (degree) of all non-leaf nodes i
n the hierarchy
5
Main Approach
1. Every monitor node Mi uses a single-stream approximate frequency counting algorithm
2. m monitor nodes M1,M2,…,Mm relay data every T time units to central root node R
3. Every T time, each monitor node sends its (εl-1,1)-synopsis to its parent node
6
Main Approach
4. The parent node combines the d (εl-1,1)-synopses from its d children into a single (εl-2,1)-synopsis using
Algorithm 1a : based on lossy counting or Algorithm 1b : based on majority counting
5. The root node combines the d (ε1,1)-synopses using
Algorithm 2a : based on lossy counting or Algorithm 2b : based on majority counting
7
Algorithm 1a and 1b
8
Example
use algorithm 1a 27distinct items (I1~I27), partition them
into categories: A : contains I1
B : contains I2~I14
C : contains I15~I27
ε1≈ ε= 0.05, ε2 = 0.03
9
Example
S1 = S3 = { I1:9, I2~I14:6, I15~I27:1 }S2 = S4 = { I1:9, I2~I14:1, I15~I27:6 }
S1:n = S2:n = S3:n = S4:n = 100
Because of the lossy counting algorithm leads to undercounting of eachitem’s frequency by ε2 ·100 = 0.03 ·100 = 3
S1 = S3 = { I1:6, I2~I14:3, I15~I27:0 }S2 = S4 = { I1:6, I2~I14:0, I15~I27:3 }
S1 = S3 = { I1:6, I2~I14:3 }
S2 = S4 = { I1:6, I15~I27:3 }
Item’s counts fall below zero are eliminated
Link load M1l1 = 14
14 14 14 14
l1 = { I1:8}
l2 = { I1:8}
l1:n = S1:n + S2:n = 100 + 100 = 200l1: (Iĉ 1) = S1: (Iĉ 1) + S2: (Iĉ 1) = 6 + 6 = 12l1: (Iĉ 1) = l1: (Iĉ 1) - (ε1- ε2)·l1:n = 12 - (0.05-0.03)·200 = 12 - 4 = 8
l1:n = S1:n + S2:n = 100 + 100 = 200l1: (Iĉ 2) = S1: (Iĉ 2) + S2: (Iĉ 2) = 3 + 0 = 3l1: (Iĉ 1) = l1: (Iĉ 1) - (ε1- ε2)·l1:n = 3 - (0.05-0.03)·200 = 3 - 4 = -1 delete
Link load l1 R = 1
1 1
10
Example
11
Minimizing Total Load on the Root Node
Using Algorithm 1a at all applicable nodes and setting εi = 0 for all 2 ≦ i ≦l−1, term this strategy MinRootLoad
12
Minimizing Worst Case Maximum Load on Any Link
Definition : 2 ≦ i ≦ l − 2, Δi = εi − εi+1 and Δl−1 = εl-1
I : the contents of all input streams S1, . . . , Sm
I : denote the set of all possible instances of I Communication hierarchy T defined by degree d a
nd number of levels l w : maximum load on any link : Worst-case load W :
13
Minimizing Worst Case Maximum Load on Any Link
Iwc : denotes the set of worst-case inputs, for all instances I ∈I – Iwc , I ’∈Iwc ,for any
Worst case property : For any two input streams Si and Sj , there is no item occu
rrence common to both Si and Sj
For any input stream Si, all items occurring in Si occur with equal frequency
For any two input streams Si and Sj , both the number of item occurrences, and the number of distinct items, in Si and Sj are equal
14
Minimizing Worst Case Maximum Load on Any Link
Level X : all counts are dropped at the level X, the most heavily loaded link(s) are the ones leading to level X
15
Minimizing Worst Case Maximum Load on Any Link
After solving the equation, we obtain Δi = ε1 · , 2 ≦ i ≦ l-1 and
Δl-1 = ε1 ·
the maximum possible load on any link is Lwc = we term this strategy MinMaxLoad_WC
16
Good Precision Gradients for Non-Worst-Case Inputs
Real data is unlikely to exhibit worst-case characteristics
Two extreme opposite cases : 1. Items on each input stream are disjoint
Solution : use strategy MinMaxLoad_WC 2. All input streams contain identical distributions of item
s Solution : there is no benefit to delaying pruning, set ε1 = ε2 =
…= εl-1 = ε (strategy SS2)
17
Good Precision Gradients for Non-Worst-Case Inputs
Most real-world data falls somewhere between these two extremes
Li : the number of local frequent items in Si
Gi : the number of items that are local frequent in Si and global frequent in S, S = S1 U S2 U…U Sm
Commonality parameter γ∈ [0, 1],
18
Good Precision Gradients for Non-Worst-Case Inputs
A natural hybrid strategy is to use a linear combination of MinMaxLoad_WC and SS2
Set εi = (1 - γ) ·
εl-1 = (1 - γ) ·
We term this hybrid strategy MinMaxLoad_NWC
19
Experiment
Data set : 1. traffic logs from Internet, and identify hosts rece
iving large numbers of packets recently Data were collected for one full day
2. Java Servlet versions of two publicly available dynamic Web application benchmarks :
RUBiS is modeled after eBay, an online auction site and RUBBoS is modeled after slashdot, an online bulletin-boa
rd Ran each benchmark for 40 hours
20
Experiment
Simulated environment : 216 monitoring nodes (m = 216) Communication hierarchy of fanout six (d = 6) consisted o
f four levels (l = 4) set s = 0.01, ε = 0.1·s, and ε1 = 0.9· ε An epoch T for data1 is 5 mins ,for data2 is 15 mins
Data Characteristics : Data1(internet), γ= 0.675 Data2, Auction γ= 0.839 and Bulletin Board γ= 0.571
21
Experiment
22
Experiment
23
Experiment
24
Experiment