1 finding (recently) frequent items in distributed data streams amit manjhi, vladislav shkapenyuk,...

1

Finding (Recently) Frequent Items in Distributed Data Streams

Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston

CMU-CS-05

Speaker ：董原賓 Advisor ：柯佳伶

2

Introduction

Mining distributed data streams Challenge ： Transfer among nodes is costly Goal ： Minimize communication requirements

Application Monitor usage in large-scale distributed systems

Content Delivery Network Detect malicious activities in networked systems

Detect worms

3

Definition Data streams S1, S2 ,…,Sm, m ≧ 1 R : root node ε: error tolerance, 0 ≦ε≦ s s : minimum support c(u) : frequency of item u in S, u ∈ universe U of items N = ∑u∈U c(u) (u) : estimate count of c(u),ĉ max{0, c(u) – ε* N} ≦ (u) ĉ ≦ c(u)

4

Definition T : a period of time units, an epoch α: decay rate, 0 ＜ α ≦ 1 (ε,α)-synopsis : consist of S: (u) and S:nĉ l ≧2 : number of levels in the hierarchy ε≧ ε1≧ ε2≧…≧εl-1 d≧2 : fanout (degree) of all non-leaf nodes i

n the hierarchy

5

Main Approach

1. Every monitor node Mi uses a single-stream approximate frequency counting algorithm

2. m monitor nodes M1,M2,…,Mm relay data every T time units to central root node R

3. Every T time, each monitor node sends its (εl-1,1)-synopsis to its parent node

6

Main Approach

4. The parent node combines the d (εl-1,1)-synopses from its d children into a single (εl-2,1)-synopsis using

Algorithm 1a : based on lossy counting or Algorithm 1b : based on majority counting

5. The root node combines the d (ε1,1)-synopses using

Algorithm 2a : based on lossy counting or Algorithm 2b : based on majority counting

7

Algorithm 1a and 1b

8

Example

use algorithm 1a 27distinct items (I1~I27), partition them

into categories: A : contains I1

B : contains I2~I14

C : contains I15~I27

ε1≈ ε= 0.05, ε2 = 0.03

9

Example

S1 = S3 = { I1:9, I2~I14:6, I15~I27:1 }S2 = S4 = { I1:9, I2~I14:1, I15~I27:6 }

S1:n = S2:n = S3:n = S4:n = 100

Because of the lossy counting algorithm leads to undercounting of eachitem’s frequency by ε2 ·100 = 0.03 ·100 = 3

S1 = S3 = { I1:6, I2~I14:3, I15~I27:0 }S2 = S4 = { I1:6, I2~I14:0, I15~I27:3 }

S1 = S3 = { I1:6, I2~I14:3 }

S2 = S4 = { I1:6, I15~I27:3 }

Item’s counts fall below zero are eliminated

Link load M1l1 = 14

14 14 14 14

l1 = { I1:8}

l2 = { I1:8}

l1:n = S1:n + S2:n = 100 + 100 = 200l1: (Iĉ 1) = S1: (Iĉ 1) + S2: (Iĉ 1) = 6 + 6 = 12l1: (Iĉ 1) = l1: (Iĉ 1) - (ε1- ε2)·l1:n = 12 - (0.05-0.03)·200 = 12 - 4 = 8

l1:n = S1:n + S2:n = 100 + 100 = 200l1: (Iĉ 2) = S1: (Iĉ 2) + S2: (Iĉ 2) = 3 + 0 = 3l1: (Iĉ 1) = l1: (Iĉ 1) - (ε1- ε2)·l1:n = 3 - (0.05-0.03)·200 = 3 - 4 = -1 delete

Link load l1 R = 1

1 1

10

Example

11

Minimizing Total Load on the Root Node

Using Algorithm 1a at all applicable nodes and setting εi = 0 for all 2 ≦ i ≦l−1, term this strategy MinRootLoad

12

Minimizing Worst Case Maximum Load on Any Link

Definition : 2 ≦ i ≦ l − 2, Δi = εi − εi+1 and Δl−1 = εl-1

I : the contents of all input streams S1, . . . , Sm

I : denote the set of all possible instances of I Communication hierarchy T defined by degree d a

nd number of levels l w : maximum load on any link : Worst-case load W :

13


Iwc ： denotes the set of worst-case inputs, for all instances I ∈I – Iwc , I ’∈Iwc ,for any

Worst case property : For any two input streams Si and Sj , there is no item occu

rrence common to both Si and Sj

For any input stream Si, all items occurring in Si occur with equal frequency

For any two input streams Si and Sj , both the number of item occurrences, and the number of distinct items, in Si and Sj are equal

14


Level X : all counts are dropped at the level X, the most heavily loaded link(s) are the ones leading to level X

15


After solving the equation, we obtain Δi = ε1 · , 2 ≦ i ≦ l-1 and

Δl-1 = ε1 ·

the maximum possible load on any link is Lwc = we term this strategy MinMaxLoad_WC

16

Good Precision Gradients for Non-Worst-Case Inputs

Real data is unlikely to exhibit worst-case characteristics

Two extreme opposite cases : 1. Items on each input stream are disjoint

Solution : use strategy MinMaxLoad_WC 2. All input streams contain identical distributions of item

s Solution : there is no benefit to delaying pruning, set ε1 = ε2 =

…= εl-1 = ε (strategy SS2)

17


Most real-world data falls somewhere between these two extremes

Li : the number of local frequent items in Si

Gi : the number of items that are local frequent in Si and global frequent in S, S = S1 U S2 U…U Sm

Commonality parameter γ∈ [0, 1],

18


A natural hybrid strategy is to use a linear combination of MinMaxLoad_WC and SS2

Set εi = (1 - γ) ·

εl-1 = (1 - γ) ·

We term this hybrid strategy MinMaxLoad_NWC

19

Experiment

Data set : 1. traffic logs from Internet, and identify hosts rece

iving large numbers of packets recently Data were collected for one full day

2. Java Servlet versions of two publicly available dynamic Web application benchmarks :

RUBiS is modeled after eBay, an online auction site and RUBBoS is modeled after slashdot, an online bulletin-boa

rd Ran each benchmark for 40 hours

20

Experiment

Simulated environment : 216 monitoring nodes (m = 216) Communication hierarchy of fanout six (d = 6) consisted o

f four levels (l = 4) set s = 0.01, ε = 0.1·s, and ε1 = 0.9· ε An epoch T for data1 is 5 mins ,for data2 is 15 mins

Data Characteristics : Data1(internet), γ= 0.675 Data2, Auction γ= 0.839 and Bulletin Board γ= 0.571

21

Experiment

22

Experiment

23

Experiment

24

Experiment

1 finding (recently) frequent items in distributed data streams amit manjhi, vladislav shkapenyuk,...

Documents