1 finding (recently) frequent items in distributed data streams amit manjhi, vladislav shkapenyuk,...

24
1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapen yuk, Kedar Dhamdhere, Christopher Ols ton CMU-CS-05 Speaker 董董董 Advisor 董董董

Upload: dominick-sullivan

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

1

Finding (Recently) Frequent Items in Distributed Data Streams

Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston

CMU-CS-05

Speaker :董原賓 Advisor :柯佳伶

Page 2: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

2

Introduction

Mining distributed data streams Challenge : Transfer among nodes is costly Goal : Minimize communication requirements

Application Monitor usage in large-scale distributed systems

Content Delivery Network Detect malicious activities in networked systems

Detect worms

Page 3: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

3

Definition Data streams S1, S2 ,…,Sm, m ≧ 1 R : root node ε: error tolerance, 0 ≦ε≦ s s : minimum support c(u) : frequency of item u in S, u ∈ universe U of items N = ∑u∈U c(u) (u) : estimate count of c(u),ĉ max{0, c(u) – ε* N} ≦ (u) ĉ ≦ c(u)

Page 4: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

4

Definition T : a period of time units, an epoch α: decay rate, 0 < α ≦ 1 (ε,α)-synopsis : consist of S: (u) and S:nĉ l ≧2 : number of levels in the hierarchy ε≧ ε1≧ ε2≧…≧εl-1 d≧2 : fanout (degree) of all non-leaf nodes i

n the hierarchy

Page 5: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

5

Main Approach

1. Every monitor node Mi uses a single-stream approximate frequency counting algorithm

2. m monitor nodes M1,M2,…,Mm relay data every T time units to central root node R

3. Every T time, each monitor node sends its (εl-1,1)-synopsis to its parent node

Page 6: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

6

Main Approach

4. The parent node combines the d (εl-1,1)-synopses from its d children into a single (εl-2,1)-synopsis using

Algorithm 1a : based on lossy counting or Algorithm 1b : based on majority counting

5. The root node combines the d (ε1,1)-synopses using

Algorithm 2a : based on lossy counting or Algorithm 2b : based on majority counting

Page 7: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

7

Algorithm 1a and 1b

Page 8: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

8

Example

use algorithm 1a 27distinct items (I1~I27), partition them

into categories: A : contains I1

B : contains I2~I14

C : contains I15~I27

ε1≈ ε= 0.05, ε2 = 0.03

Page 9: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

9

Example

S1 = S3 = { I1:9, I2~I14:6, I15~I27:1 }S2 = S4 = { I1:9, I2~I14:1, I15~I27:6 }

S1:n = S2:n = S3:n = S4:n = 100

Because of the lossy counting algorithm leads to undercounting of eachitem’s frequency by ε2 ·100 = 0.03 ·100 = 3

S1 = S3 = { I1:6, I2~I14:3, I15~I27:0 }S2 = S4 = { I1:6, I2~I14:0, I15~I27:3 }

S1 = S3 = { I1:6, I2~I14:3 }

S2 = S4 = { I1:6, I15~I27:3 }

Item’s counts fall below zero are eliminated

Link load M1l1 = 14

14 14 14 14

l1 = { I1:8}

l2 = { I1:8}

l1:n = S1:n + S2:n = 100 + 100 = 200l1: (Iĉ 1) = S1: (Iĉ 1) + S2: (Iĉ 1) = 6 + 6 = 12l1: (Iĉ 1) = l1: (Iĉ 1) - (ε1- ε2)·l1:n = 12 - (0.05-0.03)·200 = 12 - 4 = 8

l1:n = S1:n + S2:n = 100 + 100 = 200l1: (Iĉ 2) = S1: (Iĉ 2) + S2: (Iĉ 2) = 3 + 0 = 3l1: (Iĉ 1) = l1: (Iĉ 1) - (ε1- ε2)·l1:n = 3 - (0.05-0.03)·200 = 3 - 4 = -1 delete

Link load l1 R = 1

1 1

Page 10: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

10

Example

Page 11: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

11

Minimizing Total Load on the Root Node

Using Algorithm 1a at all applicable nodes and setting εi = 0 for all 2 ≦ i ≦l−1, term this strategy MinRootLoad

Page 12: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

12

Minimizing Worst Case Maximum Load on Any Link

Definition : 2 ≦ i ≦ l − 2, Δi = εi − εi+1 and Δl−1 = εl-1

I : the contents of all input streams S1, . . . , Sm

I : denote the set of all possible instances of I Communication hierarchy T defined by degree d a

nd number of levels l w : maximum load on any link : Worst-case load W :

Page 13: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

13

Minimizing Worst Case Maximum Load on Any Link

Iwc : denotes the set of worst-case inputs, for all instances I ∈I – Iwc , I ’∈Iwc ,for any

Worst case property : For any two input streams Si and Sj , there is no item occu

rrence common to both Si and Sj

For any input stream Si, all items occurring in Si occur with equal frequency

For any two input streams Si and Sj , both the number of item occurrences, and the number of distinct items, in Si and Sj are equal

Page 14: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

14

Minimizing Worst Case Maximum Load on Any Link

Level X : all counts are dropped at the level X, the most heavily loaded link(s) are the ones leading to level X

Page 15: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

15

Minimizing Worst Case Maximum Load on Any Link

After solving the equation, we obtain Δi = ε1 · , 2 ≦ i ≦ l-1 and

Δl-1 = ε1 ·

the maximum possible load on any link is Lwc = we term this strategy MinMaxLoad_WC

Page 16: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

16

Good Precision Gradients for Non-Worst-Case Inputs

Real data is unlikely to exhibit worst-case characteristics

Two extreme opposite cases : 1. Items on each input stream are disjoint

Solution : use strategy MinMaxLoad_WC 2. All input streams contain identical distributions of item

s Solution : there is no benefit to delaying pruning, set ε1 = ε2 =

…= εl-1 = ε (strategy SS2)

Page 17: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

17

Good Precision Gradients for Non-Worst-Case Inputs

Most real-world data falls somewhere between these two extremes

Li : the number of local frequent items in Si

Gi : the number of items that are local frequent in Si and global frequent in S, S = S1 U S2 U…U Sm

Commonality parameter γ∈ [0, 1],

Page 18: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

18

Good Precision Gradients for Non-Worst-Case Inputs

A natural hybrid strategy is to use a linear combination of MinMaxLoad_WC and SS2

Set εi = (1 - γ) ·

εl-1 = (1 - γ) ·

We term this hybrid strategy MinMaxLoad_NWC

Page 19: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

19

Experiment

Data set : 1. traffic logs from Internet, and identify hosts rece

iving large numbers of packets recently Data were collected for one full day

2. Java Servlet versions of two publicly available dynamic Web application benchmarks :

RUBiS is modeled after eBay, an online auction site and RUBBoS is modeled after slashdot, an online bulletin-boa

rd Ran each benchmark for 40 hours

Page 20: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

20

Experiment

Simulated environment : 216 monitoring nodes (m = 216) Communication hierarchy of fanout six (d = 6) consisted o

f four levels (l = 4) set s = 0.01, ε = 0.1·s, and ε1 = 0.9· ε An epoch T for data1 is 5 mins ,for data2 is 15 mins

Data Characteristics : Data1(internet), γ= 0.675 Data2, Auction γ= 0.839 and Bulletin Board γ= 0.571

Page 21: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

21

Experiment

Page 22: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

22

Experiment

Page 23: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

23

Experiment

Page 24: 1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston CMU-CS-05 Speaker

24

Experiment