1 distributed streams algorithms for sliding windows phillip b. gibbons, srikanta tirthapura

Post on 20-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Distributed Streams Algorithms for Sliding Windows

Phillip B. Gibbons,

Srikanta Tirthapura

2

Abstract

• Algorithm for estimating aggregate functions over a “sliding window” of the N most recent data items in one or more streams.

3

Single stream

• The first E-approximation scheme for number of 1’s in a sliding window.

• The first E-approximation scheme for the sum of integers in [0..R] in a sliding window.

• Both algorithms are optimal in worst case time and space.

• Both algorithms are deterministic

4

Distributed Streams

• The first randomized E-approximation scheme for the number of 1’s in a sliding window over the union of distributed streams.

5

Usage

• Network Monitoring

• Data Warehousing

• Telecommunications

• Sensor Networks

6

• Multiple Data Source - Distributed Stream Model

• Only the most recent data is important - “Sliding Window”

7

The Goal in the algorithms

• Approximating a function F while minimizing :

• 1. The total memory

• 2. The time take by each party to

process a data item

• 3. The time to produce an estimate -

query time

8

Definition 1-An -approximation scheme for a quantity X • A randomized procedure that, given any

positive <1 and <1, compute an estimate :

• -approximate :An estimate whose worst case relative error is at most

9

An Example for Basic Counting Problem

10

Algorithms for Distributed Stream

• Each party observes only its own stream

• Each party communicates with other parties only when estimate is requested

• Each party sends a message to a Referee who computes the estimate

11

The Idea

• Storing a wave consisting of many random samples of the stream.

• Samples that contain only the recent items are sampled at a high probability, while those containing old items are sampled at a lower probability

12

Contributions• Introducing a data structures called

waves

• Presenting the first E-approximation scheme for Basic Counting.

• Presenting the first E-approximation scheme for the sum of integers in [0..R]. Both optimal in worst case space, processing time and query time.

13

Contributions

• Presenting the first randomized -approximation for the number of 1’s in a sliding window over the union of distributed streams

14

Related Work

• From the paper of Datar et al :

• Using Exponential Histogram data base

15

Exponential Histogram • Maintain more information about

recently seen items, less about old items.

• k0 most recent 1’s are assigned to individual bucket

• The K1 next most recent 1’s are assigned to bucket size 2.

• The K2 next most recent 1’s are assigned to bucket size 4.

• So on until last N items are assigned to some bucket

16

Exponential Histogram• Each ki is either or • The last bucket is discarded if its position no

longer falls within the window• If the new item is a 1, it is assigned to a new

bucket of size 1.

• If this make , then the two least recent buckets of size 1 are merged to form a bucket of size 2.

• If k1 in now too large, the two least recent buckets of size 2 are merged

• So on resulting in a cascading of up to log N bucket merges in the worst case.

• The approach using waves avoids this cascading

17

The Basic Wave

• Assumption : is an integer.• Counters:

1. pos - the current length of stream2. rank - the current number of 1’s in the stream.

• The wave contains the position of the recent 1’s in the stream, arranged at different “levels”.

• For i=1,2,..,l-1, level i contains the positions of the most recent 1-bits whose 1-rank is a multiple of

18

An Example for Basic Wave

• The crest of the wave is always over the largest 1-rank

• N=48, 1/E=3, l=5

19

Estimation Steps:• Let s=max(0,pos-n+1)

{estimation number of 1’s in [s,pos]}

• Let p1 be the maximum position less than s, and p2 the minimum position greater/equal then s.

• Let r1 and r2 be the rank-1 of p1 and p2 respectively.

• Return = rank-r+1 where r= r2 if r2-r1 =1 otherwise r=(r1+r2)/2

20

LEMMA 1

• The procedure returns an estimate that is within a relative error of E of the actual number of 1’s in the window.

21

Proof• Let j be the smallest numbered level containing

position p1.• By returning the midpoint of the range [r1,r2] , we

guarantee that the absolute error is at most (r2-r1)/2

• There is at most a gap between r1 and its next larger position r2.

• Thus the absolute error in our estimate is at most

• Let r3 be the earliest 1-rank at level j-1.• r3> r1, r3>=r2. • by definition

22

Improvement• Use modulo N’ counters for pos and

rank, store the positions in the wave as modulo N’ numbers - Take only log N’ bits.

• Keep track of both the largest 1-rank discarded (r1) and the smallest 1-rank (r2) still in the wave - Number of 1’s answer in O(1).

• Instead of storing a single position in multiple levels, store each position only at its maximal level.

23

Improvement

24

Improvement• The positions at each level are stored in a

fixed length queue so that each time new position is added , the position at the end of the queue is removed.

• Maintaining a doubly link list of the position in the wave in increasing order.

• By storing the difference between consecutive positions instead of the absolute positions - reduce the space from to

25

The deterministic wave algorithm• Upon receiving a stream bit b:

1.Increment pos (modulo N’=2N)2.If the head(p,r) of the linked list L has expired (p<=pos-N), then discard it from L and from its queue, and store r as the largest 1-rank discarded

• 3.If b=1 then do:(a)Increment rank, and determine the corresponding wave level j, the largest j such that rank is a multiple of (b)If the level j queue is full,discard the tail of the queue and splice it out of L(c)Add(pos,rank) to the head of the level j queue and the tail of L

26

Answering a query for a sliding window of size N:

• 1. Let r1 the largest 1-rank discarded. (If no such r1, return rank as exact answer.) Let r2 be 1-rank at the head of the linked list L. (If L is empty, return 0).

• 2. Return rank-r+1, where r=r2 if r2-r1=1 and otherwise r=(r1+r2)/2

27

• Space -

• Process time for each item - O(1)• Estimate time - O(1)

• In related work (Datar et al)

• Space -

• Process time for each item - O(log(EN))

28

Sum of Bounded Integers• The sum over a sliding window can

range from 0 to NR.

• Let N’ be smallest power of 2 greater than/equal to 2RN.

• Counters(modulo N’):pos - the current lengthtotal - the running sum

• l=log(2ENR) levels.

• Storing triple for each item (p,v,z)v-the value for the data itemz-the partial sum trough this item

29

• The answer for query is the midpoint of the interval [total-z2+v2,total-z1)

30

The Algorithm for the sum of last N items in a data stream• Upon receiving a stream value v between 0 to R:• 1.Increment pos (modulo N’=2N)• 2.If the head(p,v’,z) of the linked list L has expired

(p<=pos-N), then discard it from L and from its queue, and store z as the largest partial sum discarded

• 3.If v>0 then do:• (a)Determine the largest j such that some number in

(total,total+v) is a multiple of Add v to total.• (b)If the level j queue is full,discard the tail of the

queue and splice it out of L• (c)Add(pos,v,total) to the head of the level j queue and

the tail of L

31

Step 3a• The desired wave level is the largest

position j such that some number y in the interval (total,total+v] has 0’s in all positions less than j.

• y-1 and y differ in bit position j.

• If bit j changes from 1 to 0 at any point in [total,total+v],then j is not the largest

• j is the position of the most-significant bit that is 0 in total and 1 in total+v.

• j is the most -significant bit that is 1 in bitwise xor between total and total+v

32

Answering a query for a sliding window of size N :

• 1. Let z1 be the largest partial sum discarded from L. (If no such z1, return total as exact answer.) Let (pos,v2,z2) be the head of the linked list L. (If L is empty, return 0).

• 2. Return [total - (z1+z2-v2)/2]

33

• Space -O(1/E(logN+logR)) memory word of O(logN+logR)

• Process time for each item - O(1)• Estimate time - O(1)

• In related work (Datar et al)

• Space - O(1/E(logN+logR)) buckets of logN+log(logN+logR)

• Process time for each item - O(logN+logR)

34

Distributed Streams• Tree definitions for sliding window

over a collection of t>1 distributed

stream:1. Seeking the total number of 1’s in the last N

items in each of the t streams (tN items in total)

2. A single logical stream has been split

arbitrarily among the parties. Each party receives

items that include a sequence number in the

logical stream. Seeking the total number of 1’s in

the last N items in the logical stream.

3.Seeking the total number of 1’s in the last N

items in the position-wise union of the t streams

35

Solution for First Scenario :

• Applying single stream algorithm to each stream.

• To answer a query, each party sends its count to the Referee.

• The Referee sums the answers.

• Because each individual count is within E relative error, so is the total.

36

Solution for Second Scenario :

• To answer a query, each party sends its wave to the Referee.

• The Referee computes the maximum sequence number over all the parties use each wave to obtain an estimate over the resulting window, and sum the result.

• Because each individual count is within E relative error, so is the total.

37

Randomized Waves • Contains the positions of the recent 1’s in

the data stream, stored at different levels.• Each level i contains the most recently

selected positions of the 1-bits, where a position is selected into level i with probability

• The deterministic wave select 1 out of every 1-bits at regular interval.

• A randomized wave selects an expected 1 out of every 1-bits random interval.

• The randomize wave retains more position per level.

38

The Basic Randomized Wave• Let N’ be the power of 2 that is at least 2N• Let d=logN’• Let E<1 be the desired error probability• Each Party Pj maintains a basic

randomized wave for its stream consisting of d+1 queues, Qj(0),..,Qj(d), one for each level.

• Using a psedo-random hash function h to map positions to levels, according to exponential distribution

39

The Steps for Maintaining the Randomized Wave:• Party Pj, upon receiving a stream bit b:

1.Increment pos (modulo N’=2N)2.Discard any position p in the tail of a queue that has expired (p<=pos-N)3.If b=1 then for l= 0,..,h(pos) do:(a) If the level l queue Qj(l) is full, then discard the tail of Qj(l)(b) Add pos to the head of Qj(l).

• The sample for each level, stored in a queue, contains the most recent position selected into the level. (c=36)

40

• Consider a queue Qj(l) contains all the 1-bitwise the interval [I,pos] whose position i. Then Qj(l) contains all the 1-bits in the interval [i,pos] whose positions hash to a value greater than equal to l.

• As we move from level l to l+1, the range may increase.

• The queues at lower numbered levels may have ranges that fail to contain the window, but as we move to higher levels, we will find a level whose contains the window

41

Answering a query for a sliding window of size n<=N • After each party has observed pos bits:

1. Each party j sends its wave, {Qj(0),..,Qj(logN’))}, to the Referee, let s=max(0,pos-n+1). Then W=[s,pos] is the desired window.2.For j=1,..,t let lj be the minimum level such that the tail of Qj(lj) is a position p<=s.3.Let l*=max{lj},j=0,..,t. Let U be the union of all positions in Q1(l*),..Qt(l*).4. Return

42

• The algorithm returns an estimate for Union Counting Problem for any sliding window of size n<=N that is within a relative error E with probability greater than 2/3

• space -

top related