1 distributed streams algorithms for sliding windows phillip b. gibbons, srikanta tirthapura

42
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

1

Distributed Streams Algorithms for Sliding Windows

Phillip B. Gibbons,

Srikanta Tirthapura

Page 2: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

2

Abstract

• Algorithm for estimating aggregate functions over a “sliding window” of the N most recent data items in one or more streams.

Page 3: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

3

Single stream

• The first E-approximation scheme for number of 1’s in a sliding window.

• The first E-approximation scheme for the sum of integers in [0..R] in a sliding window.

• Both algorithms are optimal in worst case time and space.

• Both algorithms are deterministic

Page 4: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

4

Distributed Streams

• The first randomized E-approximation scheme for the number of 1’s in a sliding window over the union of distributed streams.

Page 5: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

5

Usage

• Network Monitoring

• Data Warehousing

• Telecommunications

• Sensor Networks

Page 6: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

6

• Multiple Data Source - Distributed Stream Model

• Only the most recent data is important - “Sliding Window”

Page 7: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

7

The Goal in the algorithms

• Approximating a function F while minimizing :

• 1. The total memory

• 2. The time take by each party to

process a data item

• 3. The time to produce an estimate -

query time

Page 8: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

8

Definition 1-An -approximation scheme for a quantity X • A randomized procedure that, given any

positive <1 and <1, compute an estimate :

• -approximate :An estimate whose worst case relative error is at most

Page 9: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

9

An Example for Basic Counting Problem

Page 10: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

10

Algorithms for Distributed Stream

• Each party observes only its own stream

• Each party communicates with other parties only when estimate is requested

• Each party sends a message to a Referee who computes the estimate

Page 11: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

11

The Idea

• Storing a wave consisting of many random samples of the stream.

• Samples that contain only the recent items are sampled at a high probability, while those containing old items are sampled at a lower probability

Page 12: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

12

Contributions• Introducing a data structures called

waves

• Presenting the first E-approximation scheme for Basic Counting.

• Presenting the first E-approximation scheme for the sum of integers in [0..R]. Both optimal in worst case space, processing time and query time.

Page 13: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

13

Contributions

• Presenting the first randomized -approximation for the number of 1’s in a sliding window over the union of distributed streams

Page 14: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

14

Related Work

• From the paper of Datar et al :

• Using Exponential Histogram data base

Page 15: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

15

Exponential Histogram • Maintain more information about

recently seen items, less about old items.

• k0 most recent 1’s are assigned to individual bucket

• The K1 next most recent 1’s are assigned to bucket size 2.

• The K2 next most recent 1’s are assigned to bucket size 4.

• So on until last N items are assigned to some bucket

Page 16: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

16

Exponential Histogram• Each ki is either or • The last bucket is discarded if its position no

longer falls within the window• If the new item is a 1, it is assigned to a new

bucket of size 1.

• If this make , then the two least recent buckets of size 1 are merged to form a bucket of size 2.

• If k1 in now too large, the two least recent buckets of size 2 are merged

• So on resulting in a cascading of up to log N bucket merges in the worst case.

• The approach using waves avoids this cascading

Page 17: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

17

The Basic Wave

• Assumption : is an integer.• Counters:

1. pos - the current length of stream2. rank - the current number of 1’s in the stream.

• The wave contains the position of the recent 1’s in the stream, arranged at different “levels”.

• For i=1,2,..,l-1, level i contains the positions of the most recent 1-bits whose 1-rank is a multiple of

Page 18: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

18

An Example for Basic Wave

• The crest of the wave is always over the largest 1-rank

• N=48, 1/E=3, l=5

Page 19: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

19

Estimation Steps:• Let s=max(0,pos-n+1)

{estimation number of 1’s in [s,pos]}

• Let p1 be the maximum position less than s, and p2 the minimum position greater/equal then s.

• Let r1 and r2 be the rank-1 of p1 and p2 respectively.

• Return = rank-r+1 where r= r2 if r2-r1 =1 otherwise r=(r1+r2)/2

Page 20: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

20

LEMMA 1

• The procedure returns an estimate that is within a relative error of E of the actual number of 1’s in the window.

Page 21: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

21

Proof• Let j be the smallest numbered level containing

position p1.• By returning the midpoint of the range [r1,r2] , we

guarantee that the absolute error is at most (r2-r1)/2

• There is at most a gap between r1 and its next larger position r2.

• Thus the absolute error in our estimate is at most

• Let r3 be the earliest 1-rank at level j-1.• r3> r1, r3>=r2. • by definition

Page 22: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

22

Improvement• Use modulo N’ counters for pos and

rank, store the positions in the wave as modulo N’ numbers - Take only log N’ bits.

• Keep track of both the largest 1-rank discarded (r1) and the smallest 1-rank (r2) still in the wave - Number of 1’s answer in O(1).

• Instead of storing a single position in multiple levels, store each position only at its maximal level.

Page 23: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

23

Improvement

Page 24: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

24

Improvement• The positions at each level are stored in a

fixed length queue so that each time new position is added , the position at the end of the queue is removed.

• Maintaining a doubly link list of the position in the wave in increasing order.

• By storing the difference between consecutive positions instead of the absolute positions - reduce the space from to

Page 25: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

25

The deterministic wave algorithm• Upon receiving a stream bit b:

1.Increment pos (modulo N’=2N)2.If the head(p,r) of the linked list L has expired (p<=pos-N), then discard it from L and from its queue, and store r as the largest 1-rank discarded

• 3.If b=1 then do:(a)Increment rank, and determine the corresponding wave level j, the largest j such that rank is a multiple of (b)If the level j queue is full,discard the tail of the queue and splice it out of L(c)Add(pos,rank) to the head of the level j queue and the tail of L

Page 26: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

26

Answering a query for a sliding window of size N:

• 1. Let r1 the largest 1-rank discarded. (If no such r1, return rank as exact answer.) Let r2 be 1-rank at the head of the linked list L. (If L is empty, return 0).

• 2. Return rank-r+1, where r=r2 if r2-r1=1 and otherwise r=(r1+r2)/2

Page 27: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

27

• Space -

• Process time for each item - O(1)• Estimate time - O(1)

• In related work (Datar et al)

• Space -

• Process time for each item - O(log(EN))

Page 28: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

28

Sum of Bounded Integers• The sum over a sliding window can

range from 0 to NR.

• Let N’ be smallest power of 2 greater than/equal to 2RN.

• Counters(modulo N’):pos - the current lengthtotal - the running sum

• l=log(2ENR) levels.

• Storing triple for each item (p,v,z)v-the value for the data itemz-the partial sum trough this item

Page 29: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

29

• The answer for query is the midpoint of the interval [total-z2+v2,total-z1)

Page 30: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

30

The Algorithm for the sum of last N items in a data stream• Upon receiving a stream value v between 0 to R:• 1.Increment pos (modulo N’=2N)• 2.If the head(p,v’,z) of the linked list L has expired

(p<=pos-N), then discard it from L and from its queue, and store z as the largest partial sum discarded

• 3.If v>0 then do:• (a)Determine the largest j such that some number in

(total,total+v) is a multiple of Add v to total.• (b)If the level j queue is full,discard the tail of the

queue and splice it out of L• (c)Add(pos,v,total) to the head of the level j queue and

the tail of L

Page 31: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

31

Step 3a• The desired wave level is the largest

position j such that some number y in the interval (total,total+v] has 0’s in all positions less than j.

• y-1 and y differ in bit position j.

• If bit j changes from 1 to 0 at any point in [total,total+v],then j is not the largest

• j is the position of the most-significant bit that is 0 in total and 1 in total+v.

• j is the most -significant bit that is 1 in bitwise xor between total and total+v

Page 32: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

32

Answering a query for a sliding window of size N :

• 1. Let z1 be the largest partial sum discarded from L. (If no such z1, return total as exact answer.) Let (pos,v2,z2) be the head of the linked list L. (If L is empty, return 0).

• 2. Return [total - (z1+z2-v2)/2]

Page 33: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

33

• Space -O(1/E(logN+logR)) memory word of O(logN+logR)

• Process time for each item - O(1)• Estimate time - O(1)

• In related work (Datar et al)

• Space - O(1/E(logN+logR)) buckets of logN+log(logN+logR)

• Process time for each item - O(logN+logR)

Page 34: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

34

Distributed Streams• Tree definitions for sliding window

over a collection of t>1 distributed

stream:1. Seeking the total number of 1’s in the last N

items in each of the t streams (tN items in total)

2. A single logical stream has been split

arbitrarily among the parties. Each party receives

items that include a sequence number in the

logical stream. Seeking the total number of 1’s in

the last N items in the logical stream.

3.Seeking the total number of 1’s in the last N

items in the position-wise union of the t streams

Page 35: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

35

Solution for First Scenario :

• Applying single stream algorithm to each stream.

• To answer a query, each party sends its count to the Referee.

• The Referee sums the answers.

• Because each individual count is within E relative error, so is the total.

Page 36: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

36

Solution for Second Scenario :

• To answer a query, each party sends its wave to the Referee.

• The Referee computes the maximum sequence number over all the parties use each wave to obtain an estimate over the resulting window, and sum the result.

• Because each individual count is within E relative error, so is the total.

Page 37: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

37

Randomized Waves • Contains the positions of the recent 1’s in

the data stream, stored at different levels.• Each level i contains the most recently

selected positions of the 1-bits, where a position is selected into level i with probability

• The deterministic wave select 1 out of every 1-bits at regular interval.

• A randomized wave selects an expected 1 out of every 1-bits random interval.

• The randomize wave retains more position per level.

Page 38: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

38

The Basic Randomized Wave• Let N’ be the power of 2 that is at least 2N• Let d=logN’• Let E<1 be the desired error probability• Each Party Pj maintains a basic

randomized wave for its stream consisting of d+1 queues, Qj(0),..,Qj(d), one for each level.

• Using a psedo-random hash function h to map positions to levels, according to exponential distribution

Page 39: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

39

The Steps for Maintaining the Randomized Wave:• Party Pj, upon receiving a stream bit b:

1.Increment pos (modulo N’=2N)2.Discard any position p in the tail of a queue that has expired (p<=pos-N)3.If b=1 then for l= 0,..,h(pos) do:(a) If the level l queue Qj(l) is full, then discard the tail of Qj(l)(b) Add pos to the head of Qj(l).

• The sample for each level, stored in a queue, contains the most recent position selected into the level. (c=36)

Page 40: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

40

• Consider a queue Qj(l) contains all the 1-bitwise the interval [I,pos] whose position i. Then Qj(l) contains all the 1-bits in the interval [i,pos] whose positions hash to a value greater than equal to l.

• As we move from level l to l+1, the range may increase.

• The queues at lower numbered levels may have ranges that fail to contain the window, but as we move to higher levels, we will find a level whose contains the window

Page 41: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

41

Answering a query for a sliding window of size n<=N • After each party has observed pos bits:

1. Each party j sends its wave, {Qj(0),..,Qj(logN’))}, to the Referee, let s=max(0,pos-n+1). Then W=[s,pos] is the desired window.2.For j=1,..,t let lj be the minimum level such that the tail of Qj(lj) is a position p<=s.3.Let l*=max{lj},j=0,..,t. Let U be the union of all positions in Q1(l*),..Qt(l*).4. Return

Page 42: 1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

42

• The algorithm returns an estimate for Union Counting Problem for any sliding window of size n<=N that is within a relative error E with probability greater than 2/3

• space -