approximate frequency counts over data streams loo kin kong 4 th oct., 2002

Approximate Frequency Counts over Data Streams

Loo Kin Kong4th Oct., 2002

Plan Motivation Paper review: Approximate Frequency Counts over

Data Streams Finding frequent items Finding frequent itemsets Performance

Conclusion

Motivation In some new applications, data come as a continu

ous “stream” The sheer volume of a stream over its lifetime is huge Queries require timely answer

Examples: Stock ticks Network traffic measurements

Frequent itemset mining on offline databases vs data streams Often, level-wise algorithms are used to mine offlin

e databases E.g., the Apriori algorithm and its variants At least 2 database scans are needed

Level-wise algorithms cannot be applied to mine data streams

Cannot go through the data stream multiple times

Paper review: Approximate Frequency Counts over Data Streams By G. S. Manku and R. Motwani Published in VLDB 02 Main contributions of the paper:

Proposed 2 algorithms to find frequent items appear in a data stream of items

Extended the algorithms to find frequent itemset

Notations Some notations:

Let N denote the current length of the stream Let s (0,1) denote the support threshold Let (0,1) denote the error tolerance

Goals of the paper

The algorithm ensures that All itemsets whose true frequency exceeds sN are reporte

d No itemset whose true frequency is less than (s-)N is out

put Estimated frequencies are less than the true frequencies

by at most N

The simple case: finding frequent items Each transaction in the stream contains only 1 ite

m 2 algorithms were proposed, namely:

Sticky Sampling Algorithm Lossy Counting Algorithm

Features of the algorithms: Sampling techniques are used Frequency counts found are approximate but error is gua

ranteed not to exceed a user-specified tolerance level For Lossy Counting, all frequent items are reported

Sticky Sampling Algorithm User input includes 3 values, namely:

Support threshold s Error tolerance Probability of failure

Counts are kept in a data structure S Each entry in S is in the form (e,f), where:

e is the item f is the frequency of e in the stream since the entry is

inserted in S When queried about the frequent items, all

entries (e,f) such that f (s - )N

Sticky Sampling Algorithm (cont’d)1. S ; N 0; t 1/ log (1/s); r 12. e next transaction; N N + 13. if (e,f) exists in S do4. increment the count f5. else if random(0,1) > 1/r do6. insert (e,1) to S7. endif8. if N = 2t 2n do9. r 2r10. halfSampRate(S);11. endif12. Goto 2;

S: The set of all countse: Transaction (item)

N: Curr. len. of streamr: Sampling ratet: 1/ log (1/s)

Sticky Sampling Algorithm: halfSampRate()1. function halfSampRate(S)2. for every entry (e,f) in S do3. while random(0,1) < 0.5 and f > 0 do4. f f – 15. if f = 0 do6. remove the entry from S7. endif

Lossy Counting Algorithm Incoming data stream is conceptually divided into

buckets of 1/ transactions Counts are kept in a data structure D Each entry in D is in the form (e, f, ), where:

e is the item f is the frequency of e in the stream since the entry is inser

ted in D is the maximum count of e in the stream before e is adde

d to D

Lossy Counting Algorithm (cont’d)1. D ; N 02. w 1/; b 13. e next transaction; N N + 14. if (e,f,) exists in D do5. f f + 16. else do7. insert (e,1,b-1) to D8. endif9. if N mod w = 0 do10. prune(D, b); b b + 111. endif12. Goto 3;

D: The set of all countsN: Curr. len. of stream

e: Transaction (itemset)w: Bucket width

b: Current bucket id

Lossy Counting Algorithm – prune()1. function prune(D, b)2. for each entry (e,f,) in D do3. if f + b do4. remove the entry from D5. endif

Lossy Counting Lossy Counting guarantees that:

When deletion occurs, b N If an entry (e, f, ) is deleted, fe b where fe is the actual fre

quency count of e Hence, if an entry (e, f, ) is deleted, fe N

Finally, f fe f + N

Sticky Sampling vs Lossy Counting Sticky Sampling is non-deterministic, while Lossy

Counting is deterministic Experimental result shows that Lossy Counting req

uires fewer entries than Sticky Sampling

The more complex case: finding frequent itemsets The Lossy Counting algorithm is extended to find fr

equent itemsets Transactions in the data stream contains any num

ber of items

Overview of the algorithm Incoming data stream is conceptually divided into

buckets of 1/ transactions Counts are kept in a data structure D Multiple buckets ( of them say) are processed in a

batch Each entry in D is in the form (set, f, ), where:

set is the itemset f is the frequency of set in the stream since the entry is ins

erted in D is the maximum count of set in the stream before set is a

dded to D

Overview of the algorithm (cont’d) D is updated by the operations UPDATE_SET and N

EW_SET UPDATE_SET updates and deletes entries in D

For each entry (set, f, ), count occurrence of set in the batch and update the entry

If an updated entry satisfies f + bcurrent, the entry is removed from D

NEW_SET inserts new entries into D If a set set has frequency f in the batch and set does not

occur in D, create a new entry (set, f, bcurrent-)

Implementation Challenges:

Not to enumerate all subsets of a transaction Data structure must be compact for better space efficienc

y 3 major modules:

Buffer Trie SetGen

Implementation (cont’d) Buffer: repeatedly reads in a batch of buckets of tr

ansactions, where each transaction is a set of item-id’s, into available main memory

Trie: maintains the data structure D SetGen: generates subsets of item-id’s along with

their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application of bot

h UPDATE_SET and NEW_SET, then no supersets of S should be considered

Performance IBM dataset (T10 I4 D1000K / 10K items)

Performance (cont’d) Compared with Apriori

IBM dataset (T10 I4 D1000K / 10K items)

Conclusion

Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items

Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic

Lossy Counting can be extended to find frequent itemsets

Reference G. S. Manku and R. Motwani. Approximate Frequency Counts

over Data Streams. In VLDB 02, Hong Kong, 2002

approximate frequency counts over data streams loo kin kong 4 th oct., 2002

Documents