approximate frequency counts over data streams loo kin kong 4 th oct., 2002

26
Approximate Frequency Count s over Data Streams Loo Kin Kong 4 th Oct., 2002

Upload: vincent-burns

Post on 16-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Approximate Frequency Counts over Data Streams

Loo Kin Kong4th Oct., 2002

Page 2: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Plan Motivation Paper review: Approximate Frequency Counts over

Data Streams Finding frequent items Finding frequent itemsets Performance

Conclusion

Page 3: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Motivation In some new applications, data come as a continu

ous “stream” The sheer volume of a stream over its lifetime is huge Queries require timely answer

Examples: Stock ticks Network traffic measurements

Page 4: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Frequent itemset mining on offline databases vs data streams Often, level-wise algorithms are used to mine offlin

e databases E.g., the Apriori algorithm and its variants At least 2 database scans are needed

Level-wise algorithms cannot be applied to mine data streams

Cannot go through the data stream multiple times

Page 5: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Paper review: Approximate Frequency Counts over Data Streams By G. S. Manku and R. Motwani Published in VLDB 02 Main contributions of the paper:

Proposed 2 algorithms to find frequent items appear in a data stream of items

Extended the algorithms to find frequent itemset

Page 6: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Notations Some notations:

Let N denote the current length of the stream Let s (0,1) denote the support threshold Let (0,1) denote the error tolerance

Page 7: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Goals of the paper

The algorithm ensures that All itemsets whose true frequency exceeds sN are reporte

d No itemset whose true frequency is less than (s-)N is out

put Estimated frequencies are less than the true frequencies

by at most N

Page 8: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

The simple case: finding frequent items Each transaction in the stream contains only 1 ite

m 2 algorithms were proposed, namely:

Sticky Sampling Algorithm Lossy Counting Algorithm

Features of the algorithms: Sampling techniques are used Frequency counts found are approximate but error is gua

ranteed not to exceed a user-specified tolerance level For Lossy Counting, all frequent items are reported

Page 9: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Sticky Sampling Algorithm User input includes 3 values, namely:

Support threshold s Error tolerance Probability of failure

Counts are kept in a data structure S Each entry in S is in the form (e,f), where:

e is the item f is the frequency of e in the stream since the entry is

inserted in S When queried about the frequent items, all

entries (e,f) such that f (s - )N

Page 10: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Sticky Sampling Algorithm (cont’d)1. S ; N 0; t 1/ log (1/s); r 12. e next transaction; N N + 13. if (e,f) exists in S do4. increment the count f5. else if random(0,1) > 1/r do6. insert (e,1) to S7. endif8. if N = 2t 2n do9. r 2r10. halfSampRate(S);11. endif12. Goto 2;

S: The set of all countse: Transaction (item)

N: Curr. len. of streamr: Sampling ratet: 1/ log (1/s)

Page 11: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Sticky Sampling Algorithm: halfSampRate()1. function halfSampRate(S)2. for every entry (e,f) in S do3. while random(0,1) < 0.5 and f > 0 do4. f f – 15. if f = 0 do6. remove the entry from S7. endif

Page 12: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Lossy Counting Algorithm Incoming data stream is conceptually divided into

buckets of 1/ transactions Counts are kept in a data structure D Each entry in D is in the form (e, f, ), where:

e is the item f is the frequency of e in the stream since the entry is inser

ted in D is the maximum count of e in the stream before e is adde

d to D

Page 13: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Lossy Counting Algorithm (cont’d)1. D ; N 02. w 1/; b 13. e next transaction; N N + 14. if (e,f,) exists in D do5. f f + 16. else do7. insert (e,1,b-1) to D8. endif9. if N mod w = 0 do10. prune(D, b); b b + 111. endif12. Goto 3;

D: The set of all countsN: Curr. len. of stream

e: Transaction (itemset)w: Bucket width

b: Current bucket id

Page 14: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Lossy Counting Algorithm – prune()1. function prune(D, b)2. for each entry (e,f,) in D do3. if f + b do4. remove the entry from D5. endif

Page 15: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Lossy Counting Lossy Counting guarantees that:

When deletion occurs, b N If an entry (e, f, ) is deleted, fe b where fe is the actual fre

quency count of e Hence, if an entry (e, f, ) is deleted, fe N

Finally, f fe f + N

Page 16: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Sticky Sampling vs Lossy Counting Sticky Sampling is non-deterministic, while Lossy

Counting is deterministic Experimental result shows that Lossy Counting req

uires fewer entries than Sticky Sampling

Page 17: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

The more complex case: finding frequent itemsets The Lossy Counting algorithm is extended to find fr

equent itemsets Transactions in the data stream contains any num

ber of items

Page 18: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Overview of the algorithm Incoming data stream is conceptually divided into

buckets of 1/ transactions Counts are kept in a data structure D Multiple buckets ( of them say) are processed in a

batch Each entry in D is in the form (set, f, ), where:

set is the itemset f is the frequency of set in the stream since the entry is ins

erted in D is the maximum count of set in the stream before set is a

dded to D

Page 19: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Overview of the algorithm (cont’d) D is updated by the operations UPDATE_SET and N

EW_SET UPDATE_SET updates and deletes entries in D

For each entry (set, f, ), count occurrence of set in the batch and update the entry

If an updated entry satisfies f + bcurrent, the entry is removed from D

NEW_SET inserts new entries into D If a set set has frequency f in the batch and set does not

occur in D, create a new entry (set, f, bcurrent-)

Page 20: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Implementation Challenges:

Not to enumerate all subsets of a transaction Data structure must be compact for better space efficienc

y 3 major modules:

Buffer Trie SetGen

Page 21: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Implementation (cont’d) Buffer: repeatedly reads in a batch of buckets of tr

ansactions, where each transaction is a set of item-id’s, into available main memory

Trie: maintains the data structure D SetGen: generates subsets of item-id’s along with

their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application of bot

h UPDATE_SET and NEW_SET, then no supersets of S should be considered

Page 22: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Performance IBM dataset (T10 I4 D1000K / 10K items)

Page 23: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Performance (cont’d) Compared with Apriori

IBM dataset (T10 I4 D1000K / 10K items)

Page 24: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Conclusion

Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items

Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic

Lossy Counting can be extended to find frequent itemsets

Page 25: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Reference G. S. Manku and R. Motwani. Approximate Frequency Counts

over Data Streams. In VLDB 02, Hong Kong, 2002

Page 26: Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Q & A