what’s hot and what’s not: tracking most frequent items dynamically by graham cormode & s....
Post on 15-Jan-2016
216 views
TRANSCRIPT
![Page 1: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/1.jpg)
What’s Hot and What’s Not:Tracking Most Frequent Items Dynamically
By Graham Cormode
& S. Muthukrishnan
Rutgers University, Piscataway NY
Presented by Tal Sterenzy
![Page 2: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/2.jpg)
Motivation
A basic statistic on database relationship is which items are hot – occur frequently
Dynamically maintaining hot items in the presence of delete and insert transactions.
Examples: DBMS – keep statistics to improve performance Telecommunication networks - network
connections start and end over time
![Page 3: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/3.jpg)
Overview
Definitions Prior work Algorithm description & analysis Experimental results Summery
![Page 4: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/4.jpg)
Formal definition Sequence of n transactions on m items [1…m] - Net occurrence of item i at time t
The number of times it has inserted minus the times it has been deleted
- current frequency of item at time t - most frequent item at time t The k most frequent items at time t are those with
the k largest
in t
1
( ) / ( )m
i i jjf t n t n t
*( ) max ( )i if t f t
( )if t
![Page 5: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/5.jpg)
Finding k hot items
k is a parameter Item i is an hot item if Frequent items that appear a significant
fraction of the entire dataset There can be at most k hot items, and there
can be none Assume basic integrity constraint
( ) 1/(1 )if t k
![Page 6: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/6.jpg)
Our algorithm
highly efficient, randomized algorithm for maintaining hot items in a dynamically changing database
monitors the changes to the data distribution and maintains O(klogklogm)
When queried, we can find all hot items in time O(klogklogm) with probability 1-δ
No need to scan the underlying relation
![Page 7: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/7.jpg)
Small tail assumption Restriction:
are the frequencies of items A set of frequencies has a small tail
if If there are k hot items then small tail
probability holds If small tail probability holds then some top k
items might not be hot We shall analyze our solution in the presence
and absence of this small tail property (STP)
1 .. mf f
( ) 1/(1 )i k if t k
![Page 8: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/8.jpg)
Prior work – why is it not adaptable? All these algorithms hold counters:
incremented when the item is observed decremented or reallocated under certain circumstances
Can’t directly adapt these algorithms for insertions and deletions: the state of the algorithm is different to that reached without
the insertions and deletions of the item.
Work on dynamic data is sparse, and provide no guarantees for the fully dynamic case with deletions
![Page 9: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/9.jpg)
Our algorithm - idea
Do not keep counters of individual items, but rather of subsets of items
Ideas from group testing: Design a number of tests, each of which group
together a number of m items in order to find up to k items which test positive
Here: find k items that are hot Minimize number of tests, where each group
consists of a subset of items
![Page 10: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/10.jpg)
General procedure For each transaction on item i, determine
which subsets it is included in: S(i) Each subset has a counter:
For insertion: increment all S(i) counters For deletion: decrement all S(i) counters
The test will be: does the counter exceed a threshold
Identifying the hot items is done by combining test results from several groups
![Page 11: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/11.jpg)
The challenge is choosing the subsets
Bounding the number of required subsets Finding concise representation of the groups Giving efficiant way to go from results of tests
to the sets of hot items
Lets start with a simple case: k=1 (freq>1/2)
Deterministic algorithm for maintaining majority item
![Page 12: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/12.jpg)
Finding majority item
For insertions only, constant time and space Keep logm+1 counters:
1 counter of items “alive”: The rest are labeled ,one per group Each group represents a bit in the binary
representation of the item Each group consists of half of the items
( ) ( )in t n t1 log... mc c
![Page 13: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/13.jpg)
Finding majority item – cont. bit(i,j) – reports value of jth bit in binary representation of i gt(i, j) – return 1 if i>j, 0 otherwise
Scheme: Insertion of item i: Increment each counter such
that bit(i, j) = 1 in time O(logm). Deletion of i: Decrement each counter such that
bit(i, j) = 1 in time O(logm). Query: If there is a majority, then it is given by
computed in time O(logm).2log
12 ( , / 2)
m jjj
gt c c
jc
jc
![Page 14: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/14.jpg)
Finding majority item – cont. Theorem: The algorithm finds a majority item
if there is one with time O(logm) per operation
The state of the data structure is equivalent if there are I insertion and D deletions, or if there are c = I - D insertions
In case of insertions only: the majority is found
![Page 15: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/15.jpg)
UpdateCounters procedure
int c[0…logm]
UpdateCounters(i,transtype,c[0…logm])c[0]=c[0] + diff
for j=1 to logm do
If (transtype = ins)
c[j] = c[j] + bit(j,i)
Else
c[j] = c[j] - bit(j,i)
![Page 16: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/16.jpg)
FindMajority procedure
FindMajority(c[0 ... log m])
Position = 0, t =1
for j=1 to log m do
if (c[j] > c[0]/2) then
position = position + t
t = 2* t
return position
![Page 17: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/17.jpg)
Randomized constructions for finding hot items Observation: If we select subsets with one hot
item exactly applying the majority algorithm will identify the hot item
Definition: Let [1... ] denote the set of hot items
Set [1... ] is a if | | 1
F m
S m good set S F
![Page 18: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/18.jpg)
How many subsets do we need? Theorem: Picking O(k logk) subsets by drawing m/k
items uniformly from [1…m] means that with constant probability we have included k good subsets S1…Sk such that
Proof: p – pick one item from F
O(k logk) subsets will guarantee with constant probability that we have one of each hot item (coupon’s collector problem)
( )iiF S F
/ 1 /(1 ) (1 )
And for 1 / 2 1/ 4 2 / 3 / 4
m k m km k k m kp
k m m m k mk m p e
![Page 19: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/19.jpg)
Coupon collector problem p is probability that coupon is good X – number of trials required to collect at
least one of each type of coupon Epoch i begins with after i-th success and
ends with (i+1)-th success Xi – number of trials in the i-th epoch Xi distributed geometrically and pi = p(k-i)/k
1 1
0 0 1
1[ ] [ ] ln ( ) ( log )
( )
k k kk
pii i i
k kE X E X k O k O k k
p k i p i
![Page 20: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/20.jpg)
Defining the groups with universal hash functions
The groups are chosen in a pseudo-random way using universal hash functions: Fix prime P > 2k a, b are drawn uniformly from [0…P-1] Then set:
Fact: Over all choices of a and b, for x<>y:
,
, , ,
( ) (( ) mod ) mod 2
{ | ( ) }a b
a b i a b
h x ax b P k
S x h x i
, ,
1Pr( ( ) ( ))
2a b a bh x h yk
![Page 21: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/21.jpg)
Choosing and updating the subsets
We will choose T = logk/δ values of a and b,Which creates 2kT= 2klogk/δ subsets of items
Processing an item i means: To which T sets i belongs? For each one: update logm counters based on bit
representation of i If the set is good, this gives us the hot item
![Page 22: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/22.jpg)
Space requirements
a and b are O(m): O(logk/δ logm) Number of counters: 2k logk/δ (logm + 1) Total space: O(k logk/δ logm)
log(k/δ) choices of a,b
2k subsets
log m + 1 counters
![Page 23: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/23.jpg)
Probability of each hot item being in at least one good subset is at least 1-δ Consider one hot item: For each T repetitions
we put it in one of 2k groups The expected total
frequency of other items: If f<1/(k+1) majority will be found success If f>1/(k+1) majority can’t be found failure Probability of failure < ½ (by Markov inequality) Probability to fail on each T < Probability of any hot items failing at most δ.
i
i j
f 1 1E[f]=( )
2k 2 1 2( 1)
k
k k k
log /1/ 2 /k k
![Page 24: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/24.jpg)
Detecting good subsets Given a subset and it’s associated
counters , it is possible to detect deterministically whether the subset is a good subset
Proof: a subset can fail in two cases: No hot items (assuming STP) : then
More than one hot item: there will be j such that:
a good subset is determined
Sa,b,i
0/( 1) and /( 1)j jc c k c c c k
0 /( 1)c c k
0 log... mc c
![Page 25: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/25.jpg)
ProcessItem procedureInitialize c[0 … 2Tk][0 … log m]Draw a[1 … T], b[1 … T], c=0
ProccessItem(i,transtype,T,k)if (trans = ins) then
c = c + 1 else
c = c – 1 for x = 1 to T do index =2k(x-1)+(i*a[x]+b[x]modP)mod2k UpdateCounters(i,transtype,c[index])
![Page 26: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/26.jpg)
GroupTest procedure
GroupTest(T,k,b)for i=1 to 2Tk do
if c[i][0] > cb position = 0; t =1 for j = 1 to log m do if (c[i][j] > cb and
c[i][0] – c[i][j] > cb) then Skip to next i
if c[i][j] > cb position += t
t = 2 * t output position
![Page 27: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/27.jpg)
Algorithm correctness
With probability at least 1-δ, calling the GroupTest(logk/δ,k,1/k+1) procedure finds all hot items. Time processing item is: O(logk/δ logm) Time to get all hot items is O(k logk/δ logm)
With or without STP, we are still guarenteed to include all hot items with high probability
Without STP, we might output infrequent items
![Page 28: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/28.jpg)
Algorithm correctness – cont.
When will an infrequent item be output? (no STP) A set with 2 hot items or more will be detected A set with one hot item will never fault. Even if
there is a split without the hot item that exceeds the threshold – it will be detected
A set with no hot item, and for all logm splits one half will exceed the threshold and the other not only then the algorithm will fail
![Page 29: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/29.jpg)
Algorithm properties
• The set of counters created with T= log k/ δ can be used to find hot items with parameter k’ for any k’<k with probability of success 1 – δ by calling GroupTest(logk/δ,k,1/(k’+1))
Proof: in the proof of probability for k hot items: 1 ' 1
2 ' 1 2( ' 1)
k
k k k
![Page 30: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/30.jpg)
Experiments GroupTesting algorithm was compared to Loosy
Counting and Frequent algorithms. The authors implemented them so that when an
item is deleted we decrement the corresponding counter if such exist.
The recall is the proportion of the hot items that are found by the method to the total number of hot items.
The precision is the proportion of items identified by the algorithm, which are hot, to number of all output items.
![Page 31: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/31.jpg)
Synthetic data (Recall)
Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed
![Page 32: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/32.jpg)
Synthetic data (Precision)
Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed
![Page 33: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/33.jpg)
Real data (Recall)
Real data was obtained from one of AT&T network for part of a day.
![Page 34: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/34.jpg)
Real Data (Percision)
Real data has no guarantee of having small tail property
![Page 35: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/35.jpg)
Varying frequency at query time
The data structure was build for queries at the 0.5% level, but was then tested with queries ranged from 10% to 0.02%
![Page 36: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649d615503460f94a427c4/html5/thumbnails/36.jpg)
Conclusions and extensions
New method which can cope with dynamic dataset is proposed.
It’s interesting to try to use the algorithm to compare the differences in frequencies between different datasets.
Can we find combinatorial design that achieve the same properties but in deterministic construction for maintaining hot items?