false positive or false negative: mining frequent itemsets from high speed transactional data...
TRANSCRIPT
![Page 1: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/1.jpg)
False Positive or False Negative: False Positive or False Negative: Mining Frequent Itemsets from High SMining Frequent Itemsets from High Speed Transactional Data Streamspeed Transactional Data Streams
Jeffrey Xu Yu , Zhihong Chong , Hongjun Lu , Aoying ZhouJeffrey Xu Yu , Zhihong Chong , Hongjun Lu , Aoying Zhou
VLDB 2004VLDB 2004
![Page 2: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/2.jpg)
IntroductionIntroduction
Mining data stream:Mining data stream:– Data items arrive Data items arrive
continuouslycontinuously– One scan of dataOne scan of data– Limited memoryLimited memory– Bounded errorBounded error
![Page 3: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/3.jpg)
IntroductionIntroduction
In this paper, develop algorithm of efIn this paper, develop algorithm of effectively mining frequent itemset witfectively mining frequent itemset with bound of memory consumptionh bound of memory consumption
Use false-negativeUse false-negative
![Page 4: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/4.jpg)
False PositiveFalse Positive
Most existing algorithm of mining freMost existing algorithm of mining frequent itemset are false-positive oriequent itemset are false-positive orientednted– Control memory consumption by error Control memory consumption by error
parameter parameter εε– Allow item’s support below min suppoAllow item’s support below min suppo
rt rt ss but above s – but above s –ε ε as frequentas frequent Approximate frequency counts over Approximate frequency counts over
data streams (VLDB 02)data streams (VLDB 02)
![Page 5: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/5.jpg)
False PositiveFalse Positive
Memory bound : O ( Memory bound : O ( .. log (log (εεN))N)) Dilemma of false-positive approachDilemma of false-positive approach
– εε smaller, less # of false-positive item inclu smaller, less # of false-positive item includedded
– Memory consumption increase reciprocally Memory consumption increase reciprocally in terms of in terms of εε
– In Apriori, k-th frequent itemset generate In Apriori, k-th frequent itemset generate (k+1)-th candidate itemset(k+1)-th candidate itemset
1lε
![Page 6: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/6.jpg)
False Positive & False False Positive & False NegativeNegative
s
S + ε
S - ε
False Positive
False Negative
All itemsets will output
Some will output
All itemsets will output
Some will output
![Page 7: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/7.jpg)
False NegativeFalse Negative
Error control and pruning Error control and pruning – εε : prune data, control error bound, : prune data, control error bound,
changeablechangeable ε decrease and approach to zero when # of ε decrease and approach to zero when # of
observation increaseobservation increase εε reciprocal of n reciprocal of n
– s : minimum supports : minimum support– n : # of observationn : # of observation
![Page 8: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/8.jpg)
False NegativeFalse Negative
Memory controlMemory control– δ δ : reliability, instead : reliability, instead ε ε control control
memory memory consumption consumption– Memory consumption related to Memory consumption related to
ln(1/ ln(1/ δδ)) In this approach not allow 1-In this approach not allow 1-
itemsets with support below s as itemsets with support below s as frequentfrequent
![Page 9: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/9.jpg)
Comparison:Comparison:False Positive & False NegativeFalse Positive & False Negative
Recall and PrecisionRecall and PrecisionAA : true frequent itemsets : true frequent itemsets BB : obtained frequent itemsets : obtained frequent itemsets
– Recall =Recall =
– Precision =Precision =
|A∩B|
|A||A∩B|
|B|
![Page 10: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/10.jpg)
Comparison:Comparison:False PositiveFalse Positive
εε= S/10= S/10 δδ=0.1=0.1
S(%S(%))
True True SizeSize
Mined Mined SizeSize
RecaRecallll
PrecisiPrecisionon
0.080.08 21,36121,361 126,307126,307 1.001.00 0.170.17
0.100.10 12,25212,252 68,27568,275 1.001.00 0.180.18
0.200.20 2,3592,359 23,15423,154 1.001.00 0.160.16
![Page 11: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/11.jpg)
Comparison:Comparison:False NegativeFalse Negative
s+ε: minimum supportS S
(%)(%)True True SizeSize
Mined Mined SizeSize
RecaRecallll
PrecisiPrecisionon
0.080.08 21,36121,361 18,35118,351 0.860.86 1.001.00
0.100.10 12,25212,252 10,41110,411 0.850.85 1.001.00
0.200.20 2,3592,359 1,7391,739 0.740.74 1.001.00
![Page 12: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/12.jpg)
Chernoff BoundChernoff Bound
Chernoff Bound give certain probabilistic Chernoff Bound give certain probabilistic guarantee on estimation of statistics abouguarantee on estimation of statistics about underlying datat underlying data
Pr{ T Pr{ T ≥ ≥ ее E[T]} E[T]} ≤ ≤ ее--E[T]E[T]
For example : Pick a lottery numberFor example : Pick a lottery number 0000,0001, …,9999. 0000,0001, …,9999. 1,000,000 people buy $1 ticket 1,000,000 people buy $1 ticket E[#winners] = 100E[#winners] = 100 Pr{TPr{T≧≧273} 273} ≦≦ e e-100-100
![Page 13: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/13.jpg)
Chernoff BoundChernoff Bound
Bernolli trails (coin flips):Bernolli trails (coin flips):– PrPr[[ooii=1]==1]=pp, , PrPr[[ooii=0]=1-=0]=1-pp– rr : # of heads in : # of heads in nn coin flips coin flips– npnp: expectation of : expectation of rr
for any for any γ> 0γ> 0
![Page 14: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/14.jpg)
Chernoff BoundChernoff Bound
Let Let rr as as rr//nn, min support , min support ss as as pp
Replace Replace ssγ with γ with εε
Right of equation beRight of equation be δ δ– Pr{|Pr{|RunningSupport – TrueSupportRunningSupport – TrueSupport||≥≥εεn n } } ≤≤δδ
![Page 15: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/15.jpg)
Frequent or InfrequentFrequent or Infrequent
A pattern X is potential A pattern X is potential infrequentinfrequent if if count(X) / n < s –εcount(X) / n < s –εnn in terms of n in terms of n A pattern X is potential A pattern X is potential frequentfrequent if it if it is is notnot potential potential infrequentinfrequent in terms of in terms of
n n
![Page 16: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/16.jpg)
FDPM-1(s, δ)FDPM-1(s, δ)
![Page 17: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/17.jpg)
ACD
FDPM-1(s, δ)FDPM-1(s, δ)
ItemItem AACountCount
Source
B1 12
C1
Memory is full
Compute new εn
Delete infrequent items
D
![Page 18: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/18.jpg)
FDPM-1(s, δ)FDPM-1(s, δ)
Algorithm ensure :Algorithm ensure :– item whose true frequency exceeds item whose true frequency exceeds sN sN arar
e output with probability of at least 1-e output with probability of at least 1-δδ– No item whose true frequency is less thaNo item whose true frequency is less tha
n n sNsN are output are output– Probability of the estimated support that Probability of the estimated support that
equal true support no less than 1-equal true support no less than 1-δδ
![Page 19: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/19.jpg)
Memory BoundMemory Bound
Sup(X) Sup(X) ≥≥ ( s – ε ( s – εnn) n) n |P| |P| ≤≤ 1/( s – ε 1/( s – εnn), when s – ε), when s – εnn>0>0
|P| = n|P| = n = = = =
nn = =
ns
s)/2ln(2
1
1
S – εn
s
)/2ln(22
![Page 20: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/20.jpg)
FDPM-2(s, δ)FDPM-2(s, δ)
![Page 21: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/21.jpg)
Mining Frequent Itemsets
Mining Frequent Itemsets Mining Frequent Itemsets from a Data Streamfrom a Data Stream
ItemItem
setset{A}{A} {B}{B} {AB{AB
}} ……
CounCountt ……
Source
……
n1
{A,B{A,B}}
………….... {E,F,{E,F,G}G}
ItemItem
SetSet{A}{A} {B}{B} {AB{AB
}}{E}{E} {F}{F} {EF{EF
}}
CounCountt 44 55 44 55 66 44
P
F{C} {D}
9 8 6 3 313 13 10
{E}
5
Memory is full, compute new εnDelete infrequent itemsets
{F} {EF}
6 4
![Page 22: False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56649ef15503460f94c02abc/html5/thumbnails/22.jpg)
ConclusionConclusion
False negativeFalse negative Limited memoryLimited memory Error bound with some probabilityError bound with some probability