an efficient rigorous approach for identifying statistically significant frequent itemsets
DESCRIPTION
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets. Fabio Vandin DEI - Università di Padova CS Dept. - Brown University Join work with: A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal. Data Mining. - PowerPoint PPT PresentationTRANSCRIPT
AlgoDEEP 160410 1
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets
Fabio VandinDEI - Universitagrave di Padova
CS Dept - Brown University
Join work with A Kirsch M MitzenmacherA Pietracaprina G Pucci E Upfal
AlgoDEEP 160410 2
Data Mining
Discovery of hidden patterns (eg correlations association rules clusters anomalies etc) from large data sets
When is a pattern significant
Open problem development of rigorous (mathematicalstatistical) approaches to assess significance and to discover significant patterns efficiently
AlgoDEEP 160410 3
Frequent Itemsets (1)
Dataset DD of transactions over set of items I (D sube 2I) Support of an itemset X isin 2I in D =
number of transactions that contain X
TID Items
1 Bread Milk
2 Bread Diaper Beer Eggs
3 Milk Diaper Beer Coke
4 Bread Milk Diaper Beer
5 Bread Milk Diaper Coke
support(BeerDiaper) = 3Significant
AlgoDEEP 160410 4
Original formulation of the problem [Agrawal et al 93] input dataset D over I support threshold s output all itemsets of support ge s in D (frequent itemsets )
Rationale significance = high support (ge s)
Drawbacks Threshold s hard to fix
too low possible output explosion and spurious discoveries (false positives) too high loss of interesting itemsets (false negatives)
No guarantee of significance of output itemsets Alternative formulations proposed to mitigate the above
drawbacks Closed itemsets maximal itemsets top-K itemsets
Frequent Itemsets (2)
AlgoDEEP 160410 5
Significance
Focus on statistical significance significance wrt random model
We address the following questions What support level makes an itemset significantly
frequent How to narrow the search down to significant
itemsets Goal minimize false discoveries and improve
quality of subsequent analysis
AlgoDEEP 160410 6
Many works consider significance of itemsets in isolation Eg [Silverstein Brin Motwani 98] rigorous statistical framework (with flaws) 2 test to assess degree of dependence of items in an
itemset
Global characteristics of dataset taken into account in [Gionis Mannila et al 06] deviation from random dataset wrt number of
frequent itemsets no rigouros statistical grounding
Related Work
AlgoDEEP 160410 7
Statistical Tests
Standard statistical test null hypothesis H0 (asympnot significant)
alternative hypothesis H1 H0 is tested against H1 by observing a certain statistic
s p-value = Prob( obs ge s | H0 is true ) Significance level α = probability of rejecting H0 when
it is true (false positive) Also called probability of Type I error
AlgoDEEP 160410 8
Random Model
I = set of n items
D = input dataset of t transactions over I i ∊ I
n(i) = support of i in D fi= n(i)t = frequency of i in D
D = random dataset of t transactions over I Item i is included in transaction j with probability
fi independently of all other events
AlgoDEEP 160410 9
For each itemset X = i1 i2 ik sube I
fX = fi1fi2 hellip fik expected frequency of X in D
null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )
alternative hypothesis H1(X) the support of X in D does not conforms with D
Naiumlve Approach (1)
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 2
Data Mining
Discovery of hidden patterns (eg correlations association rules clusters anomalies etc) from large data sets
When is a pattern significant
Open problem development of rigorous (mathematicalstatistical) approaches to assess significance and to discover significant patterns efficiently
AlgoDEEP 160410 3
Frequent Itemsets (1)
Dataset DD of transactions over set of items I (D sube 2I) Support of an itemset X isin 2I in D =
number of transactions that contain X
TID Items
1 Bread Milk
2 Bread Diaper Beer Eggs
3 Milk Diaper Beer Coke
4 Bread Milk Diaper Beer
5 Bread Milk Diaper Coke
support(BeerDiaper) = 3Significant
AlgoDEEP 160410 4
Original formulation of the problem [Agrawal et al 93] input dataset D over I support threshold s output all itemsets of support ge s in D (frequent itemsets )
Rationale significance = high support (ge s)
Drawbacks Threshold s hard to fix
too low possible output explosion and spurious discoveries (false positives) too high loss of interesting itemsets (false negatives)
No guarantee of significance of output itemsets Alternative formulations proposed to mitigate the above
drawbacks Closed itemsets maximal itemsets top-K itemsets
Frequent Itemsets (2)
AlgoDEEP 160410 5
Significance
Focus on statistical significance significance wrt random model
We address the following questions What support level makes an itemset significantly
frequent How to narrow the search down to significant
itemsets Goal minimize false discoveries and improve
quality of subsequent analysis
AlgoDEEP 160410 6
Many works consider significance of itemsets in isolation Eg [Silverstein Brin Motwani 98] rigorous statistical framework (with flaws) 2 test to assess degree of dependence of items in an
itemset
Global characteristics of dataset taken into account in [Gionis Mannila et al 06] deviation from random dataset wrt number of
frequent itemsets no rigouros statistical grounding
Related Work
AlgoDEEP 160410 7
Statistical Tests
Standard statistical test null hypothesis H0 (asympnot significant)
alternative hypothesis H1 H0 is tested against H1 by observing a certain statistic
s p-value = Prob( obs ge s | H0 is true ) Significance level α = probability of rejecting H0 when
it is true (false positive) Also called probability of Type I error
AlgoDEEP 160410 8
Random Model
I = set of n items
D = input dataset of t transactions over I i ∊ I
n(i) = support of i in D fi= n(i)t = frequency of i in D
D = random dataset of t transactions over I Item i is included in transaction j with probability
fi independently of all other events
AlgoDEEP 160410 9
For each itemset X = i1 i2 ik sube I
fX = fi1fi2 hellip fik expected frequency of X in D
null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )
alternative hypothesis H1(X) the support of X in D does not conforms with D
Naiumlve Approach (1)
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 3
Frequent Itemsets (1)
Dataset DD of transactions over set of items I (D sube 2I) Support of an itemset X isin 2I in D =
number of transactions that contain X
TID Items
1 Bread Milk
2 Bread Diaper Beer Eggs
3 Milk Diaper Beer Coke
4 Bread Milk Diaper Beer
5 Bread Milk Diaper Coke
support(BeerDiaper) = 3Significant
AlgoDEEP 160410 4
Original formulation of the problem [Agrawal et al 93] input dataset D over I support threshold s output all itemsets of support ge s in D (frequent itemsets )
Rationale significance = high support (ge s)
Drawbacks Threshold s hard to fix
too low possible output explosion and spurious discoveries (false positives) too high loss of interesting itemsets (false negatives)
No guarantee of significance of output itemsets Alternative formulations proposed to mitigate the above
drawbacks Closed itemsets maximal itemsets top-K itemsets
Frequent Itemsets (2)
AlgoDEEP 160410 5
Significance
Focus on statistical significance significance wrt random model
We address the following questions What support level makes an itemset significantly
frequent How to narrow the search down to significant
itemsets Goal minimize false discoveries and improve
quality of subsequent analysis
AlgoDEEP 160410 6
Many works consider significance of itemsets in isolation Eg [Silverstein Brin Motwani 98] rigorous statistical framework (with flaws) 2 test to assess degree of dependence of items in an
itemset
Global characteristics of dataset taken into account in [Gionis Mannila et al 06] deviation from random dataset wrt number of
frequent itemsets no rigouros statistical grounding
Related Work
AlgoDEEP 160410 7
Statistical Tests
Standard statistical test null hypothesis H0 (asympnot significant)
alternative hypothesis H1 H0 is tested against H1 by observing a certain statistic
s p-value = Prob( obs ge s | H0 is true ) Significance level α = probability of rejecting H0 when
it is true (false positive) Also called probability of Type I error
AlgoDEEP 160410 8
Random Model
I = set of n items
D = input dataset of t transactions over I i ∊ I
n(i) = support of i in D fi= n(i)t = frequency of i in D
D = random dataset of t transactions over I Item i is included in transaction j with probability
fi independently of all other events
AlgoDEEP 160410 9
For each itemset X = i1 i2 ik sube I
fX = fi1fi2 hellip fik expected frequency of X in D
null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )
alternative hypothesis H1(X) the support of X in D does not conforms with D
Naiumlve Approach (1)
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 4
Original formulation of the problem [Agrawal et al 93] input dataset D over I support threshold s output all itemsets of support ge s in D (frequent itemsets )
Rationale significance = high support (ge s)
Drawbacks Threshold s hard to fix
too low possible output explosion and spurious discoveries (false positives) too high loss of interesting itemsets (false negatives)
No guarantee of significance of output itemsets Alternative formulations proposed to mitigate the above
drawbacks Closed itemsets maximal itemsets top-K itemsets
Frequent Itemsets (2)
AlgoDEEP 160410 5
Significance
Focus on statistical significance significance wrt random model
We address the following questions What support level makes an itemset significantly
frequent How to narrow the search down to significant
itemsets Goal minimize false discoveries and improve
quality of subsequent analysis
AlgoDEEP 160410 6
Many works consider significance of itemsets in isolation Eg [Silverstein Brin Motwani 98] rigorous statistical framework (with flaws) 2 test to assess degree of dependence of items in an
itemset
Global characteristics of dataset taken into account in [Gionis Mannila et al 06] deviation from random dataset wrt number of
frequent itemsets no rigouros statistical grounding
Related Work
AlgoDEEP 160410 7
Statistical Tests
Standard statistical test null hypothesis H0 (asympnot significant)
alternative hypothesis H1 H0 is tested against H1 by observing a certain statistic
s p-value = Prob( obs ge s | H0 is true ) Significance level α = probability of rejecting H0 when
it is true (false positive) Also called probability of Type I error
AlgoDEEP 160410 8
Random Model
I = set of n items
D = input dataset of t transactions over I i ∊ I
n(i) = support of i in D fi= n(i)t = frequency of i in D
D = random dataset of t transactions over I Item i is included in transaction j with probability
fi independently of all other events
AlgoDEEP 160410 9
For each itemset X = i1 i2 ik sube I
fX = fi1fi2 hellip fik expected frequency of X in D
null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )
alternative hypothesis H1(X) the support of X in D does not conforms with D
Naiumlve Approach (1)
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 5
Significance
Focus on statistical significance significance wrt random model
We address the following questions What support level makes an itemset significantly
frequent How to narrow the search down to significant
itemsets Goal minimize false discoveries and improve
quality of subsequent analysis
AlgoDEEP 160410 6
Many works consider significance of itemsets in isolation Eg [Silverstein Brin Motwani 98] rigorous statistical framework (with flaws) 2 test to assess degree of dependence of items in an
itemset
Global characteristics of dataset taken into account in [Gionis Mannila et al 06] deviation from random dataset wrt number of
frequent itemsets no rigouros statistical grounding
Related Work
AlgoDEEP 160410 7
Statistical Tests
Standard statistical test null hypothesis H0 (asympnot significant)
alternative hypothesis H1 H0 is tested against H1 by observing a certain statistic
s p-value = Prob( obs ge s | H0 is true ) Significance level α = probability of rejecting H0 when
it is true (false positive) Also called probability of Type I error
AlgoDEEP 160410 8
Random Model
I = set of n items
D = input dataset of t transactions over I i ∊ I
n(i) = support of i in D fi= n(i)t = frequency of i in D
D = random dataset of t transactions over I Item i is included in transaction j with probability
fi independently of all other events
AlgoDEEP 160410 9
For each itemset X = i1 i2 ik sube I
fX = fi1fi2 hellip fik expected frequency of X in D
null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )
alternative hypothesis H1(X) the support of X in D does not conforms with D
Naiumlve Approach (1)
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 6
Many works consider significance of itemsets in isolation Eg [Silverstein Brin Motwani 98] rigorous statistical framework (with flaws) 2 test to assess degree of dependence of items in an
itemset
Global characteristics of dataset taken into account in [Gionis Mannila et al 06] deviation from random dataset wrt number of
frequent itemsets no rigouros statistical grounding
Related Work
AlgoDEEP 160410 7
Statistical Tests
Standard statistical test null hypothesis H0 (asympnot significant)
alternative hypothesis H1 H0 is tested against H1 by observing a certain statistic
s p-value = Prob( obs ge s | H0 is true ) Significance level α = probability of rejecting H0 when
it is true (false positive) Also called probability of Type I error
AlgoDEEP 160410 8
Random Model
I = set of n items
D = input dataset of t transactions over I i ∊ I
n(i) = support of i in D fi= n(i)t = frequency of i in D
D = random dataset of t transactions over I Item i is included in transaction j with probability
fi independently of all other events
AlgoDEEP 160410 9
For each itemset X = i1 i2 ik sube I
fX = fi1fi2 hellip fik expected frequency of X in D
null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )
alternative hypothesis H1(X) the support of X in D does not conforms with D
Naiumlve Approach (1)
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 7
Statistical Tests
Standard statistical test null hypothesis H0 (asympnot significant)
alternative hypothesis H1 H0 is tested against H1 by observing a certain statistic
s p-value = Prob( obs ge s | H0 is true ) Significance level α = probability of rejecting H0 when
it is true (false positive) Also called probability of Type I error
AlgoDEEP 160410 8
Random Model
I = set of n items
D = input dataset of t transactions over I i ∊ I
n(i) = support of i in D fi= n(i)t = frequency of i in D
D = random dataset of t transactions over I Item i is included in transaction j with probability
fi independently of all other events
AlgoDEEP 160410 9
For each itemset X = i1 i2 ik sube I
fX = fi1fi2 hellip fik expected frequency of X in D
null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )
alternative hypothesis H1(X) the support of X in D does not conforms with D
Naiumlve Approach (1)
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 8
Random Model
I = set of n items
D = input dataset of t transactions over I i ∊ I
n(i) = support of i in D fi= n(i)t = frequency of i in D
D = random dataset of t transactions over I Item i is included in transaction j with probability
fi independently of all other events
AlgoDEEP 160410 9
For each itemset X = i1 i2 ik sube I
fX = fi1fi2 hellip fik expected frequency of X in D
null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )
alternative hypothesis H1(X) the support of X in D does not conforms with D
Naiumlve Approach (1)
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 9
For each itemset X = i1 i2 ik sube I
fX = fi1fi2 hellip fik expected frequency of X in D
null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )
alternative hypothesis H1(X) the support of X in D does not conforms with D
Naiumlve Approach (1)
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 10
Naiumlve Approach (2)
Statistic of interest sx = support of X in D
Reject H0(X) if
p-value = Prob(B(t fX) ge sX) le α
Significant itemsets = X sube I H0(X) is rejected
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 11
Whatrsquos wrong D with t=1000000 transactions over n=1000 items
each item with frequency 11000 Pair ij that occurs 7 times is it statistically
significant In D (random dataset)
E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃
ij must be significant
Naiumlve Approach (3)
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 12
Expected number of pairs with support ge 7 in random dataset is ≃ 50
existence of ij with support ge 7 is not such a rare event
returning ij as significant itemset could be a false discovery
However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)
Naiumlve Approach (4)
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 13
Multi-Hypothesis test (1)
Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for
m= null hypotheses H0(X)|X|=k
How to combine m tests while minimizing false positives
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 14
Multi-Hypothesis test (2)
V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)
GOAL maximize R while ensuring FDR le β
[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm
m = does not yield a support threshold for mining
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 15
Our Approach
Q(k s) = obs number of k-itemsets of support ge s
null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D
alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D
Problem how to compute the p-value of Q(k s)
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 16
Main Results (PODS 2009)
Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well
approximated by a Poisson distribution
Result 2 Methodology to establish a support threshold for
discovering significant itemsets with small FDR
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 17
Approximation Result (1)
Based on Chen-Stein method (1975)
Q(ks) = number of k-itemsets of support ge s in random dataset D
U~Poisson(λ) λ = E[Q(ks)]
Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s
distance (Q(ks) U) =O(1n)
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 18
Approximation Result (2)
Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin
In practice Monte-Carlo method to determine smin st with probability at least 1-δ
distance (Q(ks) U) le εfor all sgesmin
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 19
Support threshold for mining significant itemsets (1)
Determine smin and let h be such that smin +2h is the maximum support for an itemset
Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h
si= smin +2i
Q(k si) = obs number of k-itemsets of support ge si
H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if
p-value of Q(ksi) lt αi and Q(ksi) ge λi βi
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 20
Support threshold for mining significant itemsets (2)
Theorem Let s be the minimum s such that H0(ks) was rejected We have
1With significance level α the number of k-itemsets of support ge s is significant
2The k-itemsets with support ge s are significant with FDR le β
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 21
FIMI repositoryhttpfimicshelsinkifidata
Experiments benchmark datasets
avg trans lengthitems frequencies range
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 22
Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s
Itemset of size 154 with support ge 7
Experiments results (1)
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 23
Experiments results (2)
Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 24
Poisson approximation for number of k-itemsets of support s ge smin in a random dataset
An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR
Conclusions
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 25
Deal with false negatives
Software package
Application of the method to other frequent pattern problems
Future Work
AlgoDEEP 160410 26
Questions
Thank you
AlgoDEEP 160410 26
Questions
Thank you