associion rule mining
TRANSCRIPT
-
8/12/2019 Associion Rule Mining
1/19
Association Rule Mining
Mining Association Rules in
Large Databases
Association rule mining
Algorithms Apriori and FP-Growth
Max and closed patterns
Mining various kinds of association/correlation
rules
-
8/12/2019 Associion Rule Mining
2/19
Max-patterns & Close-patterns
If there are frequent patterns with manyitems, enumerating all of them is costly.
We may be interested in finding theboundaryfrequent patterns.
Two types
Max-patterns
Frequent pattern {a1, , a100} (1001) +
(1002) + + (1
10000) = 2100-1 = 1.27*1030
frequent sub-patterns!
Max-pattern: frequent patterns withoutproper frequent super pattern
BCDE, ACD are max-patterns
BCD is not a max-pattern
A,C,D,F30
B,C,D,E,20
A,B,C,D,E10
ItemsTid
Min_sup=2
-
8/12/2019 Associion Rule Mining
3/19
Maximal Frequent Itemset
Border
Infrequent
Itemsets
Maximal
Itemsets
An itemset is maximal frequent i f none of its immediate supersets
is frequent
Closed Itemset
An itemset is closed if none of its immediatesupersets has the same support as the itemset
TID Items
1 {A,B}
2 {B,C,D}
3 {A,B,C,D}
4 {A,B,D}
5 {A,B,C,D}
Itemset Support
{A} 4
{B} 5
{C} 3
{D} 4
{A,B} 4
{A,C} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
Itemset Support
{A,B,C} 2
{A,B,D} 3
{A,C,D} 2
{B,C,D} 3
{A,B,C,D} 2
-
8/12/2019 Associion Rule Mining
4/19
-
8/12/2019 Associion Rule Mining
5/19
Maximal vs Closed Itemsets
MaxMiner: Mining Max-patterns Idea: generate the complete set-
enumeration tree one level at a time, whileprune if applicable.
(ABCD)
A (BCD) B (CD) C (D) D ()
AB (CD) AC (D) AD () BC (D) BD () CD ()
ABC (C)
ABCD ()
ABD () ACD () BCD ()
-
8/12/2019 Associion Rule Mining
6/19
Local Pruning Techniques (e.g. at node A)
Check the frequency of ABCD and AB, AC, AD.
If ABCD is frequent, prune the whole sub-tree.
If AC is NOT frequent, remove C from theparenthesis before expanding.
(ABCD)
A (BCD) B (CD) C (D) D ()
AB (CD) AC (D) AD () BC (D) BD () CD ()
ABC (C)
ABCD ()
ABD () ACD () BCD ()
Algorithm MaxMiner Initially, generate one node N= ,
where h(N)= and t(N)={A,B,C,D}.
Consider expanding N,
If h(N)t(N) is frequent, do not expand N.
If for some it(N), h(N){i} is NOT frequent,remove i from t(N) before expanding N.
Apply global pruning techniques
(ABCD)
-
8/12/2019 Associion Rule Mining
7/19
Global Pruning Technique (across sub-trees)
When a max pattern is identified (e.g. ABCD),prune all nodes (e.g. B, C and D) where h(N)t(N)is a sub-set of it (e.g. ABCD).
(ABCD)
A (BCD) B (CD) C (D) D ()
AB (CD) AC (D) AD () BC (D) BD () CD ()
ABC (C)
ABCD ()
ABD () ACD () BCD ()
Example
A,C,D,F30
B,C,D,E,20
A,B,C,D,E10
ItemsTid (ABCDEF)
3C
2B
2E
3D
1F
2A
0ABCDEF
FrequencyItemsMin_sup=2
Max patterns:
A (BCDE)B (CDE) C (DE) E ()D (E)
-
8/12/2019 Associion Rule Mining
8/19
Example
A,C,D,F30
B,C,D,E,20
A,B,C,D,E10
ItemsTid (ABCDEF)
2AD
2AC
1AE
1AB
1ABCDE
FrequencyItems
Min_sup=2
A (BCDE)B (CDE) C (DE) E ()D (E)
AC (D) AD ()
Max patterns:
Node A
Example
A,C,D,F30
B,C,D,E,20
A,B,C,D,E10
ItemsTid (ABCDEF)
BC
BD
2BCDE
BE
FrequencyItems
Min_sup=2
A (BCDE)B (CDE) C (DE) E ()D (E)
AC (D) AD ()
Max patterns:
BCDE
Node B
-
8/12/2019 Associion Rule Mining
9/19
Example
A,C,D,F30
B,C,D,E,20
A,B,C,D,E10
ItemsTid (ABCDEF)
2ACD
FrequencyItems
Min_sup=2
A (BCDE)B (CDE) C (DE) E ()D (E)
AC (D) AD ()
Max patterns:
BCDE
ACD
Node AC
Frequent Closed Patterns
For frequent itemset X, if there exists noitem y s.t. every transaction containing Xalso contains y, then X is a frequentclosed pattern
abis a frequent closed pattern
Concise rep. of freq pats
Reduce # of patterns and rules N. Pasquier et al. In ICDT99
e, f50
a, b, d40
a, b, d30
a, b, c20
a, b, c10
ItemsTID
Min_sup=2
-
8/12/2019 Associion Rule Mining
10/19
Max Pattern vs. Frequent Closed Pattern
max pattern closed pattern
if itemset X is a max pattern, adding any itemto it would not be a frequent pattern; thusthere exists no item y s.t. every transactioncontaining X also contains y.
closed pattern max pattern
abis a closed pattern, but not max
e, f50
a, b, d40
a, b, d30
a, b, c20
a, b, c10
ItemsTID
Min_sup=2
Mining Frequent Closed Patterns: CLOSET
Flist: list of all frequent items in support ascending order
Flist: d-a-f-e-c
Divide search space
Patterns having d
Patterns having a but not d, etc.
Find frequent closed pattern recursively
Among the transactions having d, cfa is frequent closed
cfad is a frequent closed pattern
J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets", DMKD'00.
c, e, f50
a, c, d, f40
c, e, f30
a, b, e20
a, c, d, e, f10
ItemsTID
Min_sup=2
-
8/12/2019 Associion Rule Mining
11/19
-
8/12/2019 Associion Rule Mining
12/19
Multi-level Association: UniformSupport vs. Reduced Support
Uniform Support: the same minimumsupport for all levels
+ One minimum support threshold. No need toexamine itemsets containing any item whoseancestors do not have minimum support.
Lower level items do not occur as frequently.If support threshold
too high miss low level associations
too low generate too many high levelassociations
Multi-level Association: UniformSupport vs. Reduced Support
Reduced Support: reduced minimumsupport at lower levels There are 4 search strategies:
Level-by-level independent Independent search at all levels (no misses)
Level-cross filtering by k-itemset Prune a k-pattern if the corresponding k-pattern at
the upper level is infrequent
Level-cross filtering by single item Prune an item if its parent node is infrequent
Controlled level-cross filtering by single item Consider subfrequent items that pass a passage
threshold
-
8/12/2019 Associion Rule Mining
13/19
Uniform Support
Multi-level mining with uniform support
Milk
[support = 10%]
full fat Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
X
Reduced Support
Multi-level mining with reduced support
full fat Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2min_sup = 3%
Milk
[support = 10%]
-
8/12/2019 Associion Rule Mining
14/19
Pattern Evaluation
Association rule algorithms tend to produce too
many rules
many of them are uninteresting or redundant
Redundant if {A,B,C} {D} and {A,B} {D}have same support & confidence
Interestingness measures can be used to prune/rankthe derived patterns
In the original formulation of association rules,support & confidence are the only measures used
Computing Interestingness Measure Given a rule X Y, information needed to compute
rule interestingness can be obtained from acontingency table
|T|f+0f+1
fo+f00f01X
f1+f10f11X
YY
Contingency table for X Y
f11: support of X and Y
f10: support of X and Y
f01: support of X and Y
f00: support of X and Y
Used to define various measures
support, confidence, lift, Gini,
J-measure, etc.
-
8/12/2019 Associion Rule Mining
15/19
Drawback of Confidence
1001090
80575Tea
20515Tea
CoffeeCoffee
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Although confidence is high, rule is misleading
P(Coffee|Tea) = 0.9375
Statistical Independence
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(SB) = 420/1000 = 0.42
P(S) P(B) = 0.6 0.7 = 0.42
P(SB) = P(S) P(B) => Statistical independence
P(SB) > P(S) P(B) => Positively correlated
P(SB) < P(S) P(B) => Negatively correlated
-
8/12/2019 Associion Rule Mining
16/19
Statistical-based Measures
Measures that take into account statisticaldependence
)](1)[()](1)[(
)()(),(
)()(),(
)()(
),(
)(
)|(
YPYPXPXP
YPXPYXPtcoefficien
YPXPYXPPS
YPXP
YXPInterest
YP
XYPLift
=
=
=
=
Example: Lift/Interest
1001090
80575Tea
20515Tea
CoffeeCoffee
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
-
8/12/2019 Associion Rule Mining
17/19
Drawback of Lift & Interest
1009010
90900X
10010X
YY
1001090
10100X
90090X
YY
10)1.0)(1.0(
1.0==Lift 11.1
)9.0)(9.0(
9.0==Lift
Statistical independence:
If P(X,Y)=P(X)P(Y) => Li ft = 1
There are lots of
measures proposedin the literature
Some measures are
good for certain
applications, but not
for others
What criteria should
we use to determine
whether a measure
is good or bad?
What about Apriori-style support based
pruning? How does
it affect these
measures?
-
8/12/2019 Associion Rule Mining
18/19
Properties of A Good Measure
Piatetsky-Shapiro:3 properties a good measure M mustsatisfy:
M(A,B) = 0 if A and B are statisticallyindependent
M(A,B) increase monotonically with P(A,B)when P(A) and P(B) remain unchanged
M(A,B) decreases monotonically with P(A) [orP(B)] when P(A,B) and P(B) [or P(A)] remainunchanged
Comparing Different Measures
10 examples of
contingency tables:
Rankings of contingency tables
using various measures:
Example f11 f10 f01 f00
E1 8123 83 424 1370
E2 8330 2 622 1046
E3 9481 94 127 298
E4 3954 3080 5 2961
E5 2886 1 363 1 320 4 431
E6 1500 2000 500 6000
E7 4000 2 000 1 000 3 000
E8 4000 2 000 2 000 2 000
E9 1720 7121 5 1154
E10 61 2483 4 7452
-
8/12/2019 Associion Rule Mining
19/19
Property under Variable
Permutation B B
A p q
A r s
A A B p r
B q s
Does M(A,B) = M(B,A)?
Symmetric measures:
support, lift, collective strength, cosine, Jaccard, etc
Asymmetric measures:
confidence, conviction, Laplace, J-measure, etc