association rule mining data mining and knowledge discovery prof. carolina ruiz and weiyang lin...
TRANSCRIPT
Association Rule Mining
Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin
Department of Computer ScienceWorcester Polytechnic Institute
Sample Applications
Commercial Market basket analysis cross-marketing attached mailing store layout, catalog design customer segmentation based on buying
patterns
Scientific Genetic analysis Analysis of medical data
Industrial
Transactions and Assoc. Rules
Association Rule:a → c confidence: 66% (percentage of
transactions which contain a also contain c.) = P(c | a)
support: 50% (percentage of transactions contain both a and c.) = P(a & c)
Transaction Id
Purchased Items
1 2 3 4
{a, b, c} {a, d} {a, c} {b, e, f}
Association Rules - Intuition
Given a set of transactions where each transaction is a set of items
Find all rules X → Y that correlate the presence of one set of items X with another set of items Y
- Example: 98% of people who purchase diapers
and baby food also buy beer.
- Any number of items in the antecedent and in the consequent of a rule.
- Possible to specify constraints on rules
Mining Association Rules
Problem Statement
Given: a set of transactions (each transaction is a set of items) user-specified minimum support user-specified minimum confidence
Find: all association rules that have support and confidence
greater than or equal to the user-specified minimum support and minimum confidence
Naïve Procedure to mine rules
List all the subsets of the set of items
For each subset Split the subset into two
parts (one for the antecedent and one for the consequent of the rule
Compute the support of the rule
Compute the confidence of the rule
IF support and confidence are no lower than user-specified min. support and confident THEN output the rule
Complexity: Let n be the number of items. The number of rules naively considered is:
n i-1
[(i
n)(k
i)]
i=2 k=1
n
[(i
n)(2
i-2) ]
i=2
= 3n – 2
(n+1) + 1
The Apriori Algorithm
1. Find all frequent itemsets: sets of items whose support is greater than or equal the user-specified minimum support.
2. Generate the desired rules: if {a, b, c, d} and {a, b} are frequent itemsets, then compute the ratio
conf (a & b → c & d) = P(c & d | a & b) = P( a & b & c & d)/P(a & b) = support({a, b, c, d})/support({a, b}).
If conf >= mincoff, then add rule a & b → c & d
The Apriori Algorithm — Exampleslide taken from J. Han & M. Kamber’s Data Mining book
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
Scan D
C1
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
L3Scan D itemset sup{2 3 5} 2
C3 itemset{2 3 5}
itemset sup.{1} 2{2} 3{3} 3{5} 3
L1
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
C2 C2Scan D
Min. supp = 50%, I.e. min support count = 2
Apriori Principle
Key observation: Every subset of a frequent itemset is
also a frequent itemsetOr equivalently, The support of an itemset is greater
than or equal to the support of any superset of the itemset
Apriori - Compute Frequent Itemsets
Making multiple passes over the datafor pass k{candidate generation: Ck := Lk-1 joined with Lk-1 ; support counting in Ck;
Lk := All candidates in Ck with minimum support;}terminate when Lk== or Ck+1==
Frequent-Itemsets = k Lk
Lk - Set of frequent itemsets of size k. (those with minsup)Ck - Set of candidate itemsets of size k. (potentially frequent itemsets)
Apriori – Generating rules
For each frequent itemset:- Generate the desired rules: if {a, b, c, d}
and {a, b} are frequent itemsets, then compute the ratio
conf (a & b → c & d) = support({a, b, c, d})/support({a, b}).
If conf >= mincoff, then add rule a & b → c & d