association rule mining data mining and knowledge discovery prof. carolina ruiz and weiyang lin...

11
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Upload: dwayne-crawford

Post on 05-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Association Rule Mining

Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin

Department of Computer ScienceWorcester Polytechnic Institute

Page 2: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Sample Applications

Commercial Market basket analysis cross-marketing attached mailing store layout, catalog design customer segmentation based on buying

patterns

Scientific Genetic analysis Analysis of medical data

Industrial

Page 3: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Transactions and Assoc. Rules

Association Rule:a → c confidence: 66% (percentage of

transactions which contain a also contain c.) = P(c | a)

support: 50% (percentage of transactions contain both a and c.) = P(a & c)

Transaction Id

Purchased Items

1 2 3 4

{a, b, c} {a, d} {a, c} {b, e, f}

Page 4: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Association Rules - Intuition

Given a set of transactions where each transaction is a set of items

Find all rules X → Y that correlate the presence of one set of items X with another set of items Y

- Example: 98% of people who purchase diapers

and baby food also buy beer.

- Any number of items in the antecedent and in the consequent of a rule.

- Possible to specify constraints on rules

Page 5: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Mining Association Rules

Problem Statement

Given: a set of transactions (each transaction is a set of items) user-specified minimum support user-specified minimum confidence

Find: all association rules that have support and confidence

greater than or equal to the user-specified minimum support and minimum confidence

Page 6: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Naïve Procedure to mine rules

List all the subsets of the set of items

For each subset Split the subset into two

parts (one for the antecedent and one for the consequent of the rule

Compute the support of the rule

Compute the confidence of the rule

IF support and confidence are no lower than user-specified min. support and confident THEN output the rule

Complexity: Let n be the number of items. The number of rules naively considered is:

n i-1

[(i

n)(k

i)]

i=2 k=1

n

[(i

n)(2

i-2) ]

i=2

= 3n – 2

(n+1) + 1

Page 7: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

The Apriori Algorithm

1. Find all frequent itemsets: sets of items whose support is greater than or equal the user-specified minimum support.

2. Generate the desired rules: if {a, b, c, d} and {a, b} are frequent itemsets, then compute the ratio

conf (a & b → c & d) = P(c & d | a & b) = P( a & b & c & d)/P(a & b) = support({a, b, c, d})/support({a, b}).

If conf >= mincoff, then add rule a & b → c & d

Page 8: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

The Apriori Algorithm — Exampleslide taken from J. Han & M. Kamber’s Data Mining book

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

Scan D

C1

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

L3Scan D itemset sup{2 3 5} 2

C3 itemset{2 3 5}

itemset sup.{1} 2{2} 3{3} 3{5} 3

L1

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

C2 C2Scan D

Min. supp = 50%, I.e. min support count = 2

Page 9: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Apriori Principle

Key observation: Every subset of a frequent itemset is

also a frequent itemsetOr equivalently, The support of an itemset is greater

than or equal to the support of any superset of the itemset

Page 10: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Apriori - Compute Frequent Itemsets

Making multiple passes over the datafor pass k{candidate generation: Ck := Lk-1 joined with Lk-1 ; support counting in Ck;

Lk := All candidates in Ck with minimum support;}terminate when Lk== or Ck+1==

Frequent-Itemsets = k Lk

Lk - Set of frequent itemsets of size k. (those with minsup)Ck - Set of candidate itemsets of size k. (potentially frequent itemsets)

Page 11: Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute

Apriori – Generating rules

For each frequent itemset:- Generate the desired rules: if {a, b, c, d}

and {a, b} are frequent itemsets, then compute the ratio

conf (a & b → c & d) = support({a, b, c, d})/support({a, b}).

If conf >= mincoff, then add rule a & b → c & d