association rule mining

15
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP 790-90 Seminar BCB 713 Module Spring 2011

Upload: jaime-bryant

Post on 30-Dec-2015

52 views

Category:

Documents


0 download

DESCRIPTION

Association Rule Mining. COMP 790-90 Seminar BCB 713 Module Spring 2011. Outline. What is association rule mining? Methods for association rule mining Extensions of association rule. What Is Association Rule Mining?. - PowerPoint PPT Presentation

TRANSCRIPT

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Association Rule Mining

COMP 790-90 Seminar

BCB 713 Module

Spring 2011

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

2

Outline

What is association rule mining?

Methods for association rule mining

Extensions of association rule

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

3

What Is Association Rule Mining?

Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]

Frequent pattern mining: finding regularities in data

What products were often purchased together? Beer and diapers?!

What are the subsequent purchases after buying a car?

Can we automatically profile customers?

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

4

Why Essential?

Foundation for many data mining tasksAssociation rules, correlation, causality, sequential patterns, structural patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …

Broad applicationsBasket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, …

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

5

Basics

Itemset: a set of itemsE.g., acm={a, c, m}

Support of itemsetsSup(acm)=3

Given min_sup=3, acm is a frequent patternFrequent pattern mining: find all frequent patterns in a database

TID Items bought

100 f, a, c, d, g, I, m, p

200 a, b, c, f, l,m, o

300 b, f, h, j, o

400 b, c, k, s, p

500 a, f, c, e, l, p, m, n

Transaction database TDB

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

6

Frequent Pattern Mining: A Road Map

Boolean vs. quantitative associations age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “car”) [1%, 75%]

Single dimension vs. multiple dimensional associations

Single level vs. multiple-level analysisWhat brands of beers are associated with what brands of diapers?

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

7

Extensions & Applications

Correlation, causality analysis & mining interesting rules

Maxpatterns and frequent closed itemsets

Constraint-based mining

Sequential patterns

Periodic patterns

Structural Patterns

Computing iceberg cubes

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

8

Frequent Pattern Mining Methods

Apriori and its variations/improvements Mining frequent-patterns without candidate generationMining max-patterns and closed itemsetsMining multi-dimensional, multi-level frequent patterns with flexible support constraintsInterestingness: correlation and causality

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

9

Apriori: Candidate Generation-and-test

Any subset of a frequent itemset must be also frequent — an anti-monotone property

A transaction containing {beer, diaper, nuts} also contains {beer, diaper}

{beer, diaper, nuts} is frequent {beer, diaper} must also be frequent

No superset of any infrequent itemset should be generated or tested

Many item combinations can be pruned

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

10

Apriori-based Mining

Generate length (k+1) candidate itemsets from length k frequent itemsets, and

Test the candidates against DB

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

11

Apriori Algorithm

A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994)

TID Items10 a, c, d20 b, c, e30 a, b, c, e40 b, e

Min_sup=2

Itemset Supa 2b 3c 3d 1e 3

Data base D 1-candidates

Scan D

Itemset Supa 2b 3c 3e 3

Freq 1-itemsetsItemset

abacaebcbece

2-candidates

Itemset Supab 1ac 2ae 1bc 2be 3ce 2

Counting

Scan D

Itemset Supac 2bc 2be 3ce 2

Freq 2-itemsetsItemset

bce

3-candidates

Itemset Supbce 2

Freq 3-itemsets

Scan D

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

12

The Apriori Algorithm

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for (k = 1; Lk !=; k++) doCk+1 = candidates generated from Lk;for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

return k Lk;

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

13

Important Details of Apriori

How to generate candidates?Step 1: self-joining Lk

Step 2: pruning

How to count supports of candidates?

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

14

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an orderStep 1: self-join Lk-1 INSERT INTO Ck

SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1

FROM Lk-1 p, Lk-1 qWHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

Step 2: pruningFor each itemset c in Ck do

For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

15

Example of Candidate-generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:acde is removed because ade is not in L3

C4={abcd}