association rule mining zhenjiang lin group presentation april 10, 2007

Association Rule Mining

Zhenjiang LinGroup Presentation

April 10, 2007

2

OverviewAssociationsMarket Basket AnalysisBasic ConceptsFrequent ItemsetsGenerating Frequent Itemsets Apriori FP-Growth

Applications

3

Association Rule Learners

to discover elements that co-occur frequently within a data set consisting of multiple independent selections of elements (such as purchasing transactions), and

to discover rules, such as implication or correlation, which relate co-occurring elements.

to answer questions such as "if a customer purchases product A, how likely is he to purchase product B?" and "What products will a customer buy if he buys products C and D?" are answered by association-finding algorithms.

to reduce a potentially huge amount of information to a small, understandable set of statistically supported statements.

also known as “market basket analysis”.

4

Associations

Rules expressing relationships between itemsExample

cereal, milk fruit

“People who bought cereal and milk also bought fruit.”

Stores might want to offer specials on milk and cereal to get people to buy more fruit.

5

Market Basket Analysis

Analyze tables of transactions

Can we hypothesize? Chips => Salsa Lettuce => Spinach

Person Basket

A Chips, Salsa, Cookies, Crackers, Coke, Beer

B Lettuce, Spinach, Oranges, Celery, Apples, Grapes

C Chips, Salsa, Frozen Pizza, Frozen Cake

D Lettuce, Spinach, Milk, Butter

6

Market Baskets

In general, data consists of

TID Basket

Transaction ID

Subset of items

7

Basic Concepts

Set of items

Transaction

Association Rule

- set of transactions (i.e., our data)

IT

},...,,{ 21 miiiI

BAIBIA

BA

,,

D

8

Measuring Interesting Rules

Support Ratio of # of transactions containing A

and B to the total # of transactions

Confidence Ratio of # of transactions containing A

and B to #of transactions containing A

||||

||}|{||)(

D

TBADTBAs

||}|{||

||}|{||)(

TADT

TBADTBAc

9

Measuring Interesting Rules

Rules are included/excluded based on two metrics minimum support level - how

frequently all of the items in a rule appear in transactions

minimum confidence level - how frequently the left hand side of a rule implies the right hand side

10

Market Basket Analysis

What is I?What is T for person B?What is s(Chips=>Salsa)?What is c(Chips=>Salsa)?

Person Basket

A Chips, Salsa, Cookies, Crackers, Coke, Beer

B Lettuce, Spinach, Oranges, Celery, Apples, Grapes

C Chips, Salsa, Frozen Pizza, Frozen Cake

D Lettuce, Spinach, Milk, Butter, Chips

11

Frequent Itemsets

itemset – any set of itemsk-itemset – an itemset containing k itemsfrequent itemset – an itemset that satisfies a minimum support levelIf I contains m items, how many itemsets are there?

12

Strong Association Rules

Given an itemset, it’s easy to generate association rules Given itemset, {Chips, Salsa} => Chips, Salsa Chips => Salsa Salsa => Chips Chips, Salsa =>

Strong rules are interesting Generally defined as those rules satisfying

minimum support and minimum confidence

13

Association Rule Mining

Two basic steps Find all frequent itemsets

Satisfying minimum support Find all strong association rules

Generate association rules from frequent itemsets

Keep rules satisfying minimum confidence

14

Generating Frequent Itemsets

Naïve algorithm

n <- |D|

for each subset s of I do

l <- 0

for each transaction T in D do

if s is a subset of T then

l <- l + 1

if minimum support <= l/n then

add s to frequent subsets

15


Analysis of naïve algorithm 2m subsets of I Scan n transactions for each subset O(2mn) tests of s being subset of T

Growth is exponential in the number of items!Can we do better?

16


Frequent itemsets support the apriori property If A is not a frequent itemset, then any

superset of A is not a frequent itemset.

Proof: Let n be the number of transactions. Suppose A is a subset of l transactions. If A’ A, then A’ is a subset of l’ l transactions. Thus, if l/n < minimum support, so is l’/n.

17


Central idea: Build candidate k-itemsets from frequent (k-1)-itemsetsApproach Find all frequent 1-itemsets Extend (k-1)-itemsets to candidate k-

itemsets Prune candidate itemsets that do not

meet the minimum support.

18

Generating Frequent Itemsets (Basic Apriori)

L1 = {frequent 1-itemsets}

for (k=2; L(k-1) is not empty; k++) {

Ck = generate k-itemset candidates from L(k-1)

for each transaction t in D { // The candidates that are subsets of t

Ct=subset(Ck,t)

for each candidate c in Ct { c.count++; } }

Lk = {c in Ck | c.count >= min_sup}}

The frequent itemsets are the union of the Lk

19

FP Growth (Han, Pei, Yin 2000)

One problematic aspect of the Apriori is the candidate generation Source of exponential growth

Another approach is to use a divide and conquer strategyIdea: Compress the database into a frequent pattern tree representing frequent items

20

FP Growth (Tree construction)

Initially, scan database for frequent 1-itemsets Place resulting set in a list L in descending

order by frequency (support)

Construct an FP-tree Create a root node labeled null Scan database

Process the items in each transaction in L order From the root, add nodes in the order in which

items appear in the transactions Link nodes representing items along different

branches

21

Frequent 1-itemsets

Minimum support of 20% (frequency of 2)Frequent 1-itemsets

I1,I2,I3,I4,I5

Construct listL = {(I2,7),(I1,6),(I3,6),(I4,2),

(I5,2)}

TID

Items

1 I1,I2,I5

2 I2,I4

3 I2,I3,I6

4 I1,I2,I4

5 I1,I3

6 I2,I3

7 I1,I3

8 I1,I2,I3,I5

9 I1,I2,I3

22

Build FP-Tree Create root node

null Scan database

Transaction1: I1, I2, I5 Order: I2, I1, I5

Process transaction Add nodes in item order Label with items, count

(I2,1)

(I1,1)

(I5,1)

1I5

0I4

0I3

1I1

1I2

Maintain header table

23

Build FP-Tree

null

(I2,2)

(I1,1)

(I5,1)

1I5

1I4

0I3

1I1

2I2

(I4,1)

TID

Items

1 I1,I2,I5

2 I2,I4

3 I2,I3,I6

4 I1,I2,I4

5 I1,I3

6 I2,I3

7 I1,I3

8 I1,I2,I3,I5

9 I1,I2,I3

24

Minining the FP-tree

Start at the last item in the tableFind all paths containing item Follow the node-links

Identify conditional patterns Patterns in paths with required frequency

Build conditional FP-tree CAppend item to all paths in C, generating frequent patternsMine C recursively (appending item)Remove item from table and tree

25

Mining the FP-Tree

null

(I2,7)

(I1,4)

(I5,1)

2I5

2I4

6I3

6I1

7I2

(I4,1)

(I3,2)

(I3,2)(I4,1)

(I5,1)

(I1,2)

(I3,2)

Prefix Paths(I2 I1,1)(I2 I1 I3, 1)

Conditional Path(I2 I1, 2)

Conditional FP-tree

(I2 I1 I5, 2)

null

(I2,2)

(I1,2)

26

Applications

Web PersonalizationGenomic Data

27

Web Personalization

“ Effective Personalization Based on Association Rule Discovery from Web Usage Data,” Mobasher, et al., ACM Workshop on Web Information and Data Management, 2001.Personalization and recommendation systems e.g. Amazon.com’s recommended books

28

Data Preprocessing

Identify set of pageviews P Which files result in a single browser

display (complicated by frames, images, etc.)

P = {p1, …, pn}

Transactions T From session IDs or cookies T = {t1, …, tm}

29

Data Preprocessing

A transaction t consists of t = {(p1

t, w(p1t)), …, (pl

t,w(plt))}

The w is a weight associated with the pageview Could be binary (purchase or non-

purchase) Could be related to amount of time

spent on the page

30

Data Preprocessing

In the paper, only considered pageviews in a transaction with w(p) = 1Ordering of pageviews didn’t matter

31

Recommendation Engine

Has to run online i.e. must be fast generate frequent itemsets first and store in a

graph data structure for efficient searching

Maintains a history of the user’s current session Sets a window size w (e.g. 3) Consider pageviews A, B, C

{A,B,C} If user then visits D

{B,C,D}

32

Genomic Data

“ Finding Association Rules on Heterogenous Genome Data,” Satou et al.Combined data from PDB, SWISS-PROT, and PROSITEProtein Name

sequence feature1

sequence feature2

structure feature1

function1

function2

name1 1 0 1 0 1

name2 0 0 1 1 0

33

Genomic Data

After mining, 182388 association rules were generated (minimum support = 5, minimum confidence = 65%)Post process results with max support of 30 Itemsets appearing too frequently

aren’t interesting

Reduced to 381 rules

34

Genomic Data

Rules generated were corroborated by biological background data Found common substructures in

serine endopeptidases

Rules were not distributed well over protein families Still some work to be done on the data

preprocessing stage

35

Association Rule Summary

Association rule mining is a fundamental tool in data miningSeveral algorithms Apriori: Use a provable mathematical property

to improve performance FP-Growth: Stop candidate generation, use

effective data structure Correlation Rules: Evaluate interestingness

based on statistics Query Flocks: Generalize approach with the

purpose of query optimization (incorporation into database systems)

36

Association Rule Summary

There exist several extensions Hierarchical attributes (e.g. year-

>month->week->day or computer->luggable->handheld->palm) Multilevel/multidimensional

Numerical attributes Constraint based

association rule mining zhenjiang lin group presentation april 10, 2007

Documents

salsa salsa

chips slide

salsa chips

interesting rules rules

data slide

minimum confidence slide

strong association rules

butter slide