ch5 mining frequent patterns, associations, and correlations dr. bernard chen ph.d. university of...

38
Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Upload: tobias-nichols

Post on 28-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Ch5 Mining Frequent Patterns, Associations, and Correlations

Dr. Bernard Chen Ph.D.University of Central Arkansas

Page 2: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Outline

Association Rules Association Rules with FP tree Misleading Rules Multi-level Association Rules

Page 3: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of

items, subsequences, substructures, etc.) that occurs frequently in a data set

First proposed by Agrawal, Imielinski,

and Swami [AIS93] in the context of

frequent itemsets and association rule

mining

Page 4: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

What Is Frequent Pattern Analysis?

Motivation: Finding inherent regularities in data What products were often purchased together? bread and

milk?

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?

Applications

Basket data analysis, cross-marketing, catalog design, sale

campaign analysis, Web log (click stream) analysis, and DNA

sequence analysis.

Page 5: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Association Rules

Page 6: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Association Rules

support, s, probability that a transaction contains X Y

confidence, c, conditional probability that a transaction having X also contains Y

Page 7: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Association Rules Let’s have an example

T100 1,2,5 T200 2,4 T300 2,3 T400 1,2,4 T500 1,3 T600 2,3 T700 1,3 T800 1,2,3,5 T900 1,2,3

Page 8: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Association Rules with AprioriMinimum support=2/9Minimum confidence=60%

Page 9: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

The Apriori Algorithm

Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

Page 10: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Strong Association Rule

Strong association rules means the frequent rules that also pass the minimum confidence.

For example frequent rules: {I1, I2} Confidence(I1->I2)= 4/6

(strong association rule!) Confidence(I2->I1)= 4/7

Page 11: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Exercise A dataset has five

transactions, let min-support=60% and min_support=80%

Find all frequent itemsets using Apriori and all strong association rules

TID Items_bought

T1T2T3T4T5

M, O, N, K, E, YD, O, N, K , E, YM, A, K, EM, U, C, K ,YC, O, O, K, I ,E

Page 12: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Association Rules with AprioriK:5 KE:4 KEE:4 KM:3 KMM:3 KO:3 KOO:3 => KY:3 => KY => KEOY:3 EM:2 EO

EO:3EY:2MO:1MY:2OY:2

Page 13: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Outline

Association Rules Association Rules with FP tree Misleading Rules Multi-level Association Rules

Page 14: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Mining Frequent Itemsets without Candidate Generation In many cases, the Apriori candidate

generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain.

However, it suffer from two nontrivial costs: It may generate a huge number of candidates

(for example, if we have 10^4 1-itemset, it may generate more than 10^7 candidata 2-itemset)

It may need to scan database many times

Page 15: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Association Rules with AprioriMinimum support=2/9Minimum confidence=70%

Page 16: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Bottleneck of Frequent-pattern Mining

Multiple database scans are costly Mining long patterns needs many passes of

scanning and generates lots of candidates To find frequent itemset i1i2…i100

# of scans: 100 # of Candidates: (100

1) + (1002) + … + (1

10

00

0) =

2100-1 = 1.27*1030 ! Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

Page 17: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Mining Frequent Patterns Without Candidate Generation

Grow long patterns from short ones using local

frequent items

“abc” is a frequent pattern

Get all transactions having “abc”: DB|abc

“d” is a local frequent item in DB|abc abcd is a

frequent pattern

Page 18: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Process of FP growth

Scan DB once, find frequent 1-itemset (single item pattern)

Sort frequent items in frequency descending order

Scan DB again, construct FP-tree

Page 19: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Association Rules Let’s have an example

T100 1,2,5 T200 2,4 T300 2,3 T400 1,2,4 T500 1,3 T600 2,3 T700 1,3 T800 1,2,3,5 T900 1,2,3

Page 20: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

FP Tree

Page 21: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Mining the FP tree

Page 22: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Benefits of the FP-tree Structure

Completeness Preserve complete information for frequent pattern

mining Never break a long pattern of any transaction

Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more

frequently occurring, the more likely to be shared Never be larger than the original database (not

count node-links and the count field) For Connect-4 DB, compression ratio could be over

100

Page 23: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Exercise A dataset has five

transactions, let min-support=60% and min_confidence=80%

Find all frequent itemsets using FP Tree

TID Items_bought

T1T2T3T4T5

M, O, N, K, E, YD, O, N, K , E, YM, A, K, EM, U, C, K ,YC, O, O, K, I ,E

Page 24: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Association Rules with FP Tree

K:5E:4M:3O:3Y:3

Page 25: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Association Rules with FP Tree

Y: KEMO:1 KEO:1 KY:1 K:3 KY

O: KEM:1 KE:2 KE:3 KO EO KEOM: KE:2 K:1 K:3 KME: K:4 KE

Page 26: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime

(se

c.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

Page 27: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Why Is FP-Growth the Winner?

Divide-and-conquer: decompose both the mining task and DB

according to the frequent patterns obtained so far

leads to focused search of smaller databases

Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building

sub FP-tree, no pattern search and matching

Page 28: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Outline

Association Rules Association Rules with FP tree Misleading Rules Multi-level Association Rules

Page 29: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Example 5.8 Misleading “Strong” Association Rule

Of the 10,000 transactions analyzed, the data show that 6,000 of the customer included

computer games, while 7,500 include videos, And 4,000 included both computer

games and videos

Page 30: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Misleading “Strong” Association Rule For this example:

Support (Game & Video) = 4,000 / 10,000 =40%

Confidence (Game => Video) = 4,000 / 6,000 = 66%

Suppose it pass our minimum support and confidence (30% , 60%, respectively)

Page 31: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Misleading “Strong” Association Rule

However, the truth is : “computer games and videos are negatively associated”

Which means the purchase of one of these items actually decreases the likelihood of purchasing the other.

(How to get this conclusion??)

Page 32: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Misleading “Strong” Association Rule

Under the normal situation, 60% of customers buy the game 75% of customers buy the video Therefore, it should have 60% * 75%

= 45% of people buy both That equals to 4,500 which is more

than 4,000 (the actual value)

Page 33: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

From Association Analysis to Correlation Analysis Lift is a simple correlation measure that is

given as follows The occurrence of itemset A is independent of the

occurrence of itemset B ifP(AUB) = P(A)P(B)

Otherwise, itemset A and B are dependent and correlated as events

Lift(A,B) = P(AUB) / P(A)P(B) If the value is less than 1, the occurrence of A is

negatively correlated with the occurrence of B If the value is greater than 1, then A and B are

positively correlated

Page 34: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Outline

Association Rules Association Rules with FP tree Misleading Rules Multi-level Association Rules

Page 35: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Mining Multiple-Level Association Rules

Items often form hierarchies

Page 36: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Mining Multiple-Level Association Rules

Items often form hierarchies

Page 37: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Mining Multiple-Level Association Rules

Flexible support settings Items at the lower level are expected

to have lower support

uniform support

Milk[support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Level 1min_sup = 5%

Level 2min_sup = 3%

reduced support

Page 38: Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

Multi-level Association: Redundancy Filtering

Some rules may be redundant due to “ancestor” relationships between items.

Example milk wheat bread [support = 8%, confidence = 70%]

2% milk wheat bread [support = 2%, confidence = 72%]

We say the first rule is an ancestor of the second rule.