ch5 mining frequent patterns, associations, and correlations dr. bernard chen ph.d. university of...
TRANSCRIPT
Ch5 Mining Frequent Patterns, Associations, and Correlations
Dr. Bernard Chen Ph.D.University of Central Arkansas
Outline
Association Rules Association Rules with FP tree Misleading Rules Multi-level Association Rules
What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of
items, subsequences, substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski,
and Swami [AIS93] in the context of
frequent itemsets and association rule
mining
What Is Frequent Pattern Analysis?
Motivation: Finding inherent regularities in data What products were often purchased together? bread and
milk?
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
Association Rules
Association Rules
support, s, probability that a transaction contains X Y
confidence, c, conditional probability that a transaction having X also contains Y
Association Rules Let’s have an example
T100 1,2,5 T200 2,4 T300 2,3 T400 1,2,4 T500 1,3 T600 2,3 T700 1,3 T800 1,2,3,5 T900 1,2,3
Association Rules with AprioriMinimum support=2/9Minimum confidence=60%
The Apriori Algorithm
Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
Strong Association Rule
Strong association rules means the frequent rules that also pass the minimum confidence.
For example frequent rules: {I1, I2} Confidence(I1->I2)= 4/6
(strong association rule!) Confidence(I2->I1)= 4/7
Exercise A dataset has five
transactions, let min-support=60% and min_support=80%
Find all frequent itemsets using Apriori and all strong association rules
TID Items_bought
T1T2T3T4T5
M, O, N, K, E, YD, O, N, K , E, YM, A, K, EM, U, C, K ,YC, O, O, K, I ,E
Association Rules with AprioriK:5 KE:4 KEE:4 KM:3 KMM:3 KO:3 KOO:3 => KY:3 => KY => KEOY:3 EM:2 EO
EO:3EY:2MO:1MY:2OY:2
Outline
Association Rules Association Rules with FP tree Misleading Rules Multi-level Association Rules
Mining Frequent Itemsets without Candidate Generation In many cases, the Apriori candidate
generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain.
However, it suffer from two nontrivial costs: It may generate a huge number of candidates
(for example, if we have 10^4 1-itemset, it may generate more than 10^7 candidata 2-itemset)
It may need to scan database many times
Association Rules with AprioriMinimum support=2/9Minimum confidence=70%
Bottleneck of Frequent-pattern Mining
Multiple database scans are costly Mining long patterns needs many passes of
scanning and generates lots of candidates To find frequent itemset i1i2…i100
# of scans: 100 # of Candidates: (100
1) + (1002) + … + (1
10
00
0) =
2100-1 = 1.27*1030 ! Bottleneck: candidate-generation-and-test Can we avoid candidate generation?
Mining Frequent Patterns Without Candidate Generation
Grow long patterns from short ones using local
frequent items
“abc” is a frequent pattern
Get all transactions having “abc”: DB|abc
“d” is a local frequent item in DB|abc abcd is a
frequent pattern
Process of FP growth
Scan DB once, find frequent 1-itemset (single item pattern)
Sort frequent items in frequency descending order
Scan DB again, construct FP-tree
Association Rules Let’s have an example
T100 1,2,5 T200 2,4 T300 2,3 T400 1,2,4 T500 1,3 T600 2,3 T700 1,3 T800 1,2,3,5 T900 1,2,3
FP Tree
Mining the FP tree
Benefits of the FP-tree Structure
Completeness Preserve complete information for frequent pattern
mining Never break a long pattern of any transaction
Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more
frequently occurring, the more likely to be shared Never be larger than the original database (not
count node-links and the count field) For Connect-4 DB, compression ratio could be over
100
Exercise A dataset has five
transactions, let min-support=60% and min_confidence=80%
Find all frequent itemsets using FP Tree
TID Items_bought
T1T2T3T4T5
M, O, N, K, E, YD, O, N, K , E, YM, A, K, EM, U, C, K ,YC, O, O, K, I ,E
Association Rules with FP Tree
K:5E:4M:3O:3Y:3
Association Rules with FP Tree
Y: KEMO:1 KEO:1 KY:1 K:3 KY
O: KEM:1 KE:2 KE:3 KO EO KEOM: KE:2 K:1 K:3 KME: K:4 KE
FP-Growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime
(se
c.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
Why Is FP-Growth the Winner?
Divide-and-conquer: decompose both the mining task and DB
according to the frequent patterns obtained so far
leads to focused search of smaller databases
Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building
sub FP-tree, no pattern search and matching
Outline
Association Rules Association Rules with FP tree Misleading Rules Multi-level Association Rules
Example 5.8 Misleading “Strong” Association Rule
Of the 10,000 transactions analyzed, the data show that 6,000 of the customer included
computer games, while 7,500 include videos, And 4,000 included both computer
games and videos
Misleading “Strong” Association Rule For this example:
Support (Game & Video) = 4,000 / 10,000 =40%
Confidence (Game => Video) = 4,000 / 6,000 = 66%
Suppose it pass our minimum support and confidence (30% , 60%, respectively)
Misleading “Strong” Association Rule
However, the truth is : “computer games and videos are negatively associated”
Which means the purchase of one of these items actually decreases the likelihood of purchasing the other.
(How to get this conclusion??)
Misleading “Strong” Association Rule
Under the normal situation, 60% of customers buy the game 75% of customers buy the video Therefore, it should have 60% * 75%
= 45% of people buy both That equals to 4,500 which is more
than 4,000 (the actual value)
From Association Analysis to Correlation Analysis Lift is a simple correlation measure that is
given as follows The occurrence of itemset A is independent of the
occurrence of itemset B ifP(AUB) = P(A)P(B)
Otherwise, itemset A and B are dependent and correlated as events
Lift(A,B) = P(AUB) / P(A)P(B) If the value is less than 1, the occurrence of A is
negatively correlated with the occurrence of B If the value is greater than 1, then A and B are
positively correlated
Outline
Association Rules Association Rules with FP tree Misleading Rules Multi-level Association Rules
Mining Multiple-Level Association Rules
Items often form hierarchies
Mining Multiple-Level Association Rules
Items often form hierarchies
Mining Multiple-Level Association Rules
Flexible support settings Items at the lower level are expected
to have lower support
uniform support
Milk[support = 10%]
2% Milk [support = 6%]
Skim Milk [support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Level 1min_sup = 5%
Level 2min_sup = 3%
reduced support
Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor” relationships between items.
Example milk wheat bread [support = 8%, confidence = 70%]
2% milk wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule.