association rule mining part 2 (under construction!) introduction to data mining with case studies...
Post on 19-Dec-2015
218 views
TRANSCRIPT
Association Rule MiningPart 2
(under construction!)
Introduction to Data Mining with Case StudiesAuthor: G. K. Gupta
Prentice Hall India, 2006.
December 2008 ©GKGupta 2
Bigger ExampleTID Items 1 Biscuits, Bread, Cheese, Coffee, Yogurt 2 Bread, Cereal, Cheese, Coffee 3 Cheese, Chocolate, Donuts, J uice, Milk 4 Bread, Cheese, Coffee, Cereal, J uice 5 Bread, Cereal, Chocolate, Donuts, J uice 6 Milk, Tea 7 Biscuits, Bread, Cheese, Coffee, Milk 8 Eggs, Milk, Tea 9 Bread, Cereal, Cheese, Chocolate, Coffee 10 Bread, Cereal, Chocolate, Donuts, J uice 11 Bread, Cheese, J uice 12 Bread, Cheese, Coffee, Donuts, J uice 13 Biscuits, Bread, Cereal 14 Cereal, Cheese, Chocolate, Donuts, J uice 15 Chocolate, Coffee 16 Donuts 17 Donuts, Eggs, J uice 18 Biscuits, Bread, Cheese, Coffee 19 Bread, Cereal, Chocolate, Donuts, J uice 20 Cheese, Chocolate, Donuts, J uice 21 Milk, Tea, Yogurt 22 Bread, Cereal, Cheese, Coffee 23 Chocolate, Donuts, J uice, Milk, Newspaper 24 Newspaper, Pastry, Rolls 25 Rolls, Sugar, Tea
December 2008 ©GKGupta 3
Frequency of Items
Item No Item name Frequency 1 Biscuits 4 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 Donuts 10 8 Eggs 2 9 J uice 11 10 Milk 6 11 Newspaper 2 12 Pastry 1 13 Rolls 2 14 Sugar 1 15 Tea 4 16 Yogurt 2
December 2008 ©GKGupta 4
Frequent Items
Assume 25% support. In 25 transactions, a frequent item must occur in at least 7 transactions. The frequent 1-itemset or L1 is now given below. How many candidates in C2? List them.
Item Frequency Bread 13 Cereal 10 Cheese 11 Chocolate 9 Coffee 9 Donuts 10 J uice 11
December 2008 ©GKGupta 5
L2
The following pairs are frequent. Now find C3 and then L3 and the rules.
Frequent 2-itemset Frequency {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 {Chocolate, Donuts} 7 {Chocolate, J uice} 7 {Donuts, J uice} 9
December 2008 ©GKGupta 6
Rules
The full set of rules are given below. Could some rules be removed?
Cheese → Bread Cheese → Coffee Coffee → Bread Coffee → Cheese Cheese, Coffee → Bread Bread, Coffee → Cheese Bread, Cheese → Coffee Chocolate → Donuts Chocolate → J uice Donuts → Chocolate Donuts → J uice Donuts, J uice → Chocolate Chocolate, J uice → Donuts Chocolate, Donuts → J uice Bread → Cereal Cereal → Bread
Comment: Study the above rules carefully.
December 2008 ©GKGupta 7
Improving the Apriori Algorithm
Many techniques for improving the efficiency have been proposed:
•Pruning (already mentioned)•Hashing based technique•Transaction reduction•Partitioning•Sampling•Dynamic itemset counting
December 2008 ©GKGupta 8
Pruning
Pruning can reduce the size of the candidate set Ck. We want to transform Ck into a set of frequent items Lk. To reduce the work of checking, we may use the rule that all subsets of Ck must also be frequent.
December 2008 ©GKGupta 9
Example
• Suppose the items are A, B, C, D, E, F, .., X, Y, Z
• Suppose L1 is A, C, E, P, Q, S, T, V, W, X
• Suppose L2 is {A, C}, {A, F}, {A, P}, {C, P}, {E, P},
{E, G}, {E, V}, {H, J}, {K, M}, {Q, S}, {Q, X}• Are you able to identify errors in the L2 list?
• What is C3?
• How to prune C3?
• C3 is {A, C, P}, {E, P, V}, {Q, S, X}
December 2008 ©GKGupta 10
Hashing
The direct hashing and pruning (DHP) algorithm attempts to generate large itemsets efficiently and reduces the transaction database size.
When generating L1, the algorithm also generates all the 2-itemsets for each transaction, hashes them to a hash table and keeps a count.
December 2008 ©GKGupta 11
Example
Transaction ID Items 100 Bread, Cheese, Eggs, J uice 200 Bread, Cheese, J uice 300 Bread, Milk, Yogurt 400 Bread, J uice, Milk 500 Cheese, J uice, Milk
100 (B, C) (B, E) (B, J ) (C, E) (C, J ) (E, J ) 200 (B, C) (B, J ) (C, J ) 300 (B, M) (B, Y) (M, Y) 400 (B, J ) (B, M) (J , M) 500 (C, J ) (C, M) (J , M)
Consider the transaction database in the first table below used in an earlier example. The second table below shows all possible 2-itemsets for each transaction.
December 2008 ©GKGupta 12
Hashing Example
The possible 2-itemsets in the last table are now hashed to a hash table below. The last column shown in the table below is not required in the hash table but we have included it for explaining the technique.
Bit vector Bucket number Count Pairs C2 1 0 3 (C, J ) (B, Y) (M, Y) (C, J ) 0 1 1 (C, M) 0 2 1 (E, J ) 0 3 0 0 4 2 (B, C) 1 5 3 (B, E) (J , M) (J , M) 1 6 3 (B, J ) (B, J ) 1 7 3 (C, E) (B, M) (B, M)
December 2008 ©GKGupta 13
Hash Function Used
For each pair, a numeric value is obtained by first representing B by 1, C by 2, E 3, J 4, M 5 and Y 6. Now each pair can be represented by a two digit number, for example (B, E) by 13 and (C, M) by 26.
The two digits are then coded as modulo 8 number (dividing by 8 and using the remainder). This is the bucket address.
A count of the number of pairs hashed is kept. Those addresses that have a count above the support value have the bit vector set to 1 otherwise 0.
All pairs in rows that have zero bit are removed.
December 2008 ©GKGupta 14
Find C2
The major aim of the algorithm is to reduce the size of C2. It is therefore essential that the hash table is large enough so that collisions are low. Collisions result in loss of effectiveness of the hash table. This is what happened in the example in which we had collisions in three of the eight rows of the hash table which required us finding which pair was frequent.
December 2008 ©GKGupta 15
Transaction Reduction
As discussed earlier, any transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets and such a transaction may be marked or removed.
December 2008 ©GKGupta 16
Example
Frequent items (L1) are A, B, D, M, T. We are not able to use these to eliminate any transactions since all transactions have at least one of the items in L1. The frequent pairs (C2) are {A,B} and {B,M}. How can we reduce transactions using these?
TID Items bought
001 B, M, T, Y
002 B, M
003 T, S, P
004 A, B, C, D
005 A, B
006 T, Y, E
007 A, B, M
008 B, C, D, T, P
009 D, T, S
010 A, B, M
December 2008 ©GKGupta 17
Partitioning
The set of transactions may be divided into a number of disjoint subsets. Then each partition is searched for frequent itemsets. These frequent itemsets are called local frequent itemsets.How can information about local itemsets be used in finding frequent itemsets of the global set of transactions?In the example on the next slide, we have divided a set of transactions into two partitions. Find the frequent items sets for each partition. Are these local frequent itemsets useful?
December 2008 ©GKGupta 18
Example1 2 5 7
2 3 4 5
5 6 11
2 5 7 4 13
6 11 13 2 4
14 19
1 2 5 7 14
12 14 19
2 4 5 6 7
2 4 6 11 13
2 4 6 11 13
2 13
2 5 7 11 13
1 2 3
4 5 6
1 2 5 7
2 4 6 11 13
5 6 11 13
December 2008 ©GKGupta 19
Partitioning
Phase 1– Divide n transactions into m partitions– Find the frequent itemsets in each partition– Combine all local frequent itemsets to form
candidate itemsets
Phase 2
Find global frequent itemsets
December 2008 ©GKGupta 20
Sampling
A random sample (usually large enough to fit in the main memory) may be obtained from the overall set of transactions and the sample is searched for frequent itemsets. These frequent itemsets are called sample frequent itemsets.
How can information about sample itemsets be used in finding frequent itemsets of the global set of transactions?
December 2008 ©GKGupta 21
Sampling
Not guaranteed to be accurate but we sacrifice accuracy for efficiency. A lower support threshold may be used for the sample to ensure not missing any frequent datasets.
The actual frequencies of the sample frequent itemsets are then obtained.
More than one sample could be used to improve accuracy.
December 2008 ©GKGupta 22
Problems with Association Rules Algorithms
• Users are overwhelmed by the number of rules identified ─ how can the number of rules be reduced to those that are relevant to the user needs?
• Apriori algorithm assumes sparsity since number of items on each record is quite small.
• Some applications produce dense data which may also have • many frequently occurring items• strong correlations• many items on each record
December 2008 ©GKGupta 23
Problems with Association Rules
Also consider:AB → C (90% confidence)
and A → C (92% confidence)
Clearly the first rule is of no use. We should look for more complex rules only if they are better than simple rules.
December 2008 ©GKGupta 24
Top Down Approach
Algorithms considered so far were bottom up i.e. they started from looking at each frequent item, then each pair and so on.
Is it possible to design top down algorithms that consider the largest group of items first and then finds the smaller groups. Let us first look at the itemset ABCD which can be frequent only if all subsets are frequent.
December 2008 ©GKGupta 26
Closed and Maximal Itemsets
A frequent closed itemset is a frequent itemset X such that there exists no superset of X with the same support count as X. A frequent itemset Y is maximal if it is not a proper subset of any other frequent itemset. Therefore a maximal itemset is a closed itemset but a closed itemset is not necessarily a maximal itemset.
December 2008 ©GKGupta 27
Closed and Maximal Itemsets
Frequent maximal itemsets – the frequent maximal itemsets uniquely determine all frequent itemsets. Therefore the aim of any association rule algorithm is to find all maximal frequent itemsets.
December 2008 ©GKGupta 28
Closed and Maximal Itemsets
In Example, we found {B, D} and {B, C, D} had the same support of 8 while {C, D} had a support of 9. {C, D} is therefore a closed itemset but not maximal. On the other hand, {B, C} was frequent but no superset of the two items is frequent. This pair therefore is maximal as well as closed.
December 2008 ©GKGupta 29
Closed and maximal itemsets
ClosedFrequentItemsets
MaximalFrequentItemsets
FrequentItemsets
December 2008 ©GKGupta 30
Performance Evaluation of Algorithms
• The FP-growth method was usually better than the best implementation of the Apriori algorithm.
• CHARM was also usually better than Apriori. In some cases, Charm was better than the FP-growth method.
• Apriori was generally better than other algorithms if the support required was high since high support leads to a smaller number of frequent items which suits the Apriori algorithm.
• At very low support, the number of frequent items became large and none of the algorithms were able to handle large frequent sets gracefully.
December 2008 ©GKGupta 31
Bibliography
• R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules Between sets of Items in Large Databases, In Proc of the ACM SIGMOD, 1993, pp. 207-216.
• R. Ramakrishnan and J. Gehrke, Database management systems,, 2nd ed. McGraw-Hill, 2000.
• M. J. A. Berry and G. Linoff, Mastering data mining : the art and science of customer relationship management, Wiley, 2000.
• I. H. Witten and E. Frank, Data mining: practical machine learning tools and techniques with Java implementations,. Morgan Kaufmann, 2000.
December 2008 ©GKGupta 32
Bibliography
• M. J. A. Berry and G. Linoff, Data mining techniques: for marketing, sales, and customer support, New York : Wiley, 1997.
• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.
• R. Agarwal, M. Mehta, J. Shafer, A. Arning, and T. Bollinger, The Quest Data Mining System, Proc 1996 Int. Conf on Data Mining and Knowledge Discovery (KDD’96), Portland, Oregon, pp. 244-249, Aug 1996.