data mining association rules
DESCRIPTION
Data Mining Association Rules. Yao Meng Hongli Li 91.574 Database II Fall 2002 . Outline. Overview Apriori AprioriTid DIC Data Structure Experiment Environment Experiment Result and Analysis. Overview – Apriori Algorithm. Overview – AprioriTid. Overview – DIC . Read M transaction - PowerPoint PPT PresentationTRANSCRIPT
Data Mining Association Rules
Yao MengHongli Li91.574 Database IIFall 2002
Outline
Overview Apriori AprioriTid DIC
Data Structure Experiment Environment Experiment Result and Analysis
Overview – Apriori AlgorithmGenerate Large
1-itemset
Merge and pruneto Generatecandidate of
next size
Go throughwhole DB to
Count Support
> MinSupport?
large itemset
HasCandidate?
End
No
Yes
Yes
Overview – AprioriTid
Generate Large1-itemset
Merge and pruneto Generatecandidate of
next size
Go throughDB_Sub to
Count Supportand generatenew DB_Sub
> MinSupport?
large itemset
HasCandidate?
End
No
Yes
Yes
Overview – DIC Read M transaction Increment those itemset
that are current counting If all the child of a itemset
turned to large, begin to counting this itemset
If an itemset has been counted through all the transaction, remove it from the current counting list
If at the end of the DB, go to the first step
Stop if no itemset are need to counting
Hypothesis of Performance AnalysisGiven a memory size AprioriTid generally has better performance
than Apriori due to I/O saving DIC has better performance than Apriori in
fairly homogenenous data environment. DIC performance should approach that of
Apriori while M approaches number of total transaction.
Experiment Environment
Data Sets IBM Synthetic Dataset Generation Code for
Association Rules Enviroments
Operating System: Microsoft Windows XP Professional
Computer Intel Pentium III processor 550MHz RAM 384 MB
Source code written in Java
Data Structure
Apriori and DIC Candidate Itemset
stored in a hash-tree Each internal node is
are hashtables The leaves stored the
candidate itemset AprioriTid
Use array to keep candidates
Size vs. Execution Time
Number of Items = 8Avg transaction length = 5M = 500
Support Threshold
Size = 16410 transaction Number of Items = 8Average Length per transaction = 5 M = 500
DIC – Different M value
Size = 12291 transaction Number of Items = 8Average Length per transaction = 5
DIC – “Non-Homogeneous” Dataset
Size = 6000 transaction Number of Items = 8M = 500
Conclusions
AprioriTid is the best in our experiment I/O saving AprioriTid use small Data structure
Apriori and DIC are very similar Apriori is Special Case of DIC They use same data structure
DIC Sensitive to data M affects performance
Reference1. Rakesh Agrawal, Tomasz Imielinski, Arun Swami. Mining Association Rules between
Sets of Items in Large Database. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993
2. Rakesh Agrawal, Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. Proc. 20th Int. Conf. Very Large Data Bases, VLDB, page 487-499. 1994
3. Ashok Savasere, Edward Omiecinski, Shamkant Navathe. An Efficient Algorithm for Mining Association Rules in Large Databases. Proc. of the 21st VLDB Conf., pp. 432-444, 1995.
4. Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Data. SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data. Tucson, Arizona, USA. 1997.
5. J. Hipp, U. Güntzer, G. Nakhaeizadeh. Mining Association Rules: Deriving a Superior Algorithm by Analysing Today's Approaches. Proceedings of the 4th European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD '00), Lyon, France. 2000.
6. Jochen Hipp, Ulrich Güntzer, Gholamreza Nakhaeizadeh. Algorithms for Association Rule Mining – A General Survey and Comparison. SIGKDD Explorations. 2(1): 58-64. 2000.
7. R. Srikant, R. Agrawal. Mining Generalized Association Rule. In Proc. of the VLDB Conference, September 1995