Application of Computational Intelligence in Data Mining

Download Application of Computational Intelligence in Data Mining

Post on 24-Apr-2015

89 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

<p>Brasov, 2011 Transilvania University of Braov Faculty of Electrical Engineering and Computer Science Applications of computational intelligence in data mining By Ioan Bogdan CRIVA A thesis submitted in partial fulfillment of the requirements for the degree of PhD Advisor: Prof. Univ. Dr. Razvan Andonie Brasov, 2011 Abstract The objective of this work is a synthesis of some of the recent efforts in the domain of predictive and associative rules extraction and processing as well as a presentation of certain original contributions to the area. The first two chapters of the theses present data mining and the some recent results in the area of rule extraction. The second chapter, Rules in the Data Mining Context introduces data mining with a focus on rule extraction. We discuss association rules and their properties as well as some notions of fuzzy modeling and fuzzy rules. The third chapter, Methods for Rules Extraction, presents the most commonly used methods for extracting rules. A special section describes the specifics of rules analysis in Microsoft SQL Server. The following chapters contain some original contributions in their context. The fourth chapter, Contributions to Rules Generalization, reviews some of the existing methods for simplifying rule models, and focuses on measures for detecting rules similarity. Similar rules can be merged, resulting in simpler rule systems. The fifth chapter, Measuring the Usage Prediction Accuracy of Recommendation Systems, presents the area of accuracy measurements for recommendation systems, one of the most common applications of association rules. A new instrument for assessing the accuracy of a recommender is presented, together with some experimental results. The sixth chapter presents some experimental results for the techniques introduced in the third and fourth chapters. The results are detailed for datasets used in presenting the methods or compared against results from other authors. The last chapter contains conclusions of this thesis as well as certain directions for further research. Brasov, 2011 Contents Contents .......................................................................................................... iii List of figures ................................................................................................... 1 Acknowledgments........................................................................................... 2 Publications, Patents and Patent Applications by the Author ........................ 3 Books ......................................................................................................... 3 Articles ...................................................................................................... 3 Issued Patents (United States Patents and Trademark Office) ................ 3 Pending patent applications (United States Patents and Trademark Office) ........................................................................................................ 4 1 Introduction .............................................................................................. 5 1.1 Objectives...................................................................................... 5 1.2 Contributions ................................................................................ 5 1.3 The Structure of this Thesis .......................................................... 8 2 Rules in the Data Mining context............................................................ 10 2.1 Data mining in industry: an overview ........................................ 10 2.2 Data Mining Problems, Tasks and Processes .............................. 11 2.2.1 Business Problems .................................................................................... 11 2.2.2 Implementation Tasks ............................................................................... 13 2.2.3 Data Mining Project Cycle ......................................................................... 14 2.3 Rules in Data Mining ................................................................... 17 2.3.1 Association Rules ...................................................................................... 18 2.3.2 Classifications of association rules ............................................................ 20 2.3.3 The Market Basket Analysis problem ....................................................... 21 2.3.4 Itemsets and Rules in dense representation ............................................ 23 2.3.5 Equivalence of dense and sparse representations ................................... 24 2.4 Fuzzy Rules .................................................................................. 27 2.4.1 Conceptualizing in Fuzzy Terms ................................................................ 27 2.4.2 Fuzzy Modeling ......................................................................................... 28 3 Methods for Rule Extraction ................................................................... 31 3.1 Extraction of Association Rules ................................................... 31 3.1.1 The Apriori algorithm ................................................................................ 31 3.1.2 The FP-Growth algorithm ......................................................................... 35 3.1.3 Other algorithms and a performance comparison ................................... 38 3.1.4 Problems raised by Minimum Support itemset extraction systems ................................................................................................................. 40 3.2 An implementation perspective: Support for association analysis in Microsoft SQL Server 2008 ................................................. 45 3.3 Rules as expression of patterns detected by other algorithms .. 50 3.3.1 Rules based on Decision Trees .................................................................. 51 3.3.2 Rules from Neural Networks ..................................................................... 52 4 Contributions to Rule Generalization ..................................................... 59 Brasov, 2011 4.1 Fuzzy Rules Generalization ......................................................... 59 4.1.1 Redundancy .............................................................................................. 60 4.1.2 Similarity ................................................................................................... 61 4.1.3 Interpolation based rule generalization techniques ................................ 62 4.2 Rule Model Simplification Techniques ....................................... 63 4.2.1 Feature set alterations .............................................................................. 63 4.2.2 Changes of the Fuzzy sets definition ........................................................ 65 4.2.3 Merging and Removal Based Reduction ................................................... 65 4.3 Similarity Measures and Rule Base Simplification ...................... 66 4.4 Rule Generalization ..................................................................... 70 4.4.1 Problem and context................................................................................. 71 4.4.2 The rule generalization algorithm ............................................................ 72 4.4.3 Applying the RGA to an apriori-derived set of rules ................................. 76 4.5 Conclusion ................................................................................... 79 4.5.1 Future directions for the basic rule generalization algorithm ............................................................................................................... 80 4.5.2 Further work for the apriori specialization of the RGA ............................ 84 5 Measuring the Usage Prediction Accuracy of Recommendation Systems ......................................................................................................... 85 5.1 Association Rules as Recommender Systems ............................. 86 5.2 Evaluating Recommendation Systems ........................................ 86 5.3 Instruments for offline measuring the accuracy of usage predictions .............................................................................................. 88 5.3.1 Accuracy measurements for a single user ................................................ 89 5.3.2 Accuracy Measurements for Multiple Users ............................................ 92 5.4 The Itemized Accuracy Curve ...................................................... 93 5.4.1 A visual interpretation of the itemized accuracy curve ............................ 98 5.4.2 Impact of the N parameter on the Lift and Area Under Curve measures .................................................................................................... 99 5.5 An Implementation for the Itemized Accuracy Curve .............. 101 5.5.1 Accuracy measures ................................................................................. 101 5.5.2 Real data test strategies ......................................................................... 102 5.5.3 The algorithm for constructing Itemized Accuracy Curve ...................... 103 5.6 Conclusions and further work ................................................... 104 6 Experimental Results ............................................................................ 107 6.1 Datasets used in this material................................................... 107 6.1.1 IC50 prediction dataset ............................................................................ 107 6.1.2 Movies Recommendation ....................................................................... 108 6.1.3 Movie Lens .............................................................................................. 109 6.1.4 Iris ............................................................................................................ 109 6.2 Experimental results for the Rule Generalization algorithm .... 110 6.2.1 Rule set and results used in Section 4.4 on generalization ................... 110 Brasov, 2011 6.2.2 Results for the apriori post-processing algorithm .................................. 112 6.3 Experimental results for the Itemized Accuracy Curve ............ 113 6.3.1 Movie Recommendation Results ............................................................ 115 6.3.2 Movie Lens Results ................................................................................. 116 7 Conclusions and directions for further research .................................. 118 7.1 Conclusions ............................................................................... 118 7.2 Further Work ............................................................................. 119 Appendix A: Key Algorithms ....................................................................... 122 Apriori ................................................................................................... 122 FP-Growth ............................................................................................. 124 Bibliography ................................................................................................ 126 List of figures Figure 2-1 The CRISP-DM process .................................................................................... 17 Figure 2-2 Standard types of membership functions (from (20) ) .................................... 28 Figure 3-1: Finding frequent itemsets .............................................................................. 34 Figure 3-2 An FP-Tree structure ........................................................................................ 36 Figure 3-3 A mining case containing tabular features ...................................................... 46 Figure 3-4 A RDBMS representation of the data supporting mining cases with nested tables ................................................................................................. 47 Figure 3-5 Using a structure nested table as source for multiple model nested tables ......................................................................................................... 50 Figure 3-6 A decision tree built for rules extraction (part of a SQL Server forest) .................................................................................................................... 52 Figure 3-7 An artificial neural network ............................................................................. 53 Figure 4-1 - Creating a fuzzy set C to replace two similar sets A and B ............................ 69 Figure 4-2 Merging of similar rules ................................................................................... 70 Figure 4-3 A visual representation of the RGA ................................................................. 74 Figure 4-4 A finer grain approach to rule generalization ................................................. 80 Figure 4-5Accuracy of a fuzzy rule as a measure of similarity with the universal set .......................................................................................................... 83 Figure 5-1 Example of ROC Curve ..................................................................................... 91 Figure 5-2 Itemized Accuracy Curve for a top-N recommender ....................................... 98 Figure 5-3 Evolution of Lift and Area Under Curve for different values of N ................. 100 Figure 5-4 Aggregated Itemized Accuracy Curve based on the Movie Recommendations dataset (for N=5 recommendations) ................................... 105 Figure 6-1 Itemized Accuracy Chart for n=3 (Movie recommendations) ....................... 114 Figure 6-2 Evolution of Lift for various values of N for test models (Movie Recommendations dataset) ................................................................................ 116 Figure 6-3 Evolution of Lift for various values of N for test models (Movie Lens dataset) ....................................................................................................... 117 2 Acknowledgments I would like to express my deepest gratitude to Prof. Dr. Rzvan Andonie for his guidance, patience and encouragements. Above all, I would like to thank him for rekindling my passion for academic research after years of industrial experience. Deep thanks also go to the Faculty of Electrical Engineering and Computer Science at the Transilvania University for their help and...</p>

Recommended

View more >