association rules and frequent pattern growth algorithms

5
Association Rules and Frequent Pattern Growth Algorithms CIS 435 Francisco E. Figueroa Executive Summary During the last years, we have witnessed an exponential growth in the amount of data generated and stored from all fields including science, business, and retailing. Data mining could be defined as the process concerned with applying computational techniques to find patterns in the data to generate knowledge and wisdom for the creation of new value for the companies. By conducting association rules mining on on given historical sales data, the results will be able to provide actionable intelligence to the business leadership team to the store can be prepare for the heavy snowstorm. Association Rules Overview The goal of the association rule is to identify all frequent itemsets above a user specified threshold (called support) and to generate all association rules above another threshold (called confident) using these frequent itemsets as input. The association analysis is useful for discovering relationships hidden in large data sets. The uncovered relationships can be represented in the form of association rules or sets of frequent items. Retailers can use this type of rules to help them identify new business opportunities for cross selling the products to the clients. For example, the following rule can be extracted potentially from the data: {milk} ----> {bread}. The rule suggests that a strong relationship exists between the sale of milk and bread because many customers who buy bread also buy milk. The association rule is an implication expression of the form X ---> Y, where X and Y are disjoint itemsets. Strength, Confidence and Lift The strength of the association rule can be measured in terms of its support and confidence. The Support determines how often a rule is applicable to a given data set, while the confidence determines how frequently items in Y appear in transactions that contain X. Support is an important measure because a rule that has very low support may occur simply by chance or that is likely to be uninteresting from a business perspective. Support can be used to eliminate uninteresting rules. The confidence measures the reliability of the infce made by the rule. For a given rule X--->Y, the higher the confidence, the more likely is for Y to be present in transactions that contain X. It also provides an estimate of the conditional probability of Y given X. The inference made by an association reul suggest a strong co-occurrence relationship between items in the antecedent and consequent rule. The Lift is equal to the confidence factor divided by the expected confidence. A credible rule has a large relative confidence factor, a relatively large level of support, and a value of lift greater than 1. Rules having a high level of confidence but little support should be interpreted with caution. (SAS, 2000)

Upload: francisco-e-figueroa-nigaglioni

Post on 11-Apr-2017

20 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Association rules and frequent pattern growth algorithms

Association Rules and Frequent Pattern Growth Algorithms

CIS 435 Francisco E. Figueroa

Executive Summary

During the last years, we have witnessed an exponential growth in the amount of data generated and stored from all fields including science, business, and retailing. Data mining could be defined as the process concerned with applying computational techniques to find patterns in the data to generate knowledge and wisdom for the creation of new value for the companies. By conducting association rules mining on on given historical sales data, the results will be able to provide actionable intelligence to the business leadership team to the store can be prepare for the heavy snowstorm.

Association Rules Overview

The goal of the association rule is to identify all frequent itemsets above a user specified threshold (called support) and to generate all association rules above another threshold (called confident) using these frequent itemsets as input. The association analysis is useful for discovering relationships hidden in large data sets. The uncovered relationships can be represented in the form of association rules or sets of frequent items. Retailers can use this type of rules to help them identify new business opportunities for cross selling the products to the clients. For example, the following rule can be extracted potentially from the data: {milk} ----> {bread}. The rule suggests that a strong relationship exists between the sale of milk and bread because many customers who buy bread also buy milk. The association rule is an implication expression of the form X ---> Y, where X and Y are disjoint itemsets. Strength, Confidence and Lift

The strength of the association rule can be measured in terms of its support and confidence. The Support determines how often a rule is applicable to a given data set, while the confidence determines how frequently items in Y appear in transactions that contain X. Support is an important measure because a rule that has very low support may occur simply by chance or that is likely to be uninteresting from a business perspective. Support can be used to eliminate uninteresting rules.

The confidence measures the reliability of the infce made by the rule. For a given rule X--->Y, the higher the confidence, the more likely is for Y to be present in transactions that contain X. It also provides an estimate of the conditional probability of Y given X. The inference made by an association reul suggest a strong co-occurrence relationship between items in the antecedent and consequent rule.

The Lift is equal to the confidence factor divided by the expected confidence. A credible rule has a large relative confidence factor, a relatively large level of support, and a value of lift greater than 1. Rules having a high level of confidence but little support should be interpreted with caution. (SAS, 2000)

Page 2: Association rules and frequent pattern growth algorithms

So, when you analyze, the Lift of the rule is X=>Y is the confidence of the rule divided by the expected confidence, assuming that the item sets are independent. Then we can say that:

- if lift value is greater than 1 indicates that X and Y appear more often together than expected; this means that the occurrence of X has a positive effect on the occurrence of Y or that X is positively correlated with Y.

- if lift is smaller than 1 indicates that X and Y appear less often together than expected, this means that the occurrence of X has a negative effect on the occurrence of Y or that X is negatively correlated with Y

-if lift value is near 1 indicates that X and Y appear almost as often together as expected; this means that the occurrence of X has almost no effect on the occurrence of Y or that X and Y have Appriori The Apriori algorithm was proposed for mining frequent item sets to obtain strong Boolean association rules. A frequent itemset is a set of transactions that occurs with a minimum specified support. A strong rule is one that satisfies both minimum support and minimum confidence. Apriori algorithm uses an iterative level-wise search, where k-itemsets (an itemset that contains k items) are used to explore k+1 itemsets, to mine frequent itemsets from transactional database for Boolean association rules. The rule involved, is to first find the set of frequent 1-itemsets (k=1). This set is denoted L1. L1 is then used to find the set of frequent 2-itemsets, L2, which is in turn used to find L3, and so on, until no more frequent k-itemsets can be found. Each iteration involves two steps – 1) Generate large k-itemsets and 2) Determine the support of each itemset using the transaction database. Infrequent itemsets are then pruned and strong rules are generated from the frequent itemsets. FP Growth FP-Growth is an improvement of apriori designed to eliminate some of the heavy bottlenecks in apriori. FP-Growth simplifies all the problems present in apriori by using a structure called an FP-Tree. In the FP-Tree each node represents an item and it's current count, and each branch represents a different association. The whole algorithm is divided in 5 simple steps: first step, count all the items in all transaction; second step, apply the threshold; third step, sort the lists to the count of each item; fourth step, build the tree based on each transaction and all items in order they appear in the short list; and fifth step, every branch of the tree and only include in the association all the nodes whose count passed the threshold. The biggest advantage of the FP-Growth is that the algorithm needs to read the file twice, removes the need to calculate the pairs to be counted, does not required the amount of memory resources as the apriori. (Alfaro, 2016) Top 10 Products When applying Apriori: MetricType: confidence; numrules 40; car:True we obtained the following prooducts: Apriori: bath tissue, hat, water, soap, beer, flashlights, rock salt, protein bars, blankets and milk

Page 3: Association rules and frequent pattern growth algorithms

Top 10 Association Rules When applying Apriori and FP Growth we obtained the following results: FP Growth: MetricType: confidence; numrules: 40 1. [WATER=T]: 99 ==> [Soap=T]: 99 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.99) 2. [Soap=T]: 99 ==> [WATER=T]: 99 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.99) 3. [Beer=T]: 88 ==> [WATER=T]: 88 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.88) 4. [Flashlights=T]: 77 ==> [WATER=T]: 77 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.77) 5. [Milk=T]: 64 ==> [WATER=T]: 64 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.64) 6. [Blankets=T]: 64 ==> [WATER=T]: 64 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.64) 7. [Beer=T]: 88 ==> [Soap=T]: 88 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.88) 8. [Flashlights=T]: 77 ==> [Soap=T]: 77 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.77) 9. [Milk=T]: 64 ==> [Soap=T]: 64 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.64) 10. [Blankets=T]: 64 ==> [Soap=T]: 64 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.64) Apriori: MetricType: confidence; numrules 40; car:True 1. Bath Tissue=T 55 ==> Hat=T 55 conf:(1) 2. WATER=T Bath Tissue=T 54 ==> Hat=T 54 conf:(1) 3. Bath Tissue=T Soap=T 54 ==> Hat=T 54 conf:(1) 4. WATER=T Bath Tissue=T Soap=T 54 ==> Hat=T 54 conf:(1) 5. Beer=T Bath Tissue=T 48 ==> Hat=T 48 conf:(1) 6. WATER=T Beer=T Bath Tissue=T 48 ==> Hat=T 48 conf:(1) 7. Beer=T Bath Tissue=T Soap=T 48 ==> Hat=T 48 conf:(1) 8. WATER=T Beer=T Bath Tissue=T Soap=T 48 ==> Hat=T 48 conf:(1) 9. Flashlights=T Bath Tissue=T 39 ==> Hat=T 39 conf:(1) 10. Flashlights=T WATER=T Bath Tissue=T 39 ==> Hat=T 39 conf:(1) We can appreciate that water, soap, beer, and flashlights are strong products. Top 2 Products Purchased The FP Growth found 19 rules associated to “Generator”. If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets. In addition, the conviction show how often the rule can be incorrect. Based on those measures, we found that water and soap because it has a lift of 1.01 and a conviction of 0.1. Now beer is another strong product to purchase with the “Generator” because is has a lift of 1.14 but has a conviction of 1.2, so the rule can be 20% of the time incorrect. FPGrowth found 19 rules (displaying top 19) Showing only rules that contain: Generator 1. [Generator=T]: 10 ==> [WATER=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1) 2. [Generator=T]: 10 ==> [Soap=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1)

Page 4: Association rules and frequent pattern growth algorithms

3. [Generator=T]: 10 ==> [Beer=T]: 10 <conf:(1)> lift:(1.14) lev:(0.01) conv:(1.2) 4. [Generator=T]: 10 ==> [WATER=T, Soap=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1) 5. [WATER=T, Generator=T]: 10 ==> [Soap=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1) 6. [Soap=T, Generator=T]: 10 ==> [WATER=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1) 7. [Generator=T]: 10 ==> [WATER=T, Beer=T]: 10 <conf:(1)> lift:(1.14) lev:(0.01) conv:(1.2) 8. [WATER=T, Generator=T]: 10 ==> [Beer=T]: 10 <conf:(1)> lift:(1.14) lev:(0.01) conv:(1.2) 9. [Beer=T, Generator=T]: 10 ==> [WATER=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1) 9 Large Itemsets

To achieve a rules which contains 9 items (L(9)), Weka had to be configured with the following parameters: Apriori, CAR: True, lowerboundMinSupport: 0.1, metricType: confidency, minMetric 0.09, numrules: 400, outputitemsets: true. We obtain the following Large Itemsets L(9). Large Itemsets L(9): Rock salt=T WATER=T Snow shovels=T Blankets=T Protien Bars=T Bath Tissue=T Soap=T Hygine Products=T Milk=T 10 Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T 10 Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Bread=T 10 Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Milk=T Bread=T 10 Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Soap=T Milk=T Bread=T 10 Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Bath Tissue=T Soap=T Milk=T Bread=T 10 Flashlights=T WATER=T Blankets=T Canned food=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 13 Flashlights=T WATER=T Blankets=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 10 Flashlights=T WATER=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 10 Flashlights=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 10 WATER=T Snow shovels=T Blankets=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Hygine Products=T Milk=T WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 10 Real-World Association Rules - Healthcare

The Institute for Integrated and Intelligent Systems implemented system-prototype, named CSCP system, using the association rules of data mining technique applied to a patients’ (assumed) database for discovering patterns of diseases that might be carried by a patient. The recognised pattern by this implementation definitely can improve the healthcare services along with medical researchers for further exploring trends of diseases that are correlated. The technique allow the IIIS to generate correlations among diseases. (Rashid)

Real-World Association Rules - Retailing Retailers collect data every day – such as transactional data, customer demographics and product sales based on parameters such as seasons and festivals. To convert this data into knowledge and wisdom, it is necessary to discover and understand the underlying patterns involved in the organisation’s operations from these data. Analysis of past transaction data is a commonly used approach in order to improve the quality of such decisions. Extraction of

Page 5: Association rules and frequent pattern growth algorithms

frequent itemsets is essential towards mining interesting patterns from datasets. A typical usage scenario for searching frequent patterns is the so called “market basket analysis” that involves analysing the transactional data of a supermarket or retail store in order to determine which products are purchased together and how often and also examine customer purchase preferences. (Prasad, 2011) Real-World Association Rules - Finance The bankruptcy prediction is very important for any organization. The financial statement is used to predict the bankruptcy. The financial analysis is integrated to analyze the financial statement. The financial statement has both balance sheet and income statement. The financial statement is then used to build a bankruptcy prediction model. The Association Rule mining Algorithm augments the efficiency of the proposed method by providing relevant results based on the association between the businesses’ financial statements. (Martin, 2011) References: Rashid, M. , Hoque, T , Sattar, A. Association Rules Mining Based Clinical Observations. Institute for Integrated and Intelligent Systems (IIIS). Retrieved from https://arxiv.org/pdf/1401.2571.pdf Kouris, I.N, Makris, C., Theodoridis, E., Tsakalidis, A. Association Rules Mining for Retail Organizations. Retrieved from http://www.igi-global.com/viewtitlesample.aspx?id=13583&ptid=362&t=association+rules+mining+for+retail+organizations Prasad, P. Malik, L., Using Association Rule Mining for Extracting Product Sales Patterns in Retail Store Transactions. 2011. Interational Journal on Computer Science and Engineering. Retrieved from http://www.enggjournals.com/ijcse/doc/IJCSE11-03-05-185.pdf SAS. The Assoc Procedure. Retrieved from http://support.sas.com/documentation/onlinedoc/miner/em43/assoc.pdf Martin, A. , Manjula, M., Venkatesan, P. A Business Intelligence Model to Predict Bankruptcy using Financial Domain Ontology with Association Rule Mining Algorithm. 2011. International Journal of Computer Science. Retrieved from https://arxiv.org/pdf/1109.1087.pdf Alfaro, F., Solano, J. Apriori vs. FP-Growth for Frequent Item Set Mining. 2016. Retrieved from http://singularities.com/blog/2015/08/apriori-vs-fpgrowth-for-frequent-item-set-mining