7-association rule mining - iasri rule mining.pdfintroduction association rule mining, ... create a...

8
ASSOCIATION RULE MINING USING SAS E-MINER Anshu Bharadwaj I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction Association rule mining, one of the most important and well researched techniques of data mining, was first introduced in 1993 (Agrawal, 1993) and are used to identify relationships among a set of items in a database. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. These relationships are not based on inherent properties of the data themselves (as in the case of functional dependencies), but are rather based on co-occurrence of the data items. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc. Association rules are mainly used to analyse transactional data. They are useful in management to increase the effectiveness and /or reduce the cost associated with advertising, marketing, inventory, stock location on the floor etc. Association rules also provide assistance in other applications such as prediction by identifying what events occur before a set of particular events. An association rule may be one of the following types: Boolean, Spatial, temporal, Generalised, Quantitative, Interval and Multiple Min-Support Association etc. or a mix of them. Association rule (Agrawal1993) (Cheung1996) gives the association among the attribute in a transactional database. Let D be a transaction database and I = {I 1 , I 2 , …, I m } be a set of m distinct items (attributes) of D, where each transaction (record) T has a set of items such that TI and has unique identifier. A transaction T is said to contain a set of item A if and only if AT. An association rule is an implication of the form AB, where A, BI, are sets of items called itemsets, and A B=. Here, A is called antecedent, and B consequent. The rule AB holds in the transaction data D with support (s) where s is the ratio (in percent) of the records that contain A B (i.e. both A and B) to the total number of records in the database. This is taken to be the probability P(A B). The rule AB has confidence (c) in the D, the ratio (in percent) of the number of records that contain X Y to the number of records that contain X. This is taken to be the conditional probability P(B|A). Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub- problems (Agrawal, 1994). To find all sets of items which occur with a frequency that is greater than or equal to the user-specified threshold support, say s. To generate the rules using the frequent itemsets, which have confidence greater than or equal to the user-specified threshold confidence, say c. 2. Evaluation methods for Association Rule Mining To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on

Upload: lenguyet

Post on 20-Apr-2018

237 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 7-Association rule mining - IASRI rule mining.pdfIntroduction Association rule mining, ... Create a data source ASSOCS by using the SAS sample data set called ... Association Rule

ASSOCIATION RULE MINING USING SAS E-MINER

Anshu Bharadwaj I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012

[email protected] 1. Introduction Association rule mining, one of the most important and well researched techniques of data mining, was first introduced in 1993 (Agrawal, 1993) and are used to identify relationships among a set of items in a database. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. These relationships are not based on inherent properties of the data themselves (as in the case of functional dependencies), but are rather based on co-occurrence of the data items. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc. Association rules are mainly used to analyse transactional data. They are useful in management to increase the effectiveness and /or reduce the cost associated with advertising, marketing, inventory, stock location on the floor etc. Association rules also provide assistance in other applications such as prediction by identifying what events occur before a set of particular events. An association rule may be one of the following types: Boolean, Spatial, temporal, Generalised, Quantitative, Interval and Multiple Min-Support Association etc. or a mix of them.

Association rule (Agrawal1993) (Cheung1996) gives the association among the attribute in a transactional database. Let D be a transaction database and I = {I1, I2, …, Im} be a set of m distinct items (attributes) of D, where each transaction (record) T has a set of items such that TI and has unique identifier. A transaction T is said to contain a set of item A if and only if AT. An association rule is an implication of the form AB, where A, BI, are sets of items called itemsets, and A B=. Here, A is called antecedent, and B consequent. The rule AB holds in the transaction data D with support (s) where s is the ratio (in percent) of the records that contain A B (i.e. both A and B) to the total number of records in the database. This is taken to be the probability P(A B). The rule AB has confidence (c) in the D, the ratio (in percent) of the number of records that contain XY to the number of records that contain X. This is taken to be the conditional probability P(B|A). Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub-problems (Agrawal, 1994).

To find all sets of items which occur with a frequency that is greater than or equal to the user-specified threshold support, say s.

To generate the rules using the frequent itemsets, which have confidence greater than or equal to the user-specified threshold confidence, say c.

2. Evaluation methods for Association Rule Mining To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on

Page 2: 7-Association rule mining - IASRI rule mining.pdfIntroduction Association rule mining, ... Create a data source ASSOCS by using the SAS sample data set called ... Association Rule

Association Rule Mining using SAS E-Miner

196

support and confidence. Since the database is large and users concern about only those frequently purchased items, usually thresholds of support and confidence are predefined by users to drop those rules that are not so interesting or useful. The two thresholds are called minimal support and minimal confidence respectively. A few more measure of interestingness for association rule mining are Lift, Conviction and Succinctness. 2.1 Support The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset {milk,bread,butter} has a support of 1 / 5 = 0.2 since it occurs in 20% of all transactions (1 out of 5 transactions). 2.2 Confidence The confidence of a rule is defined as: For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in the database, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. 2.3 Lift The lift of a rule is defined as:

or the ratio of the observed support to that expected if X and Y were independent. The rule

has a lift of . 3. Illustration Consider the following scenario. A store wants to examine its customer base and to understand which of its products tend to be purchased together. It has chosen to conduct a market-basket analysis of a sample of its customer base. This information might help you make decisions such as when to distribute coupons, when to put a product on sale, or how to present items in store displays. To perform the association analysis, follow these steps. The ASSOCS data set lists the grocery products that are purchased by 1,001 customers. Twenty possible items are represented:

Table1. Selected Variables in the ASSOCS Data Set

Code Product apples apples

artichok artichokes avocado avocado baguette baguettes bordeaux wine bourbon bourbon chicken chicken

coke cola

Page 3: 7-Association rule mining - IASRI rule mining.pdfIntroduction Association rule mining, ... Create a data source ASSOCS by using the SAS sample data set called ... Association Rule

Association Rule Mining using SAS E-Miner

197

corned_b corned beef cracker cracker

ham ham heineken beer herring fish ice_crea ice cream olives olives

peppers peppers sardines sardines

soda soda steak steak

turkey turkey Seven items were purchased by each of 1,001 customers, which yields 7,007 rows in the data set. Each row of the data set represents a customer-product combination. In most data sets, not all customers have the same number of products. 3.1 Create a Process Flow Diagram

1. Create a data source ASSOCS by using the SAS sample data set called SAMPSIO.ASSOCS.

Page 4: 7-Association rule mining - IASRI rule mining.pdfIntroduction Association rule mining, ... Create a data source ASSOCS by using the SAS sample data set called ... Association Rule

Association Rule Mining using SAS E-Miner

198

2. Select the ASSOCS data set from the SAMPSIO library.

3. Click the Variables tab.

Page 5: 7-Association rule mining - IASRI rule mining.pdfIntroduction Association rule mining, ... Create a data source ASSOCS by using the SAS sample data set called ... Association Rule

Association Rule Mining using SAS E-Miner

199

Using either the Basic or the Advanced Metadata Advisor, assign the following roles to the variables:

4. Set the model role for CUSTOMER to Id. 5. Set the model role for PRODUCT to Target. 6. Set the model role for TIME to Rejected.

Page 6: 7-Association rule mining - IASRI rule mining.pdfIntroduction Association rule mining, ... Create a data source ASSOCS by using the SAS sample data set called ... Association Rule

Association Rule Mining using SAS E-Miner

200

Note: TIME is a variable that identifies the sequence in which the products were purchased. In this example, all of the products were purchased at the same time, so the order relates only to the order in which they are priced at the register. When order is taken into account, association analysis is known as sequence analysis. Sequence analysis is not demonstrated here.

7. Close and save changes to the Input Data Source node. 8. Assign the data source the role of Transaction in the Data Source Attributes window of

the Data Source Wizard and save SAMPSIO.ASSOCS.

9. Add the data source SAMPSIO.ASSOCS to your diagram workspace. 10. Add an Association node to the diagram workspace and connect it to the data source

ASSOCS. 11. Change the Maximum Items property to 2.

Page 7: 7-Association rule mining - IASRI rule mining.pdfIntroduction Association rule mining, ... Create a data source ASSOCS by using the SAS sample data set called ... Association Rule

Association Rule Mining using SAS E-Miner

201

12. Run the Association node.

After the node runs successfully, open the Results window.

Page 8: 7-Association rule mining - IASRI rule mining.pdfIntroduction Association rule mining, ... Create a data source ASSOCS by using the SAS sample data set called ... Association Rule

Association Rule Mining using SAS E-Miner

202

Support (%) is the percentage of customers who have all the services that are involved in the rule. For example, 36.56% of the 1,001 customers purchased crackers and beer (rule 1), 25.57% purchased olives and herring (rule 7). Consider the Confidence (%) column above.

Confidence (%) represents the percentage of customers who have the right-hand side (RHS) item among those who have the left-hand side (LHS) item. For example, of the customers who purchased crackers, 75% purchased beer (rule 2). Of the customers who purchased beer, however, only 61% purchased crackers (rule 1). Lift, in the context of association rules, is the ratio of the confidence of a rule to the confidence of a rule, assuming that the RHS was independent of the LHS.

Consequently, lift is a measure of association between the LHS and RHS of the rule. Values that are greater than one represent positive association between the LHS and RHS. Values that are equal to one represent independence. Values that are less than one represent a negative association between the LHS and RHS. Click the LIFT column with the right mouse button and

select

The lift for rule 1 indicates that a customer who buys peppers and avocados is about 5.67 times as likely to purchase sardines and apples as a customer taken at random. Support (%) for this rule, unfortunately, is very low (8.99%), indicating that the event in which all four products are purchased together is a relatively rare occurrence.