data mining and statistics for decision making (tufféry/data mining and statistics for decision...
TRANSCRIPT
10
Association analysis
The detection of association rules is another descriptive method which is very popular in data
mining, especially in such areas as webmining, where it is used to analyse the pages visited by
aweb user, and the retail industry, where it can analyse the products bought by a customer on a
single visit. This explains the alternative name for this method: market basket analysis. Of
course, thismethod can be usefully applied to other activities as well. It does not have the same
theoretical difficulties as clustering and classification methods; instead, the difficulties arise
from the need to process enormous volumes of data (up to several million till receipts, for
example) and to pick out new and interesting associations from the overwhelming majority of
irrelevant or previously known associations.
10.1 Principles
Finding association rules is a matter of finding rules of the following type: ‘If, for any one
individual, variable A¼ xA, variable B¼ xB, and so on, then, in 80% of cases, variable Z¼ xZ,
and this configuration is found for 20% of the individuals.’ In other words, the aim is to find
themost frequent combined values of a set of variables of a data set. In market basket analysis,
the variables are the indicators of the products, and the rules are applied to indicators equal
to 1, in other words the products bought. Note that some recent research has been carried out
on ‘negative’ rules, where we are interested in the products that are not bought.
The value of 80% is called the index of confidence and the value of 20% is called the
support index of the rule {A¼ xA, B¼ xB, . . .}) {Z¼ xZ}. The first part of the rule is called
the ‘antecedent’ or ‘condition’; the second part is called the ‘consequent’ or ‘result’; and
expressions of the form {A¼ xA} are called ‘items’. In an association rule, an item can never
be in both the condition and the result simultaneously.
Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.
© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8
A rule is therefore an expression of the form:
If Condition; then Result:
Here is an example taken from marketing (mythical, if not veracious):
If Nappies and Saturday; then Beer:
The support index is the probability
ProbðCondition and ResultÞ:
The confidence index is the probability
ProbðCondition and ResultÞ=ProbðConditionÞ:
Naturally, the aim is to find association rules for which the support and confidence are
above specified minimum thresholds.
For example, in the transactions shown in Table 10.1, where each row corresponds to a
market basket TX, and each column corresponds to a product A, B, . . . , the confidence index ofthe association B)E is 3
4and its support index is 3
5. Similarly, the confidence index of the
associationC)B is 23and its support index is 2
5. One thing is evident: B is present in almost all
the transactions, or more precisely the a priori probability of having B there is 0.8. This
probability is greater than the confidence index for C)B, and therefore the rule C)B is not
helpful for predicting B. If we say that a transaction taken at random contains B, there is
only one chance in five that we will be wrong, as against one chance in three if we follow the
rule C)B.
The improvement brought by a rule, by comparison with a random response, is called the
lift (or simply the ‘improvement’), and is as follows:
liftðruleÞ ¼ confidence indexðruleÞProbðResultÞ ¼ ProbðCondition and ResultÞ
ProbðConditionÞ � ProbðResultÞ :
When the ‘result’ is independent of the ‘condition’, the lift is clearly equal to 1. If the lift is less
than 1, the rule does not help. Thus we find that lift(C)B)¼ 56(useless rule) and lift
(B)E)¼ 54(useful rule). But note that, if the lift of the rule
Condition ) Result
Table 10.1 Set of transactions.
T26 A B C D E
T163 B C E F
T1728 B E
T2718 A B D
T3141 C D
288 ASSOCIATION ANALYSIS
is less than 1, then the lift of the inverse rule, i.e. the rule
Condition ) NOT Result:
is greater than 1, since
confidence indexðinverse ruleÞ ¼ 1�confidence indexðruleÞ
and
ProbðNOT ResultÞ ¼ 1�ProbðResultÞ:If a rule is not useful, we can try using the inverse rule, in the hope that it will be helpful for
business or marketing purposes.
The main algorithm for detecting association rules is the Apriori algorithm proposed by
Agrawal and other researchers.1
Apriori operates in two steps, which have become standard for this type of algorithm:
. It starts by searching for the subsets of items having a probability of appearance
(support) above a certain threshold.
. Then it attempts to break down each subset in a form {Condition[Result} such that the
quotient Prob(Condition and Result)/Prob (Condition), i.e. the confidence index, is
above a certain threshold.
In the first step, Apriori starts by making a first pass through the data, to eliminate all the
items which are less frequent than the specified minimum support. It then performs a second
pass, in order to construct all the sets of itemswith two elements, formed from the items retained
previously. Of these sets, it only retains thosewhose frequency exceeds the specified minimum
support. On each pass, Apriori retains only the sets of items which are more frequent than the
support threshold, out of all those constructed on the basis of the sets from the previous pass and
the items selected in the first pass. The frequent items with a size of n which are useful for our
purposes are those constructed from setswith a size of n� 1which are themselves frequent. The
first optimization of Apriori is that only a single pass is required for each value of n.
The difficulty of implementing the search for rules is due to the exponential growth of the
number of rules with the number of items. For each subset of itemsEwith n elements, there are
2n�1–1 rules of the formA) {E�A}, and therefore the same number of possible breakdowns
in the second step. Another improvement provided by the designers of Apriori is a way of
quickly identifying the rules which may exceed the fixed threshold of the confidence index.
Because of these advantages, the Apriori algorithm is the most widespread and most
commonly implemented algorithm for detecting association rules.
In practice, however, there are still a very large number of rules remaining, and most
packages offer an option for storing these rules in a file, in which the Condition)Result rules
can be filtered up to a certain value of the support index, and can be sorted according to their
1 Agrawal, R., Imielienski, T. and Swami, A.N. (1993). Mining association rules between sets of items in large
databases. InProceedings of the 1993ACMSIGMODInternational Conference onManagement ofData, pp. 207–216.
New York: ACM Press.
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. (1995). Fast discovery of association rules.
In Advances in Knowledge Discovery and Data Mining, pp. 307–328. Cambridge, MA: AAAI Press/MIT Press.
PRINCIPLES 289
support, confidence or lift. This file is often a text file, but SAS Enterprise Miner can store the
rules in an SAS table.
The requirement in respect of the confidence threshold is generally stricter than for the
support threshold; a common example of a filter is 75% for confidence and 5% for support
(and 1 for lift, of course).
However, even with these filters, the number of rules soon becomes dizzyingly high, up to
several million for just a few hundred items and a few thousand observations. Indeed, this
number increases exponentially with the decrease in the minimum support and an increase in
the number of items in each rule. In fact, not only are almost all of these rules uninteresting or
well known already (cheese goes with bread and wine, white wine goes with oysters, nails go
with a hammer, and so on), but, purely in terms of computing power, it may be impossible to
process and store so many rules. So some packages offer a useful option for adding a filter on
the content of the rules, making it possible to retain only the rules which contain a certain item
in their consequent or antecedent. This functionality is even more useful because we often
seek rules that ‘predict’ a certain behaviour, where the consequent contains certain items
specified in advance. Among the commercial software programs, IBM SPSSModeler has this
functionality (Figure 10.1).
The packages also enable us to set a limit to the size of the rules, in other words to the
number of items they contain. We would rarely need to go beyond 10 items. Note that some
packages, but not all, permit consequents with more than one item. This is the case with
SAS Enterprise Miner and IBM SPSS Modeler, but not the freeware developed by Christian
Borgelt.2 However, this package is often mentioned and used, or implemented in other
Figure 10.1 Parameter setting in IBM SPSS Modeler.
2 Downloadable from http://fuzzy.cs.uni-magdeburg.de/%7Eborgelt/software.html, or more directly from http://
www.borgelt.net//apriori.html.
290 ASSOCIATION ANALYSIS
software (such as R and its arules package,3 and also Tanagra4), because of its high speed,
making it suitable for detecting a large number of rules.
Interesting rules are those which are non-trivial, usable in practice, and preferably explicable.
10.2 Using taxonomy
Products can be defined at a more or less fine level of detail. For example, we may consider:
. savings products in banking, finance, etc.;
. among the bank savings products, there are current accounts, passbooks, etc.;
. among passbooks, there are instant savings, building society savings, post office savings
accounts, and so on.
The taxonomy of products is the set of these levels, with its hierarchy. The finest level
enables us to undertake more accurate marketing operations. However, working at the finest
level multiplies the rules, many of which will only have low support and must therefore be
eliminated. Working at the most general level enables us to have stronger rules. Both
viewpoints have their advantages and disadvantages. A good compromise is to adapt the
level of generality to each product, based on its scarcity, for example.
Products which are scarcest and most expensive (e.g. microcomputers or hi-fi in a
department store) will be coded at a finer level, whereas more common products (e.g. food
products) will be coded at a more general level. By way of example, we can group all
yogurts, cheeses, creams, etc., into ‘dairy products’, while making a distinction between
DVD players and camcorders. Even in this example, we can see that the finest level that is
of any use is most often the level of the product type (e.g. television), in other words
the level of the department or sub-department, rather than the identification number of
the product (such as the Efficient Article Numbering, or Stock Keeping Unit (SKU), which
is the reference number of the product in the stores or in the catalogue). A level as fine as
the SKU, which identifies everything down to the format and colour of the product, is
rarely useful.
The value of this procedure is that it can provide more relevant rules, in which the
commonest products do not hide the less common ones purely because of their frequency.
The best market basket analyses are therefore generally carried out on the basis of
different levels of the product taxonomy. In all cases, even if just one level is used, the products
in the transactions analysed must be carefully coded, to clearly distinguish a separate product
from an option which is not to be taken into account. For each product, we must also ask what
the most important property in the associations is to be: is it the type of product, its brand, or
maybe its size (for clothes)?
3 See: http://cran.univ-lyon1.fr/web/packages/arules/index.html and http://rss.acs.unt.edu/Rdoc/library/arules/
html/apriori.html.4 See: http://eric.univ-lyon2.fr/�ricco/tanagra/fichiers/fr_Tanagra_Assoc_Rules_Comparison.pdf.
USING TAXONOMY 291
10.3 Using supplementary variables
In addition to the products in a market basket, events relating to customers, etc., the
transaction lines analysed may include supplementary variables such as the date and time
of the transaction, or the method of payment. These enable us to detect rules such as:
If Nappies and Saturday; then Beer:
By adding temporal variables, we can look for the sequence of events which ends with the
purchase of a new product, the departure of the customer, or the like. In this case, we speak of
temporal associations.
Other information may be found here, such as the name of a manufacturer which
is included with some product types. Thus a market basket analysis can detect brand
loyalty phenomena.
For this purpose, the data to be analysed are presented as follows:
Product 1 Product 2 . . .
Customer A Type Brand Purchase date Type Brand Purchase date . . .
Customer B Type Brand Purchase date Type Brand Purchase date . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . .
In mail order, insurance and banking, we can also add some information about the distribution
channel: shop/agency, telephone, Internet, etc.
10.4 Applications
The method of finding association rules has been used widely since the 1960s in the retail
industry for analysing market baskets, stocking departments, organizing promotions, man-
aging stocks to prevent shortages and overstocks, etc. It is also useful for detecting
associations of options chosen in packaged products (in banking, telephony, insurance, etc.)
or associations of terms in a corpus of documents. It can be applied to any kind of items; for
example, it can be used to detect rules in sports, for example: if player X is on the field and the
match takes place in given circumstances, then the player Y scores more goals in 70% of cases.
As mentioned above, the main problem in implementing this method is the large number
of irrelevant association rules which may submerge the relevant ones. This problem can be
mitigated by using filters and taxonomies. However, some rules with high lifts and confidence
indices may pass unnoticed because their support indices are below the threshold which had
to be specified in order to prevent the numbers of rules becoming impossible to process.
Hastie et al. (2009)5 offer the light-hearted example of ‘vodka) caviar’ which is penalized
by the scarcity of the consequent.
5 Hastie, T., Tibshirani, R. and Friedman, J.H. (2009) The Elements of Statistical Learning: Data Mining,
Inference and Prediction, 2nd edn. New York: Springer.
292 ASSOCIATION ANALYSIS
Naturally, a huge amount of computation power is needed to analyse the market baskets of
a hypermarket with several tens of millions of products on its lists and several million
transactions per year.6 Association detection algorithms are provided in data mining programs
available for a client–server system, such as SAS Enterprise Miner� and IBM SPSSModeler
(Figure 10.2), as well as in freeware such as R, Tanagra, RapidMiner and Weka.
Figure 10.2 Association rules detection in IBM SPSS� Modeler.
6 The total number of transactions in all the Wal-Mart stores is more than 20 million per day!
APPLICATIONS 293
10.5 Example of use
If we start with an ordinary data set in the form of ‘individuals� variables’, most software
packages require one or two preliminary procedures of data preparation.We can illustrate this
using the Titanic data set which we examined in Section 3.12, dealing with interactions, and
which we will use again in Section 11.8.13 for the development of a logistic model.
To start with, we must convert the data from observations in form 1 (tabular):
Individual Age Sex Class Survived
1 A F 1 Y
2 A M 3 N
3 C M 1 Y
. . . . . . . . . . . . . . .
to form 2:
1 Age¼A Sex¼F Class¼1 Survived¼Y
2 Age¼A Sex¼M Class¼3 Survived¼N
3 Age¼C sex¼M Class¼1 Survived¼Y
. . . . . . . . . . . . . . .
and sometimes to form 3 (transactional):
1 Age¼A
1 Sex¼F
1 Class¼1
1 Survived¼Y
2 Age¼A
2 Sex¼M
2 Class¼3
. . . . . .
To consider only three examples, the freeware by C. Borgelt processes form 2, but SAS
Enterprise Miner requires form 3, while IBM SPSS Modeler can handle forms 1 and 3.
The following SAS code can be used to create form 3, used by SAS, directly. First, we
must add a key, in other words a unique identifier of each individual, if the file does not already
have one. The input file will then be transposed with respect to this key by an association rules
detection program. Note that the variables are numeric in this example.
DATA titanic ;
SET sasuser.titanic ;
id = _n_ _ ;
RUN;
294 ASSOCIATION ANALYSIS
CLASS AGE SEX SURVIVED ID
1 1 1 1 1 1
2 1 1 1 1 2
3 1 1 1 1 3
4 1 1 1 1 4
5 1 1 1 1 5
6 1 1 1 1 6
7 1 1 1 1 7
8 1 1 1 1 8
9 1 1 1 1 9
10 1 1 1 1 10
The following transposition transforms the ‘individuals� variables’ data set into a data set
with one line per (individual, variable) pair with the name of the variable in _name_ (‘name of
the former variable’) and its content in var1, where ‘var’ is the prefix specified in the
TRANSPOSE procedure. Since the variable ‘ID’ has also been transposed (all the variables
have been transposed, by the VAR _all_ instruction), the corresponding lines are deleted from
the TRANSPO file.
PROC TRANSPOSE DATA=test OUT=transpo (WHERE = (_name_ NE "id"))
PREFIX= var ;
BY id ;
VAR _all_ ;
RUN;
ID NAME OF THE
FORMER VARIABLE
var1
1 1 CLASS 1
2 1 AGE 1
3 1 SEX 1
4 1 SURVIVED 1
5 2 CLASS 1
6 2 AGE 1
7 2 SEX 1
8 2 SURVIVED 1
9 3 CLASS 1
10 3 AGE 1
11 3 SEX 1
12 3 SURVIVED 1
A step called DATA transforms the preceding data set into form 3 as mentioned above, by
concatenating the name of each variable with its content:
EXAMPLE OF USE 295
DATA titanic_assoc (KEEP = id item)
SET transpo ;
LENGTH item $20. ;
item = CATX (’=’,_name_, var1) ;
RUN;
key item
1 1 CLASS¼1
2 1 AGE¼1
3 1 SEX¼1
4 1 SURVIVED¼1
5 2 CLASS¼1
6 2 AGE¼1
7 2 SEX¼1
8 2 SURVIVED¼1
9 3 CLASS¼1
10 3 AGE¼1
11 3 SEX¼1
12 3 SURVIVED¼1
A data set in the above form (form 3) can be analysed in Enterprise Miner. Other packages
require form 2, and the file can be transposed again from form 3 to form 2 by using the ID as
a pivot.
PROC TRANSPOSE DATA=titanic_assoc OUT=titanic_assoc2 (DROP =_name_)
PREFIX=var ;
BY id ;
VAR item ;
RUN ;
ID var1 var2 var3 var4
1 1 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
2 2 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
3 3 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
4 4 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
5 5 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
6 6 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
7 7 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
8 8 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
9 9 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
10 10 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1
296 ASSOCIATION ANALYSIS
Figure 10.3 shows the parameter setting screen of the Association node of SAS Enterprise
Miner, which is applied to the data set in form 3 (transactional). This results in the 17 rules
shown in Figure 10.4. As mentioned above, the package provides an option for storing these
rules in an SAS data set or exporting them in another format.
The first rule is: male) adult (SEX¼1)AGE¼1). It relates to 1667 individuals out of
2201 passengers on the Titanic, i.e. the support index is 75.74%. As there are 1731 males, of
whom 1667 are adults, the confidence index is 96.30%. The lift of this rule is its confidence
index divided by the probability of being an adult, which is 95.05% (2092 out of 2201
passengers). This is only 1.01, and the 96.30% is only very slightly greater than the 95.05% of
confidence achieved by trivial prediction. This rule is therefore of low interest.
The rule with the strongest lift is ‘SURVIVED¼0 & CLASS¼0) SEX¼1 & AGE¼1’:
drowned þ member of crew)male þ adult. The lift is 99.55% (confidence) divided by
75.74% (this percentage of passengers are male adults), i.e. 1.31. But is this prediction really
useful?Whatweneed is rules inwhich survival or drowning appears in the consequents (results)
and not in the antecedents (conditions). None of the above 17 rules meets this condition.
Figure 10.4 Result of association detection in SAS Enterprise Miner.
Figure 10.3 Parameter setting for association detection in SAS Enterprise Miner.
EXAMPLE OF USE 297
If we choose a support threshold of 5% and a confidence threshold of 75%,we go from17 to
62 rules. The first three rules concern the prediction of survival, and they also have interesting
lifts, all three being greater than 3. The second rule of the three has the strongest confidence and
support indices. It is stated thus: female þ first class) survived. As the survivors are only
32.30%of the total, thus rule, which is true in 141 out of 145 cases of first class and females, i.e.
97.24% confidence, provides real information with a lift of 97.24/32.30 ¼ 3.01. This very
reliable criterion of survival will also appear in the decision tree in Section 11.4.2.
SET_
SIZE
EXP_
CONF
CONF SUPPORT LIFT COUNT RULE
4 29.71 96.55 6.36 3.25 140.00 SEX¼0 & CLASS¼1)SURVIVED¼1 & AGE¼1
3 32.30 97.24 6.41 3.01 141.00 SEX¼0 & CLASS¼1)SURVIVED¼1
4 32.30 97.22 6.36 3.01 140.00 SEX¼0 & CLASS¼1 &
AGE¼1) SURVIVED¼1
4 60.38 75.71 30.44 1.25 670.00 CLASS¼0) SURVIVED¼0
& SEX¼1 & AGE¼1
3 61.97 75.71 30.44 1.22 670.00 CLASS¼0) SURVIVED¼0
& SEX¼1
4 61.97 75.71 30.44 1.22 670.00 CLASS¼0 & AGE¼1)SURVIVED¼0 & SEX¼1
. . . . . . . . . . . . . . . . . . . . .
2 95.05 96.51 65.33 1.02 1438.0 SURVIVED¼0)AGE¼1
2 95.05 96.30 75.74 1.01 1667.0 SEX¼1)AGE¼1
As shown in Figure 10.5, SAS Enterprise Miner also displays the most frequent items,
namely those whose frequency exceeds 5% (the support threshold that was set previously) of
the number of individuals, which is 2201 in this case.
Figure 10.5 The most frequent items in SAS Enterprise Miner.
298 ASSOCIATION ANALYSIS
A note on redundant rules. In the example above, rule 13,
SURVIVED ¼ 0 & CLASS ¼ 0 ) SEX ¼ 1;
and rule 15,
SURVIVED ¼ 0 & CLASS ¼ 0 ) SEX ¼ 1 &AGE ¼ 1;
have exactly the same support (670 observations), because rule 14,
SURVIVED ¼ 0 & CLASS ¼ 0 ) AGE ¼ 1;
is always true. Rule 13 is therefore redundant with respect to rule 15, because 15) 13.
EXAMPLE OF USE 299