data mining and statistics for decision making (tufféry/data mining and statistics for decision...

10

Association analysis

The detection of association rules is another descriptive method which is very popular in data

mining, especially in such areas as webmining, where it is used to analyse the pages visited by

aweb user, and the retail industry, where it can analyse the products bought by a customer on a

single visit. This explains the alternative name for this method: market basket analysis. Of

course, thismethod can be usefully applied to other activities as well. It does not have the same

theoretical difficulties as clustering and classification methods; instead, the difficulties arise

from the need to process enormous volumes of data (up to several million till receipts, for

example) and to pick out new and interesting associations from the overwhelming majority of

irrelevant or previously known associations.

10.1 Principles

Finding association rules is a matter of finding rules of the following type: ‘If, for any one

individual, variable A¼ xA, variable B¼ xB, and so on, then, in 80% of cases, variable Z¼ xZ,

and this configuration is found for 20% of the individuals.’ In other words, the aim is to find

themost frequent combined values of a set of variables of a data set. In market basket analysis,

the variables are the indicators of the products, and the rules are applied to indicators equal

to 1, in other words the products bought. Note that some recent research has been carried out

on ‘negative’ rules, where we are interested in the products that are not bought.

The value of 80% is called the index of confidence and the value of 20% is called the

support index of the rule {A¼ xA, B¼ xB, . . .}) {Z¼ xZ}. The first part of the rule is called

the ‘antecedent’ or ‘condition’; the second part is called the ‘consequent’ or ‘result’; and

expressions of the form {A¼ xA} are called ‘items’. In an association rule, an item can never

be in both the condition and the result simultaneously.

Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.

© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8

A rule is therefore an expression of the form:

If Condition; then Result:

Here is an example taken from marketing (mythical, if not veracious):

If Nappies and Saturday; then Beer:

The support index is the probability

ProbðCondition and ResultÞ:

The confidence index is the probability

ProbðCondition and ResultÞ=ProbðConditionÞ:

Naturally, the aim is to find association rules for which the support and confidence are

above specified minimum thresholds.

For example, in the transactions shown in Table 10.1, where each row corresponds to a

market basket TX, and each column corresponds to a product A, B, . . . , the confidence index ofthe association B)E is 3

4and its support index is 3

5. Similarly, the confidence index of the

associationC)B is 23and its support index is 2

5. One thing is evident: B is present in almost all

the transactions, or more precisely the a priori probability of having B there is 0.8. This

probability is greater than the confidence index for C)B, and therefore the rule C)B is not

helpful for predicting B. If we say that a transaction taken at random contains B, there is

only one chance in five that we will be wrong, as against one chance in three if we follow the

rule C)B.

The improvement brought by a rule, by comparison with a random response, is called the

lift (or simply the ‘improvement’), and is as follows:

liftðruleÞ ¼ confidence indexðruleÞProbðResultÞ ¼ ProbðCondition and ResultÞ

ProbðConditionÞ � ProbðResultÞ :

When the ‘result’ is independent of the ‘condition’, the lift is clearly equal to 1. If the lift is less

than 1, the rule does not help. Thus we find that lift(C)B)¼ 56(useless rule) and lift

(B)E)¼ 54(useful rule). But note that, if the lift of the rule

Condition ) Result

Table 10.1 Set of transactions.

T26 A B C D E

T163 B C E F

T1728 B E

T2718 A B D

T3141 C D

288 ASSOCIATION ANALYSIS

is less than 1, then the lift of the inverse rule, i.e. the rule

Condition ) NOT Result:

is greater than 1, since

confidence indexðinverse ruleÞ ¼ 1�confidence indexðruleÞ

and

ProbðNOT ResultÞ ¼ 1�ProbðResultÞ:If a rule is not useful, we can try using the inverse rule, in the hope that it will be helpful for

business or marketing purposes.

The main algorithm for detecting association rules is the Apriori algorithm proposed by

Agrawal and other researchers.1

Apriori operates in two steps, which have become standard for this type of algorithm:

. It starts by searching for the subsets of items having a probability of appearance

(support) above a certain threshold.

. Then it attempts to break down each subset in a form {Condition[Result} such that the

quotient Prob(Condition and Result)/Prob (Condition), i.e. the confidence index, is

above a certain threshold.

In the first step, Apriori starts by making a first pass through the data, to eliminate all the

items which are less frequent than the specified minimum support. It then performs a second

pass, in order to construct all the sets of itemswith two elements, formed from the items retained

previously. Of these sets, it only retains thosewhose frequency exceeds the specified minimum

support. On each pass, Apriori retains only the sets of items which are more frequent than the

support threshold, out of all those constructed on the basis of the sets from the previous pass and

the items selected in the first pass. The frequent items with a size of n which are useful for our

purposes are those constructed from setswith a size of n� 1which are themselves frequent. The

first optimization of Apriori is that only a single pass is required for each value of n.

The difficulty of implementing the search for rules is due to the exponential growth of the

number of rules with the number of items. For each subset of itemsEwith n elements, there are

2n�1–1 rules of the formA) {E�A}, and therefore the same number of possible breakdowns

in the second step. Another improvement provided by the designers of Apriori is a way of

quickly identifying the rules which may exceed the fixed threshold of the confidence index.

Because of these advantages, the Apriori algorithm is the most widespread and most

commonly implemented algorithm for detecting association rules.

In practice, however, there are still a very large number of rules remaining, and most

packages offer an option for storing these rules in a file, in which the Condition)Result rules

can be filtered up to a certain value of the support index, and can be sorted according to their

1 Agrawal, R., Imielienski, T. and Swami, A.N. (1993). Mining association rules between sets of items in large

databases. InProceedings of the 1993ACMSIGMODInternational Conference onManagement ofData, pp. 207–216.

New York: ACM Press.

Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. (1995). Fast discovery of association rules.

In Advances in Knowledge Discovery and Data Mining, pp. 307–328. Cambridge, MA: AAAI Press/MIT Press.

PRINCIPLES 289

support, confidence or lift. This file is often a text file, but SAS Enterprise Miner can store the

rules in an SAS table.

The requirement in respect of the confidence threshold is generally stricter than for the

support threshold; a common example of a filter is 75% for confidence and 5% for support

(and 1 for lift, of course).

However, even with these filters, the number of rules soon becomes dizzyingly high, up to

several million for just a few hundred items and a few thousand observations. Indeed, this

number increases exponentially with the decrease in the minimum support and an increase in

the number of items in each rule. In fact, not only are almost all of these rules uninteresting or

well known already (cheese goes with bread and wine, white wine goes with oysters, nails go

with a hammer, and so on), but, purely in terms of computing power, it may be impossible to

process and store so many rules. So some packages offer a useful option for adding a filter on

the content of the rules, making it possible to retain only the rules which contain a certain item

in their consequent or antecedent. This functionality is even more useful because we often

seek rules that ‘predict’ a certain behaviour, where the consequent contains certain items

specified in advance. Among the commercial software programs, IBM SPSSModeler has this

functionality (Figure 10.1).

The packages also enable us to set a limit to the size of the rules, in other words to the

number of items they contain. We would rarely need to go beyond 10 items. Note that some

packages, but not all, permit consequents with more than one item. This is the case with

SAS Enterprise Miner and IBM SPSS Modeler, but not the freeware developed by Christian

Borgelt.2 However, this package is often mentioned and used, or implemented in other

Figure 10.1 Parameter setting in IBM SPSS Modeler.

2 Downloadable from http://fuzzy.cs.uni-magdeburg.de/%7Eborgelt/software.html, or more directly from http://

www.borgelt.net//apriori.html.


software (such as R and its arules package,3 and also Tanagra4), because of its high speed,

making it suitable for detecting a large number of rules.

Interesting rules are those which are non-trivial, usable in practice, and preferably explicable.

10.2 Using taxonomy

Products can be defined at a more or less fine level of detail. For example, we may consider:

. savings products in banking, finance, etc.;

. among the bank savings products, there are current accounts, passbooks, etc.;

. among passbooks, there are instant savings, building society savings, post office savings

accounts, and so on.

The taxonomy of products is the set of these levels, with its hierarchy. The finest level

enables us to undertake more accurate marketing operations. However, working at the finest

level multiplies the rules, many of which will only have low support and must therefore be

eliminated. Working at the most general level enables us to have stronger rules. Both

viewpoints have their advantages and disadvantages. A good compromise is to adapt the

level of generality to each product, based on its scarcity, for example.

Products which are scarcest and most expensive (e.g. microcomputers or hi-fi in a

department store) will be coded at a finer level, whereas more common products (e.g. food

products) will be coded at a more general level. By way of example, we can group all

yogurts, cheeses, creams, etc., into ‘dairy products’, while making a distinction between

DVD players and camcorders. Even in this example, we can see that the finest level that is

of any use is most often the level of the product type (e.g. television), in other words

the level of the department or sub-department, rather than the identification number of

the product (such as the Efficient Article Numbering, or Stock Keeping Unit (SKU), which

is the reference number of the product in the stores or in the catalogue). A level as fine as

the SKU, which identifies everything down to the format and colour of the product, is

rarely useful.

The value of this procedure is that it can provide more relevant rules, in which the

commonest products do not hide the less common ones purely because of their frequency.

The best market basket analyses are therefore generally carried out on the basis of

different levels of the product taxonomy. In all cases, even if just one level is used, the products

in the transactions analysed must be carefully coded, to clearly distinguish a separate product

from an option which is not to be taken into account. For each product, we must also ask what

the most important property in the associations is to be: is it the type of product, its brand, or

maybe its size (for clothes)?

3 See: http://cran.univ-lyon1.fr/web/packages/arules/index.html and http://rss.acs.unt.edu/Rdoc/library/arules/

html/apriori.html.4 See: http://eric.univ-lyon2.fr/�ricco/tanagra/fichiers/fr_Tanagra_Assoc_Rules_Comparison.pdf.

USING TAXONOMY 291

10.3 Using supplementary variables

In addition to the products in a market basket, events relating to customers, etc., the

transaction lines analysed may include supplementary variables such as the date and time

of the transaction, or the method of payment. These enable us to detect rules such as:

If Nappies and Saturday; then Beer:

By adding temporal variables, we can look for the sequence of events which ends with the

purchase of a new product, the departure of the customer, or the like. In this case, we speak of

temporal associations.

Other information may be found here, such as the name of a manufacturer which

is included with some product types. Thus a market basket analysis can detect brand

loyalty phenomena.

For this purpose, the data to be analysed are presented as follows:

Product 1 Product 2 . . .

Customer A Type Brand Purchase date Type Brand Purchase date . . .

Customer B Type Brand Purchase date Type Brand Purchase date . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . .

In mail order, insurance and banking, we can also add some information about the distribution

channel: shop/agency, telephone, Internet, etc.

10.4 Applications

The method of finding association rules has been used widely since the 1960s in the retail

industry for analysing market baskets, stocking departments, organizing promotions, man-

aging stocks to prevent shortages and overstocks, etc. It is also useful for detecting

associations of options chosen in packaged products (in banking, telephony, insurance, etc.)

or associations of terms in a corpus of documents. It can be applied to any kind of items; for

example, it can be used to detect rules in sports, for example: if player X is on the field and the

match takes place in given circumstances, then the player Y scores more goals in 70% of cases.

As mentioned above, the main problem in implementing this method is the large number

of irrelevant association rules which may submerge the relevant ones. This problem can be

mitigated by using filters and taxonomies. However, some rules with high lifts and confidence

indices may pass unnoticed because their support indices are below the threshold which had

to be specified in order to prevent the numbers of rules becoming impossible to process.

Hastie et al. (2009)5 offer the light-hearted example of ‘vodka) caviar’ which is penalized

by the scarcity of the consequent.

5 Hastie, T., Tibshirani, R. and Friedman, J.H. (2009) The Elements of Statistical Learning: Data Mining,

Inference and Prediction, 2nd edn. New York: Springer.


Naturally, a huge amount of computation power is needed to analyse the market baskets of

a hypermarket with several tens of millions of products on its lists and several million

transactions per year.6 Association detection algorithms are provided in data mining programs

available for a client–server system, such as SAS Enterprise Miner� and IBM SPSSModeler

(Figure 10.2), as well as in freeware such as R, Tanagra, RapidMiner and Weka.

Figure 10.2 Association rules detection in IBM SPSS� Modeler.

6 The total number of transactions in all the Wal-Mart stores is more than 20 million per day!

APPLICATIONS 293

10.5 Example of use

If we start with an ordinary data set in the form of ‘individuals� variables’, most software

packages require one or two preliminary procedures of data preparation.We can illustrate this

using the Titanic data set which we examined in Section 3.12, dealing with interactions, and

which we will use again in Section 11.8.13 for the development of a logistic model.

To start with, we must convert the data from observations in form 1 (tabular):

Individual Age Sex Class Survived

1 A F 1 Y

2 A M 3 N

3 C M 1 Y

. . . . . . . . . . . . . . .

to form 2:

1 Age¼A Sex¼F Class¼1 Survived¼Y

2 Age¼A Sex¼M Class¼3 Survived¼N

3 Age¼C sex¼M Class¼1 Survived¼Y

. . . . . . . . . . . . . . .

and sometimes to form 3 (transactional):

1 Age¼A

1 Sex¼F

1 Class¼1

1 Survived¼Y

2 Age¼A

2 Sex¼M

2 Class¼3

. . . . . .

To consider only three examples, the freeware by C. Borgelt processes form 2, but SAS

Enterprise Miner requires form 3, while IBM SPSS Modeler can handle forms 1 and 3.

The following SAS code can be used to create form 3, used by SAS, directly. First, we

must add a key, in other words a unique identifier of each individual, if the file does not already

have one. The input file will then be transposed with respect to this key by an association rules

detection program. Note that the variables are numeric in this example.

DATA titanic ;

SET sasuser.titanic ;

id = _n_ _ ;

RUN;


CLASS AGE SEX SURVIVED ID

1 1 1 1 1 1

2 1 1 1 1 2

3 1 1 1 1 3

4 1 1 1 1 4

5 1 1 1 1 5

6 1 1 1 1 6

7 1 1 1 1 7

8 1 1 1 1 8

9 1 1 1 1 9

10 1 1 1 1 10

The following transposition transforms the ‘individuals� variables’ data set into a data set

with one line per (individual, variable) pair with the name of the variable in _name_ (‘name of

the former variable’) and its content in var1, where ‘var’ is the prefix specified in the

TRANSPOSE procedure. Since the variable ‘ID’ has also been transposed (all the variables

have been transposed, by the VAR _all_ instruction), the corresponding lines are deleted from

the TRANSPO file.

PROC TRANSPOSE DATA=test OUT=transpo (WHERE = (_name_ NE "id"))

PREFIX= var ;

BY id ;

VAR _all_ ;

RUN;

ID NAME OF THE

FORMER VARIABLE

var1

1 1 CLASS 1

2 1 AGE 1

3 1 SEX 1

4 1 SURVIVED 1

5 2 CLASS 1

6 2 AGE 1

7 2 SEX 1

8 2 SURVIVED 1

9 3 CLASS 1

10 3 AGE 1

11 3 SEX 1

12 3 SURVIVED 1

A step called DATA transforms the preceding data set into form 3 as mentioned above, by

concatenating the name of each variable with its content:

EXAMPLE OF USE 295

DATA titanic_assoc (KEEP = id item)

SET transpo ;

LENGTH item $20. ;

item = CATX (’=’,_name_, var1) ;

RUN;

key item

1 1 CLASS¼1

2 1 AGE¼1

3 1 SEX¼1

4 1 SURVIVED¼1

5 2 CLASS¼1

6 2 AGE¼1

7 2 SEX¼1

8 2 SURVIVED¼1

9 3 CLASS¼1

10 3 AGE¼1

11 3 SEX¼1

12 3 SURVIVED¼1

A data set in the above form (form 3) can be analysed in Enterprise Miner. Other packages

require form 2, and the file can be transposed again from form 3 to form 2 by using the ID as

a pivot.

PROC TRANSPOSE DATA=titanic_assoc OUT=titanic_assoc2 (DROP =_name_)

PREFIX=var ;

BY id ;

VAR item ;

RUN ;

ID var1 var2 var3 var4

1 1 CLASS¼1 AGE¼1 SEX¼1 SURVIVED¼1











Figure 10.3 shows the parameter setting screen of the Association node of SAS Enterprise

Miner, which is applied to the data set in form 3 (transactional). This results in the 17 rules

shown in Figure 10.4. As mentioned above, the package provides an option for storing these

rules in an SAS data set or exporting them in another format.

The first rule is: male) adult (SEX¼1)AGE¼1). It relates to 1667 individuals out of

2201 passengers on the Titanic, i.e. the support index is 75.74%. As there are 1731 males, of

whom 1667 are adults, the confidence index is 96.30%. The lift of this rule is its confidence

index divided by the probability of being an adult, which is 95.05% (2092 out of 2201

passengers). This is only 1.01, and the 96.30% is only very slightly greater than the 95.05% of

confidence achieved by trivial prediction. This rule is therefore of low interest.

The rule with the strongest lift is ‘SURVIVED¼0 & CLASS¼0) SEX¼1 & AGE¼1’:

drowned þ member of crew)male þ adult. The lift is 99.55% (confidence) divided by

75.74% (this percentage of passengers are male adults), i.e. 1.31. But is this prediction really

useful?Whatweneed is rules inwhich survival or drowning appears in the consequents (results)

and not in the antecedents (conditions). None of the above 17 rules meets this condition.

Figure 10.4 Result of association detection in SAS Enterprise Miner.

Figure 10.3 Parameter setting for association detection in SAS Enterprise Miner.

EXAMPLE OF USE 297

If we choose a support threshold of 5% and a confidence threshold of 75%,we go from17 to

62 rules. The first three rules concern the prediction of survival, and they also have interesting

lifts, all three being greater than 3. The second rule of the three has the strongest confidence and

support indices. It is stated thus: female þ first class) survived. As the survivors are only

32.30%of the total, thus rule, which is true in 141 out of 145 cases of first class and females, i.e.

97.24% confidence, provides real information with a lift of 97.24/32.30 ¼ 3.01. This very

reliable criterion of survival will also appear in the decision tree in Section 11.4.2.

SET_

SIZE

EXP_

CONF

CONF SUPPORT LIFT COUNT RULE

4 29.71 96.55 6.36 3.25 140.00 SEX¼0 & CLASS¼1)SURVIVED¼1 & AGE¼1

3 32.30 97.24 6.41 3.01 141.00 SEX¼0 & CLASS¼1)SURVIVED¼1

4 32.30 97.22 6.36 3.01 140.00 SEX¼0 & CLASS¼1 &

AGE¼1) SURVIVED¼1

4 60.38 75.71 30.44 1.25 670.00 CLASS¼0) SURVIVED¼0

& SEX¼1 & AGE¼1

3 61.97 75.71 30.44 1.22 670.00 CLASS¼0) SURVIVED¼0

& SEX¼1

4 61.97 75.71 30.44 1.22 670.00 CLASS¼0 & AGE¼1)SURVIVED¼0 & SEX¼1

. . . . . . . . . . . . . . . . . . . . .

2 95.05 96.51 65.33 1.02 1438.0 SURVIVED¼0)AGE¼1

2 95.05 96.30 75.74 1.01 1667.0 SEX¼1)AGE¼1

As shown in Figure 10.5, SAS Enterprise Miner also displays the most frequent items,

namely those whose frequency exceeds 5% (the support threshold that was set previously) of

the number of individuals, which is 2201 in this case.

Figure 10.5 The most frequent items in SAS Enterprise Miner.


A note on redundant rules. In the example above, rule 13,

SURVIVED ¼ 0 & CLASS ¼ 0 ) SEX ¼ 1;

and rule 15,

SURVIVED ¼ 0 & CLASS ¼ 0 ) SEX ¼ 1 &AGE ¼ 1;

have exactly the same support (670 observations), because rule 14,

SURVIVED ¼ 0 & CLASS ¼ 0 ) AGE ¼ 1;

is always true. Rule 13 is therefore redundant with respect to rule 15, because 15) 13.

EXAMPLE OF USE 299

data mining and statistics for decision making (tufféry/data mining and statistics for decision...

Documents