data mining and statistics for decision making (tufféry/data mining and statistics for decision...

18
2 The development of a data mining study Before examining the key points of a data mining study or project in the following chapters, I shall briefly run through the different phases in this chapter. Each phase is mentioned here, even though some of them are optional (those covered in Sections 2.5 and 2.7), and most of them can be handed over to a specialist firm (except for those covered in Section 2.1 and parts of Sections 2.2 and 2.10). I will provide further details and illustrations of some of these steps in the chapter on scoring. We shall assume that statistical or data mining software has already been acquired; the chapter on software deals with the criteria for choice and comparison. As a general rule, the phases of a data mining project are as follows: . defining the aims; . listing the existing data; . collecting the data; . exploring and preparing the data; . population segmentation; . drawing up and validating the predictive models; . deploying the models; . training the model users; . monitoring the models; . enriching the models. Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry. © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8

Upload: stephane

Post on 06-Jun-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

2

The development of a data

mining study

Before examining the key points of a data mining study or project in the following chapters,

I shall briefly run through the different phases in this chapter. Each phase is mentioned here,

even though some of them are optional (those covered in Sections 2.5 and 2.7), and most of

them can be handed over to a specialist firm (except for those covered in Section 2.1 and parts

of Sections 2.2 and 2.10). I will provide further details and illustrations of some of these steps

in the chapter on scoring. We shall assume that statistical or data mining software has already

been acquired; the chapter on software deals with the criteria for choice and comparison.

As a general rule, the phases of a data mining project are as follows:

. defining the aims;

. listing the existing data;

. collecting the data;

. exploring and preparing the data;

. population segmentation;

. drawing up and validating the predictive models;

. deploying the models;

. training the model users;

. monitoring the models;

. enriching the models.

Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.

© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8

Page 2: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

2.1 Defining the aims

We must start by choosing the subject, defining the target population (e.g. prospects and

customers, customers only, loyal customers only, all patients, only those patients who can be

cured by the treatment under test), defining the statistical entity to be studied (e.g. a person,

a household consisting of spouses only, a household including dependent children, a business

with or without its subsidiaries), defining some essential criteria and especially the phenom-

enon to be predicted, planning the project, deciding on the expected operational use of

the information extracted and the models produced, and specifying the expected results.

To do this, we must arrange for meetings to be held between the clients (risk management

and marketing departments, specialists, future users, etc., as appropriate) and the service

providers (statisticians and IT specialists). As some data mining projects are mainly

horizontal, operating across several departments, it will be useful for the general management

to be represented at this stage, so that the inevitable arbitration can take place. The top

management will also be able to promote the new data mining tools, if their introduction, and

the changes in procedure that they entail, encounter opposition in the business. The contacts

present at this stagewill subsequently attend periodic steering committee meetings to monitor

the progress of the project.

This stage will partly determine the choice of the data mining tools to be used. For

example, if the aim is to set out explicit rules for a marketing service or to discover the factors

in the remission of a disease, neural networks will be excluded.

The aims must be very precise and must lead to specific actions, such as the refining of

targeting for a direct marketing campaign. In the commercial field, the aims must also be

realistic (see Section 13.1) and allow for the economic realities, marketing operations that

have already been conducted, the penetration rate, market saturation, etc.

2.2 Listing the existing data

The second step is the collection of data that are useful, accessible (inside or outside the

business or organization), legally and technically exploitable, reliable and sufficiently up to

date as regards the characteristics and behaviour of the individuals (customers, patients, users,

etc.) being studied. These data are obtained from the IT system of the business, or are stored in

the business outside the central IT system (in Excel or Access files, for example), or are bought

or retrieved outside the business, or are calculated from earlier data (indicators, ratios,

changes over time). If the aim is to construct a predictive model, it will also be necessary to

find a second type of data, namely the historical data on the phenomenon to be predicted. Thus

we need to have the results of medical trials, marketing campaigns, etc., in order to discover

how patients, customers, etc. with known characteristics have reacted to a medicine or a mail

shot, for example. In the second case, the secondary effects are rarer, and we simply need to

know about the customer who has been approached has or has not bought the products being

offered, or some other product, or nothing at all. The model to be constructed must therefore

correlate the fact of the purchase, or the cure, with the other data held on the individual.

A problemmay arise if the business does not have the necessary data, either because it has

not archived them, or because it is creating a new activity, or simply because it has little direct

contact with its customers. In this case, it must base its enquiries on samples of customers,

if necessary by offering gifts as an incentive to reply to questionnaires. It could also use

26 THE DEVELOPMENT OF A DATA MINING STUDY

Page 3: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

geomarketing, mega-databases (Acxiom, Wegener Direct Marketing) or tools such as ‘first

name scoring’ (see Chapter 4). Finally, the business can use standard models designed by

specialist companies. For example, generic risk scores are available in the banking sector

(see Section 12.6.2).

Sometimes the business may have data that are, unfortunately, unsuitable for data mining,

for example when they are:

. detailed data available on microfilm, useful for one-off searches but exorbitantly

expensive to convert to a usable computerized format;

. data on individual customers that are aggregated at the end of the year to provide an

annual summary (e.g. the number and value of purchases, but with no details of dates,

products, etc.);

. monthly data of a type which would be useful, but aggregated at the store level, not the

customer level;

In such cases, the necessary data must be collected as a matter of urgency (see also

Section 4.1.7).

A note on terminology. Variable is taken to mean any characteristic of an entity (person,

organization, object, event, case, etc.) which can be expressed by a numerical value

(a measurement) or a coded value (an attribute). The different possible values that a

variable can take in the whole set of entities concerned are called the categories of the

variable. The statistical entity that is studied is often called the individual, even if it is not

a person. According to statistical usage, we may also use the term observation.

2.3 Collecting the data

This step leads to the construction of the database that will be used for the construction of

models. This analysis base is usually in the form of a table (DB2, Oracle, SAS, etc.) or a file

(flat file, CSV file, etc.) having one record (one row) for each statistical individual studied and

one field (one column) for each variable relating to this individual. There are exceptions to this

principle: among the most important of these are time series and predictive models with

repeated measurement. There are also spatial data, in which the full meaning of the record of

each entity is only brought out as a function of the other entities (propagation of a polluting

product, installation of parking meters or ticket machines, etc.). In some cases, the variables

are not the same for all the individuals (for example, medical examinations differ with the

patients). In general, however, the data can be represented by a table in which the rows

correspond to the individuals studied and the columns correspond to the variables describing

them, which are the same for all the individuals. This representation may be rather too

simplistic in some cases, but has the merit of convenience.

The analysis base is often built up from snapshots of the data taken at regular intervals

(monthly, for example) over a certain period (a year, for example). On the other hand, these

data are often not determined at the level of the individual, but at the level of the product,

account, invoice, medical examination, etc. In order to draw up the analysis base, we must

therefore provide syntheses or aggregations along a number of axes. These include the

individual axis, where data determined at a finer level (the level of the products owned by the

COLLECTING THE DATA 27

Page 4: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

individual, for example) are synthesized at the individual level, and the time axis, where

n indicators determined at different instants are replaced by a single indicator determined for

the whole period. For example, in the case of wages and other income paid into in a bank, we

start by calculating the monthly sum of income elements paid into the different accounts of the

household, before calculating themean of thesemonthly income elements over 12months. By

observing the independent variables over a year, we can make allowance for a full economic

cycle and smooth out the effect of seasonal variations.

As regards the aggregation of data at the individual level, this can be done (to take an

example from the commercial sector) by aggregating the ‘product’ level data (one row per

product in the files) into ‘customer’ level data, thus producing, in the file to be examined, one

row per customer, in which the ‘product’ data are replaced by sums, means, etc., of all the

customer’s products. By way of example, we can move from this file:

to this file:

Clearly, the choice of the statistical operation used to aggregate the ‘product’ data for

each customer is important.Where risk is concerned, for example, very different results may be

obtained, depending onwhether the number of debit days of a customer is defined as themean or

themaximum number of debit days for his current accounts. The choice of each operationmust

be carefully thought out and must have a functional purpose. For risk indicators, it is generally

the most serious situation of the customer that is considered, not his average situation.

When a predictive model is to be constructed, the analysis base will be of the kind shown

in Figure 2.1. This has four types of variable: the identifier (or key) of the individual, the target

variable (i.e. the dependent variable), the independent variables, and a ‘sample’ variable

which shows whether each individual is participating in the development (learning) or the

testing of the model (see Section 11.16.4).

customer no. date of purchase product purchased value of purchase

customer 1 21/02/2004 jacket 25

customer 1 17/03/2004 shirt 15

customer 2 08/06/2004 T-shirt 10

customer 2 15/10/2004 socks 8

customer 2 15/10/2004 pullover 12

. . . . . . . . . . . .

customer 2000 18/05/2004 shoes 50

. . . . . . . . . . . .

customer no. date of 1st purchase date of last purchase value of purchases

customer 1 21/02/2004 17/03/2004 40

customer 2 08/06/2004 15/10/2004 30

. . . . . . . . . . . .

customer 2000 18/05/2004 18/05/2004 50

. . . . . . . . . . . .

28 THE DEVELOPMENT OF A DATA MINING STUDY

Page 5: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

cust

om

er

no

targ

et v

aria

ble

: p

urc

has

er (

Y/N

)o

ccu

pat

ion

al&

so

cial

cate

go

ry

age

fam

ily

stat

us

no

. of

pu

rch

ases

valu

e o

f p

urc

has

esin

dep

end

ent

…va

riab

le m

sam

ple

trai

ning

……

402

mar

ried

exec

utiv

e58

Y1

test

……

303

unm

arrie

dm

anua

l wor

ker

27N

2…

……

……

……

……

……

……

……

……

… kte

st…

…75

3un

mar

ried

engi

neer

46Y

……

……

……

……

……

……

……

……

……

trai

ning

……

501

mar

ried

empl

oyee

32N

1000

……

……

……

……

rand

omin

depe

nden

t var

iabl

esde

pend

ent v

aria

ble

obse

rved

yea

r n

obse

rved

yea

r n-

1di

strib

utio

nof

cus

tom

ers

Y :

at le

ast 5

00 c

usto

mer

s ta

rget

ed in

yea

r n

and

purc

hase

rsbe

twee

nN

: at

leas

t 500

cus

tom

ers

targ

eted

in y

ear

n an

d no

n-pu

rcha

sers

the

2 sa

mpl

es

f

at least 1000 cases

PR

ED

ICT

ION

Figure

2.1

Developingapredictiveanalysisbase.

COLLECTING THE DATA 29

Page 6: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

The analysis base is constructed in accordance with the aim of the study. The aim may be to

find a function f of the independent variables of the base, such that f (independent variables of a

customer) is the probability that the customer (or patient)will buy the product offered, orwill fall

ill, etc. The data are often observed over 18months (better still, 24months): themost recent 6 (or

12) months are the period of observation for the purchase of a product or the onset of a disease,

and the previous 12 months are the period in which the customer’s data are analysed to explain

what has been observed in the recent period. In this case, we are seeking a function f such that

Probabilityðdependent variable ¼ xÞ ¼ f ðindependent variablesÞ:The periods of observation of the independent and dependent variables must be separated;

otherwise, the values of the independent variables could be the consequence, not the cause,

of the value of the dependent variable.

For example, if we examine the payments made with a card after the acquisition of a

second card in a household, they will probably be higher than for a household which has not

acquired a second card. However, this does not tell us anything useful. We need to know if the

household that has acquired a second card used to make more payments by card before

acquiring its second card than the household that did not acquire one.

2.4 Exploring and preparing the data

In this step the checking of the origin of the data, frequency tables, two-way tables, descriptive

statistics and the experience of users allow three operations.

The first operation is to make the data reliable, by replacing or removing data that are

incorrect because they contain too many missing values, aberrant values or extreme values

(‘outliers’) that are too remote from the normally accepted values. If only a few rare

individuals show one of these anomalies for one of the variables studied, they can be removed

from the study (at least in the model construction phase), or their anomalous values can be

replaced by a correct and plausible value (see Section 3.3). On the other hand, if a variable is

anomalous for too many individuals, this variable is unsuitable for use. Be careful about

numerical variables: we must not confuse a significant value of 0 with a value set to 0 by

default because no information is provided. The latter 0 should be replaced with the ‘null’

value in DB2, ‘NA’ in R or the missing value ‘.’ in SAS. Remember that a variable whose

reliability cannot be assured must never be used in a model. A model with one variable

missing is more useful than a model with a false variable. The same applies if we are uncertain

whether a variable will always be available or always correctly updated.

The second operation is the creation of relevant indicators from the raw data which have

been checked and corrected where necessary. This can be done as follows:

. by replacing absolute values with ratios, often the most relevant ones;

. by calculating the changes of variables over time (for example the mean for the recent

period, divided by the mean for the previous period);

. by making linear combinations of variables;

. by composing variables with other functions (such as logarithms or square roots of

continuous variables, to smooth their distribution and compress a highly right-skewed

distribution, as commonly found in financial or reliability studies);

30 THE DEVELOPMENT OF A DATA MINING STUDY

Page 7: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

. by recoding certain variables, for example by converting ‘low, medium, high’ into

‘1, 2, 3’;

. by modifying units of measurement;

. by replacing the dates by durations, customer lifetimes or ages;

. by replacing geographical locations with coordinates (latitude and longitude).

Thus we can create relevant indicators, following the advice of specialists if necessary. In

the commercial field, these indicators may be:

. the age of the customer when his relationship with the business begins, based on the date

of birth and the date of the first purchase;

. the number of products purchased, based on the set of variables ‘product Pi purchased

(Yes/No)’;

. the mean value of a purchase, based on the number and value of the purchases;

. the recentness and frequency of the purchases, based on the purchase dates;

. the rate of use of revolving credit, deduced from the ceiling of the credit line and the

proportion actually used.

The third operation is the reduction of the number of dimensions of the problem:

specifically, the reduction of the number of individuals, the number of variables, and the

number of categories of the variables.

As mentioned above, the reduction of the number of individuals is a matter of eliminating

certain extreme individuals from the population, in a proportion which should not normally

be more than 1–2%. It is also common practice to use a sample of the population (see

Section 3.15), although care must be taken to ensure that the sample is representative: this may

require a more complex sampling procedure than ordinary random sampling.

The reduction of the number of variables consists of:

. disregarding some variables which are too closely correlated with each other and which,

if taken into account simultaneously, would infringe the commonly required assumption

of non-collinearity (linear independence) of the independent variables (in linear

regression, linear discriminant analysis or logistic regression);

. disregarding certain variables that are not at all relevant or not discriminant with respect

to the specified objective, or to the phenomenon to be detected;

. combining several variables into a single variable; for example, the number of

products purchased with a two-year guarantee and the number of products purchased

with a five-year guarantee can be added together to give the number of products

purchased without taking the guarantee period into account, if this period is not

important for the problem under investigation; it is also possible to replace the

‘number of purchases’ variables established for each product code with the ‘number

of purchases’ variables for each product line, which would also reduce the number

of variables;

EXPLORING AND PREPARING THE DATA 31

Page 8: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

. using factor analysis (if appropriate) to convert some of the initial variables into a

smaller number of variables (by eliminating the correlated variables), these variables

being chosen to maximize the variance (the new variables are also more stable in

relation to sampling and random fluctuations).

Reducing the number of variables by factor analysis, for example by principal component

analysis (PCA: see Section 7.1), is useful when these variables are used as the input of a neural

network. This is because a decrease in the number of input variables allows us to decrease the

number of nodes in the network, thus reducing its complexity and the risk of convergence of the

network towards a non-optimal solution. PCA may also be used before automatic clustering.

Quite commonly, 100 to 200 or more variables may be investigated during the develop-

ment of a classification or prediction model, although the number of ultimately discriminant

variables selected for the calculation of the model is much smaller, the order of magnitude

being as follows:

. 3 or 4 for a simple model (although the model can still classify more than 80% of the

individuals);

. 5–10 for a model of normal quality;

. 11–20 for a very fine (or refined) model.

Beyond 20 variables, the model is very likely to lose its capacity for generalization. These

figures may have to be adjusted according to the size of the population.

The reduction of the number of categories is achieved by:

. combining categories which are too numerous or contain too few observations, in the

case of discrete and qualitative variables;

. combining the categories of discrete and qualitative variables which have the same

functional significance, and are only distinguished for practical reasons which are

unrelated to the analysis to be carried out;

. discretizing (binning) some continuous variables (i.e. converting them into ranges of

values such as quartiles, or into ranges of values that are significant from the user’s point

of view, or into ranges of values conforming to particular rules, as in medicine).

Examples of categories which are combined because they are too numerous are the

individual articles in a product catalogue (which can be replaced by product lines), social and

occupational categories (which can be aggregated into about 10 values), the segments of a

housing type (which can be combined into families as indicated in Section 4.2.1), counties

(which can be regrouped into regions), the activity codes of merchants (the Merchant

Category Code used in bank card transactions) and of businesses (NACE, the French name

of the Statistical Classification of Economic Activities in the European Community, equiva-

lent to the Standard Industrial Classification in the United Kingdom and North America),

(which can be grouped into industrial sectors), etc. By way of example, the sectors of activity

may be:

. Agriculture, Forestry, Fishing and Hunting

. Mining

32 THE DEVELOPMENT OF A DATA MINING STUDY

Page 9: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

. Utilities

. Construction

. Manufacturing

. Wholesale Trade

. Retail Trade

. Transportation and Warehousing

. Information

. Finance and Insurance

. Real Estate and Rental and Leasing

. Professional, Scientific, and Technical Services

. Management of Companies and Enterprises

. Administrative and Support and Waste Management and Remediation Services

. Education Services

. Health Care and Social Assistance

. Arts, Entertainment, and Recreation

. Accommodation and Food Services

. Other Services (except Public Administration)

. Public Administration

Extreme examples of variables having too many categories are variables which are

identifiers (customer number, account number, invoice number, etc.). These are essential keys

for accessing data files, but cannot be used as predictive variables.

2.5 Population segmentation

It may be necessary to segment the population into groups that are homogeneous in relation

to the aims of the study, in order to construct a specificmodel for each segment, beforemaking

a synthesis of themodels. This method is called the stratification of models. It can only be used

where the volume of data is large enough for each segment to contain enough individuals of

each category for prediction. In this case, the pre-segmentation of the population is a method

that can often improve the results significantly. It can be based on rules drawn up by experts or

by the statistician. It can also be carried out more or less automatically, using a statistical

algorithm: this kind of ‘automatic clustering’, or simply ‘clustering’, algorithm is described

in Chapter 9.

There are many ways of segmenting a population in order to establish predictive models

based on homogeneous sub-populations. In the first method, the population is segmented

POPULATION SEGMENTATION 33

Page 10: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

according to general characteristics which have no direct relationship with the dependent

variable. This pre-segmentation is unsupervised. In the case of a business, it may relate to its

size, its legal status or its sector of activity. At the Banque de France, the financial health of

businesses is modelled by sector: the categories are industry, commerce, transport, hotels,

caf�es and restaurants, construction, and business services. For a physical person, we may use

sociodemographic characteristics such as age or occupation. In this case, this approach can be

justified by the existence of specific marketing offers aimed at certain customer segments,

such as ‘youth’, ‘senior’, and other groups. From the statistical viewpoint, this method is not

always the best, because behaviour is not always related to general characteristics, but it may

be very useful in some cases. It is usually applied according to rules provided by experts,

rather than by statistical clustering.

Another type of pre-segmentation is carried out on behavioural data linked to the

dependent variable, for example the product to which the scoring relates. This pre-

segmentation is supervised. It produces segments that are more homogeneous in terms of

the dependent variable. Supervised pre-segmentation is generally more effective than

unsupervised pre-segmentation, because it has some of the discriminant power required in

the model. It can be implemented according to expert rules, or simply on a common-sense

basis, or by a statistical method such as a decision tree with one or two levels of depth. Even a

decision tree with only one level of depth can be used to separate the population to be

modelled into two clearly differentiated classes (or more with chi-square automatic

interaction detection), without having the instability that is a feature of deeper trees. One

common-sense rule could be to establish a consumer credit propensity score for two

segments: namely, those customers who already have this kind of credit, and the rest. In

this example, we can see that the rules can be expressed as a decision tree. This method often

enables the results to be improved.

A third type of pre-segmentation can be required because of the nature of the available

data: for example, they will be much less rich for a prospect than for a customer, and these

two populations must therefore be separated into two segments for which specific models

will be constructed.

In all cases, pre-segmentation must follow explicit rules for classifying every individual.

The following features are essential:

. simplicity of pre-segmentation (there must not be too many rules);

. a limited number of segments and stability of the segments (these two aspects

are related);

. segment sizes generally of the same order of magnitude, because, unless we have

reasons for examining a highly specific sub-population, a 99%/1% distribution is rarely

considered satisfactory;

. uniformity of the segments in terms of the independent variables;

. uniformity of the segments in terms of the dependent variable (for example, care must

be taken to avoid combining high-risk and low-risk individuals in the same segment).

We should note that some operations described in the preceding section (reduction of the

number of individuals, variables and categories) can be carried out either before any

segmentation, or on each segment.

34 THE DEVELOPMENT OF A DATA MINING STUDY

Page 11: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

2.6 Drawing up and validating predictive models

This step is the heart of the data mining activity, even if it is not the most time-consuming. It

may take the form of the calculation of a score, or more generally a predictive model, for each

segment produced in the preceding step, followed by verification of the results in a test sample

that is different from the learning sample. It may also be concerned with detecting the profiles

of customers, for example according to their consumption of products or their use of the

services of a business. Sometimes it will be a matter of analysing the contents of customers’

trolleys in department stores and supermarkets, in order to find associations between products

that are often purchased together. A detailed review of the different statistical and data mining

methods for developing a model, with their principles, fields of application, strong points

and limitations, is provided in Chapters 6–11.

In the case of scoring, the principal methods are logistic regression, linear discriminant

analysis and decision trees. Trees provide entirely explicit rules and can process heterogeneous

and possibly incomplete data, without any assumptions concerning the distribution of the

variables, and can detect non-linear phenomena. However, they lack reliability and the capacity

for generalization, and this can prove troublesome. Logistic regression directly calculates the

probability of the appearance of the phenomenon that is being sought, provided that the

independent variables have no missing values and are not interrelated. Discriminant analysis

also requires the independent variables to be continuous and to follow a normal law (plus

another assumption that will be discussed below), but in these circumstances it is very reliable.

Logistic regression has an accuracy very similar to that of discriminant analysis, but is more

general because it deals with qualitative independent variables, not just quantitative ones.

In this step, several models are usually developed, in the same family or in different

families. The development of a number of models in the same family is most common,

especially when the aim is to optimize the parameters of a model by adjusting the independent

variables chosen, the number of their categories, the depth or number of leaves on a decision

tree, the number of neurons in a hidden layer, etc. We can also develop models in different

families (e.g. logistic regression and linear discriminant analysis) which are run jointly to

maximize the performance of the model finally chosen.

The aim is therefore to choose the model with the best performance out of a number of

models. I will discuss in Section 11.16.5 the importance of evaluating the performance of a

model on a test sample which is separate from the learning sample used to develop the model,

while being equally representative in statistical terms.

For comparing models of the same kind, there are statistical indicators which we will

examine inChapter 11.However, since the statistical indicators of twomodels of different kinds

(e.g. R2 for a discriminant analysis and R2 for a logistic regression) are rarely comparable, the

models are compared either by comparing their error rates in a confusion matrix (see Section

11.16.4) or by superimposing their lift curves or receiver operating characteristic (ROC) curves

(see Section 11.16.5). These indicators can be used to select the best model in each family of

models, and then the best model from all the families considered together.

The ROC curve can be used to visualize the separating power of a model, representing the

percentage of correctly detected events (the Y axis) as a function of the percentage of

incorrectly detected events (the X axis) when the score separation threshold is varied. If this

curve coincides with the diagonal, the performance of the model is no better than that of a

random prediction; as the curve approaches the upper left-hand corner of the square, the

DRAWING UP AND VALIDATING PREDICTIVE MODELS 35

Page 12: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

model improves. Several ROC curves can be superimposed to show the progressive

contribution of each independent variable in the model, as in Figure 2.2. This shows that

a single variable, if well chosen, already explains much of the phenomenon to be predicted,

and that an increase in the number of variables added to the model decreases the marginal gain

they provide.

In addition to the qualities of mathematical accuracy of the model, other selection criteria

may sometimes have to be taken into account, such as the readability of the model from the

user’s viewpoint, and the ease of implementing it in the production IT system.

2.7 Synthesizing predictive models of different segments

This small step only takes place if customer segmentation has been carried out. We need to

compare the scores of the different segments, and calculate a unique normalized score that is

comparable from one segment to another. This new score allows for the proportion of

customers in each segment, and each score band, who have already achieved the aim,

i.e. bought the product, repaid the loan, etc.

The synthesis of the scores for the different segments is easy in a logistic regressionmodel,

since the score is a probability which is inherently normalized and is suitable for the same

interpretation in all the segments (provided that the sampling has been comparable in the

different segments).

In all cases, it is possible to draw up a table showing the score points (for propensity in this

case) of the different segments, indicating the subscription rate corresponding to this score

and this segment on each occasion (Table 2.1). The rows of this table are then sorted by

subscription rate, from lowest to highest. The rows sorted in this way are then combined into ten

groups, each covering about 10% of the customers, with the rows of each group following each

other and therefore having similar subscription rates. One point is given to the customers of the

first group, and so on up to the customers of the tenth group, who have 10 points.

Figure 2.2 ROC curve.

36 THE DEVELOPMENT OF A DATA MINING STUDY

Page 13: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

2.8 Iteration of the preceding steps

The procedure described in Sections 2.4 to 2.7 is usually iterated until completely satisfactory

results are obtained. The input data are adjusted, new variables are selected or the parameters are

changed in the datamining tools, which, if operating in an exploratorymode, asmentioned above,

may be highly sensitive to the choice of initial data and parameters, especially in neural networks.

Before accepting amodel, it is useful to have it validated by experts in the field, for example

credit analysts in the case of scoring, or market researchmanagers. These experts can be shown

some real cases of individuals towhich different competingmodels have been applied, and they

can then be asked to saywhichmodel appears to provide the best results, giving reasons for their

response. Thus they can detect the characteristics of a model which would not have appeared

duringmodelling. Thismay bedone in thefield ofmarketing and propensity scores. Themethod

of modelling on historic data tends to reproduce the commercial habits of the business, even if

precautions are taken to avoid this (by neutralizing the effect of sales campaigns, for example).

Thus, recent customers may be rather too well scored, since they are usually most likely to be

approached by sales staff. In the test phase, we may discover that older customers are not

describedwell enough by the propensity score, and that the use of this scorewould result in their

not being targeted. This phase could thus lead to a reduction in the effect of thevariables relating

to customer lifetime; these variables may even be removed from the model. In the field of risk,

such variables may be those which forecast the risk too late and mask variables which would

allow earlier prediction; or theymay simply be variables which are not obvious to experts in the

field, and which are discarded from the model.

2.9 Deploying the models

Deployment involves the implementation of the data mining models in a computer system,

before using the results for action (adapting procedures, targeting, etc.) and making them

available to users (information on the workplace, etc.). It is harder to update the ‘workplace’

applications to incorporate the score, especially if we wish to allow users to input new

Table 2.1 Synthesizing the score points of the different segments.

Segment Score Number of

customers

Subscription rate

1 1 n1,1 t1,1

1 2 n1,2 t1,2

. . . . . . . . . . . .

1 10 n1,10 t1,10

2 1 n2,1 t2,1

2 2 n2,2 t2,2

. . . . . . . . . . . .

5 1 n5,1 t5,1

. . . . . . . . . . . .

5 10 n5,10 t5,10

DEPLOYING THE MODELS 37

Page 14: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

information in order to upload them to the central production files or to execute the model in

real time. For an initial test, therefore, we may limit ourselves to creating a computer file

containing the data yielded by data mining, for example the customers’ score points; this file

may or may not be uploaded to the computerized production system, and may be used for

targeting in direct marketing. A spreadsheet file can be used to create amail merge, usingword

processing software.

In this step we must decide on the output of the data at different levels of fineness. There

may be fine grades in the production files (from 1 to 1000, for example), which are aggregated

at the workstation (from 1 to 10) or even regrouped into bands (low/medium/high).

The confidentiality of the score output will be considered in this step, if not before: for

example, is a salesman allowed to see the scores of all customers, or only the ones for which he

is responsible? The question of the frequency of updating of the data must also be considered:

this is usually monthly, but a quarterly frequency may be suitable for data that do not change

very often, while a daily frequency may be required for sensitive risk data.

Before deploying a data mining tool in a commercial network, it is obviously preferable to

test it for several months with volunteers. There will also be a phase of take-over of the tool,

possibly followed by a phase in which the initial volunteers train other users.

2.10 Training the model users

To ensure a smooth take-over of the new decision tools by the future users, it is essential to

spend some time familiarizing themwith these tools. Theymust know the objective (and stick

to it), the principles of the tools, how they work (without going into technical details, which

would subsequently have to be justified individually), their limits (noting that these are only

statistical tools which may not be useful in special cases: the difficulty in educational terms is

to maintain a minimum of vigilance and critical approach without casting doubt on the

reliability of the tools), the methods for using them (while pointing out that these are decision

support tools, not tools for automatic decision making), what the tools will contribute

(the most important point), and how the users’ work patterns will change, in both operational

and organizational terms (adaptation of procedures, increased delegation, rules for res-

ponsibility if the user does or does not follow the recommendations of the tool, in risk scoring

for example).

2.11 Monitoring the models

Two forms of monitoring will be distinguished in this section. The first is one-off. Whenever

a new data mining application is brought into use, the results must be analysed. Let us take

the example of a major marketing campaign for which a propensity score has been

developed to improve response rates. After the campaign, it is important to ensure that

the response rates match the score values, and that the customers with the highest scores

have in fact responded best. If a category of customers with a high score has not responded

well to the marketing offer, this must be examined in detail to see if it is possible to segment

the category or apply a decision tree to it, in order to discover the profiles of the

non-purchasers and take this into account in the next campaign. It will be necessary to

check whether the definition of this profile uses a variable which is not allowed for in the

score calculation. It may be useful to complement this quantitative analysis with a

38 THE DEVELOPMENT OF A DATA MINING STUDY

Page 15: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

qualitative analysis based on the opinions of the sales staff who have used the score: they

may have some idea of the phenomena which were not taken into account or were

underestimated by the score; they may also have intuitively identified the poorly scored

populations. The analysis of the low score results is also useful, especially when the score

has been developed in the absence of data on customers’ refusal to purchase, and when

‘negative’ profiles for the score modelling therefore had to be defined in advance, instead of

being determined on the basis of evidence, resulting in a degree of arbitrariness. An

examination of the feedback from the campaign will make it possible to corroborate or to

reject certain hypotheses on the ‘negative’ profiles.

The second form of monitoring is continuous. When a tool such as a risk score is used in

a business, it is used constantly and both its correct operation and its use must be checked.

Its operation can be monitored by using tables like the following:

month

score

M-1 M-2 M-3 M-4 M-5 M-6 M-7 M-8 M-9 M-10 M-11 M-12

1

2

3

. . .

TOTAL

The second column of this table shows, for themonthM – 1 before the analysis and for each

score value, the number of customers, the percentage of customers in the total population,

and the default rate. The sum of the losses of the business can be added if required. Clearly,

the rate of non-payments will be replaced by the subscription rate in the case of a propensity

score, and by the departure rate in the case of an attrition score. The third column contains the

same figures for the month M – 2, and so on. If required, a column containing the figures at

the timewhen the scorewas created may be added. These figures can be used to check whether

the score is still relevant in terms of prediction, and especiallywhether the rate of non-payments

(or subscription, or attrition) is still at the level predicted by the score. It is even better to

calculate the figures in columnM-n in the followingway. If the goal of the score is (for example)

to give a 12months prediction,wemeasure for each value of the score, the number of customers

at M-n-12 and the number of defaults between M-n-12 and M-n.

The use of the score ismonitored by using a table similar to the preceding one, in which the

rate of non-payments is replaced by the number and value of transactions concluded in a given

month and for a given score value. Thus it is possible to see if sales staff are taking the risk

score into account to limit their involvement with the most high-risk customers. Indicators

such as ‘infringements’ of the score by its users can be added to this table.

These tables are useful for departmental, risk and marketing experts, but may also be

distributed in the business network to enable the results of actions to be measured. They also

enable the business to quantify the prime target customers with which it conducts its

transactions; these are preferably customers whose scores have a certain given value (low

for risk, or high for propensity).

MONITORING THE MODELS 39

Page 16: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

score M–1 !# score M–2

1 2 3 . . .

1

2

3

. . .

A ‘transition matrix’ like the one above enables the stability of the score to be monitored.

Original Now w2 probability

variable 1 category 1 number number

category 2 number number

. . .

. . .

TOTAL total number total number

variable 2 category 1 number number

. . .

. . .

. . .

variable k . . . . . . . . .

. . .

It is also worth using charts to monitor the distribution of the variables explaining the

score, comparing the actual distribution to the distribution at the time of creation of the score

(using a w2 or Cram�er’s V test, for example: see Appendix A), to discover if the distribution of

a variable has changed significantly, or if the rate of missing values is increasing, which may

mean that the score has to be revised.

2.12 Enriching the models

When the results of the use of a first data mining model are measured, the model can be

improved, as we have seen, either by adding independent variables that were not considered

initially, or by using feedback from experience to determine the profiles to be predicted (e.g.

‘purchaser/non-purchaser’) if these results were not available before.

At this point, we can also supplement the results of step 2.2) if they did not provide enough

historic details of the variable to be predicted.

Whenwe have done this, steps 2.3, 2.4, . . . can be repeated to construct a newmodel which

will perform more satisfactorily than the first one. This iterative process must become

permanent if we are to achieve a continuous improvement of the results. There is one proviso,

however: in the learning phase of a new model, we must never rule out individuals whose

previous levels were considered to be low. Otherwise this model can never be used to question

40 THE DEVELOPMENT OF A DATA MINING STUDY

Page 17: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

the validity of previous models, which may sometimes be necessary. For example, if we use

a calculated scoring model to target only the individuals with the best score points, without

ever approaching the others, there is a risk that we will fail to detect a change in behaviour of

some of the initially low-scoring individuals. So it is always desirable to include a small

proportion of individuals with medium or low scores in any targeting which is based on

a model. We will return to this matter in Section 12.4.

2.13 Remarks

The procedure described in this chapter can use a number of data mining techniques with

different origins: for example, multiple correspondence analysis for the transformation of

variables, agglomerative hierarchical clustering for segmentation, and logistic regression for

scoring. There are four important points to be considered:

. For step 2.3, it is essential to collect the data to be studied for several months or even

several years, at least where predictive methods such as scoring are concerned, in order

to provide valid analyses.

. Steps 2.3 and 2.4 are much more time-consuming than steps 2.5–2.7, but are

very important.

. The data mining process is highly iterative, with many possible loopbacks to one or

more of steps 2.4–2.7.

. If the population is large, it will be useful to segment it before calculating a score or

carrying out classification, so that we can work on homogeneous groups of individuals,

making processing more reliable. I will discuss this matter further in the chapter on

automatic clustering.

2.14 Life cycle of a model

Data mining models (especially scores) go through a small-scale trial phase, for adjustment

and validation, and to enable their application to be tested. When models are in production,

they must be applied regularly to refreshed data. However, these models must be reviewed

regularly, as described in the chapter on scoring.

2.15 Costs of a pilot project

The factors determining the number of man-days for a data mining project are so numerous

that I can only give very approximate figures here. They will have to be modified according to

the context. These numbers are mainly useful because they demonstrate the proportions of the

different stages. Furthermore, these figures do not relate to major projects, only to small- and

medium-scale ones such as pilot projects intended to initiate and promote the use of data

mining in businesses.When the processes are fully established, the tools are in full production

and the data are thoroughly known, a reduction in the costs shown in the table in Figure 2.3

may be expected. These costs relate to the data mining team only, excluding costs for the other

personnel involved.

COSTS OF A PILOT PROJECT 41

Page 18: Data Mining and Statistics for Decision Making (Tufféry/Data Mining and Statistics for Decision Making) || The Development of a Data Mining Study

No. Step Costs (in days) Remarks

Small-

scale

project

Medium-

scale

project

1 Definition of the target

and objectives

4 8 Preliminary analyses and

provision of numerical bases

for decision making.

2 Listing the existing

usable data

7 10 The cost depends on the

organization’s knowledge of

the data

3, 4 Collection, formatting

and exploration of the

data

15 28 This cost is largely determined

by the number of variables,

their quality, the number of

tables accessed, etc.

5–8 Developing and

validating the models

(clustering/scoring)

15 25 Must be multiplied if more

than one population segment

is to be processed

9 Complementary

analyses and delivery of

results.

9 12 For example, setting the

scoring thresholds

10 Documentation and

presentations

5 7

11 Analysis of the first tests

conducted

5 10 Statistical study leading to

subsequent action

TOTAL 60 100

8%10%

20%

8%

25%

12%

7%

10%

0%

5%

10%

15%

20%

25%

Defini

ng th

e aim

s

Data

inven

tory

Data

prep

arat

ion

Creat

ing th

e an

alysis

bas

e

Constr

uctio

n an

d va

lidat

ion o

f mod

els

Outpu

t of m

odels

Docum

enta

tion

and

pres

enta

tions

Analys

is of

the

first

tests

Figure 2.3 Costs of a data mining project.

42 THE DEVELOPMENT OF A DATA MINING STUDY