data mining and statistics for decision making (tufféry/data mining and statistics for decision...
TRANSCRIPT
2
The development of a data
mining study
Before examining the key points of a data mining study or project in the following chapters,
I shall briefly run through the different phases in this chapter. Each phase is mentioned here,
even though some of them are optional (those covered in Sections 2.5 and 2.7), and most of
them can be handed over to a specialist firm (except for those covered in Section 2.1 and parts
of Sections 2.2 and 2.10). I will provide further details and illustrations of some of these steps
in the chapter on scoring. We shall assume that statistical or data mining software has already
been acquired; the chapter on software deals with the criteria for choice and comparison.
As a general rule, the phases of a data mining project are as follows:
. defining the aims;
. listing the existing data;
. collecting the data;
. exploring and preparing the data;
. population segmentation;
. drawing up and validating the predictive models;
. deploying the models;
. training the model users;
. monitoring the models;
. enriching the models.
Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.
© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8
2.1 Defining the aims
We must start by choosing the subject, defining the target population (e.g. prospects and
customers, customers only, loyal customers only, all patients, only those patients who can be
cured by the treatment under test), defining the statistical entity to be studied (e.g. a person,
a household consisting of spouses only, a household including dependent children, a business
with or without its subsidiaries), defining some essential criteria and especially the phenom-
enon to be predicted, planning the project, deciding on the expected operational use of
the information extracted and the models produced, and specifying the expected results.
To do this, we must arrange for meetings to be held between the clients (risk management
and marketing departments, specialists, future users, etc., as appropriate) and the service
providers (statisticians and IT specialists). As some data mining projects are mainly
horizontal, operating across several departments, it will be useful for the general management
to be represented at this stage, so that the inevitable arbitration can take place. The top
management will also be able to promote the new data mining tools, if their introduction, and
the changes in procedure that they entail, encounter opposition in the business. The contacts
present at this stagewill subsequently attend periodic steering committee meetings to monitor
the progress of the project.
This stage will partly determine the choice of the data mining tools to be used. For
example, if the aim is to set out explicit rules for a marketing service or to discover the factors
in the remission of a disease, neural networks will be excluded.
The aims must be very precise and must lead to specific actions, such as the refining of
targeting for a direct marketing campaign. In the commercial field, the aims must also be
realistic (see Section 13.1) and allow for the economic realities, marketing operations that
have already been conducted, the penetration rate, market saturation, etc.
2.2 Listing the existing data
The second step is the collection of data that are useful, accessible (inside or outside the
business or organization), legally and technically exploitable, reliable and sufficiently up to
date as regards the characteristics and behaviour of the individuals (customers, patients, users,
etc.) being studied. These data are obtained from the IT system of the business, or are stored in
the business outside the central IT system (in Excel or Access files, for example), or are bought
or retrieved outside the business, or are calculated from earlier data (indicators, ratios,
changes over time). If the aim is to construct a predictive model, it will also be necessary to
find a second type of data, namely the historical data on the phenomenon to be predicted. Thus
we need to have the results of medical trials, marketing campaigns, etc., in order to discover
how patients, customers, etc. with known characteristics have reacted to a medicine or a mail
shot, for example. In the second case, the secondary effects are rarer, and we simply need to
know about the customer who has been approached has or has not bought the products being
offered, or some other product, or nothing at all. The model to be constructed must therefore
correlate the fact of the purchase, or the cure, with the other data held on the individual.
A problemmay arise if the business does not have the necessary data, either because it has
not archived them, or because it is creating a new activity, or simply because it has little direct
contact with its customers. In this case, it must base its enquiries on samples of customers,
if necessary by offering gifts as an incentive to reply to questionnaires. It could also use
26 THE DEVELOPMENT OF A DATA MINING STUDY
geomarketing, mega-databases (Acxiom, Wegener Direct Marketing) or tools such as ‘first
name scoring’ (see Chapter 4). Finally, the business can use standard models designed by
specialist companies. For example, generic risk scores are available in the banking sector
(see Section 12.6.2).
Sometimes the business may have data that are, unfortunately, unsuitable for data mining,
for example when they are:
. detailed data available on microfilm, useful for one-off searches but exorbitantly
expensive to convert to a usable computerized format;
. data on individual customers that are aggregated at the end of the year to provide an
annual summary (e.g. the number and value of purchases, but with no details of dates,
products, etc.);
. monthly data of a type which would be useful, but aggregated at the store level, not the
customer level;
In such cases, the necessary data must be collected as a matter of urgency (see also
Section 4.1.7).
A note on terminology. Variable is taken to mean any characteristic of an entity (person,
organization, object, event, case, etc.) which can be expressed by a numerical value
(a measurement) or a coded value (an attribute). The different possible values that a
variable can take in the whole set of entities concerned are called the categories of the
variable. The statistical entity that is studied is often called the individual, even if it is not
a person. According to statistical usage, we may also use the term observation.
2.3 Collecting the data
This step leads to the construction of the database that will be used for the construction of
models. This analysis base is usually in the form of a table (DB2, Oracle, SAS, etc.) or a file
(flat file, CSV file, etc.) having one record (one row) for each statistical individual studied and
one field (one column) for each variable relating to this individual. There are exceptions to this
principle: among the most important of these are time series and predictive models with
repeated measurement. There are also spatial data, in which the full meaning of the record of
each entity is only brought out as a function of the other entities (propagation of a polluting
product, installation of parking meters or ticket machines, etc.). In some cases, the variables
are not the same for all the individuals (for example, medical examinations differ with the
patients). In general, however, the data can be represented by a table in which the rows
correspond to the individuals studied and the columns correspond to the variables describing
them, which are the same for all the individuals. This representation may be rather too
simplistic in some cases, but has the merit of convenience.
The analysis base is often built up from snapshots of the data taken at regular intervals
(monthly, for example) over a certain period (a year, for example). On the other hand, these
data are often not determined at the level of the individual, but at the level of the product,
account, invoice, medical examination, etc. In order to draw up the analysis base, we must
therefore provide syntheses or aggregations along a number of axes. These include the
individual axis, where data determined at a finer level (the level of the products owned by the
COLLECTING THE DATA 27
individual, for example) are synthesized at the individual level, and the time axis, where
n indicators determined at different instants are replaced by a single indicator determined for
the whole period. For example, in the case of wages and other income paid into in a bank, we
start by calculating the monthly sum of income elements paid into the different accounts of the
household, before calculating themean of thesemonthly income elements over 12months. By
observing the independent variables over a year, we can make allowance for a full economic
cycle and smooth out the effect of seasonal variations.
As regards the aggregation of data at the individual level, this can be done (to take an
example from the commercial sector) by aggregating the ‘product’ level data (one row per
product in the files) into ‘customer’ level data, thus producing, in the file to be examined, one
row per customer, in which the ‘product’ data are replaced by sums, means, etc., of all the
customer’s products. By way of example, we can move from this file:
to this file:
Clearly, the choice of the statistical operation used to aggregate the ‘product’ data for
each customer is important.Where risk is concerned, for example, very different results may be
obtained, depending onwhether the number of debit days of a customer is defined as themean or
themaximum number of debit days for his current accounts. The choice of each operationmust
be carefully thought out and must have a functional purpose. For risk indicators, it is generally
the most serious situation of the customer that is considered, not his average situation.
When a predictive model is to be constructed, the analysis base will be of the kind shown
in Figure 2.1. This has four types of variable: the identifier (or key) of the individual, the target
variable (i.e. the dependent variable), the independent variables, and a ‘sample’ variable
which shows whether each individual is participating in the development (learning) or the
testing of the model (see Section 11.16.4).
customer no. date of purchase product purchased value of purchase
customer 1 21/02/2004 jacket 25
customer 1 17/03/2004 shirt 15
customer 2 08/06/2004 T-shirt 10
customer 2 15/10/2004 socks 8
customer 2 15/10/2004 pullover 12
. . . . . . . . . . . .
customer 2000 18/05/2004 shoes 50
. . . . . . . . . . . .
customer no. date of 1st purchase date of last purchase value of purchases
customer 1 21/02/2004 17/03/2004 40
customer 2 08/06/2004 15/10/2004 30
. . . . . . . . . . . .
customer 2000 18/05/2004 18/05/2004 50
. . . . . . . . . . . .
28 THE DEVELOPMENT OF A DATA MINING STUDY
cust
om
er
no
targ
et v
aria
ble
: p
urc
has
er (
Y/N
)o
ccu
pat
ion
al&
so
cial
cate
go
ry
age
fam
ily
stat
us
no
. of
pu
rch
ases
valu
e o
f p
urc
has
esin
dep
end
ent
…va
riab
le m
sam
ple
trai
ning
……
402
mar
ried
exec
utiv
e58
Y1
test
……
303
unm
arrie
dm
anua
l wor
ker
27N
2…
……
……
……
……
……
……
……
……
… kte
st…
…75
3un
mar
ried
engi
neer
46Y
……
……
……
……
……
……
……
……
……
trai
ning
……
501
mar
ried
empl
oyee
32N
1000
……
……
……
……
rand
omin
depe
nden
t var
iabl
esde
pend
ent v
aria
ble
obse
rved
yea
r n
obse
rved
yea
r n-
1di
strib
utio
nof
cus
tom
ers
Y :
at le
ast 5
00 c
usto
mer
s ta
rget
ed in
yea
r n
and
purc
hase
rsbe
twee
nN
: at
leas
t 500
cus
tom
ers
targ
eted
in y
ear
n an
d no
n-pu
rcha
sers
the
2 sa
mpl
es
f
at least 1000 cases
PR
ED
ICT
ION
Figure
2.1
Developingapredictiveanalysisbase.
COLLECTING THE DATA 29
The analysis base is constructed in accordance with the aim of the study. The aim may be to
find a function f of the independent variables of the base, such that f (independent variables of a
customer) is the probability that the customer (or patient)will buy the product offered, orwill fall
ill, etc. The data are often observed over 18months (better still, 24months): themost recent 6 (or
12) months are the period of observation for the purchase of a product or the onset of a disease,
and the previous 12 months are the period in which the customer’s data are analysed to explain
what has been observed in the recent period. In this case, we are seeking a function f such that
Probabilityðdependent variable ¼ xÞ ¼ f ðindependent variablesÞ:The periods of observation of the independent and dependent variables must be separated;
otherwise, the values of the independent variables could be the consequence, not the cause,
of the value of the dependent variable.
For example, if we examine the payments made with a card after the acquisition of a
second card in a household, they will probably be higher than for a household which has not
acquired a second card. However, this does not tell us anything useful. We need to know if the
household that has acquired a second card used to make more payments by card before
acquiring its second card than the household that did not acquire one.
2.4 Exploring and preparing the data
In this step the checking of the origin of the data, frequency tables, two-way tables, descriptive
statistics and the experience of users allow three operations.
The first operation is to make the data reliable, by replacing or removing data that are
incorrect because they contain too many missing values, aberrant values or extreme values
(‘outliers’) that are too remote from the normally accepted values. If only a few rare
individuals show one of these anomalies for one of the variables studied, they can be removed
from the study (at least in the model construction phase), or their anomalous values can be
replaced by a correct and plausible value (see Section 3.3). On the other hand, if a variable is
anomalous for too many individuals, this variable is unsuitable for use. Be careful about
numerical variables: we must not confuse a significant value of 0 with a value set to 0 by
default because no information is provided. The latter 0 should be replaced with the ‘null’
value in DB2, ‘NA’ in R or the missing value ‘.’ in SAS. Remember that a variable whose
reliability cannot be assured must never be used in a model. A model with one variable
missing is more useful than a model with a false variable. The same applies if we are uncertain
whether a variable will always be available or always correctly updated.
The second operation is the creation of relevant indicators from the raw data which have
been checked and corrected where necessary. This can be done as follows:
. by replacing absolute values with ratios, often the most relevant ones;
. by calculating the changes of variables over time (for example the mean for the recent
period, divided by the mean for the previous period);
. by making linear combinations of variables;
. by composing variables with other functions (such as logarithms or square roots of
continuous variables, to smooth their distribution and compress a highly right-skewed
distribution, as commonly found in financial or reliability studies);
30 THE DEVELOPMENT OF A DATA MINING STUDY
. by recoding certain variables, for example by converting ‘low, medium, high’ into
‘1, 2, 3’;
. by modifying units of measurement;
. by replacing the dates by durations, customer lifetimes or ages;
. by replacing geographical locations with coordinates (latitude and longitude).
Thus we can create relevant indicators, following the advice of specialists if necessary. In
the commercial field, these indicators may be:
. the age of the customer when his relationship with the business begins, based on the date
of birth and the date of the first purchase;
. the number of products purchased, based on the set of variables ‘product Pi purchased
(Yes/No)’;
. the mean value of a purchase, based on the number and value of the purchases;
. the recentness and frequency of the purchases, based on the purchase dates;
. the rate of use of revolving credit, deduced from the ceiling of the credit line and the
proportion actually used.
The third operation is the reduction of the number of dimensions of the problem:
specifically, the reduction of the number of individuals, the number of variables, and the
number of categories of the variables.
As mentioned above, the reduction of the number of individuals is a matter of eliminating
certain extreme individuals from the population, in a proportion which should not normally
be more than 1–2%. It is also common practice to use a sample of the population (see
Section 3.15), although care must be taken to ensure that the sample is representative: this may
require a more complex sampling procedure than ordinary random sampling.
The reduction of the number of variables consists of:
. disregarding some variables which are too closely correlated with each other and which,
if taken into account simultaneously, would infringe the commonly required assumption
of non-collinearity (linear independence) of the independent variables (in linear
regression, linear discriminant analysis or logistic regression);
. disregarding certain variables that are not at all relevant or not discriminant with respect
to the specified objective, or to the phenomenon to be detected;
. combining several variables into a single variable; for example, the number of
products purchased with a two-year guarantee and the number of products purchased
with a five-year guarantee can be added together to give the number of products
purchased without taking the guarantee period into account, if this period is not
important for the problem under investigation; it is also possible to replace the
‘number of purchases’ variables established for each product code with the ‘number
of purchases’ variables for each product line, which would also reduce the number
of variables;
EXPLORING AND PREPARING THE DATA 31
. using factor analysis (if appropriate) to convert some of the initial variables into a
smaller number of variables (by eliminating the correlated variables), these variables
being chosen to maximize the variance (the new variables are also more stable in
relation to sampling and random fluctuations).
Reducing the number of variables by factor analysis, for example by principal component
analysis (PCA: see Section 7.1), is useful when these variables are used as the input of a neural
network. This is because a decrease in the number of input variables allows us to decrease the
number of nodes in the network, thus reducing its complexity and the risk of convergence of the
network towards a non-optimal solution. PCA may also be used before automatic clustering.
Quite commonly, 100 to 200 or more variables may be investigated during the develop-
ment of a classification or prediction model, although the number of ultimately discriminant
variables selected for the calculation of the model is much smaller, the order of magnitude
being as follows:
. 3 or 4 for a simple model (although the model can still classify more than 80% of the
individuals);
. 5–10 for a model of normal quality;
. 11–20 for a very fine (or refined) model.
Beyond 20 variables, the model is very likely to lose its capacity for generalization. These
figures may have to be adjusted according to the size of the population.
The reduction of the number of categories is achieved by:
. combining categories which are too numerous or contain too few observations, in the
case of discrete and qualitative variables;
. combining the categories of discrete and qualitative variables which have the same
functional significance, and are only distinguished for practical reasons which are
unrelated to the analysis to be carried out;
. discretizing (binning) some continuous variables (i.e. converting them into ranges of
values such as quartiles, or into ranges of values that are significant from the user’s point
of view, or into ranges of values conforming to particular rules, as in medicine).
Examples of categories which are combined because they are too numerous are the
individual articles in a product catalogue (which can be replaced by product lines), social and
occupational categories (which can be aggregated into about 10 values), the segments of a
housing type (which can be combined into families as indicated in Section 4.2.1), counties
(which can be regrouped into regions), the activity codes of merchants (the Merchant
Category Code used in bank card transactions) and of businesses (NACE, the French name
of the Statistical Classification of Economic Activities in the European Community, equiva-
lent to the Standard Industrial Classification in the United Kingdom and North America),
(which can be grouped into industrial sectors), etc. By way of example, the sectors of activity
may be:
. Agriculture, Forestry, Fishing and Hunting
. Mining
32 THE DEVELOPMENT OF A DATA MINING STUDY
. Utilities
. Construction
. Manufacturing
. Wholesale Trade
. Retail Trade
. Transportation and Warehousing
. Information
. Finance and Insurance
. Real Estate and Rental and Leasing
. Professional, Scientific, and Technical Services
. Management of Companies and Enterprises
. Administrative and Support and Waste Management and Remediation Services
. Education Services
. Health Care and Social Assistance
. Arts, Entertainment, and Recreation
. Accommodation and Food Services
. Other Services (except Public Administration)
. Public Administration
Extreme examples of variables having too many categories are variables which are
identifiers (customer number, account number, invoice number, etc.). These are essential keys
for accessing data files, but cannot be used as predictive variables.
2.5 Population segmentation
It may be necessary to segment the population into groups that are homogeneous in relation
to the aims of the study, in order to construct a specificmodel for each segment, beforemaking
a synthesis of themodels. This method is called the stratification of models. It can only be used
where the volume of data is large enough for each segment to contain enough individuals of
each category for prediction. In this case, the pre-segmentation of the population is a method
that can often improve the results significantly. It can be based on rules drawn up by experts or
by the statistician. It can also be carried out more or less automatically, using a statistical
algorithm: this kind of ‘automatic clustering’, or simply ‘clustering’, algorithm is described
in Chapter 9.
There are many ways of segmenting a population in order to establish predictive models
based on homogeneous sub-populations. In the first method, the population is segmented
POPULATION SEGMENTATION 33
according to general characteristics which have no direct relationship with the dependent
variable. This pre-segmentation is unsupervised. In the case of a business, it may relate to its
size, its legal status or its sector of activity. At the Banque de France, the financial health of
businesses is modelled by sector: the categories are industry, commerce, transport, hotels,
caf�es and restaurants, construction, and business services. For a physical person, we may use
sociodemographic characteristics such as age or occupation. In this case, this approach can be
justified by the existence of specific marketing offers aimed at certain customer segments,
such as ‘youth’, ‘senior’, and other groups. From the statistical viewpoint, this method is not
always the best, because behaviour is not always related to general characteristics, but it may
be very useful in some cases. It is usually applied according to rules provided by experts,
rather than by statistical clustering.
Another type of pre-segmentation is carried out on behavioural data linked to the
dependent variable, for example the product to which the scoring relates. This pre-
segmentation is supervised. It produces segments that are more homogeneous in terms of
the dependent variable. Supervised pre-segmentation is generally more effective than
unsupervised pre-segmentation, because it has some of the discriminant power required in
the model. It can be implemented according to expert rules, or simply on a common-sense
basis, or by a statistical method such as a decision tree with one or two levels of depth. Even a
decision tree with only one level of depth can be used to separate the population to be
modelled into two clearly differentiated classes (or more with chi-square automatic
interaction detection), without having the instability that is a feature of deeper trees. One
common-sense rule could be to establish a consumer credit propensity score for two
segments: namely, those customers who already have this kind of credit, and the rest. In
this example, we can see that the rules can be expressed as a decision tree. This method often
enables the results to be improved.
A third type of pre-segmentation can be required because of the nature of the available
data: for example, they will be much less rich for a prospect than for a customer, and these
two populations must therefore be separated into two segments for which specific models
will be constructed.
In all cases, pre-segmentation must follow explicit rules for classifying every individual.
The following features are essential:
. simplicity of pre-segmentation (there must not be too many rules);
. a limited number of segments and stability of the segments (these two aspects
are related);
. segment sizes generally of the same order of magnitude, because, unless we have
reasons for examining a highly specific sub-population, a 99%/1% distribution is rarely
considered satisfactory;
. uniformity of the segments in terms of the independent variables;
. uniformity of the segments in terms of the dependent variable (for example, care must
be taken to avoid combining high-risk and low-risk individuals in the same segment).
We should note that some operations described in the preceding section (reduction of the
number of individuals, variables and categories) can be carried out either before any
segmentation, or on each segment.
34 THE DEVELOPMENT OF A DATA MINING STUDY
2.6 Drawing up and validating predictive models
This step is the heart of the data mining activity, even if it is not the most time-consuming. It
may take the form of the calculation of a score, or more generally a predictive model, for each
segment produced in the preceding step, followed by verification of the results in a test sample
that is different from the learning sample. It may also be concerned with detecting the profiles
of customers, for example according to their consumption of products or their use of the
services of a business. Sometimes it will be a matter of analysing the contents of customers’
trolleys in department stores and supermarkets, in order to find associations between products
that are often purchased together. A detailed review of the different statistical and data mining
methods for developing a model, with their principles, fields of application, strong points
and limitations, is provided in Chapters 6–11.
In the case of scoring, the principal methods are logistic regression, linear discriminant
analysis and decision trees. Trees provide entirely explicit rules and can process heterogeneous
and possibly incomplete data, without any assumptions concerning the distribution of the
variables, and can detect non-linear phenomena. However, they lack reliability and the capacity
for generalization, and this can prove troublesome. Logistic regression directly calculates the
probability of the appearance of the phenomenon that is being sought, provided that the
independent variables have no missing values and are not interrelated. Discriminant analysis
also requires the independent variables to be continuous and to follow a normal law (plus
another assumption that will be discussed below), but in these circumstances it is very reliable.
Logistic regression has an accuracy very similar to that of discriminant analysis, but is more
general because it deals with qualitative independent variables, not just quantitative ones.
In this step, several models are usually developed, in the same family or in different
families. The development of a number of models in the same family is most common,
especially when the aim is to optimize the parameters of a model by adjusting the independent
variables chosen, the number of their categories, the depth or number of leaves on a decision
tree, the number of neurons in a hidden layer, etc. We can also develop models in different
families (e.g. logistic regression and linear discriminant analysis) which are run jointly to
maximize the performance of the model finally chosen.
The aim is therefore to choose the model with the best performance out of a number of
models. I will discuss in Section 11.16.5 the importance of evaluating the performance of a
model on a test sample which is separate from the learning sample used to develop the model,
while being equally representative in statistical terms.
For comparing models of the same kind, there are statistical indicators which we will
examine inChapter 11.However, since the statistical indicators of twomodels of different kinds
(e.g. R2 for a discriminant analysis and R2 for a logistic regression) are rarely comparable, the
models are compared either by comparing their error rates in a confusion matrix (see Section
11.16.4) or by superimposing their lift curves or receiver operating characteristic (ROC) curves
(see Section 11.16.5). These indicators can be used to select the best model in each family of
models, and then the best model from all the families considered together.
The ROC curve can be used to visualize the separating power of a model, representing the
percentage of correctly detected events (the Y axis) as a function of the percentage of
incorrectly detected events (the X axis) when the score separation threshold is varied. If this
curve coincides with the diagonal, the performance of the model is no better than that of a
random prediction; as the curve approaches the upper left-hand corner of the square, the
DRAWING UP AND VALIDATING PREDICTIVE MODELS 35
model improves. Several ROC curves can be superimposed to show the progressive
contribution of each independent variable in the model, as in Figure 2.2. This shows that
a single variable, if well chosen, already explains much of the phenomenon to be predicted,
and that an increase in the number of variables added to the model decreases the marginal gain
they provide.
In addition to the qualities of mathematical accuracy of the model, other selection criteria
may sometimes have to be taken into account, such as the readability of the model from the
user’s viewpoint, and the ease of implementing it in the production IT system.
2.7 Synthesizing predictive models of different segments
This small step only takes place if customer segmentation has been carried out. We need to
compare the scores of the different segments, and calculate a unique normalized score that is
comparable from one segment to another. This new score allows for the proportion of
customers in each segment, and each score band, who have already achieved the aim,
i.e. bought the product, repaid the loan, etc.
The synthesis of the scores for the different segments is easy in a logistic regressionmodel,
since the score is a probability which is inherently normalized and is suitable for the same
interpretation in all the segments (provided that the sampling has been comparable in the
different segments).
In all cases, it is possible to draw up a table showing the score points (for propensity in this
case) of the different segments, indicating the subscription rate corresponding to this score
and this segment on each occasion (Table 2.1). The rows of this table are then sorted by
subscription rate, from lowest to highest. The rows sorted in this way are then combined into ten
groups, each covering about 10% of the customers, with the rows of each group following each
other and therefore having similar subscription rates. One point is given to the customers of the
first group, and so on up to the customers of the tenth group, who have 10 points.
Figure 2.2 ROC curve.
36 THE DEVELOPMENT OF A DATA MINING STUDY
2.8 Iteration of the preceding steps
The procedure described in Sections 2.4 to 2.7 is usually iterated until completely satisfactory
results are obtained. The input data are adjusted, new variables are selected or the parameters are
changed in the datamining tools, which, if operating in an exploratorymode, asmentioned above,
may be highly sensitive to the choice of initial data and parameters, especially in neural networks.
Before accepting amodel, it is useful to have it validated by experts in the field, for example
credit analysts in the case of scoring, or market researchmanagers. These experts can be shown
some real cases of individuals towhich different competingmodels have been applied, and they
can then be asked to saywhichmodel appears to provide the best results, giving reasons for their
response. Thus they can detect the characteristics of a model which would not have appeared
duringmodelling. Thismay bedone in thefield ofmarketing and propensity scores. Themethod
of modelling on historic data tends to reproduce the commercial habits of the business, even if
precautions are taken to avoid this (by neutralizing the effect of sales campaigns, for example).
Thus, recent customers may be rather too well scored, since they are usually most likely to be
approached by sales staff. In the test phase, we may discover that older customers are not
describedwell enough by the propensity score, and that the use of this scorewould result in their
not being targeted. This phase could thus lead to a reduction in the effect of thevariables relating
to customer lifetime; these variables may even be removed from the model. In the field of risk,
such variables may be those which forecast the risk too late and mask variables which would
allow earlier prediction; or theymay simply be variables which are not obvious to experts in the
field, and which are discarded from the model.
2.9 Deploying the models
Deployment involves the implementation of the data mining models in a computer system,
before using the results for action (adapting procedures, targeting, etc.) and making them
available to users (information on the workplace, etc.). It is harder to update the ‘workplace’
applications to incorporate the score, especially if we wish to allow users to input new
Table 2.1 Synthesizing the score points of the different segments.
Segment Score Number of
customers
Subscription rate
1 1 n1,1 t1,1
1 2 n1,2 t1,2
. . . . . . . . . . . .
1 10 n1,10 t1,10
2 1 n2,1 t2,1
2 2 n2,2 t2,2
. . . . . . . . . . . .
5 1 n5,1 t5,1
. . . . . . . . . . . .
5 10 n5,10 t5,10
DEPLOYING THE MODELS 37
information in order to upload them to the central production files or to execute the model in
real time. For an initial test, therefore, we may limit ourselves to creating a computer file
containing the data yielded by data mining, for example the customers’ score points; this file
may or may not be uploaded to the computerized production system, and may be used for
targeting in direct marketing. A spreadsheet file can be used to create amail merge, usingword
processing software.
In this step we must decide on the output of the data at different levels of fineness. There
may be fine grades in the production files (from 1 to 1000, for example), which are aggregated
at the workstation (from 1 to 10) or even regrouped into bands (low/medium/high).
The confidentiality of the score output will be considered in this step, if not before: for
example, is a salesman allowed to see the scores of all customers, or only the ones for which he
is responsible? The question of the frequency of updating of the data must also be considered:
this is usually monthly, but a quarterly frequency may be suitable for data that do not change
very often, while a daily frequency may be required for sensitive risk data.
Before deploying a data mining tool in a commercial network, it is obviously preferable to
test it for several months with volunteers. There will also be a phase of take-over of the tool,
possibly followed by a phase in which the initial volunteers train other users.
2.10 Training the model users
To ensure a smooth take-over of the new decision tools by the future users, it is essential to
spend some time familiarizing themwith these tools. Theymust know the objective (and stick
to it), the principles of the tools, how they work (without going into technical details, which
would subsequently have to be justified individually), their limits (noting that these are only
statistical tools which may not be useful in special cases: the difficulty in educational terms is
to maintain a minimum of vigilance and critical approach without casting doubt on the
reliability of the tools), the methods for using them (while pointing out that these are decision
support tools, not tools for automatic decision making), what the tools will contribute
(the most important point), and how the users’ work patterns will change, in both operational
and organizational terms (adaptation of procedures, increased delegation, rules for res-
ponsibility if the user does or does not follow the recommendations of the tool, in risk scoring
for example).
2.11 Monitoring the models
Two forms of monitoring will be distinguished in this section. The first is one-off. Whenever
a new data mining application is brought into use, the results must be analysed. Let us take
the example of a major marketing campaign for which a propensity score has been
developed to improve response rates. After the campaign, it is important to ensure that
the response rates match the score values, and that the customers with the highest scores
have in fact responded best. If a category of customers with a high score has not responded
well to the marketing offer, this must be examined in detail to see if it is possible to segment
the category or apply a decision tree to it, in order to discover the profiles of the
non-purchasers and take this into account in the next campaign. It will be necessary to
check whether the definition of this profile uses a variable which is not allowed for in the
score calculation. It may be useful to complement this quantitative analysis with a
38 THE DEVELOPMENT OF A DATA MINING STUDY
qualitative analysis based on the opinions of the sales staff who have used the score: they
may have some idea of the phenomena which were not taken into account or were
underestimated by the score; they may also have intuitively identified the poorly scored
populations. The analysis of the low score results is also useful, especially when the score
has been developed in the absence of data on customers’ refusal to purchase, and when
‘negative’ profiles for the score modelling therefore had to be defined in advance, instead of
being determined on the basis of evidence, resulting in a degree of arbitrariness. An
examination of the feedback from the campaign will make it possible to corroborate or to
reject certain hypotheses on the ‘negative’ profiles.
The second form of monitoring is continuous. When a tool such as a risk score is used in
a business, it is used constantly and both its correct operation and its use must be checked.
Its operation can be monitored by using tables like the following:
month
score
M-1 M-2 M-3 M-4 M-5 M-6 M-7 M-8 M-9 M-10 M-11 M-12
1
2
3
. . .
TOTAL
The second column of this table shows, for themonthM – 1 before the analysis and for each
score value, the number of customers, the percentage of customers in the total population,
and the default rate. The sum of the losses of the business can be added if required. Clearly,
the rate of non-payments will be replaced by the subscription rate in the case of a propensity
score, and by the departure rate in the case of an attrition score. The third column contains the
same figures for the month M – 2, and so on. If required, a column containing the figures at
the timewhen the scorewas created may be added. These figures can be used to check whether
the score is still relevant in terms of prediction, and especiallywhether the rate of non-payments
(or subscription, or attrition) is still at the level predicted by the score. It is even better to
calculate the figures in columnM-n in the followingway. If the goal of the score is (for example)
to give a 12months prediction,wemeasure for each value of the score, the number of customers
at M-n-12 and the number of defaults between M-n-12 and M-n.
The use of the score ismonitored by using a table similar to the preceding one, in which the
rate of non-payments is replaced by the number and value of transactions concluded in a given
month and for a given score value. Thus it is possible to see if sales staff are taking the risk
score into account to limit their involvement with the most high-risk customers. Indicators
such as ‘infringements’ of the score by its users can be added to this table.
These tables are useful for departmental, risk and marketing experts, but may also be
distributed in the business network to enable the results of actions to be measured. They also
enable the business to quantify the prime target customers with which it conducts its
transactions; these are preferably customers whose scores have a certain given value (low
for risk, or high for propensity).
MONITORING THE MODELS 39
score M–1 !# score M–2
1 2 3 . . .
1
2
3
. . .
A ‘transition matrix’ like the one above enables the stability of the score to be monitored.
Original Now w2 probability
variable 1 category 1 number number
category 2 number number
. . .
. . .
TOTAL total number total number
variable 2 category 1 number number
. . .
. . .
. . .
variable k . . . . . . . . .
. . .
It is also worth using charts to monitor the distribution of the variables explaining the
score, comparing the actual distribution to the distribution at the time of creation of the score
(using a w2 or Cram�er’s V test, for example: see Appendix A), to discover if the distribution of
a variable has changed significantly, or if the rate of missing values is increasing, which may
mean that the score has to be revised.
2.12 Enriching the models
When the results of the use of a first data mining model are measured, the model can be
improved, as we have seen, either by adding independent variables that were not considered
initially, or by using feedback from experience to determine the profiles to be predicted (e.g.
‘purchaser/non-purchaser’) if these results were not available before.
At this point, we can also supplement the results of step 2.2) if they did not provide enough
historic details of the variable to be predicted.
Whenwe have done this, steps 2.3, 2.4, . . . can be repeated to construct a newmodel which
will perform more satisfactorily than the first one. This iterative process must become
permanent if we are to achieve a continuous improvement of the results. There is one proviso,
however: in the learning phase of a new model, we must never rule out individuals whose
previous levels were considered to be low. Otherwise this model can never be used to question
40 THE DEVELOPMENT OF A DATA MINING STUDY
the validity of previous models, which may sometimes be necessary. For example, if we use
a calculated scoring model to target only the individuals with the best score points, without
ever approaching the others, there is a risk that we will fail to detect a change in behaviour of
some of the initially low-scoring individuals. So it is always desirable to include a small
proportion of individuals with medium or low scores in any targeting which is based on
a model. We will return to this matter in Section 12.4.
2.13 Remarks
The procedure described in this chapter can use a number of data mining techniques with
different origins: for example, multiple correspondence analysis for the transformation of
variables, agglomerative hierarchical clustering for segmentation, and logistic regression for
scoring. There are four important points to be considered:
. For step 2.3, it is essential to collect the data to be studied for several months or even
several years, at least where predictive methods such as scoring are concerned, in order
to provide valid analyses.
. Steps 2.3 and 2.4 are much more time-consuming than steps 2.5–2.7, but are
very important.
. The data mining process is highly iterative, with many possible loopbacks to one or
more of steps 2.4–2.7.
. If the population is large, it will be useful to segment it before calculating a score or
carrying out classification, so that we can work on homogeneous groups of individuals,
making processing more reliable. I will discuss this matter further in the chapter on
automatic clustering.
2.14 Life cycle of a model
Data mining models (especially scores) go through a small-scale trial phase, for adjustment
and validation, and to enable their application to be tested. When models are in production,
they must be applied regularly to refreshed data. However, these models must be reviewed
regularly, as described in the chapter on scoring.
2.15 Costs of a pilot project
The factors determining the number of man-days for a data mining project are so numerous
that I can only give very approximate figures here. They will have to be modified according to
the context. These numbers are mainly useful because they demonstrate the proportions of the
different stages. Furthermore, these figures do not relate to major projects, only to small- and
medium-scale ones such as pilot projects intended to initiate and promote the use of data
mining in businesses.When the processes are fully established, the tools are in full production
and the data are thoroughly known, a reduction in the costs shown in the table in Figure 2.3
may be expected. These costs relate to the data mining team only, excluding costs for the other
personnel involved.
COSTS OF A PILOT PROJECT 41
No. Step Costs (in days) Remarks
Small-
scale
project
Medium-
scale
project
1 Definition of the target
and objectives
4 8 Preliminary analyses and
provision of numerical bases
for decision making.
2 Listing the existing
usable data
7 10 The cost depends on the
organization’s knowledge of
the data
3, 4 Collection, formatting
and exploration of the
data
15 28 This cost is largely determined
by the number of variables,
their quality, the number of
tables accessed, etc.
5–8 Developing and
validating the models
(clustering/scoring)
15 25 Must be multiplied if more
than one population segment
is to be processed
9 Complementary
analyses and delivery of
results.
9 12 For example, setting the
scoring thresholds
10 Documentation and
presentations
5 7
11 Analysis of the first tests
conducted
5 10 Statistical study leading to
subsequent action
TOTAL 60 100
8%10%
20%
8%
25%
12%
7%
10%
0%
5%
10%
15%
20%
25%
Defini
ng th
e aim
s
Data
inven
tory
Data
prep
arat
ion
Creat
ing th
e an
alysis
bas
e
Constr
uctio
n an
d va
lidat
ion o
f mod
els
Outpu
t of m
odels
Docum
enta
tion
and
pres
enta
tions
Analys
is of
the
first
tests
Figure 2.3 Costs of a data mining project.
42 THE DEVELOPMENT OF A DATA MINING STUDY