data mining and statistics for decision making (tufféry/data mining and statistics for decision...
TRANSCRIPT
6
An outline of data mining methods
This chapter introduces the five chapters which form the core technical content of this book.
They are rather more accessible than some specialist books on statistics, data analysis and
neural networks, and I hope that they will be enjoyable to read. However, a reader who is only
interested in the applications of data mining and the procedures for implementing it in
a business may omit these chapters. On the other hand, they are essential for anyone wishing
not only to understand the working of the tools, in order to use them more successfully, but
also to know when and where to use any particular algorithm. In this first technical chapter, I
shall outline the descriptive and predictive methods of data mining and statistics as a whole,
and compare their main features, which will be discussed in detail in the following chapters.
It is important to note that the logarithms used in this book are Napierian (natural)
logarithms in all cases.
6.1 Classification of the methods
As mentioned in Chapter 1, the main data mining and data analysis methods can be divided
into two large families: descriptive methods and predictive methods. In descriptive methods,
for reducing, summarizing and grouping data, there is no dependent variable, i.e. no privileged
variable. In predictive methods, which explain data, there is a dependent variable, in other
words a variable to be explained, or a privileged variable.
A more detailed version of this classification is shown in Table 6.1, where methods
forming part of conventional statistics and data analysis have been given grey backgrounds.
Considering predictive methods only (Table 6.2), we can be more precise by distinguish-
ing the differences relating to the type of variable, namely independent (in the rows) and
dependent (in the columns). Clearly, the rows ‘n quantitative (representing different
quantities)’ and ‘n qualitative’ are only relevant if the dependent variables are correlated
with each other. Otherwise, it is sufficient to carry out n analyses of the ‘1 quantitative’ or
‘1 qualitative’ type.
Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.
© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8
Table
6.1
Classificationofmethods.
Type
Family
Sub-family
Algorithm
descriptive
methods
geometricalmodels
factoranalysis(projectionand
visualizationin
aspaceoflower
dim
ension)
principal
componentanalysis(PCA)(continuous
variables)
correspondence
analysis(CA)(qualitativeand
binaryvariables)
multiple
correspondence
analysis(M
CA)(qualitative
andbinaryvariables)
cluster
analysis(groupingin
homogeneousclustersin
thewhole
space)
partitioningmethods(m
ovingcentres,k-means,
dynam
icclouds,k-medoids,etc.)
hierarchical
methods(agglomerative,
divisive)
cluster
analysisþ
dim
ension
reduction
neuralclustering(K
ohonen
maps)
combinatory
models
clusteringbyaggregationofsimilarities(qualitative
variables)
logical
rule-based
models
linkdetection
search
forassociationrules
search
forsimilar
sequences
168 AN OUTLINE OF DATA MINING METHODS
predictive
methods
logical
rule-based
models
decisiontrees
decisiontrees(dependentvariable
isnumeric
or
qualitative)
modelsbased
on
mathem
atical
functions
neuralnetworks
supervised
learningnetworks(perceptron,radialbasis
functionnetwork,etc.)
param
etricorsemi-param
etric
models
linearregression,ANOVA,MANOVA,ANCOVA,
MANCOVA,general
linearmodel
(GLM),PLS
regression(continuousdependentvariable)
Fisher’s
discrim
inantanalysis,logisticregression,
PLSlogisticregression(qualitativedependent
variable)
log-linearmodel
(dependent
variable¼counting¼number
ofindividualshavea
given
combinationofcategories
ofqualitative
variables)
generalized
linearmodel(G
LM),generalized
additive
model
(GAM)(dependentvariable
continuous,
discrete,
countingorqualitative)
predictionwithout
model
probabilisticanalysis
knearestneighbours
CLASSIFICATION OF THE METHODS 169
Table
6.2
Predictivemethods.
Independent!
#Dependent
1quantitative
(covariable)
nquantitative
(covariables)
1qualitative(factor)
nqualitative(factors)
Combination
1quantitative
simple
linear
regression,spline
regression,robust
regression,decision
trees,MARS,SVR
(supportvector
regression),knearest
neighbours
multiple
linear,
regression,spline
regression,robust
regression,�PLS
regression,decision
trees,MARS,neural
networks,SVR,k
nearestneighbours
ANOVA,decision
trees,MARS,SVR,k
nearestneighbours
ANOVA,decision
trees,MARS,
neuralnetworks,SVR,
k-nearestneighbours
ANCOVA,
univariate
GLM,
decisiontrees,
MARS,neural
networks,SVR,k
nearestneighbours
nquantitative
(representing
different
quantities)
multivariate
regression,PLS2
regression
multivariate
regression,PLS2
regression,neural
networks
MANOVA
MANOVA,neural
networks
MANCOVA,
multivariate
GLM,
neuralnetworks
1qualitative
nominalorbinary
Fisher’s
discrim
inant
analysis,logistic
regression,
regularized
generalized
linear
models,decision
trees,MARS,SVM,
naiveBayesian
classifier,knearest
neighbours
Fisher’s
discrim
inant
analysis,logistic
regression,PLS
logisticregression,
regularized
generalized
linear
models,decisiontrees,
MARS,neural
networks,SVM,
naiveBayesian
classifier,knearest
neighbours
logisticregression,
DISQUAL
discrim
inantanalysis,
regularized
generalized
linear
models,decisiontrees,
MARS,SVM,naive
Bayesianclassifier,k
nearestneighbours
logisticregression,
DISQUALdiscrim
inant
analysis,regularized
generalized
linear
models,decisiontrees,
MARS,neural
networks,SVM,naive
Bayesianclassifier,k
nearestneighbours
logisticregression,
regularized
generalized
linear
models,decision
trees,MARS,neural
networks,SVM,
naiveBayesian
classifier,knearest
neighbours
170 AN OUTLINE OF DATA MINING METHODS
nqualitative
nominalorbinary
(representing
different
characteristics)
decisiontrees,
vectorgeneralized
linearmodel,
vectorgeneralized
additivemodel
decisiontrees,vector
generalized
linear
model,vector
generalized
additive
model,neural
networks
decisiontrees,vector
generalized
linear
model,vector
generalized
additive
model
decisiontrees,vector
generalized
linear
model,vector
generalized
additivemodel,neural
networks
decisiontrees,vector
generalized
linear
model,vector
generalized
additive
model,neural
networks
1quantitative
asymmetrical
gam
maand
log-norm
al
regressions
gam
maandlog-
norm
alregressions
gam
maandlog-
norm
alregressions
gam
maand
log-norm
alregressions
gam
maandlog-
norm
alregressions
1discrete
(counting)
Poissonregression,
log-linearmodel
Poissonregression,
log-linearmodel
Poissonregression,
log-linearmodel
Poissonregression,
log-linearmodel
Poissonregression,
log-linearmodel
1qualitative
ordinal(atleast
3groups)
ordinal
logistic
regression
ordinal
logistic
regression
ordinal
logistic
regression
ordinal
logistic
regression
ordinal
logistic
regression
nquantitativeor
qualitative
(representing
repeated
measurements
ofthesame
characteristic)
generalized
linear
modelswith
repeatedmeasures
generalized
linear
modelswithrepeated
measures
generalized
linear
modelswithrepeated
measures
generalized
linear
modelswith
repeatedmeasures
generalized
linear
modelswith
repeatedmeasures
� LOESS,ridge,
lasso,LARS,andother
robustregressions.
CLASSIFICATION OF THE METHODS 171
Table
6.3
Comparisonofmethods.
Method
Absence
ofassumptions
concerningtheproblem
tobesolved
Exhaustiveprocessing
ofdatabases
Heterogeneousorincompletedata
processed
Clustering
movingcentres
method
anditsvariants
no(fixed
number
ofinitial
clustersandcentres)
yes
numerical
variablesandvariables
withoutmissingvalues
hierarchical
clustering
yes,buttheclustersat
level
naredetermined
bythose
at
level
n�1
no(non-linearalgorithm),
impossible
toprocess
more
than
several
thousandobservations
yes
(possible
toprocess
non-
numeric
variableswithan
adhoc
distance)
neuralclustering
(Kohonen)
no(fixed
number
ofclusters)
yes
thevariables2[0,1]mustbe
transform
ed
clusteringby
aggregationof
similarities
yes
inprinciple
yes,butdependsonthe
implementation
qualitativevariables
Classificationandprediction
decisiontrees
asforhierarchical
clustering
(akindof‘reverse
tree’)
no(butdoes
notreachthelimitas
soonas
hierarchical
clustering)
sometrees,such
asCHAID
,must
discretizecontinuousvariables
neuralnetworks
perceptrons
yes
(butthenumber
ofhidden
neuronsmustbespecified)
no(nolearningonseveral
hundred
variables)
thevariables2[0,1]mustbe
transform
ed
radialbasisfunction
networks
asforperceptrons
yes
thevariables2[0,1]mustbe
transform
ed
discrim
inantanalysis
no(assumptionsonthe
conditional
distributions
Xi/Y)
yes
numerical
variablesandvariables
withoutmissingvalues
172 AN OUTLINE OF DATA MINING METHODS
discrim
inantanalysis
onfactorial
coordinates
ofMCA
(DISQUAL
method)
yes
(assumptionson
conditional
distributions
Xi/Ycangenerally
be
dispensedwith)
yes
yes
(missingvalues
aretreatedas
entirely
separatevalues)
linearregression
no(linearity
inxof
E(Y
|X¼x
)þ
assumptionson
theresiduals)
yes
numerical
variablesandvariables
withoutmissingvalues
logisticregression,
generalized
linear
model
no(linearity
inxof
g(E(Y
|X¼x
))þ
non-
complete
separation(see
Section11.8.7))
yes
(provided
that
asufficiently
powerfulmachineisused,ifthe
number
ofobservationsisvery
large)
yes
(continuousvariableswith
missingvalues
aredivided
into
classes)
Associations
search
forassociation
rules
yes
dependsontheparam
eter
settings
yes
similar
sequences
yes
yes
(sam
eremarksapply)
yes
CLASSIFICATION OF THE METHODS 173
As for the descriptive methods of clustering, these are detailed in a summary table at the
end of Chapter 9.
6.2 Comparison of the methods
Table 6.3 summarizes the advantages and disadvantages of the various data miningmethods in
relation to these three essential qualities that are expected:
. the absence of restrictive assumptions concerning the problem to be solved;
. the capacity of treating the data exhaustively within a reasonable period in all cases;
. the possibility of handling incomplete and heterogeneous data which may or may
not be numerical (in the case of independent variables for the classification and
prediction methods).
174 AN OUTLINE OF DATA MINING METHODS