data mining and statistics for decision making (tufféry/data mining and statistics for decision...

6

An outline of data mining methods

This chapter introduces the five chapters which form the core technical content of this book.

They are rather more accessible than some specialist books on statistics, data analysis and

neural networks, and I hope that they will be enjoyable to read. However, a reader who is only

interested in the applications of data mining and the procedures for implementing it in

a business may omit these chapters. On the other hand, they are essential for anyone wishing

not only to understand the working of the tools, in order to use them more successfully, but

also to know when and where to use any particular algorithm. In this first technical chapter, I

shall outline the descriptive and predictive methods of data mining and statistics as a whole,

and compare their main features, which will be discussed in detail in the following chapters.

It is important to note that the logarithms used in this book are Napierian (natural)

logarithms in all cases.

6.1 Classification of the methods

As mentioned in Chapter 1, the main data mining and data analysis methods can be divided

into two large families: descriptive methods and predictive methods. In descriptive methods,

for reducing, summarizing and grouping data, there is no dependent variable, i.e. no privileged

variable. In predictive methods, which explain data, there is a dependent variable, in other

words a variable to be explained, or a privileged variable.

A more detailed version of this classification is shown in Table 6.1, where methods

forming part of conventional statistics and data analysis have been given grey backgrounds.

Considering predictive methods only (Table 6.2), we can be more precise by distinguish-

ing the differences relating to the type of variable, namely independent (in the rows) and

dependent (in the columns). Clearly, the rows ‘n quantitative (representing different

quantities)’ and ‘n qualitative’ are only relevant if the dependent variables are correlated

with each other. Otherwise, it is sufficient to carry out n analyses of the ‘1 quantitative’ or

‘1 qualitative’ type.

Data Mining and Statistics for Decision Making, First Edition. Stéphane Tufféry.

© 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-68829-8

Table

6.1

Classificationofmethods.

Type

Family

Sub-family

Algorithm

descriptive

methods

geometricalmodels

factoranalysis(projectionand

visualizationin

aspaceoflower

dim

ension)

principal

componentanalysis(PCA)(continuous

variables)

correspondence

analysis(CA)(qualitativeand

binaryvariables)

multiple

correspondence

analysis(M

CA)(qualitative

andbinaryvariables)

cluster

analysis(groupingin

homogeneousclustersin

thewhole

space)

partitioningmethods(m

ovingcentres,k-means,

dynam

icclouds,k-medoids,etc.)

hierarchical

methods(agglomerative,

divisive)

cluster

analysisþ

dim

ension

reduction

neuralclustering(K

ohonen

maps)

combinatory

models

clusteringbyaggregationofsimilarities(qualitative

variables)

logical

rule-based

models

linkdetection

search

forassociationrules

search

forsimilar

sequences

168 AN OUTLINE OF DATA MINING METHODS

predictive

methods

logical

rule-based

models

decisiontrees

decisiontrees(dependentvariable

isnumeric

or

qualitative)

modelsbased

on

mathem

atical

functions

neuralnetworks

supervised

learningnetworks(perceptron,radialbasis

functionnetwork,etc.)

param

etricorsemi-param

etric

models

linearregression,ANOVA,MANOVA,ANCOVA,

MANCOVA,general

linearmodel

(GLM),PLS

regression(continuousdependentvariable)

Fisher’s

discrim

inantanalysis,logisticregression,

PLSlogisticregression(qualitativedependent

variable)

log-linearmodel

(dependent

variable¼counting¼number

ofindividualshavea

given

combinationofcategories

ofqualitative

variables)

generalized

linearmodel(G

LM),generalized

additive

model

(GAM)(dependentvariable

continuous,

discrete,

countingorqualitative)

predictionwithout

model

probabilisticanalysis

knearestneighbours

CLASSIFICATION OF THE METHODS 169

Table

6.2

Predictivemethods.

Independent!

#Dependent

1quantitative

(covariable)

nquantitative

(covariables)

1qualitative(factor)

nqualitative(factors)

Combination

1quantitative

simple

linear

regression,spline

regression,robust

regression,decision

trees,MARS,SVR

(supportvector

regression),knearest

neighbours

multiple

linear,

regression,spline

regression,robust

regression,�PLS

regression,decision

trees,MARS,neural

networks,SVR,k

nearestneighbours

ANOVA,decision

trees,MARS,SVR,k

nearestneighbours

ANOVA,decision

trees,MARS,

neuralnetworks,SVR,

k-nearestneighbours

ANCOVA,

univariate

GLM,

decisiontrees,

MARS,neural

networks,SVR,k

nearestneighbours

nquantitative

(representing

different

quantities)

multivariate

regression,PLS2

regression

multivariate

regression,PLS2

regression,neural

networks

MANOVA

MANOVA,neural

networks

MANCOVA,

multivariate

GLM,

neuralnetworks

1qualitative

nominalorbinary

Fisher’s

discrim

inant

analysis,logistic

regression,

regularized

generalized

linear

models,decision

trees,MARS,SVM,

naiveBayesian

classifier,knearest

neighbours

Fisher’s

discrim

inant

analysis,logistic

regression,PLS

logisticregression,

regularized

generalized

linear

models,decisiontrees,

MARS,neural

networks,SVM,

naiveBayesian

classifier,knearest

neighbours

logisticregression,

DISQUAL

discrim

inantanalysis,

regularized

generalized

linear


MARS,SVM,naive

Bayesianclassifier,k

nearestneighbours

logisticregression,

DISQUALdiscrim

inant

analysis,regularized

generalized

linear


MARS,neural

networks,SVM,naive

Bayesianclassifier,k

nearestneighbours

logisticregression,

regularized

generalized

linear

models,decision

trees,MARS,neural

networks,SVM,

naiveBayesian

classifier,knearest

neighbours


nqualitative

nominalorbinary

(representing

different

characteristics)

decisiontrees,

vectorgeneralized

linearmodel,

vectorgeneralized

additivemodel

decisiontrees,vector

generalized

linear

model,vector

generalized

additive

model,neural

networks


generalized

linear

model,vector

generalized

additive

model


generalized

linear

model,vector

generalized

additivemodel,neural

networks


generalized

linear

model,vector

generalized

additive

model,neural

networks

1quantitative

asymmetrical

gam

maand

log-norm

al

regressions

gam

maandlog-

norm

alregressions

gam

maandlog-

norm

alregressions

gam

maand

log-norm

alregressions

gam

maandlog-

norm

alregressions

1discrete

(counting)

Poissonregression,

log-linearmodel

Poissonregression,

log-linearmodel

Poissonregression,

log-linearmodel

Poissonregression,

log-linearmodel

Poissonregression,

log-linearmodel

1qualitative

ordinal(atleast

3groups)

ordinal

logistic

regression

ordinal

logistic

regression

ordinal

logistic

regression

ordinal

logistic

regression

ordinal

logistic

regression

nquantitativeor

qualitative

(representing

repeated

measurements

ofthesame

characteristic)

generalized

linear

modelswith

repeatedmeasures

generalized

linear

modelswithrepeated

measures

generalized

linear

modelswithrepeated

measures

generalized

linear

modelswith

repeatedmeasures

generalized

linear

modelswith

repeatedmeasures

� LOESS,ridge,

lasso,LARS,andother

robustregressions.


Table

6.3

Comparisonofmethods.

Method

Absence

ofassumptions

concerningtheproblem

tobesolved

Exhaustiveprocessing

ofdatabases

Heterogeneousorincompletedata

processed

Clustering

movingcentres

method

anditsvariants

no(fixed

number

ofinitial

clustersandcentres)

yes

numerical

variablesandvariables

withoutmissingvalues

hierarchical

clustering

yes,buttheclustersat

level

naredetermined

bythose

at

level

n�1

no(non-linearalgorithm),

impossible

toprocess

more

than

several

thousandobservations

yes

(possible

toprocess

non-

numeric

variableswithan

adhoc

distance)

neuralclustering

(Kohonen)

no(fixed

number

ofclusters)

yes

thevariables2[0,1]mustbe

transform

ed

clusteringby

aggregationof

similarities

yes

inprinciple

yes,butdependsonthe

implementation

qualitativevariables

Classificationandprediction

decisiontrees

asforhierarchical

clustering

(akindof‘reverse

tree’)

no(butdoes

notreachthelimitas

soonas

hierarchical

clustering)

sometrees,such

asCHAID

,must

discretizecontinuousvariables

neuralnetworks

perceptrons

yes

(butthenumber

ofhidden

neuronsmustbespecified)

no(nolearningonseveral

hundred

variables)


transform

ed

radialbasisfunction

networks

asforperceptrons

yes


transform

ed

discrim

inantanalysis

no(assumptionsonthe

conditional

distributions

Xi/Y)

yes

numerical




discrim

inantanalysis

onfactorial

coordinates

ofMCA

(DISQUAL

method)

yes

(assumptionson

conditional

distributions

Xi/Ycangenerally

be

dispensedwith)

yes

yes

(missingvalues

aretreatedas

entirely

separatevalues)

linearregression

no(linearity

inxof

E(Y

|X¼x

)þ

assumptionson

theresiduals)

yes

numerical



logisticregression,

generalized

linear

model

no(linearity

inxof

g(E(Y

|X¼x

))þ

non-

complete

separation(see

Section11.8.7))

yes

(provided

that

asufficiently

powerfulmachineisused,ifthe

number

ofobservationsisvery

large)

yes

(continuousvariableswith

missingvalues

aredivided

into

classes)

Associations

search

forassociation

rules

yes

dependsontheparam

eter

settings

yes

similar

sequences

yes

yes

(sam

eremarksapply)

yes


As for the descriptive methods of clustering, these are detailed in a summary table at the

end of Chapter 9.

6.2 Comparison of the methods

Table 6.3 summarizes the advantages and disadvantages of the various data miningmethods in

relation to these three essential qualities that are expected:

. the absence of restrictive assumptions concerning the problem to be solved;

. the capacity of treating the data exhaustively within a reasonable period in all cases;

. the possibility of handling incomplete and heterogeneous data which may or may

not be numerical (in the case of independent variables for the classification and

prediction methods).


data mining and statistics for decision making (tufféry/data mining and statistics for decision...

Documents