hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio...

22
THEORETICAL ADVANCES Hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio statistic: application to clustering massive data sets Antonio Ciampi Yves Lechevallier Manuel Castejo ´n Limas Ana Gonza ´lez Marcos Received: 27 June 2006 / Accepted: 9 April 2007 / Published online: 30 October 2007 Ó Springer-Verlag London Limited 2007 Abstract The problem of clustering subpopulations on the basis of samples is considered within a statistical framework: a distribution for the variables is assumed for each subpopulation and the dissimilarity between any two populations is defined as the likelihood ratio statistic which compares the hypothesis that the two subpopulations differ in the parameter of their distributions to the hypothesis that they do not. A general algorithm for the construction of a hierarchical classification is described which has the important property of not having inversions in the dendro- gram. The essential elements of the algorithm are specified for the case of well-known distributions (normal, multi- nomial and Poisson) and an outline of the general parametric case is also discussed. Several applications are discussed, the main one being a novel approach to dealing with massive data in the context of a two-step approach. After clustering the data in a reasonable number of ‘bins’ by a fast algorithm such as k-Means, we apply a version of our algorithm to the resulting bins. Multivariate normality for the means calculated on each bin is assumed: this is justified by the central limit theorem and the assumption that each bin contains a large number of units, an assumption gen- erally justified when dealing with truly massive data such as currently found in modern data analysis. However, no assumption is made about the data generating distribution. Keywords Cluster analysis Binned data Dissimilarity Likelihood ratio statistic Dendrogram Large data sets 1 Introduction Both supervised [14] and unsupervised [59] classifica- tion are active areas of research. This paper focuses on the latter, in particular, in the development of a hierarchical clustering algorithm of special interest due to the various problems that addresses. Hierarchical clustering algorithms yield useful insights into the structure of a data set by providing a global rep- resentation of the pairwise dissimilarities amongst observational units. Classically, a unit or singleton is rep- resented by the recorded values of a set of features of special interest, hence by a vector of numbers or characters. The dissimilarity between two units is chosen to be a function of the vectors representing the two units. In modern research, however, the data analyst often encounters situations in which observational units are composite and of varying sizes, i.e., contain several sin- gletons. For example, suppose we have data on families and children within families; according to the specific question at hand, we may wish to consider the family rather than the child as the observational unit. As another example, consider longitudinal data arising from measuring a quantity at several points in time on a number of subjects. The times of the measurements need not be the same for different subjects; also, the number of measurements may differ from one subject to another. Here, we may focus on the individual at a particular point in time, or, perhaps more A. Ciampi (&) Department of Epidemiology and Biostatistics, McGill University, Montreal, P.Q., Canada e-mail: [email protected] Y. Lechevallier INRIA—Rocquencourt, 87153 Le Chesnay Cedex, France M. C. Limas A. G. Marcos Department of Mechanical, Informatical and Aerospace Engineering, Universidad de Leo ´n, 24007 Leo ´n, Spain 123 Pattern Anal Applic (2008) 11:199–220 DOI 10.1007/s10044-007-0088-4

Upload: mcgill

Post on 19-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

THEORETICAL ADVANCES

Hierarchical clustering of subpopulations with a dissimilaritybased on the likelihood ratio statistic: application to clusteringmassive data sets

Antonio Ciampi Æ Yves Lechevallier ÆManuel Castejon Limas Æ Ana Gonzalez Marcos

Received: 27 June 2006 / Accepted: 9 April 2007 / Published online: 30 October 2007

� Springer-Verlag London Limited 2007

Abstract The problem of clustering subpopulations on

the basis of samples is considered within a statistical

framework: a distribution for the variables is assumed for

each subpopulation and the dissimilarity between any two

populations is defined as the likelihood ratio statistic which

compares the hypothesis that the two subpopulations differ

in the parameter of their distributions to the hypothesis that

they do not. A general algorithm for the construction of a

hierarchical classification is described which has the

important property of not having inversions in the dendro-

gram. The essential elements of the algorithm are specified

for the case of well-known distributions (normal, multi-

nomial and Poisson) and an outline of the general

parametric case is also discussed. Several applications are

discussed, the main one being a novel approach to dealing

with massive data in the context of a two-step approach.

After clustering the data in a reasonable number of ‘bins’ by

a fast algorithm such as k-Means, we apply a version of our

algorithm to the resulting bins. Multivariate normality for

the means calculated on each bin is assumed: this is justified

by the central limit theorem and the assumption that each

bin contains a large number of units, an assumption gen-

erally justified when dealing with truly massive data such as

currently found in modern data analysis. However, no

assumption is made about the data generating distribution.

Keywords Cluster analysis � Binned data � Dissimilarity �Likelihood ratio statistic � Dendrogram � Large data sets

1 Introduction

Both supervised [1–4] and unsupervised [5–9] classifica-

tion are active areas of research. This paper focuses on the

latter, in particular, in the development of a hierarchical

clustering algorithm of special interest due to the various

problems that addresses.

Hierarchical clustering algorithms yield useful insights

into the structure of a data set by providing a global rep-

resentation of the pairwise dissimilarities amongst

observational units. Classically, a unit or singleton is rep-

resented by the recorded values of a set of features of

special interest, hence by a vector of numbers or characters.

The dissimilarity between two units is chosen to be a

function of the vectors representing the two units.

In modern research, however, the data analyst often

encounters situations in which observational units are

composite and of varying sizes, i.e., contain several sin-

gletons. For example, suppose we have data on families

and children within families; according to the specific

question at hand, we may wish to consider the family rather

than the child as the observational unit. As another

example, consider longitudinal data arising from measuring

a quantity at several points in time on a number of subjects.

The times of the measurements need not be the same for

different subjects; also, the number of measurements may

differ from one subject to another. Here, we may focus on

the individual at a particular point in time, or, perhaps more

A. Ciampi (&)

Department of Epidemiology and Biostatistics,

McGill University, Montreal, P.Q., Canada

e-mail: [email protected]

Y. Lechevallier

INRIA—Rocquencourt, 87153 Le Chesnay Cedex, France

M. C. Limas � A. G. Marcos

Department of Mechanical,

Informatical and Aerospace Engineering,

Universidad de Leon, 24007 Leon, Spain

123

Pattern Anal Applic (2008) 11:199–220

DOI 10.1007/s10044-007-0088-4

naturally, on the individual’s time curve. Therefore, it is

useful to study clustering of observational units that consist

of groups of simpler units. We will limit ourselves to the

case in which each (composite) observational unit consists

of a random sample from a subpopulation. Moreover we

assume that a unique probability distribution is associated

with each subpopulation, which generates the feature

vectors of its simple units. Thus we are lead to define

dissimilarities between composite units in terms of dis-

similarities between probability distributions.

There is nothing profoundly new in the idea of cluster-

ing subpopulations. The idea has been proposed (or

rediscovered) more or less directly in a variety of areas.

One example is the problem of multiple comparisons in

one-way ANOVA, which is addressed by constructing a

dendrogram (hierarchical classification tree) in which the

initial units are the sets of subjects at each level of the class

variable [10–12]. Another important area in which the

general idea is currently used is model-based clustering

[13–17]. Here, one constructs a hierarchy of subpopula-

tions based on the likelihood ratio dissimilarity assuming

that the data are generated by a mixture of distributions

(usually multivariate normal). Interestingly, this construc-

tion known as hierarchical model-based clustering is

justified as a way to solve the classification maximum

likelihood problem, that is the maximization not of a

mixture likelihood, but of a likelihood in which one of the

parameters is the label of the distribution of each subject.

Indeed, in this context, the process of constructing such a

hierarchy, is seen as a preliminary exploration of the data,

which guides the eventual maximization of the mixture

likelihood by a version of the EM algorithm [14, 18, 19]. A

third modern problem in which clustering of subpopula-

tions plays a role is clustering of massive data sets [20, 21].

Indeed, one of the most common approaches to this prob-

lem consists of preprocessing the data so as to reduce the

number of initial simple units to a manageable number of

composite units (preprocessing or preclustering). Then, a

hierarchical model-based clustering algorithm (or other

forms of clustering) can be applied to the composite units.

In this paper, we attempt to bring a unifying perspective

to the study of clustering algorithm for composite obser-

vational units, which we refer to as atoms, and treat as

samples from subpopulations of a given population. Our

central tool is a well-known dissimilarity between the

distributions of two subpopulations defined as a likelihood

ratio statistic (LRS). Based on this dissimilarity we outline

a general approach to hierchical agglomerative clustering,

and we show that a number of interesting problems can be

addressed by specifying the general algorithm.

The general framework is outlined in the first part of

Sect. 2, which defines the LRS dissimilarity and outlines a

general algorithm of hierarchical agglomerative clustering

(HAC). A HAC algorithm is essentially specified by a

dissimilarity measure between pairs of units and an

agglomeration rule which defines the dissimilarity between

two aggregates of the original subpopulations. It proceeds

by successively agglomerating, starting from the atoms, the

pair of aggregates with the smallest dissimilarity. Perhaps

the most natural way to extend the dissimilarity from

populations to aggregates is to define it again as a LRS.

However, as we shall see, classical cluster analysis sug-

gests a variety of other possibilities. In spite of the limited

novelty of this first part, we thought it useful to point out

some of the properties of the likelihood-based dissimilarity

and of the associated general approach to hierarchical

clustering of subpopulations.

In Sects. 2.1–2.4 we develop dissimilarity measures for

clustering subpopulations with some familiar distributions

of the exponential family (multivariate normal, Poisson,

multinomial). Developing these formulas is, again, little

more than an exercise in mathematical statistics; however

the use we make of them in later sections shows that they

can help solve interesting problems such as clustering of

rows or columns of a contingency table. A proposal, based

on asymptotic large sample theory, for adapting the general

algorithm to any distribution satisfying appropriate regu-

larity conditions, is outlined in Sect. 2.5; we believe this to

be new. In Sect. 2.6, which also contains new material, we

discuss the relationship of our framework to the simulta-

neous testing procedure (STP) approach to multiple

comparisons, such as developed by Calinski and Corsten

[10], see also [11, 12]. We conclude by outlining a general

proposal for generating simultaneous test procedures

(STPs) within a broad class of problems, such as MA-

NOVA and the generalization of ANOVA to non-normal

distributions.

Section 3 proposes a new algorithm for handling anal-

ysis of massive data sets. As discussed at the beginning of

the section, our algorithm bears strong similarity to others

[14, 17]. However, unlike others, it does not assume nor-

mality (conditional on knowing the class) for the original

data. On the other hand, our algorithm starts from pre-

processed data on composite units, assuming that they

satisfy some regularity assumptions that justify the appli-

cation of the central limit theorem.

The next section (Sect. 4) is devoted to a limited eval-

uation of the proposed methodology. This is done using

data artificially generated from statistical models with

known clustering structures. In particular: in Sect. 4.1, we

compared our proposal for handling large data sets with an

approach considered state-of-the-art [14, 17]; in Sect. 4.2

we retrieved several aspects of a large multilevel data set,

which was artificially generated to imitate the structure of

the real data set analyzed in Sect. 5, and performed a

limited sensitivity analysis to assess the dependence of the

200 Pattern Anal Applic (2008) 11:199–220

123

results on the degree of separation of the clusters and of the

number of ‘bins’ used in the preprocessing; in Sect. 4.3, we

demonstrated how by changing the agglomeration rule of

our basic algorithm, we can retrieve from a large data set a

notoriously difficult-to-retrieve clustering structure.

The real data set which has motivated this work is

analyzed in Sect. 5 as an example of the potential of our

methodology.

A brief discussion resumes the work in Sect. 6 and,

finally, the proof of the propositions and formulas devel-

oped in this work are given in Sect. 7.

2 An algorithm for hierarchical clustering

of subpopulations

This section is devoted to the general framework (2.1); to

the development of explicit formulas for the LRS dissim-

ilarity in some special cases involving well-known

distributions (2.2–2.4); to a general proposal to handle

more complex distributions (2.5); and, finally, to a dis-

cussion of the relationship between our framework and

STPs, concluding with the formulation of a general pro-

posal for developing STPs (2.6).

We suppose that we have data from M subpopulations of

a given population: P1, P2, ... , PM. Initially we assume that

each data set Dr, r = 1,...,M, is an independent random

sample from the corresponding subpopulation and that the

distribution of each population is known except for an

unknown parameter. Thus each subpopulation is repre-

sented by its distribution: f1 x j h1ð Þ; f2 x j h2ð Þ; . . .;fM x j hMð Þ:

2.1 The general algorithm

Suppose that we wish to cluster the subpopulations using

the samples to estimate the unknown parameters. We

define here a dissimilarity from which we will develop a

general HAC algorithm. We want to express the idea that

two subpopulations are similar to the extent that they

cannot be easily distinguished statistically. Thus it seems

reasonable to propose to define the dissimilarity as the LRS

to test the hypothesis H0 that two subpopulation have the

same parameter against the hypothesis H1 that the two

parameters are different:

d Pi;Pj

� �¼ 2 ln

bL1

bL0

ð1Þ

As in all hierarchical clustering algorithms, we begin by

joining the two subpopulations with minimum

dissimilarity, thus representing the merged pair by a

unique parameter, estimated from the joint samples. Then

we need to recalculate the dissimilarity between the

merged pair and all other subpopulations. This

necessitates the specification of an agglomeration rule. In

this paper we use, except otherwise specified, the model-

based agglomeration rule, which defines the dissimilarity

between two aggregates Ai and Aj as a LRS:

d Ai;Aj

� �¼ 2 ln

bL1

bL0

ð2Þ

The above definitions and description clearly constitute

a specification of the typical HAC algorithm (see

Algorithm 1), with dissimilarity given by Eq. 1 and

agglomeration rule given by Eq. 2 or variants thereof.

To go from the general algorithm to a specific one, we

have to develop detailed formulas for calculating the dis-

similarity between atoms and the dissimilarity between

aggregates (agglomeration rule).

As stated in the introduction, HAC algorithms provide a

global representation of the pairwise dissimilarities

amongst atoms. A tree or dendrogram is a powerful rep-

resentation of the results of an HAC algorithm. The

aggregate obtained at each step is represented by a node of

the dendrogram drawn at a height proportional to an

appropriately defined index. For the model-based

agglomeration rule (Eq. 2) the index is calculated as:

f Pið Þ ¼ 0 ð3Þ

f Ai [ Aj

� �¼ f Aið Þ þ f Aj

� �þ d Ai;Aj

� �ð4Þ

It has the following important property:1

Proposition 1

f Ai [ Aj

� �� max f Aið Þ; f Aj

� �� �ð5Þ

A corollary of this proposition is that the hierarchy has

no inversion.

Remark 1 As already mentioned, there are alternative

definitions of the agglomeration rule. We could define the

dissimilarity of two aggregates Ai and Aj as the smallest

dissimilarity between its members, i.e. the smallest of the

Algorithm 1 General HAC algorithm

1: Specify dissimilarity and agglomeration rule.

2: Start from {P1, P2, ..., PM }.

3: Calculate d(Ai, Aj )

4: Substitute the two merged subpopulations by their union.

5: Repeat from step 2 until all subpopulations have been merged.

1 The proof of this and the other propositions of this paper are given

in Sect. 7.

Pattern Anal Applic (2008) 11:199–220 201

123

LRS’s comparing one subpopulation from Ai with one

subpopulation from Aj. This would be single-link clustering

based on the original dissimilarity matrix. Clearly, one can

also develop complete link and average link versions of the

general algorithm. Indeed, one could use any clustering

algorithm based on a dissimilarity matrix [22]. The interest

of this depends on the problem at hand. Here we wish only

to point out that each method of clustering constitutes an

additional tool for exploring a data set of complex struc-

ture. We will give one example of a situation in which we

have used single-link clustering of subpopulations (see

Sect. 4.3).

Remark 2 Under the regularity conditions and for large

sample, it is well known that the Wald and Score statistics

are asymptotically equivalent to the LRS. An alternative

definition of the dissimilarity of Eq. 1 could then be one of

these statistics.

When developing a specific algorithm, the main task is

to calculate the maximum likelihood estimates of the

parameters and the corresponding maximized likelihoods

under specific hypotheses. This can be done in closed form

for some important cases, in particular when the underlying

distribution belongs to the exponential family [23–25]. We

do not give here the general formulas, but limit ourselves to

the most frequently used distributions of this family, the

normal, the Poisson and the multinomial distributions.

2.2 Clustering multivariate normal subpopulations

We will suppose here that Pr has distribution N lr;Rrð Þ:From standard normal theory we can write:

dN Ai;Aj

� �¼ DAi[Aj

� DAi� DAj

ð6Þ

where

DAi¼X

k2Ai

xk � blAi

� �TbR�1

Aixk � blAi

� �þ nAi

ln j bRAij ð7Þ

DAj¼X

k2Aj

xk � blAj

� �TbR�1

Ajxk � blAj

� �þ nAj

ln j bRAjj

ð8Þ

DAi[Aj¼

X

k2Ai[Aj

xk � blAi[Aj

� �TbR�1

Ai[Ajxk � blAi[Aj

� �

þ nAiþ nAj

� �ln j bRAi[Aj

jð9Þ

Here blAr; bRArðr ¼ i; jÞ and blAi[Aj

; bRAi[Ajare the

maximum likelihood estimators of lAr;RArðr ¼ i; jÞ and

lAi[Aj;RAi[Aj

respectively; and nAris the cardinality of Ar.

This is sufficient to completely define the algorithm for our

case. Notice that [16] has developed efficient approaches to

the computation of these formulas in the context of

maximum likelihood hierarchical clustering. Also, the

clustering algorithm of Bradley and Fayyad [26], Bradley

et al. [27], though not hierarchical, uses similar formulas to

achieve its efficiency.

2.3 Clustering Poisson subpopulations

We consider now subpopulations Pr, i = 1,..., M, with

Poisson distribution Pðlr;NrÞ where Nr is known for each

Pr. Direct likelihood maximization yields:

dP Ai;Aj

� �¼ ni ln bli þ nj ln blj � ni þ nj

� �ln bli[j ð10Þ

where

bli ¼ni

Nið11Þ

blj ¼nj

Njð12Þ

bli[j ¼ni þ nj

Ni þ Njð13Þ

and nr represents the sampled value for the r-th cell

(r = i, j).

2.4 Clustering multinomial subpopulations

Assuming the multinomial distribution M pr;Nrð Þ; where

Nr is known for each Pr, we obtain:

dM Ai;Aj

� �¼Xp

l¼1

nil ln bpil þ njl ln bpjl � nil þ njl

� �ln bpi[jl

� �

ð14Þ

where nrl represent the l-th component of the r-th sample

(r = i, j), and bpi[jl ¼nilþnjl

NiþNj; bpgl ¼

ngl

Ngare the components of

the maximum likelihood estimators of the pr parameters

(r = i, j, i [ j). This case is interesting, for instance, when

we wish to cluster rows or columns of a contingency table,

as we will illustrate in Sects. 4.2 and 5.

2.5 Other distributions

In most cases the distributions representing the subpopu-

lations do not permit closed form estimates, and therefore

the calculation of the dissimilarity may be quite a heavy

task. One general simplification may be introduced by

relying on large sample theory. Suppose we first estimate

the parameters for each of the initial subpopulations. Then

we can use the fact that these estimates are asymptotically

normal with mean equal to the parameter itself and vari-

ance–covariance matrix given by the inverse of the

202 Pattern Anal Applic (2008) 11:199–220

123

information matrix. The latter is in general a function of the

unknown parameter. However, we can, as a first approxi-

mation, substitute the maximum likelihood estimates of the

parameters in the Hessian of the likelihood (observed

information matrix) and proceed as though the actual

variances were known. Then at each stage we only have to

estimate the mean of the distribution of the parameters;

these estimates can also be obtained in an approximate way

as function of the estimates of the initial population

parameters and their variance–covariance matrices, treated

as known. We do not give details here, but point out that

there are similarities with what we will develop in the next

section.

2.6 Relationship with STPs and a proposal

for developing STPs

As mentioned in the introduction, the problem of mul-

tiple comparisons in classical ANOVA was one of the

motivations for considering clustering subpopulations, in

particular to avoid logical inconsistencies. In ANOVA

one is interested in studying how a variable of interest y

(response to treatment) varies across a number of sub-

populations (treatment groups). One starts by testing at a

fixed level of Type I error a, the overall homogeneity

hypothesis that the expectation of a variable of interest y

(response to treatment) does not vary across a given set

of subpopulations (treatment groups). But in most prac-

tical cases, one also wants to test whether various

subgroups of treatments can be considered homogeneous

at the same level a, and this is notoriously difficult;

furthermore, there are several ways of formulating the

problem, depending on the type of questions that the

investigator poses. The problem was formulated as fol-

lows by Calinski and Corsten [10]: ‘‘the question is

raised how to produce not an enumeration of all possible

homogeneous subsets, but a partition of the sample

means into non-overlapping subsets such that each sub-

set (or subset of that) may be considered internally

homogeneous. For obtaining such a partition, clustering

methods may be invoked’’. This formulation of the

problem of multiple comparisons in ANOVA leads

directly to the concept of STP. An STP is a procedure

for deciding whether to simultaneously accept or reject

all the hypotheses of a given family. It should possess

the following properties:

(a) For a fixed a, the STP rejects at least one true implied

null hypothesis with probability at most a, and exactly

a if the overall null hypothesis is true.

(b) Acceptance of any implied null hypothesis implies

acceptance of all hypotheses implied by it

(monotonicity).

(c) Homogeneity of any group will be rejected only if the

homogeneity of at least a pair of its members is

rejected (strict monotonicity).

A clustering procedure is clearly a good start for an

STP. For instance, cutting a dendrogram at a certain

height h0 (a specified value of the hierarchy index), can

be seen as declaring all groupings which result from the

cut as ‘internally homogeneous’ for the level of detail h0.

Such a procedure enjoys properties (b) and (c) above. To

turn it into an STP, we would have to be able to attach

to h0 a statement about the probability of rejecting at

least one true homogeneity hypothesis as in property a)

above. Calinski and Corsten [10] achieved this by basing

the construction of their classification schemes on two

criteria of homogeneity: the studentized range and the F-

test statistic. Since they assumed equal group sizes as

well as homoscedasticity, the distributions of both these

criteria turned out to be well known. They recognized

that when using the studentized range, their construction

was an agglomerative hierarchical algorithm with dis-

similarity given by the ordinary Euclidean distance and

complete link as agglomeration rule. Clearly, their

distance is equivalent to our LRS dissimilarity, but

indeed they used an alternative definition of agglomer-

ation rule (see Remark 1). The key point in their STP is

to cut the tree at the height corresponding to the a-

critical level for testing the overall homogeneity hypoth-

esis, and to declare all groups (and all groups of groups)

generated by the cut as homogeneous at the level a.

With this definition, the properties (a)–(c) above are

verified. Notice that if the overall homogeneity hypoth-

esis is rejected, then the tree cannot be cut and all nodes

of the hierarchy as well as groups of nodes can be

considered homogeneous.

From our perspective, the algorithms proposed in [10]

can be generalized as follows. Assume that we know the

distribution of the statistic that tests the overall homoge-

neity hypothesis for a parameter of interest. Choose an

HAC procedure with index h such that for any node A of

the resulting tree, h(A) is a monotonic function of the

statistic that tests the homogeneity of A. Construct a hier-

archy with this method and cut the resulting tree at a height

which corresponds to the critical value of the statistic

which tests the overall homogeneity hypothesis. If the latter

is accepted, the tree cannot be cut; however, if the tree can

be cut at height ha, then the set of the resulting partition can

be declared homogeneous and similarly for their subsets. If

the distribution of the test statistic of the overall hypothesis

Pattern Anal Applic (2008) 11:199–220 203

123

is not known in closed form, a permutation test can be used

to estimate it [28].

3 Clustering massive data sets

This section is devoted to the main application of our

framework: hierarchical clustering of massive data sets.

After a brief review of the area in Sect. 3.1, we outline our

approach in Sect. 3.2; we give detailed formulas to define

an appropriate HAC algorithm in Sect. 3.3; and discuss in

Sect. 3.4 some attempts to evaluate our work and compare

it with a comparable method. Our approach consists of a

preprocessing step to obtain a manageable number of bins

containing the original data, followed by a specific version

of the general algorithm of Sect. 2.1. The distinguishing

feature of our approach is that we do not assume that the

distribution generating the data is known. However,

assumptions about the results of the preprocessing allow us

to apply the central limit theorem and use the multivariate

normal distribution.

3.1 Clustering massive data sets: a brief review

As the size of available data bases increases, eventually

any algorithm will run into a data set that requires more

time or memory than the analyst can afford. This problem

has plagued developers of clustering algorithms for quite

some time and several approaches to address it have been

developed. One can broadly classify these approaches into

three classes, which we shall refer to as: the direct, the

sampling and the two-step approaches.

Researchers adopting the direct approach attempt to

redesign traditional algorithms with the aim to reduce their

complexity [27, 29–36]. Within the context of model-based

clustering, accelerations of the straightforward EM algo-

rithm have been proposed by [16, 19]. These are based on

the development of formulas that are valid, in principle, if

the distribution generating the data is multivariate normal.

While these authors do consider hierarchical agglomerative

versions of their model-based algorithms, they have not

developed graphical techniques to fully exploit the

advantages of hierarchical clustering. Also, the normality

assumption may be a drawback of these otherwise

remarkable efforts. The main problem however is that these

improved algorithms still fail to handle within reasonable

time limits, data sets of size larger than 100,000 samples

and 10 variables, say (our experimentation). For such sizes,

the only successful approaches to reduce computational

burden are essentially non-hierarchical developments of

the classic k-Means algorithm [37]. For instance the scal-

able EM algorithm of Bradley et al. [27] requires only a

single scan of the data. A feature common to all these

algorithms is that they require a priori specification of the

number of clusters [38]. The approach of MCLUST [17]

includes a suggestion for determining the number of clus-

ters, but it is based on the knowledge of the distribution

generating the data. In general these algorithms lack the

graphical power of hierarchical algorithms, which often

may suggest how many clusters one should look for by

simple inspection of the dendrogram.

The sampling approach is based on a very simple idea:

choose a sample of manageable size, apply to it a conve-

nient clustering algorithm and then assign the rest of the

original data set to one of the clusters by an appropriate

rule. This is the path followed by Banfield and Raftery [14]

in their model-based approach to clustering large data sets;

they use the classification likelihood and Bayes’ rule to

assign the rest of the data set to one of the clusters. This has

been criticized [39, 40] for overlooking the problem of

small but important clusters. These may be underrepre-

sented or not represented at all in the sample, and therefore

easily missed. Also, even if all the important clusters are

represented, the sample-based solution may be very dif-

ferent from the solution based on the whole data set. An

extension of sampling termed fractionation was proposed

by Douglass et al. [41]; it successfully reduces calcula-

tions. More recently, fractionation has been extended and

improved by Tantrum et al. [42] who propose, with very

promising results, a hierarchical approach termed

refractionation.

Both the direct and the sampling approaches eventually

run into size limitations. In principle, the two-step class of

methods is very general and open-ended. The idea is to first

reduce the data to a manageable number of bins by some

form of inexpensive preprocessing. Then, in a second step,

one has to choose a reasonable way of representing these

bins and devise an algorithm to cluster them. Therefore, the

burden of the calculations may be shifted in part to the

preprocessing step. Clearly the main problem at this stage

is to decide on how to preprocess the data, because this

choice defines the level of detail below which we loose

information. In view of this, one can argue that the choice

of preprocessing depends more on the problem at hand than

on general methodological considerations. Several

approaches to preprocessing have been proposed. Probably

the most common one in daily practice is, as suggested by

the SAS manual [43], to apply to the data some fast version

of k-Means so as to obtain a number of a few hundred

preclusters. There is also an implicit preprocessing in the

older version of MCLUST. This version only allowed

singletons as inputs, hence no explicit preprocessing.

However, in the early steps of the agglomeration algorithm,

the unconstrained multivariate normal was replaced by

multivariate normal distributions with variance–covariance

204 Pattern Anal Applic (2008) 11:199–220

123

matrices constrained to a very simple form, and these

restrictions were removed only after reaching reasonable

size of agglomerates. The gain in size of data sets that can

be treated by this approach was unfortunately, relatively

modest.

While there are numerous attempts to develop new

methods of preprocessing [44–48], not much new has been

proposed for the second step of the two-step approach. For

the second step, the SAS manual suggests that the bins be

represented simply by the average value of the variables, so

that any clustering algorithm can be applied to them. A

much better choice is that of Banfield and Raftery [14],

who suggest to consider the bins as samples from sub-

populations, assumed to be well represented by a

multivariate normal distribution, and apply to them their

hierarchical model-based clustering algorithm. The current

version of MCLUST explicitly allows the user to initialize

the algorithm with preclusters (our bins). Arguably, the

model-based HAC algorithm of Banfield and Raftery such

as implemented in MCLUST and with appropriate pre-

processing is the state-of-the-art in the area of hierarchical

clustering of massive data sets. Its main drawbacks are the

assumption of normality and the fact that in spite of the

very remarkable improvements in speed [16], the algorithm

may still be too slow in some cases.

Our approach, explained in what follows, attempts to

improve on these two aspects of the second step. It does not

attempt to improve on the preprocessing step.

3.2 The basic setup

As usual we start from a data matrix D = [x1 |x2| ... |xp ],

which represents the values that p continuous random

variables (columns) take on N subjects (rows). We denote

by x the vector random variable that is assumed to generate

D, and by F its distribution. We consider a situation in

which the N subjects have been grouped into M ‘bins’, B1,

B2, ..., BM, of size n1, n2, ..., nM. We assume M large and

both N and ni, i = 1,...,M, very large. We assume also that

in each bin, xr, the restriction of x to Br, has an unknown

but fixed distribution Fr. We do not assume Fr normal, but

we assume regularity conditions such that the central limit

theorem holds for the sample mean of xr taken over a large,

independent random sample with location parameter lr and

variance–covariance matrix Rr. We assume the variance–

covariance matrices known: in practice we will use the bin

sample variance–covariance matrix Sr to estimate Rr. It

follows from the above assumptions that the sample mean

over bin r may be considered approximately multivariate

normal:

xr ¼1

nr

X

xi2Br

xi ! N lr;Vrð Þ ð15Þ

where Vr ¼ 1nr

Rr; for r = 1,..., M.

Therefore a bin is represented by the distribution of its

mean vector, asymptotically normal by the central limit

theorem.

The second step of our proposal is a special case of the

general algorithm. To define the special case, we only need

to work out formulas for the dissimilarities between atoms

and between nodes.

During the analysis process, pairs of bins are succes-

sively merged to form new groups at a higher level, thus

creating a hierarchy. Let Ar ¼ Bs; s 2 arf g; ar �1; . . .;Mf g; r ¼ i; j; ai \ aj ¼£: The nodes of this hierar-

chy will be represented again by the distribution of their

mean vectors.

3.3 Dissimilarity formulas

We define first the dissimilarity between two bins and then

extend the definition to the dissimilarity between any two

nodes of the hierarchical classification. Since we are rep-

resenting any two bins Bs and Bt by the distribution of their

mean, it is reasonable to define the dissimilarity between

them as the LRS for testing the hypothesis H0 that the two

bins have the same mean vector against the hypothesis H1

that their mean vectors differ:

ds;t ¼ 2 lnbL1

bL0

ð16Þ

where bL1 and bL0 represent the maximized likelihood

functions under the two hypotheses. Now, the likelihood

functions L1 and L0 are:

L1 ¼ N xs j ls;Vsð ÞN xt j lt;Vtð Þ ð17Þ

L0 ¼ N xs j ls[t;Vs

� �N xt j ls[t;Vt

� �ð18Þ

or

L1 ¼e�

12 xs � lsð ÞV�1

s xs � lsð Þ2pð Þ

p2j Vs j

12

e�12 xt � ltð ÞV�1

t xt � ltð Þ2pð Þ

p2j Vt j

12

ð19Þ

L0¼e�

12 xs�ls[t

� �V�1

s xs�ls[t

� �

2pð Þp2jVs j

12

e�12 xt�lr[s

� �V�1

t xt�ls[t

� �

2pð Þp2jVt j

12

ð20Þ

Standard algebraic calculations yield:

Pattern Anal Applic (2008) 11:199–220 205

123

ds;t¼ xs�xs[tð ÞTV�1s xs�xs[tð Þþ xt�xs[tð ÞTV�1

t xt�xs[tð Þð21Þ

where:

xs[t ¼ V�1s þ V�1

t

� ��1V�1

s xs þ V�1t xt

� �ð22Þ

We obtain Vs[t; the variance–covariance of xs[t; from

the general property:

Var w1x1 þ w2x2ð Þ ¼ w1Var x1ð ÞwT1 þ w2Var x2ð ÞwT

2 ð23Þ

Thus:

Vs[t ¼ Var V�1s þ V�1

t

� ��1V�1

s xs þ V�1s þ V�1

t

� ��1V�1

t xt

h i

ð24Þ

whence

Vs[t ¼ V�1s þ V�1

t

� ��1 ð25Þ

It is interesting to also formulate these results in terms of

precision. We recall that the precision is the inverse of the

variance–covariance matrix: Us ¼ V�1s and U t ¼ V�1

t :

Then from (25) we can see that the precision of the mean

of the combined bins is the sum of their respective

precisions:

V�1s[t ¼ V�1

s þ V�1t , Us[t ¼ Us þ U t ð26Þ

Equation 25 also allows us to re-write the mean of the

combined bins as

xs[t ¼ Vs[t V�1s xs þ V�1

t xt

� �ð27Þ

These results lead us to the simpler formula for the

dissimilarity of the two bins:

ds;t¼ xs�xs[tð ÞTV�1s xs�xs[tð Þþ xt�xs[tð ÞTV�1

t xt�xs[tð Þð28Þ

It is useful for computational purposes to further

simplify the formula for the dissimilarity:

Proposition 2 Two alternative expressions for ds,t are:

ds;t ¼ xs � xtð ÞT Vs þ Vtð Þ�1 xs � xtð Þ ð29Þ

ds;t ¼ xs � xtð ÞTV�1s V�1

t Vs[t xs � xtð Þ ð30Þ

We can now extend the definition of dissimilarity to any

two nodes Ai and Aj. Following the same approach, we

define the dissimilarity as the LRS for testing the

hypothesis H0 that groups Ai and Aj have the same mean

vector against the hypothesis H1 that their mean vectors

differ:

di;j ¼ 2 lnbL1

bL0

ð31Þ

where:

L1 ¼Y

r2ai

N xr j lAi;Vr

� �Y

r2aj

N xr j lAj;Vr

� �ð32Þ

L0 ¼Y

r2ai[aj

N xr j lAi[Aj;Vr

� �ð33Þ

Replacing the location parameters lAi; lAj

and lAi[Ajby

their maximum likelihood estimators xAi; xAj

and xAi[Aj;

respectively, we obtain now:

di;j ¼X

r2ai[aj

xr � xAi[Aj

� �TV�1

r xr � xAi[Aj

� �

�X

s¼i;j

X

r2as

xr � xAið ÞTV�1

r xr � xAið Þ

ð34Þ

with

xAi¼ VAi

X

r2ai

V�1r xr

� �ð35Þ

V�1Ai¼X

r2ai

V�1r ð36Þ

xAi[Aj¼ VAi[Aj

X

s¼i;j

V�1As

xAsð37Þ

V�1Ai[Aj

¼X

r2ai[aj

V�1r ð38Þ

As above, the following proposition is useful to simplify

the calculations.

Proposition 3 Two alternative expressions for di,j are

di;j ¼ xAi� xAj

� �TVAiþ VAj

� ��1xAi� xAj

� �ð39Þ

di;j ¼ xAi� xAj

� �TV�1

AiV�1

Aiþ V�1

Aj

� ��1

V�1Aj

xAi� xAj

� �

ð40Þ

Notice that Proposition 3 refers to the dissimilarity

between two aggregates, while Proposition 2 refers to the

dissimilarity between two bins, so that it can be seen as a

particular case of the former.

3.4 Evaluation and comparisons

As discussed in Sect. 3.1, our approach proposes an inno-

vation at the second step of a two-step technique.

We are arguing that our method is essentially explor-

atory and that the graphical representations obtained with it

are an important asset. One way to evaluate what we

206 Pattern Anal Applic (2008) 11:199–220

123

propose is therefore to show how our dendrograms cor-

rectly retrieve the number of clusters in artificial data sets

where it is known.

As for comparisons with existing methods, it seems

natural to evaluate the performance of ours with the

MCLUST approach [14, 17], considered state-of-the-art.

We expect an improvement in speed at least in an appro-

priate setting. Also, since we do not assume normality, it

would be reasonable to expect that if we work with data

that are increasingly not normal, our approach should give

better results than the state-of-the-art. To do this, we will

use the multivariate t distribution, systematically decreas-

ing the number of degrees of freedom (df), since this

distribution tends to a multivariate normal for large df but

has thicker tails compared to it for small df.

4 Clustering artificial data

This section studies aspects of the performance of our

methodology using artificial data with a known cluster

structure. We are interested in assessing how well we can

retrieve these structures. The purpose of the exercise is

threefold. First, we intend to evaluate the proposal descri-

bed in Sect. 3 for the second step of a two-step procedure,

by comparing its performance with what we consider state-

of-the art. This is done using the statistical model described

in Sect. 4.1. Second, we want to demonstrate the flexibility

of the idea of HAC for subpopulations, by applying it

systematically to the study of a data set that is not only

huge, but also complex, as it has a multilevel structure.

This is done in Sect. 4.2 using the statistical model

described therein. Third, we want to test the performance

of our approach in retrieving a cluster structure known to

be difficult to retrieve for most methods. This is done in

Sect. 4.3 using the well-known two-spiral model.

4.1 An evaluation of the HAC approach as the second

step of our two-step procedure

In a first experiment, we generated four-dimensional data

from three distributions with mean and variance–covari-

ance matrices specified in Tables 2 and 3. These tables

refer to the next examples in which they are used within the

context of a more complex experiment. Using multivariate

t distributions with degrees of freedoms ranging from 2 to

10, we generated one data set of 230,000 samples for each

value of df.

We analyzed each data set by two different two-step

procedures with the same preprocessing step: k-Means

clustering with k = 25. The 25 bins were then clustered

in the second step by two HAC algorithms: the one

described in Sect. 3 and the one by Banfield and Raftery

as implemented in MCLUST (R version). Figure 1 pre-

sents a summary of this experiment. For df = 2, 5 and 10,

we give three graphs. The first two from the left are

obtained by our procedure and show a dendrogram and

the hierarchical index (height) plotted against the step

number; the third graph represents the BIC as a function

of the number of steps, which is a generally accepted tool

for choosing the number of clusters. As mentioned above,

MCLUST does not have the facility to plot dendrograms

nor to extract the relevant information to do so. The

dendrograms show clearly a three-class structure in all

cases. The height and the BIC graphs are generally used

by looking at an elbow in the curve; the step number

corresponding to the elbow is taken as the best guess at

the number of clusters. The sharper the elbow, the

stronger is the suggestion given by the data. It can be

seen from Fig. 1 that, for high df, the BIC graphs point to

a three-cluster solution, but the picture becomes increas-

ingly uncertain as df decreases, which corresponds to

departures from normality and a certain overlap of the

three clusters. On the other hand, the elbow in the height

curve is clearly at 3, regardless of the number of df’s of

the data- generating distribution. Furthermore, looking at

the dendrograms gives a global picture of the data, a

feature that is absent in the other approach. This confirms

our expectations. The state-of-the-art approach is based

on the normal assumption, and therefore it will perform

better if the data are normal and not as well when

departures from normality become important. By contrast,

our procedure does not assume normality, and therefore it

is insensitive to changes in the data-generating

distribution.

In a second experiment, we generated data from three

multivariate normal distributions with dimensions varying

from 5 to 25 and sample sizes of 30,000 and 300,000. As

above, we applied the two-step procedures under study,

with the same first step resulting in 25 bins, and measured

the computing time for each data set and each procedure.

The results are summarized in Fig. 2. The two panels of the

figure correspond to sample sizes of 30,000 and 300,000; in

each graph the x-axis represents the dimension of the

multivariate distributions, and the y-axis represents the

computing time of the analysis measured in seconds. Our

approach shows definite advantages for larger sample sizes.

Notice also that our procedure is not as sensitive to the

dimension of the data-generating distribution. This was to

be expected because our procedure does not pass through

the whole data set during the HAC algorithm, but only

during the preprocessing step. By contrast, the state-of-the-

art procedure requires passing through the data at each step

of the HAC, which penalizes its performance as the size of

the data set increases.

Pattern Anal Applic (2008) 11:199–220 207

123

15 4 21 168 525 11 22 2018 14 13 76 119 2122324 9 10 317

0e+

002e

+07

4e+

07

−80

0000

0−

7000

000

−60

0000

0

21 6 10 224 23 16 318 4 15 7 12 1119 8 1420 13 25 1 22 9 5 17

0e+

004e

+07

8e+

07

−75

0000

0−

6500

000

21 6 22 14216 5 17 1323 8 9 7 19 4 24 1025 12 11 13 152018

0.0e

+00

6.0e

+07

1.2e

+08

−70

0000

0−

6000

000

11 923 719 1 14 8 15 5 10 624 18 21 217 425 131612 3 2022

0.0e

+00

6.0e

+07

1.2e

+08

−70

0000

0−

6000

000

20 612 9 5 125 22 24 716 1315 23 414 221 11 1731819 8 10

5 10 15 20 5 10 15 20 250.0e

+00

1.0e

+08

−70

0000

0−

6000

000

df=2df=2df=2

df=3df=3df=3

df=4df=4df=4

df=5df=5df=5

df=10df=10df=10

Number of clustersNumber of clusters

5 10 15 20 5 10 15 20 25

Number of clustersNumber of clusters

5 10 15 20 5 10 15 20 25

Number of clustersNumber of clusters

5 10 15 20 5 10 15 20 25

Number of clustersNumber of clusters

5 10 15 20 5 10 15 20 25

Number of clustersNumber of clusters

BIC

BIC

BIC

BIC

BIC

Tre

e he

ight

Tre

e he

ight

Tre

e he

ight

Tre

e he

ight

Tre

e he

ight

Fig. 1 Results from the three

populations data set analysis.

The tree obtained by our

approach is shown on the left.The height graph is shown on

the center. The BIC obtained

from the state-of-the-art

clustering algorithm is shown

on the right

208 Pattern Anal Applic (2008) 11:199–220

123

4.2 Performance of our HAC approach on large

multilevel data sets

The data-generating model used here mimics one of the

real data sets studied below, which comes from a study on

nutrition. We imagine that we have obtained samples from

seven regions of three countries for a total of 230,000 study

subjects. In order to fix the ideas we call the countries

France, Italy and Spain. France has three regions viz.,

north, center and south, while Italy and Spain have two

regions each, north and south. The data are supposed to

describe the dietary habits of the study subjects. They are

generated according to the following model: there are three

dietary patterns common to all regions and countries: high

animal, mediterranean and quasi-vegetarian. However, the

proportion of the three patterns differ according to country

and region, as shown in Table 1, which also indicates the

number of samples generated for each center. We imagine

that dietary patterns are described by four variables only,

viz., vegetables, meat, butter and oil. A pattern is charac-

terized by a multivariate normal distribution of the four

variables measuring daily intake (in grams). Tables 2 and 3

define means, variances and correlation matrices associated

with the three patterns. The two-stage clustering approach

used here consists of applying an optimal partition algo-

rithm, in this case the k-Means, to the 230,000 9 4 data

matrix, followed by our clustering method. We first report

the results obtained from a particular simulation. Next we

report the results of other simulations aimed to explore the

performance of the algorithm under different degrees of

cluster overlap. Clearly, small variances yield well-sepa-

rated clusters and large variance overlapping clusters.

Therefore, the exploration was performed by simulating

data for several values of the variables variances.

4.2.1 Analysis of a copy of the artificial data set

We generated 230,000 samples from the above model. In

this data set, used throughout this section, dietary patterns

are clearly separated. Applying the k-Means algorithm (k

= 100), we partitioned the original data into 100 bins. The

clear separation of the dietary patterns is reflected in the

empirical finding that each of the 100 bins contains sub-

jects from a unique dietary pattern. Next, we applied our

algorithm for binned data. In Fig. 3a we show the hierar-

chical tree obtained by analyzing each of the seven regions

separately; in Fig. 3b we show the hierarchical tree for

each of the three countries. Each tree grows from these

bins, represented in the bottom line. Successive combina-

tions are made to connect those bins that lay closer in the

original space, i.e., very similar according to the chosen

dissimilarity measure. Such connections are represented in

the tree by segments that join the respective nodes. The

combinations take place at a height defined by the hierar-

chical index; since this index has been defined in

accordance with the dissimilarity, the height of each union

gives us information about the relative positions of the

nodes in the original space. We find in the trees of Figs. 3a

and b, that most of the combinations take place at low

55 1010 1515 2020 2525

0.0

0.5

1.0

1.5

2.0

2.5

3.0

30,000 samples 300,000 samples

Tim

e

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Tim

e

Dimension of the data set Dimension of the data set

Fig. 2 Computing time for

each data set and each

procedure (Square hc from

MCLUST; Triangle our

approach)

Table 1 Distribution of dietary patterns in countries and regions

Dietary pattern France—

north

France—

center

France—

south

France Italy—

north

Italy—south Italy Spain—

north

Spain—

south

Spain

High animal 70% 15% 5% 30% 15% 5% 10% 20% 5% 10%

Mediterranean 10% 50% 90% 50% 50% 90% 70% 30% 90% 70%

Quasi vegeterian 20% 35% 5% 20% 35% 5% 20% 50% 5% 20%

Approx. number

of samples

30,000 30,000 30,000 90,000 20,000 40,000 60,000 40,000 40,000 80,000

Pattern Anal Applic (2008) 11:199–220 209

123

heights. Each bin has been color-coded for this example,

red for high animal, green for quasi-vegetarian, and light

blue for Mediterranean; this shows easily that the lowest

steps of the dendrograms agglomerate bins of the same

color, hence corresponding to the same pattern. Indeed, it

can be seen in the figures that every tree starts by correctly

grouping all bins of the same pattern together. Only when

this is accomplished do we find agglomerations of distinct

dietary patterns at higher heights according to the dissim-

ilarities amongst them. As expected, three clusters are

clearly visible in each tree, corresponding to the three

dietary patterns.

Next, we cut all trees so as to have three clusters.

Looking at the data regionally, we formed 21 subpopula-

tions from 7 regions and again applied our algorithm to

these to see if we could again obtain three dietary patterns;

indeed this is the case as shown in Fig. 4a, which also

shows that the three patterns were correctly recovered (see

color coding). Looking at the data by country, we formed

nine subpopulations, and again as shown in Fig. 4b, the

three dietary patterns were correctly retrieved.

An interesting alternative way of looking at the data

consists of taking each region as a subpopulation and

applying our clustering algorithm directly to these sub-

populations. The results are shown in Fig. 5. Figure 6

shows the results of our clustering algorithm as applied to

the subpopulations consisting of each country. The results

are transparent and confirm that the algorithm correctly

retrieves the simulated structure.

Until now, our algorithm has only made use of the

multivariate model for binned data. We will now illustrate

the use of clustering with the multinomial model. It is

interesting to try to replicate the latest results by using the

multinomial approach. Thus, we apply again the k-Means

algorithm to our artificial data set to obtain 300 bins and

construct again a hierarchical tree. Again, not surprisingly,

we find three dietary patterns, which correctly retrieve our

structure (Fig. 7). Next, we cross the patterns with the

seven regions (and the three countries) and present the

frequency of occurrence of each dietary pattern in the

seven regions (and the three countries) in a two-way table

of counts (see Table 4). Now, the rows of this table can be

interpreted as a sample from a multinomial model

M pr;Nrð Þ; where Nr is the marginal total of the r-th row.

Clustering the rows of this table, at the region level, we

obtain the tree of Fig. 8. Similarly, clustering the rows,

now at the country level, we obtain the tree in Fig. 9. The

results (including data not shown here) obtained from the

two points of view do confirm that the data have been

generated from three basic dietary patterns, present in

different proportions in each region and country.

4.2.2 A limited sensitivity analysis

The dietary patterns of the artificial data analyzed above

happen to be clearly separated. This happens in most cases

when generating the data from the model with the same

parameters. A simple way to obtain patterns that are not

clearly separated is to change the model by multiplying the

variances of the four variables by a common scale factor s.

As s increases, points from different dietary patterns tend to

overlap. It can also be suspected that the number of bins

may have an influence on the ability of the algorithm to

retrieve the actual dietary patterns when these are not

clearly separated.

To study the performance of our algorithm under vary-

ing conditions we have performed a limited sensitivity

analysis, by varying s and the number of bins nB. In Fig. 11

we show hierarchical trees obtained from simulated data.

The rows of the figure correspond to a value of s, and the

columns to the number of bins from which the tree was

constructed. We have reported the results for s = 1, 10, 25,

50, 75, 100, and nB = 25, 50, 100, 150, 200. In each case of

Table 2 Means (variance) in

gramsDiet type Vegetables Meat Butter Oil

High animal 0 (4) 100 (25) 50 (25) 0 (4)

Mediterranean 50 (16) 50 (16) 0 (4) 50 (16)

Quasi vegeterian 90 (25) 10 (1) 20 (4) 30 (16)

Table 3 Correlation matricesHigh animal Mediterranean Quasi vegeterian

1 0:5 �0:3 �0:31 �0:5 0

1 0:31

0

BB@

1

CCA

1 0:5 0 0

1 0 0

1 0:31

0

BB@

1

CCA

1 0:5 0:5 0

1 0 0

1 0

1

0

BB@

1

CCA

210 Pattern Anal Applic (2008) 11:199–220

123

the figure we have also reported the value of the index of

Fowlkes and Mallows [49], FW index, to compare the

original dietary patterns with the clustering obtained by

cutting the tree so as to obtain three clusters. This index is

widely used in simulations to compare the cross-classifi-

cation of known and retrieved clusters [42, 50]. We use it

here not so much as a gold standard but as a pragmatic tool

for evaluation of our results. Overall the performance of

the algorithm is good: for low s, the FW index is close to 1

and is insensitive to the values of nB. As expected, the FW

index deteriorates as s increases, while the effect of the

number of bins seems to be subtler. In the worst case

considered, s = 100, the FW index is 0.6341 for nB = 150,

and seems to reach an optimum of 0.6740 for nB = 100.

Figure 10 shows the boxplots of the FW indexes obtained

running 100 simulations for each pair of values of s and nB.

On the other hand, it should be remarked that the shape of

the tree does not consistently suggest a three-cluster solu-

tion, especially as s increases. These results are not

surprising: most algorithms fail to retrieve strongly over-

lapping clusters. In Table 5 we collect the FW indexes

obtained running 100 simulations for each pair of values of

s and nB; and the FW indexes for the quadratic discriminant

analysis (QDA) predictions. Indeed, in our case, we obtain

a very similar behavior of the FW index when applied to

France–Center

France–North

France–South

Italy–North

Italy–South

Spain–North

Spain–South

France

Italy

Spain

a Trees from 7 regions b Trees from 3 countries.

Fig. 3 Hierarchies from artificial nutrition data

a Tree of 21 diets from 7 regions b Tree of 9 diets from 3 countries.

Fig. 4 Classification of dietary patterns obtained cutting the trees in

Fig. 3

Fr.–Center

Fr.–North

It.–NorthSp.–North Fr.–South It.–SouthSp.–South

Fig. 5 Classification obtained considering each region as a bin

France

ItalySpain

Fig. 6 Classification obtained considering each country as a bin

Pattern Anal Applic (2008) 11:199–220 211

123

the results of a QDA to recognize the actual dietary pat-

terns. As for detecting the actual number of clusters, this

constitutes a difficult problem for which no universal

solution has yet been devised.

4.3 Retrieval of an unusual cluster structure

by a variant of the HAC algorithm based

on our dissimilarity

We show now that our dissimilarity can be used as well in

conjunction with any traditional HAC algorithm, like sin-

gle link. It suffices to calculate once and for all the

dissimilarity between all pairs of subpopulations and sub-

mit the resulting matrix to any efficient and reliable

algorithm. This approach extends considerably the scope of

traditional HAC algorithms, which are especially vulnera-

ble to the difficulties caused by massive data sets.

The power of this idea is demonstrated in the following

experiment, in which we retrieved a complex clustering

structure that most traditional algorithms find difficult to

retrieve. We have generated a data set of 300,000 samples

from the popular benchmark two-spiral problem proposed

by Alexis Wieland and now part of the Carnegie Mellon

repository [51]. We have chosen a standard deviation of

0.05, which causes a certain overlap of the two spirals, as

shown in the graphical representation of our data of

Fig. 12a. While single link should be able to retrieve such a

structure, it cannot be applied efficiently to data sets of our

size, unless we use a two-step approach.

We have deployed a two-step approach with k = 25

k-Means in the first step and using single link on the dis-

similarity matrix in the second step. Figure 12b shows the

dendrogram thus obtained, which clearly point to a 2-cluster

solution. Moreover, this solution clearly identifies the shape

of the two spirals of Fig. 12a. By contrast, using both the

algorithm of Sect. 3 and the HAC procedure MCLUST, we

completely miss the correct solution, since both approaches

attempt to cover the whole data set by balls.

5 Analysis of dietary data

We show here an analysis of the real data set that moti-

vated much of the developments presented in this work. It

has a multilevel structure similar to the artificial data of

Sect. 4.2. The interest of this analysis is in the deployment

of many aspects of the methodology outlined in this paper,

in the context of a study that addresses an important pop-

ulation health problem.

Fig. 7 Hierarchy from the

whole artificial nutrition data set

Table 4 Detected frequencies

Mediterranean High animal Quasi vegeterian

France—center 15,199 4,686 11,008

France—north 3,031 21,638 6,149

France–south 27,450 1,497 1,497

Total France 45,680 27,821 18,654

Italy—north 20,032 5,807 13,930

Italy—south 35,558 1,990 1,954

Total Italy 55,590 7,797 15,884

Spain—north 5,888 3,915 9,674

Spain—south 35,198 1,927 1,972

Total Spain 41,086 5,842 11,646

Fr.-Center

Fr.-North

It.-NorthSp.-North Fr.-SouthIt.-SouthSp.-South

Fig. 8 Classification of seven regions with a multinomial tree

France

ItalySpain

Fig. 9 Classification of three countries with a multinomial tree

212 Pattern Anal Applic (2008) 11:199–220

123

Our data was collected in the course of the EPIC study

on nutrition and cancer [44] and comprises measurements

of dietary variables on 4,852 women from eight regions of

France. Table 6 shows the number of samples registered in

each region. The variables are 24-h food intake (in grams)

for 16 food groups. Previous to the analysis, variables for

each subject were re-expressed as percentage of the total

food intake. A profile of the variables is given in Fig. 13. In

the first stage of the analysis, we performed a k-Means

clustering. A choice of k = 30 seems appropriate and we

only report here the results obtained from clustering 30

bins. In the second stage, we applied our hierarchical

clustering algorithm, obtaining the tree of Fig. 14. The

graph does not compellingly suggest a specific cut. We

give here the results of the four-cluster cut, which leads to a

reasonable interpretation. The profiles for the four clusters

are given in Fig. 15. It can be seen that dietary pattern 1 is

characterized by high intake of alcohol, dietary pattern 2 by

high intake of dairy products, dietary pattern 3 by high

intake of vegetables and fruit, and dietary pattern 4 by high

intake of soups and relatively high intakes of dairy prod-

ucts and fruit. The frequency and the proportions of the

four dietary patterns in each of the seven regions are given

in Table 7. The multinomial clustering algorithm was

applied to the rows of Table 7a, obtaining the tree of

Fig. 16. Cutting the tree at the level corresponding to four

clusters, we can see that North-Pas-de-Calais and Ile-de-

France constitute two 1-region clusters, while the remain-

ing six regions cluster in two groups.

Figure 17 gives the distribution of the four dietary pat-

terns in each of the four clusters of regions. The largest

differences are seen in the proportions of dietary patterns 3

and 4. It can be seen that the North-Pas-de-Calais region is

associated to a large proportion of dietary pattern 4, while

the Ile-de-France region is associated to a large proportion

of dietary pattern 3. The group consisting of Alsace-

Lorraine, Aquitaine and Rhone-Alpes is very similar to Ile-

de-France, but it has a slightly higher proportion of diet 4.

Finally, the group consisting of Bretagne and Languedoc-

Roussillon contains similar proportions of diet 3 and 4.

All in all, the analysis presented here leads to a clear and

concise description of the dietary patterns found in this

population as well as of the geographical distribution of

these patterns across France.

nB=25

nB=50

nB=100

nB=150

nB=200

s=1 s=10 s=25 s=50 s=75 s=100

s=1 s=10 s=25 s=50 s=75 s=100

s=1 s=10 s=25 s=50 s=75 s=100

s=1 s=10 s=25 s=50 s=75 s=100

s=1 s=10 s=25 s=50 s=75 s=100

01

01

01

01

01

Fig. 10 FW index for different

values of s and nB

Pattern Anal Applic (2008) 11:199–220 213

123

6 Summary and conclusion

In this work we have reformulated the old idea of clus-

tering subpopulations on the basis of available samples

from them. We do this in order to show that algorithms for

clustering composite observational units provide excellent

tools for dealing with complex data structures and huge

data sets.

We start by introducing a general framework for

developing HAC algorithms. Most clustering algorithms

nB

=25

nB

=50

nB

=100

nB

=150

nB

=200

s=1

s=1

s=10

s=10

s=25

s=25

s=50

s=50

s=75

s=75

s=100

s=100Fig. 11 Hierarchies obtained

with different values of s and nB

Table 5 FW indexes for QDA

and our algorithms = 1 s = 10 s = 25 s = 50 s = 75 s = 100

FW index for the hierarchical classification

nB = 25 1.0000 0.9990 0.9719 0.8520 0.7517 0.6531

nB = 50 1.0000 0.9994 0.9735 0.8865 0.8071 0.6643

nB = 100 1.0000 0.9995 0.9750 0.9019 0.8107 0.6740

nB = 150 1.0000 0.9996 0.9752 0.9088 0.8492 0.6341

nB = 200 1.0000 0.9996 0.9757 0.9012 0.8444 0.6463

FW index for the QDA

nB = 25 1.0000 0.9991 0.9763 0.8636 0.7684 0.6678

nB = 50 1.0000 0.9994 0.9780 0.9007 0.8306 0.6817

nB = 75 1.0000 0.9996 0.9797 0.9174 0.8327 0.6943

nB = 100 1.0000 0.9996 0.9800 0.9250 0.8759 0.6515

nB = 200 1.0000 0.9997 0.9805 0.9168 0.8711 0.6659

214 Pattern Anal Applic (2008) 11:199–220

123

explicitly rely on the notion of dissimilarity between two

atoms. We argue that when dealing with atoms that are

samples from several probability distributions, known in

parametric form, a natural choice for the dissimilarity

between any two units is the likelihood ratio statistic (LRS)

of the hypothesis that the two units come from two dif-

ferent distributions to the hypothesis that they come from

the same one. This established, we need one more impor-

tant specification: the agglomeration rule, i.e. the rule that

permits to extend the notion of dissimilarity between two

atoms to the notion of dissimilarity between two aggregates

of atoms. These two specifications are sufficient to define a

hierarchical agglomerative classification (HAC) algorithm

for subpopulations. This paper uses almost entirely a sim-

ple and natural model based agglomeration rule: define the

dissimilarity between two aggregates as the LRS of the

hypothesis that the aggregates have the same distribution to

the hypothesis that they have two distinct ones. However

we also show that other agglomeration rules like single

link, complete link and others can be easily adopted as

agglomeration rule, thus changing the point of view from

which we look at a data set.

We discuss the relationship between clustering sub-

populations and three important domains of statistical

applications: model based clustering, handling of massive

data sets and Simultaneous Testing Procedures in the

area of multiple comparisons. One of the most appealing

ways of producing a clustering is to construct a hierar-

chy, represented by a dendrogram, which appropriately

summarizes the relationships amongst objects and/or

variables. Unfortunately the construction of a dendrogram

seems unpractical for data sets of the size encountered in

1

2

1.00.50.0-0.5-1.0

-1.0

-0.5

0.0

0.5

1.0

34

5

6

7 89

10

11

12

13

14

1516

17

18

19

20 2

1

22

23

24 25

b 2–step single link dendrogram.a Identified clusters

Fig. 12 The 2-spiral problem in

a large data set context

Table 6 Samples registered in

seven regionsRegion Samples

registered

Alsace-Lorraine 478

Aquitaine 443

Bretagne-Pays-

de-Loire

635

Ile-de-France 1,201

Languedoc–

Roussillon

625

Nord-Pas-de-

Calais

452

Rhone-Alpes 1,018

Table 7 Frequency and Percentage of each dietary pattern

Frequency Percentage

D1 D2 D3 D4 D1 D2 D3 D4

Alsace–Lorraine 92 73 176 137 19.2 15.3 36.8 28.7

Aquitaine 78 72 162 131 17.6 16.3 36.6 29.6

Bretagne-Pays-de-Loire 124 80 210 221 19.5 12.6 33.1 34.8

Ile-de-France 256 188 491 266 21.3 15.7 40.9 22.1

Languedoc–Roussillon 107 90 220 208 17.1 14.4 35.2 33.3

Nord-Pas-de-Calais 90 62 108 192 19.9 13.7 23.9 42.5

Rhone–Alpes 173 143 428 274 17.0 14.0 42.0 26.9

Pattern Anal Applic (2008) 11:199–220 215

123

current applications. Usually, however, interest focuses

less on the whole dendrogram than in its upper portion,

which provides insight into the hierarchy of relatively

few and large clusters. As the number of ‘‘interesting’’

classes is far smaller than the number of observations,

one may disregard the details about the lowest levels of

the dendrogram with negligible information loss.

We devote then an entire section to the development of a

new approach to HAC for huge data sets. In the next sec-

tion, we present analyses of artificial data sets with the

purpose of evaluating the approach to huge data sets we

Pot

atoe

s

Veg

etab

les

Leg

umes

Fru

its

Dai

ry

Cer

eals

Mea

t

Fis

h

Egg

s

Fat

Suga

r

Cak

es

Alc

ohol

Sau

ces

Soup

s

Mis

cella

neou

s

0

20

40

60

80

Fig. 13 Profile of the EPIC data set variables

Pota

toes

Veg

etab

les

Leg

umes

Frui

tsD

airy

Cer

eals

Mea

tFi

shE

ggs

Fat

Sug

arC

akes

Alc

ohol

Sauc

esS

oups

Mis

cell

aneo

us

Pota

toes

Veg

etab

les

Leg

umes

Frui

tsD

airy

Cer

eals

Mea

tFi

shE

ggs

Fat

Sug

arC

akes

Alc

ohol

Sauc

esS

oups

Mis

cell

aneo

us

Pota

toes

Veg

etab

les

Leg

umes

Frui

tsD

airy

Cer

eals

Mea

tFi

shE

ggs

Fat

Sug

arC

akes

Alc

ohol

Sauc

esS

oups

Mis

cell

aneo

us

Pota

toes

Veg

etab

les

Leg

umes

Frui

tsD

airy

Cer

eals

Mea

tFi

shE

ggs

Fat

Sug

arC

akes

Alc

ohol

Sauc

esS

oups

Mis

cell

aneo

us

00

00

2020

2020

1010

1010

3030

3030

4040

4040Dietary Pattern 1 Dietary Pattern 2

Dietary Pattern 3 Dietary Pattern 4

Fig. 15 Profiles of the four

dietary patterns

Languedoc–Roussillon Alsace–LorraineAquitaine

Ile–de–France

Rhone–Alpes

Nord–Pas–de–Calais

Bretagne–Pays–de–Loire

Fig. 16 Nutrition data multinomial tree

Fig. 14 Tree of the nutrition data

216 Pattern Anal Applic (2008) 11:199–220

123

propose. This is followed by a detailed analysis of a real

data set, which was the motivation of this work.

Indeed, clustering of subpopulations emerges as a uni-

fying tool for formulating and solving a variety of

problems in modern data analysis.

Acknowledgements We wish to thank Dr. F. Clavel for having

shared the EPIC French data with us and Dr. E. Riboli for general

assistance in becoming familiar with EPIC. The authors gratefully

acknowledge the hospitality of the Department of Epidemiology and

Statistics members during M. Castejon and A. Gonzalez visits to

McGill University. M. Castejon also thanks the hospitality of INRIA

members during his visit. We gratefully acknowledge support from

the Ministerio de Educacion y Ciencia de Espana, Direccion General

de Investigacion, by means of the DPI2006-14784 and DPI2007-

61090 research projects.

7 Appendix: Proofs

Proof of Proposition 1 Since d(Ai, Aj) [ 0 we have that

f(Ar) [ 0, r = 1, 2, ..., M. Thus f(Ai [ Aj )[ f(Ai) + f(Aj)

and obviously f(Ai [ Aj ) [ max {f(Ai), f(Aj)}

Proof of Proposition 2

d Bs;Btð Þ ¼ xs � xs[tð ÞTV�1s xs � xs[tð Þ

þ xt � xs[tð ÞTV�1t xt � xs[tð Þ

ð41Þ

Now:

xs�xs[t¼ V�1s þV�1

t

� ��1V�1

s þV�1t

� �xs�V�1

s xs�V�1t xt

� �

ð42Þ

xs � xs[t ¼ V�1s þ V�1

t

� ��1V�1

t xs � xtð Þ ð43Þ

xt � xs[t ¼ V�1s þ V�1

t

� ��1V�1

s xt � xsð Þ ð44Þ

Therefore:

d Bs;Btð Þ ¼ xs � xtð ÞTV�1t V�1

s þ V�1t

� ��1V�1

s

� V�1s þ V�1

t

� ��1V�1

t xs � xtð Þþ xs � xtð ÞTV�1

s V�1s þ V�1

t

� ��1V�1

t

� V�1s þ V�1

t

� ��1V�1

s xs � xtð Þ¼ xs � xtð ÞT Vs þ Vtð Þ�1 V�1

s þ V�1t

� ��1V�1

t

� xs � xtð Þ þ xs � xtð ÞT Vs þ Vtð Þ�1

� V�1s þ V�1

t

� ��1V�1

s xs � xtð Þ ¼ xs � xtð ÞT

� Vs þ Vtð Þ�1 V�1s þ V�1

t

� �V�1

s þ V�1t

� ��1

� xs � xtð Þ ¼ xs � xtð ÞT Vs þ Vtð Þ�1 xs � xtð Þð45Þ

where we have used

V�1s V�1

s þ V�1t

� ��1V�1

t ¼ Vs þ Vtð Þ�1 ð46Þ

Notice also that we can write

d Bs;Btð Þ ¼ xs � xtð ÞT V�1s þ V�1

t

� ��1V�1

s V�1t xs � xtð Þ

ð47Þ

owing to the conmutativity of symmetric matrices.

Proof of Proposition 3

d Ai;Aj

� �¼X

r2Ai[Aj

xr � xAi[Aj

� �TV�1

r xr � xAi[Aj

� �

�X2

i¼1

X

r2Ai

xr � xAið ÞTV�1

r xr � xAið Þ

ð48Þ

d Ai;Aj

� �¼X2

i¼1

X

r2Ai

xr � xAi[Aj

� �TV�1

r xr � xAi[Aj

� �(

�X

r2Ai

xr � xAið ÞTV�1

r xr � xAið Þ

)

ð49Þ

Let us develop the first term within the curly bracket

above:

Nord–Pas–de–Calais Languedoc–Roussillon and Bretagne–Pays–de–Loire

Ile–de–France Alsace–Lorraine, Aquitaine and Rhone–Alpes

00

1010

2020

3030

4040

5050

00

1010

2020

3030

4040

5050

D1 D2 D3 D4 D1 D2 D3 D4

D1 D2 D3 D4 D1 D2 D3 D4

Fig. 17 Distribution of the four

dietary patterns in each of the

four clusters of regions

Pattern Anal Applic (2008) 11:199–220 217

123

X

r2Ai

xr � xAi[Aj

� �TV�1

r xr � xAi[Aj

� �

¼X

r2Ai

xr � xAiþ xAi

� xAi[Aj

� �TV�1

r

� xr � xAiþ xAi

� xAi[Aj

� �

¼X

r2Ai

xr � xAið ÞTV�1

r xr � xAið Þ

þX

r2Ai

xAi� xAi[Aj

� �TV�1

r xAi� xAi[Aj

� �

þ 2X

r2Ai

xr � xAið ÞTV�1

r xr � xAi[Aj

� �

ð50Þ

Now notice that the last term is zero. This follows from:X

r2Ai

xr � xAið ÞTV�1

r xAi� xAi[Aj

� �¼ xAi

� xAi[Aj

� �T

�X

r2Ai

V�1r xr � xAið Þ

ð51Þ

andX

r2Ai

V�1r xr � xAið Þ ¼

X

r2Ai

V�1r xr �

X

r2Ai

V�1r xAi

¼ V�1Ai

xAi� VAið Þ�1xAi

¼ 0

ð52Þ

since by definition

V�1Ai¼X

r2Ai

V�1r ; xAi

¼ VAi

X

r2Ai

V�1r xr ð53Þ

We now have,

d Ai;Aj

� �¼X2

i¼1

X

r2Ai

xr � xAið ÞTV�1

r xr � xAið Þ

þX2

i¼1

X

r2Ai

xAi� xAi[Aj

� �TV�1

r xAi� xAi[Aj

� �

�X2

i¼1

X

r2Ai

xr � xAið ÞTV�1

r xr � xAið Þ

¼X2

i¼1

X

r2Ai

xAi� xAi[Aj

� �TV�1

r xAi� xAi[Aj

� �( )

ð54Þ

and we can proceed as in proposition 2. We write to avoid

confusion

Ur ¼ V�1r ;UAi

¼X

r2Ai

V�1r ¼

X

r2Ai

Ur;UAi[Aj¼ UAi

þ UAj

whence

xAi[Aj¼ U�1

Ai[AjUAi

xAiþ UAj

xAj

� �ð55Þ

and therefore

xAi� xAi[Aj

¼ U�1Ai

X

r2Ai

Urxr � U�1Ai[Aj

UAixAiþ UAj

xAj

� �

¼ U�1Ai[Aj

UAi[AjxAi� UAi

xAi� UAj

xAj

� �

¼ U�1Ai[Aj

UAixAiþ UAj

xA1 � UAixAi� UAj

xAj

� �

¼ U�1Ai[AjUAj

xAi� xAj

� �

ð56Þ

and similarly

xAj� xAi[Aj

¼ U�1Ai[AjUAi

xAj� xAi

� �ð57Þ

Substituting in the expression for d(Ai, Aj ) above

(Eq. 54, p. 36):

d Ai;Aj

� �¼X

r2Ai

xAi� xAi[Aj

� �TV�1

r xAi� xAi[Aj

� �

þX

r2Aj

xAj� xAi[Aj

� �TV�1

r xAi� xAi[Aj

� �

¼ xAi� xAj

� �TUAjU�1

Ai[Aj

X

r2Ai

V�1r

!

� UAjU�1

Ai[A2 xAi� xAj

� �þ xAi

� xAj

� �

� UAiU�1

Ai[Aj

X

r2Aj

V�1r

0

@

1

AUAiU�1

Ai[AjxAi� xAj

� �

¼ xAi� xAj

� �TUAjU�1

Ai[AjUAiUAjU�1

Ai[Aj

� xAi� xAj

� �þ xAi

� xAj

� �T

� UAiU�1

Ai[AjUAjUAiU�1

Ai[AjxAi� xAj

� �

¼ xAi� xAj

� �TU�1Ai[AjUAiUAjUAiþ UAj

� �U�1

Ai[Aj

� xAi� xAj

� �¼ xAi

� xAj

� �T

� U�1Ai[AjUAiUAj

xAi� xAj

� �

ð58Þ

where we have used the commutativity of symmetric

matrices, the definition UAi¼P

r2AiV�1

r ; and UAi[Aj¼

UAiþ UAj

: Finally noting that:

U�1Ai[AjUAiUAj¼ V�1

Aiþ V�1

Aj

� ��1

V�1Ai

V�1Aj¼ VAi

þ VAj

� ��1

ð59Þ

we have

d Ai;Aj

� �¼ xAi

� xAj

� �TVAiþ VAj

� ��1xAi� xAj

� �ð60Þ

References

1. Conversano C (2002) Bagged mixtures of classifiers using model

scoring criteria. Pattern Anal Appl 5(4):351–362

2. Barandela R, Valdovinos RM, Sanchez JS (2003) New applica-

tions of ensembles of classifiers. Pattern Anal Appl 6(3):245–256

218 Pattern Anal Applic (2008) 11:199–220

123

3. Yang J, Ye H, Zhang D (2004) A new lda-kl combined method

for feature extraction and its generalisation. Pattern Anal Appl

7(1):40–50. doi:10.1007/s10044-004-0205-6

4. Masip D, Kuncheva LI, Vitria J (2005) An ensemble-based

method for linear feature extraction for two-class problems.

Pattern Anal Appl 227–237

5. Chou C-H, Su M-C (2004) A new cluster validity measure and its

application to image compression. Pattern Anal Appl 7(2):205–

220. doi:10.1007/s10044-004-0218-1

6. Gyllenberg M, Koski T, Lund T, Nevalainen O (2000) Clustering

by adaptive local search with multiple search operators. Pattern

Anal Appl 3(4):348–357. doi:10.1007/s100440070006

7. Omran MGH, Salman A, Engelbrech AP (2006) Dynamic clus-

tering using particle swarm optimization with application in

image segmentation. Pattern Anal Appl 34:332–344. doi:

10.1007/978-3-540-34956-3_6

8. Frigui H (2005) Unsupervised learning of arbitrarily shaped

clusters using ensembles of gaussian models. Pattern Anal Appl

8(1–2):32–49. doi:10.1007/s10044-005-0240-y

9. Franti P, Kivijarvi J (2000) Randomised local search algorithm

for the clustering problem. Pattern Anal Appl 3:358–369

10. Calinski T, Corsten LCA (1985) Clustering means in anova by

simultaneous testing. Biometrics 41:39–48

11. Gabriel KR (1964) A procedure for testing the homogeneity of all

sets of means in analysis of variance. Biometrics 20(3):459–477.

doi: http://dx.doi.org/10.2307/2528488

12. Gabriel KR (1969) Simultaneous test procedures—some theory

of multiple comparisons. Ann Math Stat 40(1):224–250

13. Scott AJ, Symons MJ (1971) Clustering methods based on like-

lihood ratio criteria. Biometrics 27:387–397

14. Banfield JD, Raftery AE (1993) Model-based gaussian and non-

gaussian clustering. Biometrics 49:803–821

15. Bensmail H, Celeux G, Raftery AE, Robert CP (1997) Inference

in model-based cluster analysis. Stat Comput 7(1):1–10. doi:

http://dx.doi.org/10.1023/A:1018510926151

16. Fraley C (1998) Algorithms for model-based gaussian hierar-

chical clustering. SIAM J Sci Comput 20:270–281

17. Fraley C, Raftery AE (1999) Mclust: software for model-based

cluster analysis. J Classification 16:297–306

18. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood

from incomplete data via the em algorithm. J R Stat Soc 39(1):

1–38

19. Gilles Celeux, Gerard Govaert (1992) A classification EM algo-

rithm for clustering and two stochastic versions. Comput Stat

Data Anal 14(3):315–332, ISSN 0167-9473. doi:http://dx.doi.org/

10.1016/0167-9473(92)90042-E

20. Maitra Ranjan (1998) Clustering massive datasets. Technical

report, http://www.math.umbc.edu/*maitra/papers/jsm98.ps

21. Murtagh F (2002) Clustering in massive data sets, chap 14.

Kluwer, Dordrecht, pp 401–545

22. Kaufman L, Rousseeuw PJ (1990) Finding groups in data. An

introduction to cluster analysis. Wiley series in probability and

mathematical statistics. Applied probability and statistics. Wiley,

New York

23. Barndorff-Nielsen OE (1978) Information and exponential fam-

ilies in statistical theory. Wiley, Chichester

24. Cox DR, Hinckley DV (1974) Theoretical statistics. Chapman

and Hall, London

25. McCullagh P, Nelder JA (1989) Generalized linear models, 2nd

edn. Chapman and Hall, London

26. Bradley PS, Fayyad UM (1998) Refining initial points for k-

Means clustering. In: Proceedings of the 15th international con-

ference on machine learning. Morgan Kaufmann, San Francisco,

pp 91–99

27. Bradley PS, Fayyad U, Reina C (1998) Scaling cluster algorithms

to large databases. In: Proceedings of the 4th international

conference on knowledge discovery and data mining. American

Association for Artificial Intelligence Press, Menlo Park, pp 9–15

28. Good P (2005) Permutation, parametric, and bootstrap tests of

hypotheses. A practical guide to resampling methods for testing

hypotheses. Springer, Heidelberg

29. Zhang T, Ramakrishnan R, Livny M (1997) Birch: A new data

clustering algorithm and its applications. Data Mining and Knowl

Discov 1(2):141–182

30. Ng RT, Han J (1994) Efficient and effective clustering methods

for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds)

20th International Conference on Very Large Data Bases, 12–15

September 1994, Santiago, Chile proceedings. Morgan Kaufmann

Publishers, Los Altos, CA 94022, pp 144–155

31. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Auto-

matic subspace clustering of high dimensional data for data

mining applications. In: Proceedings of ACM SIGMOD inter-

national conference on management of data. ACM, New York, pp

94–105

32. Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering

algorithm for large databases. In: Proceedings of ACM SIGMOD

international conference on management of data. ACM, New

York, pp 73–84

33. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based

algorithm for discovering clusters in large spatial databases

with noise. In: Proceeding of the 2nd international conference

on knowledge discovery and data mining, Portland, pp 226–

231

34. Ester M, Kriegel H-P, Sander J, Wimmer M, Xu X (1998)

Incremental clustering for mining in a data warehousing envi-

ronment. In: Proceedings of the 24th international conference on

very large data bases, New York City, NY, pp 323–333

35. Nittel S, Leung KT, Braverman A (2004) Scaling clustering

algorithms for massive data sets using data streams. In: Pro-

ceedings of the 20th international conference on eata engineering

(ICDE’04)

36. Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering

algorithm for categorial attributes. In: Proceeding of the 15th

international conference on data engineering, Sydney, Australia

37. Hartigan JA (1975) Clustering algorithms. Wiley, New York

38. Chris Fraley, Adrian E Raftery (1998) How many clusters?

Which clustering method? Answers via model-based cluster

analysis. Comput J 41(8):578–588. http://citeseer.nj.nec.com/

article/fraley98how.html

39. Fayyad U, Smyth P (1996) From massive datasets to science

catalogs: Applications and challenges. http://www.citeseer.ist.

psu.edu/fayyad95from.html

40. Gordon AD (1986) Links between clustering and assignment

procedures. In: Proceedings of computational statistics, pp 149–

156. doi: http://dx.doi.org/10.1016/0167-9473(92)90042-E

41. Cutting DR, Pedersen JO, Karger D, Tukey JW (1992) Scatter/

gather: a cluster-based approach to browsing large document

collections. In: Proceedings of the 15th annual international

ACM SIGIR conference on research and development in

information retrieval, pp 318–329. http://citeseer.ist.psu.edu/

cutting92scattergather.html

42. Tantrum J, Murua A, Stuetzle W (2002) Hierarchical model-

based clustering of large datasets through fractionation and

refractionation. http://citeseer.nj.nec.com/572414.html

43. SAS Institute Inc (2004) SAS/STAT1 9.1 User’s Guide. SAS

Institute Inc, Cary. ISBN 1-59047-243-8

44. Ciampi A, Lechevallier Y (2000) Clustering large, multi-level

data sets: an approach based on kohonen self organizing maps. In:

Principles of data mining and knowledge discovery: 4th european

conference, PKDD 2000, Lyon, France. Proceedings, volume

1910/2000 of Lecture Notes in Computer Science. Springer,

Heidelberg, pp 353–358

Pattern Anal Applic (2008) 11:199–220 219

123

45. Lechevallier Y, Ciampi A (2007) Multilevel clustering for large

databases. In: Auget J-L, Balakrishnan N, Mesbah M, Mole-

nberghs G (eds) Advances in statistical methods for the health

sciences, statistics for industry and technology, chap 10. Applied

Probality and Statistics, Springer edition, pp 263–274

46. Posse C (2001) Hierarchical model-based clustering for large

datasets. J Comput Graph Stat 10:464–486

47. Guo H, Renaut R, Chen K, Reiman E (2003) Clustering huge data

sets for parametric pet imaging. Biosystems 71:81–92

48. Frigui H (2004) Fuzzy information. Processing NAFIPS’04.

IEEE annual meeting, 27–30 June 2004, vol 2, pp 967–972. doi:

10.1109/NAFIPS.2004.1337437

49. Fowlkes EB, Mallows CL (1983) A method for comparing two

hierarchical clusterings. J Am Stat Assoc 78:553–569

50. Meila M (2002) Comparing clusterings. Technical Report 418,

Department of Statistics, University of Washington

51. Singh S (1998) 2d spiral pattern recognition with possibilistic

measures. Pattern Recogn Lett 19(2):141–147, ISSN 0167-8655.

doi: http://dx.doi.org/10.1016/S0167-8655(97)00163-3

Author Biographies

Antonio Ciampi received his

M.Sc. and Ph.D. degrees from

Queen’s University, Kingston,

Ontario, Canada in 1973. He

taught at the University of Zam-

bia from 1973 to 1977. Returning

to Canada he worked as statiti-

cian in the Treasury of the

Ontario Government. From 1978

to 1985, he was Senior Scientist

in the Ontario Cancer Institute,

Toronto, and taught at the Uni-

versity of Toronto. In 1985 he

moved to Montreal where he is

Associate Professor in the Department of Epidemiology, Biostatistics

and Occupational Health, McGill University. He has also been Senior

Scientist of the Montreal Children’s Hospital Research Instititue, in the

Montreal Heart Institute and in the St. Mary’s Hospital Community

Health Research Unit. His research interest include Statistical Learning,

Data Mining and Statistical Modeling.

Yves Lechevallier In 1976 he

joined the INRIA where he was

engaged in the project of Clus-

tering and Pattern Recognition.

Since 1988 he has been teaching

Clustering, Neural Network and

Data Mining at the University of

PARIS-IX, CNAM and ENSAE.

He specializes in Mathematical

Statistics, Applied Statistics,

Data Analysis and Classification.

Current Research Interests: (1)

Clustering algorithm (Dynamic

Clustering Method, Kohonen

Maps, Divisive Clustering Method); (2) Discrimination Problems and

Decision Tree Methods; Build an efficient Neural Network by Classi-

fication Tree.

Manuel Castejon Limasreceived his engineering degree

from the Universidad de Oviedo

in 1999 and his Ph.D. degree

from the Universidad de La

Rioja in 2004. From 2002 he

teaches project management at

the Universidad de Leon. His

research is oriented towards the

development of data analysis

procedures that may aid project

managers on their decision

making processes.

Ana Gonzalez Marcos received

her M.Sc. and Ph.D. degrees from

the University of La Rioja, Spain.

In 2003, she joined the University

of Leon, Spain, where she works

as a Lecturer in the Department of

Mechanical, Informatic and

Aerospace Engineering. Her

research interests include the

application of multivariate analy-

sis and artificial intelligence

techniques in order to improve the

quality of industrial processes.

220 Pattern Anal Applic (2008) 11:199–220

123