hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio...
TRANSCRIPT
THEORETICAL ADVANCES
Hierarchical clustering of subpopulations with a dissimilaritybased on the likelihood ratio statistic: application to clusteringmassive data sets
Antonio Ciampi Æ Yves Lechevallier ÆManuel Castejon Limas Æ Ana Gonzalez Marcos
Received: 27 June 2006 / Accepted: 9 April 2007 / Published online: 30 October 2007
� Springer-Verlag London Limited 2007
Abstract The problem of clustering subpopulations on
the basis of samples is considered within a statistical
framework: a distribution for the variables is assumed for
each subpopulation and the dissimilarity between any two
populations is defined as the likelihood ratio statistic which
compares the hypothesis that the two subpopulations differ
in the parameter of their distributions to the hypothesis that
they do not. A general algorithm for the construction of a
hierarchical classification is described which has the
important property of not having inversions in the dendro-
gram. The essential elements of the algorithm are specified
for the case of well-known distributions (normal, multi-
nomial and Poisson) and an outline of the general
parametric case is also discussed. Several applications are
discussed, the main one being a novel approach to dealing
with massive data in the context of a two-step approach.
After clustering the data in a reasonable number of ‘bins’ by
a fast algorithm such as k-Means, we apply a version of our
algorithm to the resulting bins. Multivariate normality for
the means calculated on each bin is assumed: this is justified
by the central limit theorem and the assumption that each
bin contains a large number of units, an assumption gen-
erally justified when dealing with truly massive data such as
currently found in modern data analysis. However, no
assumption is made about the data generating distribution.
Keywords Cluster analysis � Binned data � Dissimilarity �Likelihood ratio statistic � Dendrogram � Large data sets
1 Introduction
Both supervised [1–4] and unsupervised [5–9] classifica-
tion are active areas of research. This paper focuses on the
latter, in particular, in the development of a hierarchical
clustering algorithm of special interest due to the various
problems that addresses.
Hierarchical clustering algorithms yield useful insights
into the structure of a data set by providing a global rep-
resentation of the pairwise dissimilarities amongst
observational units. Classically, a unit or singleton is rep-
resented by the recorded values of a set of features of
special interest, hence by a vector of numbers or characters.
The dissimilarity between two units is chosen to be a
function of the vectors representing the two units.
In modern research, however, the data analyst often
encounters situations in which observational units are
composite and of varying sizes, i.e., contain several sin-
gletons. For example, suppose we have data on families
and children within families; according to the specific
question at hand, we may wish to consider the family rather
than the child as the observational unit. As another
example, consider longitudinal data arising from measuring
a quantity at several points in time on a number of subjects.
The times of the measurements need not be the same for
different subjects; also, the number of measurements may
differ from one subject to another. Here, we may focus on
the individual at a particular point in time, or, perhaps more
A. Ciampi (&)
Department of Epidemiology and Biostatistics,
McGill University, Montreal, P.Q., Canada
e-mail: [email protected]
Y. Lechevallier
INRIA—Rocquencourt, 87153 Le Chesnay Cedex, France
M. C. Limas � A. G. Marcos
Department of Mechanical,
Informatical and Aerospace Engineering,
Universidad de Leon, 24007 Leon, Spain
123
Pattern Anal Applic (2008) 11:199–220
DOI 10.1007/s10044-007-0088-4
naturally, on the individual’s time curve. Therefore, it is
useful to study clustering of observational units that consist
of groups of simpler units. We will limit ourselves to the
case in which each (composite) observational unit consists
of a random sample from a subpopulation. Moreover we
assume that a unique probability distribution is associated
with each subpopulation, which generates the feature
vectors of its simple units. Thus we are lead to define
dissimilarities between composite units in terms of dis-
similarities between probability distributions.
There is nothing profoundly new in the idea of cluster-
ing subpopulations. The idea has been proposed (or
rediscovered) more or less directly in a variety of areas.
One example is the problem of multiple comparisons in
one-way ANOVA, which is addressed by constructing a
dendrogram (hierarchical classification tree) in which the
initial units are the sets of subjects at each level of the class
variable [10–12]. Another important area in which the
general idea is currently used is model-based clustering
[13–17]. Here, one constructs a hierarchy of subpopula-
tions based on the likelihood ratio dissimilarity assuming
that the data are generated by a mixture of distributions
(usually multivariate normal). Interestingly, this construc-
tion known as hierarchical model-based clustering is
justified as a way to solve the classification maximum
likelihood problem, that is the maximization not of a
mixture likelihood, but of a likelihood in which one of the
parameters is the label of the distribution of each subject.
Indeed, in this context, the process of constructing such a
hierarchy, is seen as a preliminary exploration of the data,
which guides the eventual maximization of the mixture
likelihood by a version of the EM algorithm [14, 18, 19]. A
third modern problem in which clustering of subpopula-
tions plays a role is clustering of massive data sets [20, 21].
Indeed, one of the most common approaches to this prob-
lem consists of preprocessing the data so as to reduce the
number of initial simple units to a manageable number of
composite units (preprocessing or preclustering). Then, a
hierarchical model-based clustering algorithm (or other
forms of clustering) can be applied to the composite units.
In this paper, we attempt to bring a unifying perspective
to the study of clustering algorithm for composite obser-
vational units, which we refer to as atoms, and treat as
samples from subpopulations of a given population. Our
central tool is a well-known dissimilarity between the
distributions of two subpopulations defined as a likelihood
ratio statistic (LRS). Based on this dissimilarity we outline
a general approach to hierchical agglomerative clustering,
and we show that a number of interesting problems can be
addressed by specifying the general algorithm.
The general framework is outlined in the first part of
Sect. 2, which defines the LRS dissimilarity and outlines a
general algorithm of hierarchical agglomerative clustering
(HAC). A HAC algorithm is essentially specified by a
dissimilarity measure between pairs of units and an
agglomeration rule which defines the dissimilarity between
two aggregates of the original subpopulations. It proceeds
by successively agglomerating, starting from the atoms, the
pair of aggregates with the smallest dissimilarity. Perhaps
the most natural way to extend the dissimilarity from
populations to aggregates is to define it again as a LRS.
However, as we shall see, classical cluster analysis sug-
gests a variety of other possibilities. In spite of the limited
novelty of this first part, we thought it useful to point out
some of the properties of the likelihood-based dissimilarity
and of the associated general approach to hierarchical
clustering of subpopulations.
In Sects. 2.1–2.4 we develop dissimilarity measures for
clustering subpopulations with some familiar distributions
of the exponential family (multivariate normal, Poisson,
multinomial). Developing these formulas is, again, little
more than an exercise in mathematical statistics; however
the use we make of them in later sections shows that they
can help solve interesting problems such as clustering of
rows or columns of a contingency table. A proposal, based
on asymptotic large sample theory, for adapting the general
algorithm to any distribution satisfying appropriate regu-
larity conditions, is outlined in Sect. 2.5; we believe this to
be new. In Sect. 2.6, which also contains new material, we
discuss the relationship of our framework to the simulta-
neous testing procedure (STP) approach to multiple
comparisons, such as developed by Calinski and Corsten
[10], see also [11, 12]. We conclude by outlining a general
proposal for generating simultaneous test procedures
(STPs) within a broad class of problems, such as MA-
NOVA and the generalization of ANOVA to non-normal
distributions.
Section 3 proposes a new algorithm for handling anal-
ysis of massive data sets. As discussed at the beginning of
the section, our algorithm bears strong similarity to others
[14, 17]. However, unlike others, it does not assume nor-
mality (conditional on knowing the class) for the original
data. On the other hand, our algorithm starts from pre-
processed data on composite units, assuming that they
satisfy some regularity assumptions that justify the appli-
cation of the central limit theorem.
The next section (Sect. 4) is devoted to a limited eval-
uation of the proposed methodology. This is done using
data artificially generated from statistical models with
known clustering structures. In particular: in Sect. 4.1, we
compared our proposal for handling large data sets with an
approach considered state-of-the-art [14, 17]; in Sect. 4.2
we retrieved several aspects of a large multilevel data set,
which was artificially generated to imitate the structure of
the real data set analyzed in Sect. 5, and performed a
limited sensitivity analysis to assess the dependence of the
200 Pattern Anal Applic (2008) 11:199–220
123
results on the degree of separation of the clusters and of the
number of ‘bins’ used in the preprocessing; in Sect. 4.3, we
demonstrated how by changing the agglomeration rule of
our basic algorithm, we can retrieve from a large data set a
notoriously difficult-to-retrieve clustering structure.
The real data set which has motivated this work is
analyzed in Sect. 5 as an example of the potential of our
methodology.
A brief discussion resumes the work in Sect. 6 and,
finally, the proof of the propositions and formulas devel-
oped in this work are given in Sect. 7.
2 An algorithm for hierarchical clustering
of subpopulations
This section is devoted to the general framework (2.1); to
the development of explicit formulas for the LRS dissim-
ilarity in some special cases involving well-known
distributions (2.2–2.4); to a general proposal to handle
more complex distributions (2.5); and, finally, to a dis-
cussion of the relationship between our framework and
STPs, concluding with the formulation of a general pro-
posal for developing STPs (2.6).
We suppose that we have data from M subpopulations of
a given population: P1, P2, ... , PM. Initially we assume that
each data set Dr, r = 1,...,M, is an independent random
sample from the corresponding subpopulation and that the
distribution of each population is known except for an
unknown parameter. Thus each subpopulation is repre-
sented by its distribution: f1 x j h1ð Þ; f2 x j h2ð Þ; . . .;fM x j hMð Þ:
2.1 The general algorithm
Suppose that we wish to cluster the subpopulations using
the samples to estimate the unknown parameters. We
define here a dissimilarity from which we will develop a
general HAC algorithm. We want to express the idea that
two subpopulations are similar to the extent that they
cannot be easily distinguished statistically. Thus it seems
reasonable to propose to define the dissimilarity as the LRS
to test the hypothesis H0 that two subpopulation have the
same parameter against the hypothesis H1 that the two
parameters are different:
d Pi;Pj
� �¼ 2 ln
bL1
bL0
ð1Þ
As in all hierarchical clustering algorithms, we begin by
joining the two subpopulations with minimum
dissimilarity, thus representing the merged pair by a
unique parameter, estimated from the joint samples. Then
we need to recalculate the dissimilarity between the
merged pair and all other subpopulations. This
necessitates the specification of an agglomeration rule. In
this paper we use, except otherwise specified, the model-
based agglomeration rule, which defines the dissimilarity
between two aggregates Ai and Aj as a LRS:
d Ai;Aj
� �¼ 2 ln
bL1
bL0
ð2Þ
The above definitions and description clearly constitute
a specification of the typical HAC algorithm (see
Algorithm 1), with dissimilarity given by Eq. 1 and
agglomeration rule given by Eq. 2 or variants thereof.
To go from the general algorithm to a specific one, we
have to develop detailed formulas for calculating the dis-
similarity between atoms and the dissimilarity between
aggregates (agglomeration rule).
As stated in the introduction, HAC algorithms provide a
global representation of the pairwise dissimilarities
amongst atoms. A tree or dendrogram is a powerful rep-
resentation of the results of an HAC algorithm. The
aggregate obtained at each step is represented by a node of
the dendrogram drawn at a height proportional to an
appropriately defined index. For the model-based
agglomeration rule (Eq. 2) the index is calculated as:
f Pið Þ ¼ 0 ð3Þ
f Ai [ Aj
� �¼ f Aið Þ þ f Aj
� �þ d Ai;Aj
� �ð4Þ
It has the following important property:1
Proposition 1
f Ai [ Aj
� �� max f Aið Þ; f Aj
� �� �ð5Þ
A corollary of this proposition is that the hierarchy has
no inversion.
Remark 1 As already mentioned, there are alternative
definitions of the agglomeration rule. We could define the
dissimilarity of two aggregates Ai and Aj as the smallest
dissimilarity between its members, i.e. the smallest of the
Algorithm 1 General HAC algorithm
1: Specify dissimilarity and agglomeration rule.
2: Start from {P1, P2, ..., PM }.
3: Calculate d(Ai, Aj )
4: Substitute the two merged subpopulations by their union.
5: Repeat from step 2 until all subpopulations have been merged.
1 The proof of this and the other propositions of this paper are given
in Sect. 7.
Pattern Anal Applic (2008) 11:199–220 201
123
LRS’s comparing one subpopulation from Ai with one
subpopulation from Aj. This would be single-link clustering
based on the original dissimilarity matrix. Clearly, one can
also develop complete link and average link versions of the
general algorithm. Indeed, one could use any clustering
algorithm based on a dissimilarity matrix [22]. The interest
of this depends on the problem at hand. Here we wish only
to point out that each method of clustering constitutes an
additional tool for exploring a data set of complex struc-
ture. We will give one example of a situation in which we
have used single-link clustering of subpopulations (see
Sect. 4.3).
Remark 2 Under the regularity conditions and for large
sample, it is well known that the Wald and Score statistics
are asymptotically equivalent to the LRS. An alternative
definition of the dissimilarity of Eq. 1 could then be one of
these statistics.
When developing a specific algorithm, the main task is
to calculate the maximum likelihood estimates of the
parameters and the corresponding maximized likelihoods
under specific hypotheses. This can be done in closed form
for some important cases, in particular when the underlying
distribution belongs to the exponential family [23–25]. We
do not give here the general formulas, but limit ourselves to
the most frequently used distributions of this family, the
normal, the Poisson and the multinomial distributions.
2.2 Clustering multivariate normal subpopulations
We will suppose here that Pr has distribution N lr;Rrð Þ:From standard normal theory we can write:
dN Ai;Aj
� �¼ DAi[Aj
� DAi� DAj
ð6Þ
where
DAi¼X
k2Ai
xk � blAi
� �TbR�1
Aixk � blAi
� �þ nAi
ln j bRAij ð7Þ
DAj¼X
k2Aj
xk � blAj
� �TbR�1
Ajxk � blAj
� �þ nAj
ln j bRAjj
ð8Þ
DAi[Aj¼
X
k2Ai[Aj
xk � blAi[Aj
� �TbR�1
Ai[Ajxk � blAi[Aj
� �
þ nAiþ nAj
� �ln j bRAi[Aj
jð9Þ
Here blAr; bRArðr ¼ i; jÞ and blAi[Aj
; bRAi[Ajare the
maximum likelihood estimators of lAr;RArðr ¼ i; jÞ and
lAi[Aj;RAi[Aj
respectively; and nAris the cardinality of Ar.
This is sufficient to completely define the algorithm for our
case. Notice that [16] has developed efficient approaches to
the computation of these formulas in the context of
maximum likelihood hierarchical clustering. Also, the
clustering algorithm of Bradley and Fayyad [26], Bradley
et al. [27], though not hierarchical, uses similar formulas to
achieve its efficiency.
2.3 Clustering Poisson subpopulations
We consider now subpopulations Pr, i = 1,..., M, with
Poisson distribution Pðlr;NrÞ where Nr is known for each
Pr. Direct likelihood maximization yields:
dP Ai;Aj
� �¼ ni ln bli þ nj ln blj � ni þ nj
� �ln bli[j ð10Þ
where
bli ¼ni
Nið11Þ
blj ¼nj
Njð12Þ
bli[j ¼ni þ nj
Ni þ Njð13Þ
and nr represents the sampled value for the r-th cell
(r = i, j).
2.4 Clustering multinomial subpopulations
Assuming the multinomial distribution M pr;Nrð Þ; where
Nr is known for each Pr, we obtain:
dM Ai;Aj
� �¼Xp
l¼1
nil ln bpil þ njl ln bpjl � nil þ njl
� �ln bpi[jl
� �
ð14Þ
where nrl represent the l-th component of the r-th sample
(r = i, j), and bpi[jl ¼nilþnjl
NiþNj; bpgl ¼
ngl
Ngare the components of
the maximum likelihood estimators of the pr parameters
(r = i, j, i [ j). This case is interesting, for instance, when
we wish to cluster rows or columns of a contingency table,
as we will illustrate in Sects. 4.2 and 5.
2.5 Other distributions
In most cases the distributions representing the subpopu-
lations do not permit closed form estimates, and therefore
the calculation of the dissimilarity may be quite a heavy
task. One general simplification may be introduced by
relying on large sample theory. Suppose we first estimate
the parameters for each of the initial subpopulations. Then
we can use the fact that these estimates are asymptotically
normal with mean equal to the parameter itself and vari-
ance–covariance matrix given by the inverse of the
202 Pattern Anal Applic (2008) 11:199–220
123
information matrix. The latter is in general a function of the
unknown parameter. However, we can, as a first approxi-
mation, substitute the maximum likelihood estimates of the
parameters in the Hessian of the likelihood (observed
information matrix) and proceed as though the actual
variances were known. Then at each stage we only have to
estimate the mean of the distribution of the parameters;
these estimates can also be obtained in an approximate way
as function of the estimates of the initial population
parameters and their variance–covariance matrices, treated
as known. We do not give details here, but point out that
there are similarities with what we will develop in the next
section.
2.6 Relationship with STPs and a proposal
for developing STPs
As mentioned in the introduction, the problem of mul-
tiple comparisons in classical ANOVA was one of the
motivations for considering clustering subpopulations, in
particular to avoid logical inconsistencies. In ANOVA
one is interested in studying how a variable of interest y
(response to treatment) varies across a number of sub-
populations (treatment groups). One starts by testing at a
fixed level of Type I error a, the overall homogeneity
hypothesis that the expectation of a variable of interest y
(response to treatment) does not vary across a given set
of subpopulations (treatment groups). But in most prac-
tical cases, one also wants to test whether various
subgroups of treatments can be considered homogeneous
at the same level a, and this is notoriously difficult;
furthermore, there are several ways of formulating the
problem, depending on the type of questions that the
investigator poses. The problem was formulated as fol-
lows by Calinski and Corsten [10]: ‘‘the question is
raised how to produce not an enumeration of all possible
homogeneous subsets, but a partition of the sample
means into non-overlapping subsets such that each sub-
set (or subset of that) may be considered internally
homogeneous. For obtaining such a partition, clustering
methods may be invoked’’. This formulation of the
problem of multiple comparisons in ANOVA leads
directly to the concept of STP. An STP is a procedure
for deciding whether to simultaneously accept or reject
all the hypotheses of a given family. It should possess
the following properties:
(a) For a fixed a, the STP rejects at least one true implied
null hypothesis with probability at most a, and exactly
a if the overall null hypothesis is true.
(b) Acceptance of any implied null hypothesis implies
acceptance of all hypotheses implied by it
(monotonicity).
(c) Homogeneity of any group will be rejected only if the
homogeneity of at least a pair of its members is
rejected (strict monotonicity).
A clustering procedure is clearly a good start for an
STP. For instance, cutting a dendrogram at a certain
height h0 (a specified value of the hierarchy index), can
be seen as declaring all groupings which result from the
cut as ‘internally homogeneous’ for the level of detail h0.
Such a procedure enjoys properties (b) and (c) above. To
turn it into an STP, we would have to be able to attach
to h0 a statement about the probability of rejecting at
least one true homogeneity hypothesis as in property a)
above. Calinski and Corsten [10] achieved this by basing
the construction of their classification schemes on two
criteria of homogeneity: the studentized range and the F-
test statistic. Since they assumed equal group sizes as
well as homoscedasticity, the distributions of both these
criteria turned out to be well known. They recognized
that when using the studentized range, their construction
was an agglomerative hierarchical algorithm with dis-
similarity given by the ordinary Euclidean distance and
complete link as agglomeration rule. Clearly, their
distance is equivalent to our LRS dissimilarity, but
indeed they used an alternative definition of agglomer-
ation rule (see Remark 1). The key point in their STP is
to cut the tree at the height corresponding to the a-
critical level for testing the overall homogeneity hypoth-
esis, and to declare all groups (and all groups of groups)
generated by the cut as homogeneous at the level a.
With this definition, the properties (a)–(c) above are
verified. Notice that if the overall homogeneity hypoth-
esis is rejected, then the tree cannot be cut and all nodes
of the hierarchy as well as groups of nodes can be
considered homogeneous.
From our perspective, the algorithms proposed in [10]
can be generalized as follows. Assume that we know the
distribution of the statistic that tests the overall homoge-
neity hypothesis for a parameter of interest. Choose an
HAC procedure with index h such that for any node A of
the resulting tree, h(A) is a monotonic function of the
statistic that tests the homogeneity of A. Construct a hier-
archy with this method and cut the resulting tree at a height
which corresponds to the critical value of the statistic
which tests the overall homogeneity hypothesis. If the latter
is accepted, the tree cannot be cut; however, if the tree can
be cut at height ha, then the set of the resulting partition can
be declared homogeneous and similarly for their subsets. If
the distribution of the test statistic of the overall hypothesis
Pattern Anal Applic (2008) 11:199–220 203
123
is not known in closed form, a permutation test can be used
to estimate it [28].
3 Clustering massive data sets
This section is devoted to the main application of our
framework: hierarchical clustering of massive data sets.
After a brief review of the area in Sect. 3.1, we outline our
approach in Sect. 3.2; we give detailed formulas to define
an appropriate HAC algorithm in Sect. 3.3; and discuss in
Sect. 3.4 some attempts to evaluate our work and compare
it with a comparable method. Our approach consists of a
preprocessing step to obtain a manageable number of bins
containing the original data, followed by a specific version
of the general algorithm of Sect. 2.1. The distinguishing
feature of our approach is that we do not assume that the
distribution generating the data is known. However,
assumptions about the results of the preprocessing allow us
to apply the central limit theorem and use the multivariate
normal distribution.
3.1 Clustering massive data sets: a brief review
As the size of available data bases increases, eventually
any algorithm will run into a data set that requires more
time or memory than the analyst can afford. This problem
has plagued developers of clustering algorithms for quite
some time and several approaches to address it have been
developed. One can broadly classify these approaches into
three classes, which we shall refer to as: the direct, the
sampling and the two-step approaches.
Researchers adopting the direct approach attempt to
redesign traditional algorithms with the aim to reduce their
complexity [27, 29–36]. Within the context of model-based
clustering, accelerations of the straightforward EM algo-
rithm have been proposed by [16, 19]. These are based on
the development of formulas that are valid, in principle, if
the distribution generating the data is multivariate normal.
While these authors do consider hierarchical agglomerative
versions of their model-based algorithms, they have not
developed graphical techniques to fully exploit the
advantages of hierarchical clustering. Also, the normality
assumption may be a drawback of these otherwise
remarkable efforts. The main problem however is that these
improved algorithms still fail to handle within reasonable
time limits, data sets of size larger than 100,000 samples
and 10 variables, say (our experimentation). For such sizes,
the only successful approaches to reduce computational
burden are essentially non-hierarchical developments of
the classic k-Means algorithm [37]. For instance the scal-
able EM algorithm of Bradley et al. [27] requires only a
single scan of the data. A feature common to all these
algorithms is that they require a priori specification of the
number of clusters [38]. The approach of MCLUST [17]
includes a suggestion for determining the number of clus-
ters, but it is based on the knowledge of the distribution
generating the data. In general these algorithms lack the
graphical power of hierarchical algorithms, which often
may suggest how many clusters one should look for by
simple inspection of the dendrogram.
The sampling approach is based on a very simple idea:
choose a sample of manageable size, apply to it a conve-
nient clustering algorithm and then assign the rest of the
original data set to one of the clusters by an appropriate
rule. This is the path followed by Banfield and Raftery [14]
in their model-based approach to clustering large data sets;
they use the classification likelihood and Bayes’ rule to
assign the rest of the data set to one of the clusters. This has
been criticized [39, 40] for overlooking the problem of
small but important clusters. These may be underrepre-
sented or not represented at all in the sample, and therefore
easily missed. Also, even if all the important clusters are
represented, the sample-based solution may be very dif-
ferent from the solution based on the whole data set. An
extension of sampling termed fractionation was proposed
by Douglass et al. [41]; it successfully reduces calcula-
tions. More recently, fractionation has been extended and
improved by Tantrum et al. [42] who propose, with very
promising results, a hierarchical approach termed
refractionation.
Both the direct and the sampling approaches eventually
run into size limitations. In principle, the two-step class of
methods is very general and open-ended. The idea is to first
reduce the data to a manageable number of bins by some
form of inexpensive preprocessing. Then, in a second step,
one has to choose a reasonable way of representing these
bins and devise an algorithm to cluster them. Therefore, the
burden of the calculations may be shifted in part to the
preprocessing step. Clearly the main problem at this stage
is to decide on how to preprocess the data, because this
choice defines the level of detail below which we loose
information. In view of this, one can argue that the choice
of preprocessing depends more on the problem at hand than
on general methodological considerations. Several
approaches to preprocessing have been proposed. Probably
the most common one in daily practice is, as suggested by
the SAS manual [43], to apply to the data some fast version
of k-Means so as to obtain a number of a few hundred
preclusters. There is also an implicit preprocessing in the
older version of MCLUST. This version only allowed
singletons as inputs, hence no explicit preprocessing.
However, in the early steps of the agglomeration algorithm,
the unconstrained multivariate normal was replaced by
multivariate normal distributions with variance–covariance
204 Pattern Anal Applic (2008) 11:199–220
123
matrices constrained to a very simple form, and these
restrictions were removed only after reaching reasonable
size of agglomerates. The gain in size of data sets that can
be treated by this approach was unfortunately, relatively
modest.
While there are numerous attempts to develop new
methods of preprocessing [44–48], not much new has been
proposed for the second step of the two-step approach. For
the second step, the SAS manual suggests that the bins be
represented simply by the average value of the variables, so
that any clustering algorithm can be applied to them. A
much better choice is that of Banfield and Raftery [14],
who suggest to consider the bins as samples from sub-
populations, assumed to be well represented by a
multivariate normal distribution, and apply to them their
hierarchical model-based clustering algorithm. The current
version of MCLUST explicitly allows the user to initialize
the algorithm with preclusters (our bins). Arguably, the
model-based HAC algorithm of Banfield and Raftery such
as implemented in MCLUST and with appropriate pre-
processing is the state-of-the-art in the area of hierarchical
clustering of massive data sets. Its main drawbacks are the
assumption of normality and the fact that in spite of the
very remarkable improvements in speed [16], the algorithm
may still be too slow in some cases.
Our approach, explained in what follows, attempts to
improve on these two aspects of the second step. It does not
attempt to improve on the preprocessing step.
3.2 The basic setup
As usual we start from a data matrix D = [x1 |x2| ... |xp ],
which represents the values that p continuous random
variables (columns) take on N subjects (rows). We denote
by x the vector random variable that is assumed to generate
D, and by F its distribution. We consider a situation in
which the N subjects have been grouped into M ‘bins’, B1,
B2, ..., BM, of size n1, n2, ..., nM. We assume M large and
both N and ni, i = 1,...,M, very large. We assume also that
in each bin, xr, the restriction of x to Br, has an unknown
but fixed distribution Fr. We do not assume Fr normal, but
we assume regularity conditions such that the central limit
theorem holds for the sample mean of xr taken over a large,
independent random sample with location parameter lr and
variance–covariance matrix Rr. We assume the variance–
covariance matrices known: in practice we will use the bin
sample variance–covariance matrix Sr to estimate Rr. It
follows from the above assumptions that the sample mean
over bin r may be considered approximately multivariate
normal:
xr ¼1
nr
X
xi2Br
xi ! N lr;Vrð Þ ð15Þ
where Vr ¼ 1nr
Rr; for r = 1,..., M.
Therefore a bin is represented by the distribution of its
mean vector, asymptotically normal by the central limit
theorem.
The second step of our proposal is a special case of the
general algorithm. To define the special case, we only need
to work out formulas for the dissimilarities between atoms
and between nodes.
During the analysis process, pairs of bins are succes-
sively merged to form new groups at a higher level, thus
creating a hierarchy. Let Ar ¼ Bs; s 2 arf g; ar �1; . . .;Mf g; r ¼ i; j; ai \ aj ¼£: The nodes of this hierar-
chy will be represented again by the distribution of their
mean vectors.
3.3 Dissimilarity formulas
We define first the dissimilarity between two bins and then
extend the definition to the dissimilarity between any two
nodes of the hierarchical classification. Since we are rep-
resenting any two bins Bs and Bt by the distribution of their
mean, it is reasonable to define the dissimilarity between
them as the LRS for testing the hypothesis H0 that the two
bins have the same mean vector against the hypothesis H1
that their mean vectors differ:
ds;t ¼ 2 lnbL1
bL0
ð16Þ
where bL1 and bL0 represent the maximized likelihood
functions under the two hypotheses. Now, the likelihood
functions L1 and L0 are:
L1 ¼ N xs j ls;Vsð ÞN xt j lt;Vtð Þ ð17Þ
L0 ¼ N xs j ls[t;Vs
� �N xt j ls[t;Vt
� �ð18Þ
or
L1 ¼e�
12 xs � lsð ÞV�1
s xs � lsð Þ2pð Þ
p2j Vs j
12
e�12 xt � ltð ÞV�1
t xt � ltð Þ2pð Þ
p2j Vt j
12
ð19Þ
L0¼e�
12 xs�ls[t
� �V�1
s xs�ls[t
� �
2pð Þp2jVs j
12
e�12 xt�lr[s
� �V�1
t xt�ls[t
� �
2pð Þp2jVt j
12
ð20Þ
Standard algebraic calculations yield:
Pattern Anal Applic (2008) 11:199–220 205
123
ds;t¼ xs�xs[tð ÞTV�1s xs�xs[tð Þþ xt�xs[tð ÞTV�1
t xt�xs[tð Þð21Þ
where:
xs[t ¼ V�1s þ V�1
t
� ��1V�1
s xs þ V�1t xt
� �ð22Þ
We obtain Vs[t; the variance–covariance of xs[t; from
the general property:
Var w1x1 þ w2x2ð Þ ¼ w1Var x1ð ÞwT1 þ w2Var x2ð ÞwT
2 ð23Þ
Thus:
Vs[t ¼ Var V�1s þ V�1
t
� ��1V�1
s xs þ V�1s þ V�1
t
� ��1V�1
t xt
h i
ð24Þ
whence
Vs[t ¼ V�1s þ V�1
t
� ��1 ð25Þ
It is interesting to also formulate these results in terms of
precision. We recall that the precision is the inverse of the
variance–covariance matrix: Us ¼ V�1s and U t ¼ V�1
t :
Then from (25) we can see that the precision of the mean
of the combined bins is the sum of their respective
precisions:
V�1s[t ¼ V�1
s þ V�1t , Us[t ¼ Us þ U t ð26Þ
Equation 25 also allows us to re-write the mean of the
combined bins as
xs[t ¼ Vs[t V�1s xs þ V�1
t xt
� �ð27Þ
These results lead us to the simpler formula for the
dissimilarity of the two bins:
ds;t¼ xs�xs[tð ÞTV�1s xs�xs[tð Þþ xt�xs[tð ÞTV�1
t xt�xs[tð Þð28Þ
It is useful for computational purposes to further
simplify the formula for the dissimilarity:
Proposition 2 Two alternative expressions for ds,t are:
ds;t ¼ xs � xtð ÞT Vs þ Vtð Þ�1 xs � xtð Þ ð29Þ
ds;t ¼ xs � xtð ÞTV�1s V�1
t Vs[t xs � xtð Þ ð30Þ
We can now extend the definition of dissimilarity to any
two nodes Ai and Aj. Following the same approach, we
define the dissimilarity as the LRS for testing the
hypothesis H0 that groups Ai and Aj have the same mean
vector against the hypothesis H1 that their mean vectors
differ:
di;j ¼ 2 lnbL1
bL0
ð31Þ
where:
L1 ¼Y
r2ai
N xr j lAi;Vr
� �Y
r2aj
N xr j lAj;Vr
� �ð32Þ
L0 ¼Y
r2ai[aj
N xr j lAi[Aj;Vr
� �ð33Þ
Replacing the location parameters lAi; lAj
and lAi[Ajby
their maximum likelihood estimators xAi; xAj
and xAi[Aj;
respectively, we obtain now:
di;j ¼X
r2ai[aj
xr � xAi[Aj
� �TV�1
r xr � xAi[Aj
� �
�X
s¼i;j
X
r2as
xr � xAið ÞTV�1
r xr � xAið Þ
ð34Þ
with
xAi¼ VAi
X
r2ai
V�1r xr
� �ð35Þ
V�1Ai¼X
r2ai
V�1r ð36Þ
xAi[Aj¼ VAi[Aj
X
s¼i;j
V�1As
xAsð37Þ
V�1Ai[Aj
¼X
r2ai[aj
V�1r ð38Þ
As above, the following proposition is useful to simplify
the calculations.
Proposition 3 Two alternative expressions for di,j are
di;j ¼ xAi� xAj
� �TVAiþ VAj
� ��1xAi� xAj
� �ð39Þ
di;j ¼ xAi� xAj
� �TV�1
AiV�1
Aiþ V�1
Aj
� ��1
V�1Aj
xAi� xAj
� �
ð40Þ
Notice that Proposition 3 refers to the dissimilarity
between two aggregates, while Proposition 2 refers to the
dissimilarity between two bins, so that it can be seen as a
particular case of the former.
3.4 Evaluation and comparisons
As discussed in Sect. 3.1, our approach proposes an inno-
vation at the second step of a two-step technique.
We are arguing that our method is essentially explor-
atory and that the graphical representations obtained with it
are an important asset. One way to evaluate what we
206 Pattern Anal Applic (2008) 11:199–220
123
propose is therefore to show how our dendrograms cor-
rectly retrieve the number of clusters in artificial data sets
where it is known.
As for comparisons with existing methods, it seems
natural to evaluate the performance of ours with the
MCLUST approach [14, 17], considered state-of-the-art.
We expect an improvement in speed at least in an appro-
priate setting. Also, since we do not assume normality, it
would be reasonable to expect that if we work with data
that are increasingly not normal, our approach should give
better results than the state-of-the-art. To do this, we will
use the multivariate t distribution, systematically decreas-
ing the number of degrees of freedom (df), since this
distribution tends to a multivariate normal for large df but
has thicker tails compared to it for small df.
4 Clustering artificial data
This section studies aspects of the performance of our
methodology using artificial data with a known cluster
structure. We are interested in assessing how well we can
retrieve these structures. The purpose of the exercise is
threefold. First, we intend to evaluate the proposal descri-
bed in Sect. 3 for the second step of a two-step procedure,
by comparing its performance with what we consider state-
of-the art. This is done using the statistical model described
in Sect. 4.1. Second, we want to demonstrate the flexibility
of the idea of HAC for subpopulations, by applying it
systematically to the study of a data set that is not only
huge, but also complex, as it has a multilevel structure.
This is done in Sect. 4.2 using the statistical model
described therein. Third, we want to test the performance
of our approach in retrieving a cluster structure known to
be difficult to retrieve for most methods. This is done in
Sect. 4.3 using the well-known two-spiral model.
4.1 An evaluation of the HAC approach as the second
step of our two-step procedure
In a first experiment, we generated four-dimensional data
from three distributions with mean and variance–covari-
ance matrices specified in Tables 2 and 3. These tables
refer to the next examples in which they are used within the
context of a more complex experiment. Using multivariate
t distributions with degrees of freedoms ranging from 2 to
10, we generated one data set of 230,000 samples for each
value of df.
We analyzed each data set by two different two-step
procedures with the same preprocessing step: k-Means
clustering with k = 25. The 25 bins were then clustered
in the second step by two HAC algorithms: the one
described in Sect. 3 and the one by Banfield and Raftery
as implemented in MCLUST (R version). Figure 1 pre-
sents a summary of this experiment. For df = 2, 5 and 10,
we give three graphs. The first two from the left are
obtained by our procedure and show a dendrogram and
the hierarchical index (height) plotted against the step
number; the third graph represents the BIC as a function
of the number of steps, which is a generally accepted tool
for choosing the number of clusters. As mentioned above,
MCLUST does not have the facility to plot dendrograms
nor to extract the relevant information to do so. The
dendrograms show clearly a three-class structure in all
cases. The height and the BIC graphs are generally used
by looking at an elbow in the curve; the step number
corresponding to the elbow is taken as the best guess at
the number of clusters. The sharper the elbow, the
stronger is the suggestion given by the data. It can be
seen from Fig. 1 that, for high df, the BIC graphs point to
a three-cluster solution, but the picture becomes increas-
ingly uncertain as df decreases, which corresponds to
departures from normality and a certain overlap of the
three clusters. On the other hand, the elbow in the height
curve is clearly at 3, regardless of the number of df’s of
the data- generating distribution. Furthermore, looking at
the dendrograms gives a global picture of the data, a
feature that is absent in the other approach. This confirms
our expectations. The state-of-the-art approach is based
on the normal assumption, and therefore it will perform
better if the data are normal and not as well when
departures from normality become important. By contrast,
our procedure does not assume normality, and therefore it
is insensitive to changes in the data-generating
distribution.
In a second experiment, we generated data from three
multivariate normal distributions with dimensions varying
from 5 to 25 and sample sizes of 30,000 and 300,000. As
above, we applied the two-step procedures under study,
with the same first step resulting in 25 bins, and measured
the computing time for each data set and each procedure.
The results are summarized in Fig. 2. The two panels of the
figure correspond to sample sizes of 30,000 and 300,000; in
each graph the x-axis represents the dimension of the
multivariate distributions, and the y-axis represents the
computing time of the analysis measured in seconds. Our
approach shows definite advantages for larger sample sizes.
Notice also that our procedure is not as sensitive to the
dimension of the data-generating distribution. This was to
be expected because our procedure does not pass through
the whole data set during the HAC algorithm, but only
during the preprocessing step. By contrast, the state-of-the-
art procedure requires passing through the data at each step
of the HAC, which penalizes its performance as the size of
the data set increases.
Pattern Anal Applic (2008) 11:199–220 207
123
15 4 21 168 525 11 22 2018 14 13 76 119 2122324 9 10 317
0e+
002e
+07
4e+
07
−80
0000
0−
7000
000
−60
0000
0
21 6 10 224 23 16 318 4 15 7 12 1119 8 1420 13 25 1 22 9 5 17
0e+
004e
+07
8e+
07
−75
0000
0−
6500
000
21 6 22 14216 5 17 1323 8 9 7 19 4 24 1025 12 11 13 152018
0.0e
+00
6.0e
+07
1.2e
+08
−70
0000
0−
6000
000
11 923 719 1 14 8 15 5 10 624 18 21 217 425 131612 3 2022
0.0e
+00
6.0e
+07
1.2e
+08
−70
0000
0−
6000
000
20 612 9 5 125 22 24 716 1315 23 414 221 11 1731819 8 10
5 10 15 20 5 10 15 20 250.0e
+00
1.0e
+08
−70
0000
0−
6000
000
df=2df=2df=2
df=3df=3df=3
df=4df=4df=4
df=5df=5df=5
df=10df=10df=10
Number of clustersNumber of clusters
5 10 15 20 5 10 15 20 25
Number of clustersNumber of clusters
5 10 15 20 5 10 15 20 25
Number of clustersNumber of clusters
5 10 15 20 5 10 15 20 25
Number of clustersNumber of clusters
5 10 15 20 5 10 15 20 25
Number of clustersNumber of clusters
BIC
BIC
BIC
BIC
BIC
Tre
e he
ight
Tre
e he
ight
Tre
e he
ight
Tre
e he
ight
Tre
e he
ight
Fig. 1 Results from the three
populations data set analysis.
The tree obtained by our
approach is shown on the left.The height graph is shown on
the center. The BIC obtained
from the state-of-the-art
clustering algorithm is shown
on the right
208 Pattern Anal Applic (2008) 11:199–220
123
4.2 Performance of our HAC approach on large
multilevel data sets
The data-generating model used here mimics one of the
real data sets studied below, which comes from a study on
nutrition. We imagine that we have obtained samples from
seven regions of three countries for a total of 230,000 study
subjects. In order to fix the ideas we call the countries
France, Italy and Spain. France has three regions viz.,
north, center and south, while Italy and Spain have two
regions each, north and south. The data are supposed to
describe the dietary habits of the study subjects. They are
generated according to the following model: there are three
dietary patterns common to all regions and countries: high
animal, mediterranean and quasi-vegetarian. However, the
proportion of the three patterns differ according to country
and region, as shown in Table 1, which also indicates the
number of samples generated for each center. We imagine
that dietary patterns are described by four variables only,
viz., vegetables, meat, butter and oil. A pattern is charac-
terized by a multivariate normal distribution of the four
variables measuring daily intake (in grams). Tables 2 and 3
define means, variances and correlation matrices associated
with the three patterns. The two-stage clustering approach
used here consists of applying an optimal partition algo-
rithm, in this case the k-Means, to the 230,000 9 4 data
matrix, followed by our clustering method. We first report
the results obtained from a particular simulation. Next we
report the results of other simulations aimed to explore the
performance of the algorithm under different degrees of
cluster overlap. Clearly, small variances yield well-sepa-
rated clusters and large variance overlapping clusters.
Therefore, the exploration was performed by simulating
data for several values of the variables variances.
4.2.1 Analysis of a copy of the artificial data set
We generated 230,000 samples from the above model. In
this data set, used throughout this section, dietary patterns
are clearly separated. Applying the k-Means algorithm (k
= 100), we partitioned the original data into 100 bins. The
clear separation of the dietary patterns is reflected in the
empirical finding that each of the 100 bins contains sub-
jects from a unique dietary pattern. Next, we applied our
algorithm for binned data. In Fig. 3a we show the hierar-
chical tree obtained by analyzing each of the seven regions
separately; in Fig. 3b we show the hierarchical tree for
each of the three countries. Each tree grows from these
bins, represented in the bottom line. Successive combina-
tions are made to connect those bins that lay closer in the
original space, i.e., very similar according to the chosen
dissimilarity measure. Such connections are represented in
the tree by segments that join the respective nodes. The
combinations take place at a height defined by the hierar-
chical index; since this index has been defined in
accordance with the dissimilarity, the height of each union
gives us information about the relative positions of the
nodes in the original space. We find in the trees of Figs. 3a
and b, that most of the combinations take place at low
55 1010 1515 2020 2525
0.0
0.5
1.0
1.5
2.0
2.5
3.0
30,000 samples 300,000 samples
Tim
e
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Tim
e
Dimension of the data set Dimension of the data set
Fig. 2 Computing time for
each data set and each
procedure (Square hc from
MCLUST; Triangle our
approach)
Table 1 Distribution of dietary patterns in countries and regions
Dietary pattern France—
north
France—
center
France—
south
France Italy—
north
Italy—south Italy Spain—
north
Spain—
south
Spain
High animal 70% 15% 5% 30% 15% 5% 10% 20% 5% 10%
Mediterranean 10% 50% 90% 50% 50% 90% 70% 30% 90% 70%
Quasi vegeterian 20% 35% 5% 20% 35% 5% 20% 50% 5% 20%
Approx. number
of samples
30,000 30,000 30,000 90,000 20,000 40,000 60,000 40,000 40,000 80,000
Pattern Anal Applic (2008) 11:199–220 209
123
heights. Each bin has been color-coded for this example,
red for high animal, green for quasi-vegetarian, and light
blue for Mediterranean; this shows easily that the lowest
steps of the dendrograms agglomerate bins of the same
color, hence corresponding to the same pattern. Indeed, it
can be seen in the figures that every tree starts by correctly
grouping all bins of the same pattern together. Only when
this is accomplished do we find agglomerations of distinct
dietary patterns at higher heights according to the dissim-
ilarities amongst them. As expected, three clusters are
clearly visible in each tree, corresponding to the three
dietary patterns.
Next, we cut all trees so as to have three clusters.
Looking at the data regionally, we formed 21 subpopula-
tions from 7 regions and again applied our algorithm to
these to see if we could again obtain three dietary patterns;
indeed this is the case as shown in Fig. 4a, which also
shows that the three patterns were correctly recovered (see
color coding). Looking at the data by country, we formed
nine subpopulations, and again as shown in Fig. 4b, the
three dietary patterns were correctly retrieved.
An interesting alternative way of looking at the data
consists of taking each region as a subpopulation and
applying our clustering algorithm directly to these sub-
populations. The results are shown in Fig. 5. Figure 6
shows the results of our clustering algorithm as applied to
the subpopulations consisting of each country. The results
are transparent and confirm that the algorithm correctly
retrieves the simulated structure.
Until now, our algorithm has only made use of the
multivariate model for binned data. We will now illustrate
the use of clustering with the multinomial model. It is
interesting to try to replicate the latest results by using the
multinomial approach. Thus, we apply again the k-Means
algorithm to our artificial data set to obtain 300 bins and
construct again a hierarchical tree. Again, not surprisingly,
we find three dietary patterns, which correctly retrieve our
structure (Fig. 7). Next, we cross the patterns with the
seven regions (and the three countries) and present the
frequency of occurrence of each dietary pattern in the
seven regions (and the three countries) in a two-way table
of counts (see Table 4). Now, the rows of this table can be
interpreted as a sample from a multinomial model
M pr;Nrð Þ; where Nr is the marginal total of the r-th row.
Clustering the rows of this table, at the region level, we
obtain the tree of Fig. 8. Similarly, clustering the rows,
now at the country level, we obtain the tree in Fig. 9. The
results (including data not shown here) obtained from the
two points of view do confirm that the data have been
generated from three basic dietary patterns, present in
different proportions in each region and country.
4.2.2 A limited sensitivity analysis
The dietary patterns of the artificial data analyzed above
happen to be clearly separated. This happens in most cases
when generating the data from the model with the same
parameters. A simple way to obtain patterns that are not
clearly separated is to change the model by multiplying the
variances of the four variables by a common scale factor s.
As s increases, points from different dietary patterns tend to
overlap. It can also be suspected that the number of bins
may have an influence on the ability of the algorithm to
retrieve the actual dietary patterns when these are not
clearly separated.
To study the performance of our algorithm under vary-
ing conditions we have performed a limited sensitivity
analysis, by varying s and the number of bins nB. In Fig. 11
we show hierarchical trees obtained from simulated data.
The rows of the figure correspond to a value of s, and the
columns to the number of bins from which the tree was
constructed. We have reported the results for s = 1, 10, 25,
50, 75, 100, and nB = 25, 50, 100, 150, 200. In each case of
Table 2 Means (variance) in
gramsDiet type Vegetables Meat Butter Oil
High animal 0 (4) 100 (25) 50 (25) 0 (4)
Mediterranean 50 (16) 50 (16) 0 (4) 50 (16)
Quasi vegeterian 90 (25) 10 (1) 20 (4) 30 (16)
Table 3 Correlation matricesHigh animal Mediterranean Quasi vegeterian
1 0:5 �0:3 �0:31 �0:5 0
1 0:31
0
BB@
1
CCA
1 0:5 0 0
1 0 0
1 0:31
0
BB@
1
CCA
1 0:5 0:5 0
1 0 0
1 0
1
0
BB@
1
CCA
210 Pattern Anal Applic (2008) 11:199–220
123
the figure we have also reported the value of the index of
Fowlkes and Mallows [49], FW index, to compare the
original dietary patterns with the clustering obtained by
cutting the tree so as to obtain three clusters. This index is
widely used in simulations to compare the cross-classifi-
cation of known and retrieved clusters [42, 50]. We use it
here not so much as a gold standard but as a pragmatic tool
for evaluation of our results. Overall the performance of
the algorithm is good: for low s, the FW index is close to 1
and is insensitive to the values of nB. As expected, the FW
index deteriorates as s increases, while the effect of the
number of bins seems to be subtler. In the worst case
considered, s = 100, the FW index is 0.6341 for nB = 150,
and seems to reach an optimum of 0.6740 for nB = 100.
Figure 10 shows the boxplots of the FW indexes obtained
running 100 simulations for each pair of values of s and nB.
On the other hand, it should be remarked that the shape of
the tree does not consistently suggest a three-cluster solu-
tion, especially as s increases. These results are not
surprising: most algorithms fail to retrieve strongly over-
lapping clusters. In Table 5 we collect the FW indexes
obtained running 100 simulations for each pair of values of
s and nB; and the FW indexes for the quadratic discriminant
analysis (QDA) predictions. Indeed, in our case, we obtain
a very similar behavior of the FW index when applied to
France–Center
France–North
France–South
Italy–North
Italy–South
Spain–North
Spain–South
France
Italy
Spain
a Trees from 7 regions b Trees from 3 countries.
Fig. 3 Hierarchies from artificial nutrition data
a Tree of 21 diets from 7 regions b Tree of 9 diets from 3 countries.
Fig. 4 Classification of dietary patterns obtained cutting the trees in
Fig. 3
Fr.–Center
Fr.–North
It.–NorthSp.–North Fr.–South It.–SouthSp.–South
Fig. 5 Classification obtained considering each region as a bin
France
ItalySpain
Fig. 6 Classification obtained considering each country as a bin
Pattern Anal Applic (2008) 11:199–220 211
123
the results of a QDA to recognize the actual dietary pat-
terns. As for detecting the actual number of clusters, this
constitutes a difficult problem for which no universal
solution has yet been devised.
4.3 Retrieval of an unusual cluster structure
by a variant of the HAC algorithm based
on our dissimilarity
We show now that our dissimilarity can be used as well in
conjunction with any traditional HAC algorithm, like sin-
gle link. It suffices to calculate once and for all the
dissimilarity between all pairs of subpopulations and sub-
mit the resulting matrix to any efficient and reliable
algorithm. This approach extends considerably the scope of
traditional HAC algorithms, which are especially vulnera-
ble to the difficulties caused by massive data sets.
The power of this idea is demonstrated in the following
experiment, in which we retrieved a complex clustering
structure that most traditional algorithms find difficult to
retrieve. We have generated a data set of 300,000 samples
from the popular benchmark two-spiral problem proposed
by Alexis Wieland and now part of the Carnegie Mellon
repository [51]. We have chosen a standard deviation of
0.05, which causes a certain overlap of the two spirals, as
shown in the graphical representation of our data of
Fig. 12a. While single link should be able to retrieve such a
structure, it cannot be applied efficiently to data sets of our
size, unless we use a two-step approach.
We have deployed a two-step approach with k = 25
k-Means in the first step and using single link on the dis-
similarity matrix in the second step. Figure 12b shows the
dendrogram thus obtained, which clearly point to a 2-cluster
solution. Moreover, this solution clearly identifies the shape
of the two spirals of Fig. 12a. By contrast, using both the
algorithm of Sect. 3 and the HAC procedure MCLUST, we
completely miss the correct solution, since both approaches
attempt to cover the whole data set by balls.
5 Analysis of dietary data
We show here an analysis of the real data set that moti-
vated much of the developments presented in this work. It
has a multilevel structure similar to the artificial data of
Sect. 4.2. The interest of this analysis is in the deployment
of many aspects of the methodology outlined in this paper,
in the context of a study that addresses an important pop-
ulation health problem.
Fig. 7 Hierarchy from the
whole artificial nutrition data set
Table 4 Detected frequencies
Mediterranean High animal Quasi vegeterian
France—center 15,199 4,686 11,008
France—north 3,031 21,638 6,149
France–south 27,450 1,497 1,497
Total France 45,680 27,821 18,654
Italy—north 20,032 5,807 13,930
Italy—south 35,558 1,990 1,954
Total Italy 55,590 7,797 15,884
Spain—north 5,888 3,915 9,674
Spain—south 35,198 1,927 1,972
Total Spain 41,086 5,842 11,646
Fr.-Center
Fr.-North
It.-NorthSp.-North Fr.-SouthIt.-SouthSp.-South
Fig. 8 Classification of seven regions with a multinomial tree
France
ItalySpain
Fig. 9 Classification of three countries with a multinomial tree
212 Pattern Anal Applic (2008) 11:199–220
123
Our data was collected in the course of the EPIC study
on nutrition and cancer [44] and comprises measurements
of dietary variables on 4,852 women from eight regions of
France. Table 6 shows the number of samples registered in
each region. The variables are 24-h food intake (in grams)
for 16 food groups. Previous to the analysis, variables for
each subject were re-expressed as percentage of the total
food intake. A profile of the variables is given in Fig. 13. In
the first stage of the analysis, we performed a k-Means
clustering. A choice of k = 30 seems appropriate and we
only report here the results obtained from clustering 30
bins. In the second stage, we applied our hierarchical
clustering algorithm, obtaining the tree of Fig. 14. The
graph does not compellingly suggest a specific cut. We
give here the results of the four-cluster cut, which leads to a
reasonable interpretation. The profiles for the four clusters
are given in Fig. 15. It can be seen that dietary pattern 1 is
characterized by high intake of alcohol, dietary pattern 2 by
high intake of dairy products, dietary pattern 3 by high
intake of vegetables and fruit, and dietary pattern 4 by high
intake of soups and relatively high intakes of dairy prod-
ucts and fruit. The frequency and the proportions of the
four dietary patterns in each of the seven regions are given
in Table 7. The multinomial clustering algorithm was
applied to the rows of Table 7a, obtaining the tree of
Fig. 16. Cutting the tree at the level corresponding to four
clusters, we can see that North-Pas-de-Calais and Ile-de-
France constitute two 1-region clusters, while the remain-
ing six regions cluster in two groups.
Figure 17 gives the distribution of the four dietary pat-
terns in each of the four clusters of regions. The largest
differences are seen in the proportions of dietary patterns 3
and 4. It can be seen that the North-Pas-de-Calais region is
associated to a large proportion of dietary pattern 4, while
the Ile-de-France region is associated to a large proportion
of dietary pattern 3. The group consisting of Alsace-
Lorraine, Aquitaine and Rhone-Alpes is very similar to Ile-
de-France, but it has a slightly higher proportion of diet 4.
Finally, the group consisting of Bretagne and Languedoc-
Roussillon contains similar proportions of diet 3 and 4.
All in all, the analysis presented here leads to a clear and
concise description of the dietary patterns found in this
population as well as of the geographical distribution of
these patterns across France.
nB=25
nB=50
nB=100
nB=150
nB=200
s=1 s=10 s=25 s=50 s=75 s=100
s=1 s=10 s=25 s=50 s=75 s=100
s=1 s=10 s=25 s=50 s=75 s=100
s=1 s=10 s=25 s=50 s=75 s=100
s=1 s=10 s=25 s=50 s=75 s=100
01
01
01
01
01
Fig. 10 FW index for different
values of s and nB
Pattern Anal Applic (2008) 11:199–220 213
123
6 Summary and conclusion
In this work we have reformulated the old idea of clus-
tering subpopulations on the basis of available samples
from them. We do this in order to show that algorithms for
clustering composite observational units provide excellent
tools for dealing with complex data structures and huge
data sets.
We start by introducing a general framework for
developing HAC algorithms. Most clustering algorithms
nB
=25
nB
=50
nB
=100
nB
=150
nB
=200
s=1
s=1
s=10
s=10
s=25
s=25
s=50
s=50
s=75
s=75
s=100
s=100Fig. 11 Hierarchies obtained
with different values of s and nB
Table 5 FW indexes for QDA
and our algorithms = 1 s = 10 s = 25 s = 50 s = 75 s = 100
FW index for the hierarchical classification
nB = 25 1.0000 0.9990 0.9719 0.8520 0.7517 0.6531
nB = 50 1.0000 0.9994 0.9735 0.8865 0.8071 0.6643
nB = 100 1.0000 0.9995 0.9750 0.9019 0.8107 0.6740
nB = 150 1.0000 0.9996 0.9752 0.9088 0.8492 0.6341
nB = 200 1.0000 0.9996 0.9757 0.9012 0.8444 0.6463
FW index for the QDA
nB = 25 1.0000 0.9991 0.9763 0.8636 0.7684 0.6678
nB = 50 1.0000 0.9994 0.9780 0.9007 0.8306 0.6817
nB = 75 1.0000 0.9996 0.9797 0.9174 0.8327 0.6943
nB = 100 1.0000 0.9996 0.9800 0.9250 0.8759 0.6515
nB = 200 1.0000 0.9997 0.9805 0.9168 0.8711 0.6659
214 Pattern Anal Applic (2008) 11:199–220
123
explicitly rely on the notion of dissimilarity between two
atoms. We argue that when dealing with atoms that are
samples from several probability distributions, known in
parametric form, a natural choice for the dissimilarity
between any two units is the likelihood ratio statistic (LRS)
of the hypothesis that the two units come from two dif-
ferent distributions to the hypothesis that they come from
the same one. This established, we need one more impor-
tant specification: the agglomeration rule, i.e. the rule that
permits to extend the notion of dissimilarity between two
atoms to the notion of dissimilarity between two aggregates
of atoms. These two specifications are sufficient to define a
hierarchical agglomerative classification (HAC) algorithm
for subpopulations. This paper uses almost entirely a sim-
ple and natural model based agglomeration rule: define the
dissimilarity between two aggregates as the LRS of the
hypothesis that the aggregates have the same distribution to
the hypothesis that they have two distinct ones. However
we also show that other agglomeration rules like single
link, complete link and others can be easily adopted as
agglomeration rule, thus changing the point of view from
which we look at a data set.
We discuss the relationship between clustering sub-
populations and three important domains of statistical
applications: model based clustering, handling of massive
data sets and Simultaneous Testing Procedures in the
area of multiple comparisons. One of the most appealing
ways of producing a clustering is to construct a hierar-
chy, represented by a dendrogram, which appropriately
summarizes the relationships amongst objects and/or
variables. Unfortunately the construction of a dendrogram
seems unpractical for data sets of the size encountered in
1
2
1.00.50.0-0.5-1.0
-1.0
-0.5
0.0
0.5
1.0
34
5
6
7 89
10
11
12
13
14
1516
17
18
19
20 2
1
22
23
24 25
b 2–step single link dendrogram.a Identified clusters
Fig. 12 The 2-spiral problem in
a large data set context
Table 6 Samples registered in
seven regionsRegion Samples
registered
Alsace-Lorraine 478
Aquitaine 443
Bretagne-Pays-
de-Loire
635
Ile-de-France 1,201
Languedoc–
Roussillon
625
Nord-Pas-de-
Calais
452
Rhone-Alpes 1,018
Table 7 Frequency and Percentage of each dietary pattern
Frequency Percentage
D1 D2 D3 D4 D1 D2 D3 D4
Alsace–Lorraine 92 73 176 137 19.2 15.3 36.8 28.7
Aquitaine 78 72 162 131 17.6 16.3 36.6 29.6
Bretagne-Pays-de-Loire 124 80 210 221 19.5 12.6 33.1 34.8
Ile-de-France 256 188 491 266 21.3 15.7 40.9 22.1
Languedoc–Roussillon 107 90 220 208 17.1 14.4 35.2 33.3
Nord-Pas-de-Calais 90 62 108 192 19.9 13.7 23.9 42.5
Rhone–Alpes 173 143 428 274 17.0 14.0 42.0 26.9
Pattern Anal Applic (2008) 11:199–220 215
123
current applications. Usually, however, interest focuses
less on the whole dendrogram than in its upper portion,
which provides insight into the hierarchy of relatively
few and large clusters. As the number of ‘‘interesting’’
classes is far smaller than the number of observations,
one may disregard the details about the lowest levels of
the dendrogram with negligible information loss.
We devote then an entire section to the development of a
new approach to HAC for huge data sets. In the next sec-
tion, we present analyses of artificial data sets with the
purpose of evaluating the approach to huge data sets we
Pot
atoe
s
Veg
etab
les
Leg
umes
Fru
its
Dai
ry
Cer
eals
Mea
t
Fis
h
Egg
s
Fat
Suga
r
Cak
es
Alc
ohol
Sau
ces
Soup
s
Mis
cella
neou
s
0
20
40
60
80
Fig. 13 Profile of the EPIC data set variables
Pota
toes
Veg
etab
les
Leg
umes
Frui
tsD
airy
Cer
eals
Mea
tFi
shE
ggs
Fat
Sug
arC
akes
Alc
ohol
Sauc
esS
oups
Mis
cell
aneo
us
Pota
toes
Veg
etab
les
Leg
umes
Frui
tsD
airy
Cer
eals
Mea
tFi
shE
ggs
Fat
Sug
arC
akes
Alc
ohol
Sauc
esS
oups
Mis
cell
aneo
us
Pota
toes
Veg
etab
les
Leg
umes
Frui
tsD
airy
Cer
eals
Mea
tFi
shE
ggs
Fat
Sug
arC
akes
Alc
ohol
Sauc
esS
oups
Mis
cell
aneo
us
Pota
toes
Veg
etab
les
Leg
umes
Frui
tsD
airy
Cer
eals
Mea
tFi
shE
ggs
Fat
Sug
arC
akes
Alc
ohol
Sauc
esS
oups
Mis
cell
aneo
us
00
00
2020
2020
1010
1010
3030
3030
4040
4040Dietary Pattern 1 Dietary Pattern 2
Dietary Pattern 3 Dietary Pattern 4
Fig. 15 Profiles of the four
dietary patterns
Languedoc–Roussillon Alsace–LorraineAquitaine
Ile–de–France
Rhone–Alpes
Nord–Pas–de–Calais
Bretagne–Pays–de–Loire
Fig. 16 Nutrition data multinomial tree
Fig. 14 Tree of the nutrition data
216 Pattern Anal Applic (2008) 11:199–220
123
propose. This is followed by a detailed analysis of a real
data set, which was the motivation of this work.
Indeed, clustering of subpopulations emerges as a uni-
fying tool for formulating and solving a variety of
problems in modern data analysis.
Acknowledgements We wish to thank Dr. F. Clavel for having
shared the EPIC French data with us and Dr. E. Riboli for general
assistance in becoming familiar with EPIC. The authors gratefully
acknowledge the hospitality of the Department of Epidemiology and
Statistics members during M. Castejon and A. Gonzalez visits to
McGill University. M. Castejon also thanks the hospitality of INRIA
members during his visit. We gratefully acknowledge support from
the Ministerio de Educacion y Ciencia de Espana, Direccion General
de Investigacion, by means of the DPI2006-14784 and DPI2007-
61090 research projects.
7 Appendix: Proofs
Proof of Proposition 1 Since d(Ai, Aj) [ 0 we have that
f(Ar) [ 0, r = 1, 2, ..., M. Thus f(Ai [ Aj )[ f(Ai) + f(Aj)
and obviously f(Ai [ Aj ) [ max {f(Ai), f(Aj)}
Proof of Proposition 2
d Bs;Btð Þ ¼ xs � xs[tð ÞTV�1s xs � xs[tð Þ
þ xt � xs[tð ÞTV�1t xt � xs[tð Þ
ð41Þ
Now:
xs�xs[t¼ V�1s þV�1
t
� ��1V�1
s þV�1t
� �xs�V�1
s xs�V�1t xt
� �
ð42Þ
xs � xs[t ¼ V�1s þ V�1
t
� ��1V�1
t xs � xtð Þ ð43Þ
xt � xs[t ¼ V�1s þ V�1
t
� ��1V�1
s xt � xsð Þ ð44Þ
Therefore:
d Bs;Btð Þ ¼ xs � xtð ÞTV�1t V�1
s þ V�1t
� ��1V�1
s
� V�1s þ V�1
t
� ��1V�1
t xs � xtð Þþ xs � xtð ÞTV�1
s V�1s þ V�1
t
� ��1V�1
t
� V�1s þ V�1
t
� ��1V�1
s xs � xtð Þ¼ xs � xtð ÞT Vs þ Vtð Þ�1 V�1
s þ V�1t
� ��1V�1
t
� xs � xtð Þ þ xs � xtð ÞT Vs þ Vtð Þ�1
� V�1s þ V�1
t
� ��1V�1
s xs � xtð Þ ¼ xs � xtð ÞT
� Vs þ Vtð Þ�1 V�1s þ V�1
t
� �V�1
s þ V�1t
� ��1
� xs � xtð Þ ¼ xs � xtð ÞT Vs þ Vtð Þ�1 xs � xtð Þð45Þ
where we have used
V�1s V�1
s þ V�1t
� ��1V�1
t ¼ Vs þ Vtð Þ�1 ð46Þ
Notice also that we can write
d Bs;Btð Þ ¼ xs � xtð ÞT V�1s þ V�1
t
� ��1V�1
s V�1t xs � xtð Þ
ð47Þ
owing to the conmutativity of symmetric matrices.
Proof of Proposition 3
d Ai;Aj
� �¼X
r2Ai[Aj
xr � xAi[Aj
� �TV�1
r xr � xAi[Aj
� �
�X2
i¼1
X
r2Ai
xr � xAið ÞTV�1
r xr � xAið Þ
ð48Þ
d Ai;Aj
� �¼X2
i¼1
X
r2Ai
xr � xAi[Aj
� �TV�1
r xr � xAi[Aj
� �(
�X
r2Ai
xr � xAið ÞTV�1
r xr � xAið Þ
)
ð49Þ
Let us develop the first term within the curly bracket
above:
Nord–Pas–de–Calais Languedoc–Roussillon and Bretagne–Pays–de–Loire
Ile–de–France Alsace–Lorraine, Aquitaine and Rhone–Alpes
00
1010
2020
3030
4040
5050
00
1010
2020
3030
4040
5050
D1 D2 D3 D4 D1 D2 D3 D4
D1 D2 D3 D4 D1 D2 D3 D4
Fig. 17 Distribution of the four
dietary patterns in each of the
four clusters of regions
Pattern Anal Applic (2008) 11:199–220 217
123
X
r2Ai
xr � xAi[Aj
� �TV�1
r xr � xAi[Aj
� �
¼X
r2Ai
xr � xAiþ xAi
� xAi[Aj
� �TV�1
r
� xr � xAiþ xAi
� xAi[Aj
� �
¼X
r2Ai
xr � xAið ÞTV�1
r xr � xAið Þ
þX
r2Ai
xAi� xAi[Aj
� �TV�1
r xAi� xAi[Aj
� �
þ 2X
r2Ai
xr � xAið ÞTV�1
r xr � xAi[Aj
� �
ð50Þ
Now notice that the last term is zero. This follows from:X
r2Ai
xr � xAið ÞTV�1
r xAi� xAi[Aj
� �¼ xAi
� xAi[Aj
� �T
�X
r2Ai
V�1r xr � xAið Þ
ð51Þ
andX
r2Ai
V�1r xr � xAið Þ ¼
X
r2Ai
V�1r xr �
X
r2Ai
V�1r xAi
¼ V�1Ai
xAi� VAið Þ�1xAi
¼ 0
ð52Þ
since by definition
V�1Ai¼X
r2Ai
V�1r ; xAi
¼ VAi
X
r2Ai
V�1r xr ð53Þ
We now have,
d Ai;Aj
� �¼X2
i¼1
X
r2Ai
xr � xAið ÞTV�1
r xr � xAið Þ
þX2
i¼1
X
r2Ai
xAi� xAi[Aj
� �TV�1
r xAi� xAi[Aj
� �
�X2
i¼1
X
r2Ai
xr � xAið ÞTV�1
r xr � xAið Þ
¼X2
i¼1
X
r2Ai
xAi� xAi[Aj
� �TV�1
r xAi� xAi[Aj
� �( )
ð54Þ
and we can proceed as in proposition 2. We write to avoid
confusion
Ur ¼ V�1r ;UAi
¼X
r2Ai
V�1r ¼
X
r2Ai
Ur;UAi[Aj¼ UAi
þ UAj
whence
xAi[Aj¼ U�1
Ai[AjUAi
xAiþ UAj
xAj
� �ð55Þ
and therefore
xAi� xAi[Aj
¼ U�1Ai
X
r2Ai
Urxr � U�1Ai[Aj
UAixAiþ UAj
xAj
� �
¼ U�1Ai[Aj
UAi[AjxAi� UAi
xAi� UAj
xAj
� �
¼ U�1Ai[Aj
UAixAiþ UAj
xA1 � UAixAi� UAj
xAj
� �
¼ U�1Ai[AjUAj
xAi� xAj
� �
ð56Þ
and similarly
xAj� xAi[Aj
¼ U�1Ai[AjUAi
xAj� xAi
� �ð57Þ
Substituting in the expression for d(Ai, Aj ) above
(Eq. 54, p. 36):
d Ai;Aj
� �¼X
r2Ai
xAi� xAi[Aj
� �TV�1
r xAi� xAi[Aj
� �
þX
r2Aj
xAj� xAi[Aj
� �TV�1
r xAi� xAi[Aj
� �
¼ xAi� xAj
� �TUAjU�1
Ai[Aj
X
r2Ai
V�1r
!
� UAjU�1
Ai[A2 xAi� xAj
� �þ xAi
� xAj
� �
� UAiU�1
Ai[Aj
X
r2Aj
V�1r
0
@
1
AUAiU�1
Ai[AjxAi� xAj
� �
¼ xAi� xAj
� �TUAjU�1
Ai[AjUAiUAjU�1
Ai[Aj
� xAi� xAj
� �þ xAi
� xAj
� �T
� UAiU�1
Ai[AjUAjUAiU�1
Ai[AjxAi� xAj
� �
¼ xAi� xAj
� �TU�1Ai[AjUAiUAjUAiþ UAj
� �U�1
Ai[Aj
� xAi� xAj
� �¼ xAi
� xAj
� �T
� U�1Ai[AjUAiUAj
xAi� xAj
� �
ð58Þ
where we have used the commutativity of symmetric
matrices, the definition UAi¼P
r2AiV�1
r ; and UAi[Aj¼
UAiþ UAj
: Finally noting that:
U�1Ai[AjUAiUAj¼ V�1
Aiþ V�1
Aj
� ��1
V�1Ai
V�1Aj¼ VAi
þ VAj
� ��1
ð59Þ
we have
d Ai;Aj
� �¼ xAi
� xAj
� �TVAiþ VAj
� ��1xAi� xAj
� �ð60Þ
References
1. Conversano C (2002) Bagged mixtures of classifiers using model
scoring criteria. Pattern Anal Appl 5(4):351–362
2. Barandela R, Valdovinos RM, Sanchez JS (2003) New applica-
tions of ensembles of classifiers. Pattern Anal Appl 6(3):245–256
218 Pattern Anal Applic (2008) 11:199–220
123
3. Yang J, Ye H, Zhang D (2004) A new lda-kl combined method
for feature extraction and its generalisation. Pattern Anal Appl
7(1):40–50. doi:10.1007/s10044-004-0205-6
4. Masip D, Kuncheva LI, Vitria J (2005) An ensemble-based
method for linear feature extraction for two-class problems.
Pattern Anal Appl 227–237
5. Chou C-H, Su M-C (2004) A new cluster validity measure and its
application to image compression. Pattern Anal Appl 7(2):205–
220. doi:10.1007/s10044-004-0218-1
6. Gyllenberg M, Koski T, Lund T, Nevalainen O (2000) Clustering
by adaptive local search with multiple search operators. Pattern
Anal Appl 3(4):348–357. doi:10.1007/s100440070006
7. Omran MGH, Salman A, Engelbrech AP (2006) Dynamic clus-
tering using particle swarm optimization with application in
image segmentation. Pattern Anal Appl 34:332–344. doi:
10.1007/978-3-540-34956-3_6
8. Frigui H (2005) Unsupervised learning of arbitrarily shaped
clusters using ensembles of gaussian models. Pattern Anal Appl
8(1–2):32–49. doi:10.1007/s10044-005-0240-y
9. Franti P, Kivijarvi J (2000) Randomised local search algorithm
for the clustering problem. Pattern Anal Appl 3:358–369
10. Calinski T, Corsten LCA (1985) Clustering means in anova by
simultaneous testing. Biometrics 41:39–48
11. Gabriel KR (1964) A procedure for testing the homogeneity of all
sets of means in analysis of variance. Biometrics 20(3):459–477.
doi: http://dx.doi.org/10.2307/2528488
12. Gabriel KR (1969) Simultaneous test procedures—some theory
of multiple comparisons. Ann Math Stat 40(1):224–250
13. Scott AJ, Symons MJ (1971) Clustering methods based on like-
lihood ratio criteria. Biometrics 27:387–397
14. Banfield JD, Raftery AE (1993) Model-based gaussian and non-
gaussian clustering. Biometrics 49:803–821
15. Bensmail H, Celeux G, Raftery AE, Robert CP (1997) Inference
in model-based cluster analysis. Stat Comput 7(1):1–10. doi:
http://dx.doi.org/10.1023/A:1018510926151
16. Fraley C (1998) Algorithms for model-based gaussian hierar-
chical clustering. SIAM J Sci Comput 20:270–281
17. Fraley C, Raftery AE (1999) Mclust: software for model-based
cluster analysis. J Classification 16:297–306
18. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood
from incomplete data via the em algorithm. J R Stat Soc 39(1):
1–38
19. Gilles Celeux, Gerard Govaert (1992) A classification EM algo-
rithm for clustering and two stochastic versions. Comput Stat
Data Anal 14(3):315–332, ISSN 0167-9473. doi:http://dx.doi.org/
10.1016/0167-9473(92)90042-E
20. Maitra Ranjan (1998) Clustering massive datasets. Technical
report, http://www.math.umbc.edu/*maitra/papers/jsm98.ps
21. Murtagh F (2002) Clustering in massive data sets, chap 14.
Kluwer, Dordrecht, pp 401–545
22. Kaufman L, Rousseeuw PJ (1990) Finding groups in data. An
introduction to cluster analysis. Wiley series in probability and
mathematical statistics. Applied probability and statistics. Wiley,
New York
23. Barndorff-Nielsen OE (1978) Information and exponential fam-
ilies in statistical theory. Wiley, Chichester
24. Cox DR, Hinckley DV (1974) Theoretical statistics. Chapman
and Hall, London
25. McCullagh P, Nelder JA (1989) Generalized linear models, 2nd
edn. Chapman and Hall, London
26. Bradley PS, Fayyad UM (1998) Refining initial points for k-
Means clustering. In: Proceedings of the 15th international con-
ference on machine learning. Morgan Kaufmann, San Francisco,
pp 91–99
27. Bradley PS, Fayyad U, Reina C (1998) Scaling cluster algorithms
to large databases. In: Proceedings of the 4th international
conference on knowledge discovery and data mining. American
Association for Artificial Intelligence Press, Menlo Park, pp 9–15
28. Good P (2005) Permutation, parametric, and bootstrap tests of
hypotheses. A practical guide to resampling methods for testing
hypotheses. Springer, Heidelberg
29. Zhang T, Ramakrishnan R, Livny M (1997) Birch: A new data
clustering algorithm and its applications. Data Mining and Knowl
Discov 1(2):141–182
30. Ng RT, Han J (1994) Efficient and effective clustering methods
for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds)
20th International Conference on Very Large Data Bases, 12–15
September 1994, Santiago, Chile proceedings. Morgan Kaufmann
Publishers, Los Altos, CA 94022, pp 144–155
31. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Auto-
matic subspace clustering of high dimensional data for data
mining applications. In: Proceedings of ACM SIGMOD inter-
national conference on management of data. ACM, New York, pp
94–105
32. Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering
algorithm for large databases. In: Proceedings of ACM SIGMOD
international conference on management of data. ACM, New
York, pp 73–84
33. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based
algorithm for discovering clusters in large spatial databases
with noise. In: Proceeding of the 2nd international conference
on knowledge discovery and data mining, Portland, pp 226–
231
34. Ester M, Kriegel H-P, Sander J, Wimmer M, Xu X (1998)
Incremental clustering for mining in a data warehousing envi-
ronment. In: Proceedings of the 24th international conference on
very large data bases, New York City, NY, pp 323–333
35. Nittel S, Leung KT, Braverman A (2004) Scaling clustering
algorithms for massive data sets using data streams. In: Pro-
ceedings of the 20th international conference on eata engineering
(ICDE’04)
36. Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering
algorithm for categorial attributes. In: Proceeding of the 15th
international conference on data engineering, Sydney, Australia
37. Hartigan JA (1975) Clustering algorithms. Wiley, New York
38. Chris Fraley, Adrian E Raftery (1998) How many clusters?
Which clustering method? Answers via model-based cluster
analysis. Comput J 41(8):578–588. http://citeseer.nj.nec.com/
article/fraley98how.html
39. Fayyad U, Smyth P (1996) From massive datasets to science
catalogs: Applications and challenges. http://www.citeseer.ist.
psu.edu/fayyad95from.html
40. Gordon AD (1986) Links between clustering and assignment
procedures. In: Proceedings of computational statistics, pp 149–
156. doi: http://dx.doi.org/10.1016/0167-9473(92)90042-E
41. Cutting DR, Pedersen JO, Karger D, Tukey JW (1992) Scatter/
gather: a cluster-based approach to browsing large document
collections. In: Proceedings of the 15th annual international
ACM SIGIR conference on research and development in
information retrieval, pp 318–329. http://citeseer.ist.psu.edu/
cutting92scattergather.html
42. Tantrum J, Murua A, Stuetzle W (2002) Hierarchical model-
based clustering of large datasets through fractionation and
refractionation. http://citeseer.nj.nec.com/572414.html
43. SAS Institute Inc (2004) SAS/STAT1 9.1 User’s Guide. SAS
Institute Inc, Cary. ISBN 1-59047-243-8
44. Ciampi A, Lechevallier Y (2000) Clustering large, multi-level
data sets: an approach based on kohonen self organizing maps. In:
Principles of data mining and knowledge discovery: 4th european
conference, PKDD 2000, Lyon, France. Proceedings, volume
1910/2000 of Lecture Notes in Computer Science. Springer,
Heidelberg, pp 353–358
Pattern Anal Applic (2008) 11:199–220 219
123
45. Lechevallier Y, Ciampi A (2007) Multilevel clustering for large
databases. In: Auget J-L, Balakrishnan N, Mesbah M, Mole-
nberghs G (eds) Advances in statistical methods for the health
sciences, statistics for industry and technology, chap 10. Applied
Probality and Statistics, Springer edition, pp 263–274
46. Posse C (2001) Hierarchical model-based clustering for large
datasets. J Comput Graph Stat 10:464–486
47. Guo H, Renaut R, Chen K, Reiman E (2003) Clustering huge data
sets for parametric pet imaging. Biosystems 71:81–92
48. Frigui H (2004) Fuzzy information. Processing NAFIPS’04.
IEEE annual meeting, 27–30 June 2004, vol 2, pp 967–972. doi:
10.1109/NAFIPS.2004.1337437
49. Fowlkes EB, Mallows CL (1983) A method for comparing two
hierarchical clusterings. J Am Stat Assoc 78:553–569
50. Meila M (2002) Comparing clusterings. Technical Report 418,
Department of Statistics, University of Washington
51. Singh S (1998) 2d spiral pattern recognition with possibilistic
measures. Pattern Recogn Lett 19(2):141–147, ISSN 0167-8655.
doi: http://dx.doi.org/10.1016/S0167-8655(97)00163-3
Author Biographies
Antonio Ciampi received his
M.Sc. and Ph.D. degrees from
Queen’s University, Kingston,
Ontario, Canada in 1973. He
taught at the University of Zam-
bia from 1973 to 1977. Returning
to Canada he worked as statiti-
cian in the Treasury of the
Ontario Government. From 1978
to 1985, he was Senior Scientist
in the Ontario Cancer Institute,
Toronto, and taught at the Uni-
versity of Toronto. In 1985 he
moved to Montreal where he is
Associate Professor in the Department of Epidemiology, Biostatistics
and Occupational Health, McGill University. He has also been Senior
Scientist of the Montreal Children’s Hospital Research Instititue, in the
Montreal Heart Institute and in the St. Mary’s Hospital Community
Health Research Unit. His research interest include Statistical Learning,
Data Mining and Statistical Modeling.
Yves Lechevallier In 1976 he
joined the INRIA where he was
engaged in the project of Clus-
tering and Pattern Recognition.
Since 1988 he has been teaching
Clustering, Neural Network and
Data Mining at the University of
PARIS-IX, CNAM and ENSAE.
He specializes in Mathematical
Statistics, Applied Statistics,
Data Analysis and Classification.
Current Research Interests: (1)
Clustering algorithm (Dynamic
Clustering Method, Kohonen
Maps, Divisive Clustering Method); (2) Discrimination Problems and
Decision Tree Methods; Build an efficient Neural Network by Classi-
fication Tree.
Manuel Castejon Limasreceived his engineering degree
from the Universidad de Oviedo
in 1999 and his Ph.D. degree
from the Universidad de La
Rioja in 2004. From 2002 he
teaches project management at
the Universidad de Leon. His
research is oriented towards the
development of data analysis
procedures that may aid project
managers on their decision
making processes.
Ana Gonzalez Marcos received
her M.Sc. and Ph.D. degrees from
the University of La Rioja, Spain.
In 2003, she joined the University
of Leon, Spain, where she works
as a Lecturer in the Department of
Mechanical, Informatic and
Aerospace Engineering. Her
research interests include the
application of multivariate analy-
sis and artificial intelligence
techniques in order to improve the
quality of industrial processes.
220 Pattern Anal Applic (2008) 11:199–220
123