mixture-model cluster analysis using information ... · 1 mixture-model cluster analysis using...
TRANSCRIPT
1
Mixture-Model Cluster Analysis using Information Theoretical Criteria
Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências Sociais e Políticas,
R. Almerindo Lessa, Pólo Universitário do Alto da Ajuda,
1349-055 Lisboa. Portugal
Margarida G. M. S. Cardoso ISCTE – Business School. Department of Quantitative Methods.
Av. das Forças Armadas 1649-026 Lisboa. Portugal [email protected]
Abstract. The estimation of mixture models has been proposed for quite some time as an approach
for cluster analysis. Several variants of the Expectation-Maximization algorithm are currently
available for this purpose. Estimation of mixture models simultaneously allows the determination
of the number of clusters and yields distributional parameters for clustering base variables. There
are several information criteria that help to support the selection of a particular model or
clustering structure. However, a question remains concerning the selection of specific criteria that
may be more suitable for particular applications. In the present work we analyze the relationship
between the performance of information criteria and the type of measurement of clustering
variables. In order to study this relationship we perform the analysis of forty-two data sets with
known clustering structure and with clustering variables that are categorical, continuous and
mixed type. We then compare eleven information-based criteria in their ability to recover the data
sets’ clustering structures. As a result, we select AIC3, BIC and ICL-BIC criteria as the best
candidates for model selection that refers to models with categorical, continuous and mixed type
clustering variables, respectively.
Keywords: Cluster Analysis; Finite Mixture Models; Model Selection; Information Theoretical
Criteria
1. Introduction
The work of Newcomb [52] may be the first contribution for modelling a mixture of
homogeneous groups. But it was only in 1914 that Pearson [54] explicitly referred to the
“dissection of a given abnormal frequency curve into m components” and his work may
be the first one on the decomposition of a mixture of two normal clusters. Since then,
the use of finite mixtures of probability distributions has increased to model data
collected from a population composed of several homogeneous subpopulations.
Several types of variables may be considered as a base for clustering. Hall and
Titterington [31] directed their study to categorical variables; others studied models for
continuous variables, such as Wang, Zhang, Luo, and Wei [65]; but most real clustering
problems involve both continuous and discrete variables, and methodologies for mixed
clustering variables were also used [28], [34], [51].
1 Corresponding author: Jaime Raúl Seixas Fonseca, Tel.: +351 213 619 430 (3179); Fax: +351 213 619 430
2
There are several information criteria that help to support the selection of a particular
mixture model (associated with a clustering structure). However, a question remains
concerning the performance of specific criteria that may be more suitable for particular
applications.
This paper analyzes the association between the performance of information criteria that
may be used for selecting the number of clusters on mixture models and the clustering
variables kind (types of measurement). As a result of this analysis, we may indicate
some preliminary guidelines concerning the selection of a specific information criterion,
when specific types of attributes are considered for clustering in an application.
The paper is organized as follows: in section 2.1, we define notation and review finite
mixture models and cluster analysis via mixture models; in section 2.2, we review
previous work on maximum likelihood, estimator properties and the EM algorithm; in
section 3, we review several model selection criteria proposed to estimate the number of
clusters of a mixture, and some comparisons; in section 4, we handle methodology; in
section 5 we report on data analysis and experimental results, and finally in section 6 we
present some concluding remarks and future work prospects.
2. Clustering of data via finite mixture models
2.1. Finite mixture model
Clustering is a task in which one seeks to identify a finite set of clusters to describe the
data [33]. Maronna and Jacovkis [44] or McLachlan and Basford [49] illustrated the use
of mixture models in the field of cluster analysis. Finite mixture models assume that
parameters of a statistical model of interest differ across unobserved or latent clusters.
They provide a useful method for grouping observations into clusters. In the mixture
method of clustering, each different cluster in the population is assumed to be described
by a different probability distribution, which may belong to the same family but differ
in the values they take for the parameters of the distribution.
This approach to clustering offers some advantages when compared with other
techniques: it identifies clusters [25]; it provides means to select the number of clusters
[47]; it is able to deal with diverse types of data (different measurement levels) [63]; it
outperforms more traditional approaches [64].
In order to present Mixture Models we give some notation below.
3
n sample size
)pY,,Y( 1 p clustering variables (random variables)
)y,,y(n1
measurements on variables p1 Y,,Y
iy measurements vector of individual i on variables p1 Y,,Y
),...,1( nzzz clusters-label vectors
iz binary vector
),(x zy complete data
probability density function
S number of unknown clusters
S*
true number of clusters
sθ vector of all unknown p(d)f parameters of the sth
cluster
vector of mixture model parameters, without weights
)1,,( 1 s vector of weights (mixing proportions)
is probability that an individual i belongs to the sth
cluster,
given i
y ),( vector of all unknown mixture model parameters
)ˆ,ˆ(ˆ estimate of the vector of all unknown parameters
L likelihood function, L( )
LL
log-likelihood function, log L( )
cLL complete-data log-likelihood function
ψn number of mixture model parameters
EN(S) entropy associated to the model with S clusters
The mixture model approach to clustering assumes that data are from a mixture of an
unknown number S of clusters in some unknown proportions, S ,,1 ; it is assumed
that the probability (density) function of i
y , )|( i
yf , is a mixture or a weighted sum of
S cluster-specific probability (density) functions )|( siysf , with parameters s ,
s = 1, … , S, that is, each data point i
y is taken to be a realization of a mixture
)
1
|()|(
S
ssi
ysfsiyf (1)
where
0s , s = 1, ..., S, and 1
1
S
ss (2)
and
ss ,, and , 1,, , , 11 .
4
The data are arranged in an (n × P) matrix denoted by Y, where n is the number of cases
and P is the attributes’ number. Let ipy be the result of the ith case in the pth attribute,
for i = 1,…,n and p = 1,…,P. Let TiPyiyi
y ,...,1 be the ith column of matrix TY , i.e.
an (P × 1) vector for results of case i under all attributes. Then, the log-likelihood
function for the parameters is
)|(
1 1
log )|(
1 1
log )( log siysf
n
i
S
sssi
ysfn
i
S
ssL
(3)
The particularization of mixture models for multinomial, multivariate normal and mixed
models can be seen in works such as [31], [65], and [35], respectively.
2.2. Maximum likelihood estimation
2.2.1. Introduction
With the maximum likelihood approach to the estimation of , an estimate is provided
by a suitable root of the likelihood equation
O )L( log
(4)
In order to derive meaningful results from clustering the mixture model must be
identifiable; this simply means that an unique solution to the maximum likelihood (ML)
problem is possible and the model parameters of the distributions are estimable and well
defined [16].
Several researchers have studied consistency property of both the number of clusters (S)
and of the other model parameters’ ML estimators. The consistency of the number of
clusters’ estimator (S), was stated by Leroux [42] (with a maximum-penalized-
likelihood method) considering information criteria AIC, and BIC. This property of the
number of clusters’ estimator was also discussed in Keribin [40], James, Priebe and
Marchette [37], Boucheron and Gassiat [14], all concluding that the estimated number
of clusters to the true (despite unknown) number of clusters is consistent.
Kabir [39] presented one of the earliest works which refers to the consistency of MLE
for the remaining parameters and he stated that mixture model estimators are consistent
and asymptotically normally distributed when clustering variables are assumed to
5
belong to the exponential family of distributions. Other researchers addressed this issue,
like Day [22], Hataway [32], and Fryer and Robertson [29].
When dealing with Mixture Models for clustering purposes, we may define each
complete data observation )iz,y(ixi
, as having arise from one of the clusters of the
mixture (1). Values of clustering base variables i
y are then regarded as being
incomplete data, augmented by cluster-label variables, isz , that is, )isz,...,(ziz i1 is the
unobserved portion of the data; isz are binary indicator latent variables, so that
s)iz(isz is 1 or 0, according as to whether i
y belongs or does not belong to the sth
cluster, for i = 1,…,n, and s = 1, …S.
Assuming that { iZ } are independent and identically distributed, each one according to a
multinomial distribution of S categories with probabilities S
,,1 , the complete-data
log-likelihood to estimate , if the complete data )iz,y(ixi
was observed [45] is
}log)s|i
(slog{ n
1i
S
1sis )(cL log syfz
(5)
2.2.2. The EM algorithm
Fitting finite mixture models (1) provides a probabilistic clustering of the n entities in
terms of their posterior probabilities of membership of the S clusters of the mixture of
distributions. Since the ML estimates of the finite mixture model (1) cannot be found
analytically, estimation of finite mixture models iteratively computes the estimates of
clusters posterior probabilities and updates the estimates of the distributional parameters
and mixing probabilities [41].
Expectation-maximization (EM) algorithm [23] is a widely used class of iterative
algorithms for ML estimation in the context of incomplete data, e.g. fitting mixture
models to observed data.
The EM algorithm proceeds by alternately applying two steps, until some convergence
criterion is met.
The E-step, on the kth iteration, calculates the complete data expected log-likelihood
function, given y , defined by the so-called Q function
6
)};(log){log)(;(
1 1
)])(;|;([))(,Q( siysfs
ki
yn
i
S
ss
ki
yi
ycLEk
(6)
where
S
jijf
sfy
kj
kj
ksi
ks
k
i
k
iis
1
)()(
)()(
)(is
)(ˆ
)ˆ|y(ˆ
)ˆ|y(ˆ
) ;y| Z( E )|(sˆ
(7)
is the membership probability of pattern i
y in cluster s (posterior probability) (i=1,…,n,
and s =1,…,S).
The M-step, on the (k+1)th iteration, demands the maximization of (6) with respect to
, to update the parameter estimation, obtaining )1(ˆ k .
Then, by the Bayesian rule, the ith pattern is probabilistically assigned into cluster s,
after algorithm EM convergence, if ))(
'ˆ|('
)('
ˆ))(ˆ|(
)(ˆ ksi
ysfk
sk
siysf
ks , Sss ,...,1' .
Since the mixture likelihood L ( ) can never be decreased during the EM sequence,
))(
ˆ())1(
ˆ(k
Lk
L
,
it implies that ))(
ˆ(k
L converges to some L for a sequence of likelihood values
bounded above. Since, typically with mixture model approach, the likelihood surface is
known to have many local maxima the selection of suitable starting values for the EM
algorithm is crucial [11]. Therefore, it is usual to obtain several values of the maximized
log-likelihood for each of the different sets of initial values applied to the given sample,
and then consider the maximum value as the solution.
3. Model selection
3.1. Model selection criteria
Model selection is an increasingly important part of data analysis. In fact, when dealing
with applications, one always has to decide which one is the most appropriate model to
characterize the data structure and several model selection criteria may be available for
this end. In the context of clustering via finite mixture models the problem of
simultaneously choosing an adequate clustering structure and a particular number of
clusters can be approached by several methods:
1) Hypothesis tests
7
With a mixture model-based approach, to clustering, regularity conditions fail to hold
for the likelihood ratio statistic (LRS). It does not have its usual asymptotic null
distribution of chi-square with (01 nn ) degrees of freedom, for testing the null
hypothesis that the true number of clusters is S0 versus the alternative (S1 > S0).
However, a re-sampling approach may be used to the assessment of the p-value of the
LRS in testing those hypotheses [48]. Bootstrap samples are generated from the mixture
model fitted under the null hypothesis of S0 clusters, and the value of the LRS is
computed for each bootstrap sample after fitting mixture models for S0 and S1 in turns.
The process is repeated independently a number of times, and the replicated values of
the LRS, obtained from the successive bootstrap samples, provide an assessment of the
bootstrap, and hence of the true null distribution of LRS, thus enabling an
approximation to be made to the p-value.
Regarding the specific case of testing the null hypothesis ),;()(: 200 yfyfH versus
),;( )1(),;( )(: 22
2111 yfyfyfH , Kathryn Roeder [57] proposes the
construction of a “Mixture Detection Plot” (MDP). She considers the function
( 1)(
)(
0
1
yf
yf), which she shows is a good indicator for the presence of a mixture, since it
has the same number of sign changes of ( )()( 01 xfxf ).
The proposed diagnostic obtains a nonparametric density estimate (using the normal
kernel estimator) of )(1 yf , and the method-of-moments estimate of )(0 yf . By this plot
diagnostic, if the true model has 1 cluster, the mixture detection plot of ( 1)(
)(
0
1
yf
yf)
approximates a stationary Gaussian process; otherwise, if the true model has 2 clusters,
then the plot approximates a bimodal function which tends to exhibit the expected four
sign changes (the MDP results in two relatively large modes with a well-defined dip).
2) Validity indices
Indices of validity for partitions are frequently interpreted as measures of partition
quality, rather than of goodness of fit. Some cluster validity indexes, available in the
literature, are Dunn’s index [4], Davies-Bouldin (DB) index [53], Xie-Beni index [67],
stability index [26]. Dunn’s index is a ratio of within cluster and between cluster
separations; DB index is a function of ratio of the sum of within cluster scatter to
between cluster scatter; Xie-Beni index is a ratio of the fuzzy cluster sum of squared
8
distances to the product of the number of elements and the minimum between cluster
separation; stability index is a stability-based technique based on clusters’ immovability
on partition. We can also see [30] for a good overview of this issue.
Bayes rule together with the common structure for posterior matrix and fuzzy partition (
they have the same mathematical structure) allows for transformation of every cluster
validity indexes into a measure that might be useful for mixture validation [7]; by doing
so, Bezdeck, Li, Attikiouzel, and Windham [7] showed that validity measures of cluster
validity that assess geometric properties of partitions that match the expected structure
in samples from mixtures of normal distributions can be as effective as information
criteria for estimating the number of components.
3) Information Criteria
In the present work we specifically refer to the use of Information Criteria for mixture
model selection and we focus on the determination of the true number of clusters for
these models.
Several criteria are available for this end. AIC – Akaike’s Information Criterion [2] and
BIC – Bayesian Information Criterion [58] are, perhaps, the best known. These and
some other information criteria are presented in Table 1.
Table 1. Some information-based criteria for assessing number of clusters on finite
mixture models
Criterion Definition Author / Bibliography
AIC ψ2n2LL [1]
AIC3 ψ3n2LL [16]
AICc 1)ψn1))/(nψ(nψ(2nAIC [36]
AICu 1))ψnnlog(n/(nAICc [50]
CAIC logn)(1ψn2LL [17]
BIC/MDL lognψn2LL [58] / [56]
CLC 2EN(S)2LL [9]
ICL-BIC 2EN(S)BIC [10]
NEC1 L(1)))EN(S)/(L(SNEC(S) [12]
AWE
logn)(3/2ψ2nc2LL [5]
L 1)/2ψS(n2)S/2log(n/1/12)slog(nλ/2)ψ(nLL [27]
1 We choose S* if NEC(S*) 1, (2S*S ), with the convention NEC(1)=1; otherwise we declare no
clustering structure in the data.
These criteria balance fitness (trying to maximize the likelihood function) and
parsimony (using penalties associated with measures of model complexity), trying to
9
avoid overfit. Furthermore, fitting a model with a large number of clusters requires
estimation of a very large number of parameters and a consequent loss of precision in
these estimates [43].
The general form of information criteria is as follows
CL )ˆ(log2 , (8)
where the first term is the negative logarithm of the maximum likelihood which
decreases when the model complexity increases; the second term or penalty term
penalizes too complex models, and increases with the model number of parameters.
Thus, the selected mixture model should evidence a good trade-off between good
description of the data and the model number of parameters.
The emphasis on information criteria begins with the pioneer work of Akaike [2], with
the Akaike’s information criterion. AIC chooses a model with S clusters that minimises
(8) with C = 2 ψn .
Later, Bozdogan [17] suggested the modified AIC criterion (AIC3) in the context of
multivariate normal mixture models, using 3 instead of 2 as penalising term, that is C =
3 ψn . When a vector parameter lies on the boundary of the parameter space (as in the
case of the standard mixture problem), in comparing two models with n and *n
parameters, respectively, the likelihood ratio statistic has a non-central chi-square
distribution with 2( n - *n ) degrees of freedom, instead of ones considered in AIC. As
a result, he obtained a penalization factor C = 2 ψn + ψn .
Another variant of AIC, the corrected AIC (AICc), is proposed [36], focusing on the
small-sample bias adjustment (AIC may perform poorly if there are too many
parameters in relation to the sample size); AICc thus selects a model with S clusters that
minimises (8) with C = )(21
nn
nn .
Since AICc still tends to overfit as the sample size increases [50] a new criterion is then
proposed - AICu – which considers a greater penalty for overfitting, specially as the
sample size increases.
The consistent AIC criterion (CAIC) with C = ψn (1 + log n) was derived by Bozdogan
[17] . It tends to select models with fewer parameters than AIC does.
The Bayesian Information Criterion (BIC) was proposed by Schwarz [58], looking for
the appropriate modification of maximum likelihood, by studying the asymptotic
10
behaviour of Bayes estimators, under a class of proper priors, which assigns positive
probability on some lower dimensional spaces of the parameter vector. It refers to C =
ψn log n, and is equivalent to the MDL- Minimum Description Length [56].
The CLC - Complete Likelihood Classification - criterion [47] was originated from the
link between the observed log-likelihood and log-classification likelihood, LLc = LL –
EN(S). It considers C = 2EN(S), where the term 2EN(S) penalizes poorly separated
clusters, with
is log
n
1i
S
1s is )S(EN
In order to account for the ability of the mixture model to give evidence for a clustering
structure of the data, Biernacki, Celeux, and Govaert [2] considered the integrated
likelihood of the complete data ( zx, ) or Integrated Classification Likelihood criterion
(ICL); an approximation, referred to as ICL-BIC [47], chooses a model with S clusters
that minimises (10) with C = 2EN(S) + ψn log n.
C. Biernacki, G. Celeux and G. Govaert [12] suggested the improved NEC (originally
introduced by Celeux and Soromenho [20]), which chooses a model with s clusters for
minimum NEC(s)1, (2 sS), because they state that NEC(1) =1; otherwise NEC
declares there is no clustering structure in the data.
Banfield and Raftery [1] have suggested a Bayesian solution to the choice of the
number of clusters, based on an approximation of the classification likelihood, the so-
called approximate weight of evidence – AWE – which penalizes more drastically
complex models than BIC; so it will select more parsimonious models than BIC, except
for well separated clusters, and chooses a model with S clusters that minimizes (8) with
logn)(3/2ψ2n2EN(S)C .
Finally, Figueiredo and Jain [27] proposed the L criterion for any type of parametric
mixture model for which it is possible to write an EM algorithm; this criterion chooses
a model with S clusters that minimizes
1)/2ψS(n2)S/2log(n/1/12)slog(nλ/2)ψ(nLL .
AIC and AIC3 are measures of model complexity associated with some criteria (see
table 1) that only depend on the number of parameters. Some other measures depend on
both the number of parameters and the sample size, as AICc, AICu, CAIC and
BIC/MDL; others depend on entropy, as CLC, and NEC; some of them depend on the
11
number of parameters, sample size, and entropy, as ICL-BIC, and AWE; L depends on
the number of parameters, sample size and mixing proportions, s .
In the present work we specifically refer to information criteria presented in table 1,
which have been referred previously. All are currently in use for the estimation of
mixture models. Their origin is diverse and is illustrated on table 2.
Table 2. Some history of some criteria for model selection on finite mixture models
Proposed for
Criteria Aim
An example of the use on
mixtures
models
Reg
ression
mod
els
AIC To select the order of an autoregressive model [18]
AICc To correct AIC for bias, on regression models [19]
AICu To achieve a better performance for AICc ---
CAIC To make AIC asymptotically consistent [8]
BIC/MDL To select of the order of models in polinomial regression
(Schwarz) or minimum description message length (Rissanen).
[62]
Clu
stering
AIC3 Bozdogan AIC correction for model selection on mixture of multivariate normal.
[3]
CLC When we have a data set with well-separated clusters, using a measure of entropy
[47]
ICL-BIC Related with BIC, favours well-separated clusters, using a measure
of entropy
[10]
NEC Entropy criterion, making a compromise between clustering quality and fit quality on S clusters related to one cluster
[27]
AWE Adds a third dimension to the information criteria. Then, it weighs:
-Fit, Parsimony, and Performance of clustering
[6]
L Much less initialization dependent and automatically avoids the
boundary of the parameter space
[68]
3.2. Information criteria comparisons
There are some studies which refer to the comparison of Information Criteria for
mixture model selection.
Cutler and Windham [21], based their work on simulations of mixtures of bivariate
normal distributions and used AIC, AIC3, BIC/MDL, and ICOMP [15], among others,
as model selection criteria. They generated 500 data sets for each combination (sample
sizes, number of clusters, and levels of separation), and they showed that BIC/MDL and
AIC performed well, by this order, for the model which has Is , and
Ssss ,...,1,1 ; ICOMP was similar to AIC, but had lower success rates in general,
on recovering the true structure of the data. For the model which has Is , s = 1,…,S,
ICOMP had the best performance, and for the full model, the most general specification
because no restrictions are imposed on parameters, ICOMP had a good performance,
12
but BIC/MDL might be preferred, especially when the clusters are well separated. Thus,
this study suggests that the type of model affects the performance of information
criteria.
Bozdogan [16], simulated n = 300 observations from a three-dimensional multivariate
normal distribution with S = 3 clusters. He developed a Monte Carlo study and
replicated the experiment 100 times. The goal was to identify and estimate the
appropriate choices for the number S of clusters and identify the best fitting model
using ICOMP, AIC3, and CAIC model selection criteria. He demonstrated the utility of
ICOMP, AIC3, and CAIC model selection criteria.
Bezdek, Li, Attikiouzel, and Windham [7], studied several criteria performance, namely
AIC, AIC3, BIC/MDL, AWE, ICOMP, and NEC, as probabilistic indices. They
simulated 12 data sets, from a two dimensional normal distribution mixture. They
showed that AIC3 had the best performance, followed by AIC, BIC/MDL, AWE, NEC,
and ICOMP. They also used other indices as Xie-Beni’s index or Dunn’s index and
their generalizations, which were as effective as information criteria for assessing the
number of clusters in a normal mixture.
Biernacki [9] studied normal distributions mixtures, and showed the best performance
of AIC3 and BIC; ICOMP performed worse, staying basically between AIC and AIC3.
However, for the situation of a non normal cluster (uniform + normal), AIC, AIC3, BIC,
and ICOMP underestimated the number of clusters; on the contrary, NEC, and AWE
showed a better performance.
The last two works we mentioned above suggest that the type of clustering variables
may influence the performance of certain information criterion. A methodological
approach is then proposed in order to explore this hypothesis.
4. Methodology
The goal of the present paper is to try to establish a relationship between the type of
clustering variables and Information Criteria performance, when using mixture-model
cluster analysis. We focus on the capacity of several Information Criteria to discover the
true number of clusters and how it associates with the type of clustering variables:
categorical, continuous or mixed.
13
In order to meet this objective we rely on the analysis of several data sets. We thus
conduct the analysis of forty-two data sets (available on the web2) for which the true
number of clusters is known. This approach enables the comparisons between the true
(S*) and estimated number of clusters (S), for each data set.
As a main result from this methodological approach we then present a ranking of
information criteria based on the proportion of data sets each criterion is able to
discover i.e. the proportion of data sets yielding S = S*.
Table 3 summarizes the number of analysed data sets referred to each type of clustering
variables and includes the data sets identification.
Table 3 Data sets analized
Type of clustering variables Data sets analysed Number
Categorical
Landis 77; Judges; Universalistic/particularist; MAT; Store;
Heinen 2; Heinen 3; Financial; Vdheijden; LSAT 6; Political;
Gss82White; Gss82; Midtown Manhattan; Depression; Gss94; Coleman; Eye and Hair color; Hannover
19
Continuous
Glucose in C5YSF1; haemophilia A; carriers; Les Papillons;
Chaetocnema insect data; H. Oleracea H.carduorum; Males of 3 sp. Chaetocnema; With and Angle; Diabetes; Iris of Fisher;
Japanese black pine; Halibut; K-means; Pearson trypanosome
13
Mixed: continuous and
categorical
Bird survival; Methadone treatment; North Central Wisconsin;
Hepatitis; Neolithic Tools; Imports-85; Heart; Cancer ; AIDS;
Ethylene Glycol
10
Tables A1, A2, and A3 (in Appendix) include additional information concerning the data
sets we use on this study, namely the type of clustering variables, the sample size, and
the true number of clusters.
Since the data sets sample size and data dimension are known, we also provide separate
results concerning different sample size and data dimension categories.
In order to evenly distribute the available data sets we consider two data dimension
categories: n 130 and n > 130, for normal multivariate and mixed cases, n 1000 and
n > 1000 for multinomial cases. This cut-off also takes into account the fact that
empirically n = 100 is large enough to support asymptotic approximations on results
[61].
2 http://www.statisticalinnovations.com/products/latentgold datasets and
http://www.ics.uci.edu/~mlearn/MLSummary.
14
In order to present results we also consider an entropy cut-off value - 0.7 - since several
criteria showed to perform better for values that are bigger than this value (AICu, CAIC,
BIC, L, and CLC, ICL-BIC, NEC, AWE).
In addition, we measure the degree of separation between clusters using index of Jedidi,
Ramaswamy, and DeSarbo [38],
))ln(/()
1 1
ˆlnˆ(1 Snn
i
S
sisissE
,
where
n
i
S
sisis
1 1
ˆlnˆ is the entropy term which measures the clusters’ overlap. It is a
relative measure and is bounded between 0 and 1. Values close to 1 indicate that the
clusters are well separated, and values close to 0 indicate that we are dealing with
overlapped clusters.
5. Experiments
In the present work forty-two mixture models are estimated through the data sets
presented in Tables 3. In order to illustrate the modelling procedure we decide on
detailing it referred to three data sets: Landis 77 (Categorical clustering variables);
Diabetes (Continuous clustering variables); Heart (Mixed clustering variables). In each
data set we particularly focus on the capacity of alternative Information Criteria to yield
the true number of clusters, S*. Table 4, table 5 and table 6 illustrate results referred to
Landis 77, Diabetes and Heart, respectively. In those tables, (< true S*) means that the
criterion tends to underestimate S*. > true S* means that the criterion tends to
overestimate S*.
The multivariate mixture models for clustering which are considered for the data sets
deal with clustering base variables which are categorical, continuous, or mixed.
When all of the clustering base variables are categorical, a mixture model of
multinomial probability functions is adopted.
When there are only continuous variables on the clustering base variables, we
propose a multivariate Gaussian mixture model.
When using mixed clustering base variables we have to specify the appropriate
univariate distribution functions for each element of iY , normal for continuous, and
multinomial for categorical variables.
15
In all mixture models it is generally assumed that the clustering base variables are
mutually independent within clusters. In general, it is possible to include local
dependences between clustering base variables by using the appropriate multivariate
rather than univariate distributions for sets of locally dependent variables (multivariate
normal distribution for sets of continuous variables, and joint multinomial distribution
for a set of categorical variables) [63].
5.1. Landis 77 data set
Landis 77 data set consists of 118 entities and 7 categorical clustering variables. The
clustering base variables are presence/absence of carcinoma in the uterine cervix,
according to 7 pathologists, and the true number of clusters (S*) is known to be 3.
The adopted mixture model considers the local independence assumption.
Table 4. Information criteria results for Landis 77 Data set
cluster LL AIC AIC3 AICc AICu CAIC BIC CLC ICL-BIC NEC AWE L
1 -524.5 1062.9 1069.9 1063.9 1072.2 108.3 1082.3 1048.9 -537.6
2 -318.2 666.3 681.3 671.0 688.2 722.9 707.9 643.0 -378.9
3 -294.9 635.9 658.9 647.6 674.4 722.6 699.6 611.9 < true S* < true S* < true S* -339.4
4 -290.7 643.5 674.5 666.5 703.9 760.4 729.4 629.8 -447.4
Table 4 reports the results for this data set. Criteria AIC, AIC3, AICc, AICu, CAIC,
BIC, CLC, and L are minima at S = 3. They are all able to recover the known data
structure.
The relative cluster sizes are 44.5%, 37.5% and 18%. Es = 0.915, indicates that clusters
are well separated.
5.2. Diabetes data set
The Diabetes data set includes 145 entities, described by three continuous clustering
variables, with three clusters. The clustering base variables are: glucose, insulin and
SSPG.
The adopted mixture model considers the local independence assumption, except for
clustering variables glucose and insulin. The criteria selecting a mixture model with
three clusters (Table 5) are CAIC, BIC, and ICL-BIC.
16
Table 5. Information criteria results for Diabetes data set
cluster LL AIC AIC3 AICc AICu CAIC BIC CLC ICL-BIC NEC AWE L
1 -2560.4 5162.6 5155.6 5155.6
2 -2380.3 4850.2 4835.2 4842.7
3 -2320.6 > true S* > true S* > true S* > true S* 4778.6 4755.6 > true S* 4804.2 < true S* < true S* < true S*
4 -2303.1 4791.6 4760.6 4812.3
The overlap index (Es) value is 0.847, indicating moderately separated clusters.
5.3. Heart data set
The Heart data set contains 270 entities, described by 12 clustering variables (five
continuous, and seven categorical), with two clusters. The clustering base variables are:
age, resting blood pressure, serum cholesterol (mg/dl), maximum heart rate achieved
and ST depression induced by exercise relative to rest, which are continuous; sex, chest
pain type (1, 2, 3, 4), fasting blood sugar > 120mg/dl (1 = true, 0 = false), resting
electrocardiographic results (0, 1, 2), exercise induced angina (0, 1), slope (1, 2, 3),
number of major vessels colored by flourosopy (0, 1, 2, 3), which are categorical. Table
6 reports the results of the analysis; the adopted mixture model considers the local
independence assumption, and selects a mixture model with two clusters, based on the
following criteria: BIC, ICL-BIC, CAIC, AWE and L. Relative sizes of two clusters are
60% and 40%. Es = 0.816, indicates that the clusters are moderately separated.
Table 6. Information criteria results for Heart Data Set
cluster LL AIC AIC3 AICc AICu CAIC BIC CLC ICL-BIC NEC AWE L
1 -6745.8 13643.4 13620.4 14184.7 13818.1 -6795.2
2 -6576.0 > true S* > true S* > true S* > true S* 13429.2 13387.2 > true S* 13882.3 > true S* 13817.1 -6722.9
3 -6528.4 13459.3 13398.3 13915.2 13997.3 -6779.7
An additional result is presented in table 7 (confusion matrix) that illustrates the
percentage of correctly and not correctly classified entities in each cluster. 95.3% of
entities of cluster 1 are correctly classified on cluster 1, and 86.7% of entities of group 2
are correctly classified on cluster 2.
17
Table 7. Confusion matrix (Heart Data Set)
True clusters Model clusters
Model clusters Total 1 2 Total
1 143
(95.3%)
7
(4.7%)
150
2 16
(13.3%)
104
(86.7%)
120
Total 159 111 270
6. Discussion and perspectives
In table 8, we report the percentage of cases corresponding to recovering the true
structure of data sets for each information-based criterion and the type of clustering base
variables, based on the entire sample of data sets. The bold values identify the criterion
with the best performance for each type of clustering variables.
Table 8 Structure recovery (%)
Criteria
Clustering
variables
AIC
AIC
3
AIC
c
AIC
u
CA
IC
BIC
CL
C
ICL
-BIC
NE
C
AW
E
L
Multinormal 62
62
39
31
69
77
23
54
46
23
54
Multinomial 84
95
84
90
74
84
21
16
21
16
54
Mixed 40
30
20
40
70
70
10
80
40
40
60
According to the obtained results, we conclude that AIC3 is the best performing
criterion when categorical variables are considered (multinomial distribution adopted),
BIC performs better when continuous variables are considered (normal multivariate
distribution adopted), and ICL-BIC has the best performance when mixed variables are
considered (multinomial and normal multivariate distributions adopted).
For data sets with categorical clustering variables the AIC3 criterion achieves an
excellent performance (see table 8), with 95% of structure recovery. AIC3 is the best
information criterion to use across a large variety of multinomial data configurations.
As we can see (table A5), its performance seems to be very good (100%) for p 4 data
dimension, very good (100%) for small sample sizes, and very good (92%) for
18
overlapped clusters ( sE 0.7). It was followed by AICu, with 90% of structure
recovery.
This conclusion is in accordance with Dias [24]. He used a Monte Carlo study to
compare the performance of information criteria for the selection of the number of
clusters of latent class models as a finite mixture model of conditionally independent
multinomial distributions. His results showed that AIC3 had the best overall success
rate of 72.9%, and outperformed other criteria such as AIC, BIC, and CAIC.
BIC (table 8) is the criterion with the best performance on normal multivariate cases,
with 77% of structure recovery, followed by CAIC with 69%. BIC (table A9) performs
very well (100%) for p 4, and performs well (80%) for small sample sizes, and for
overlapping clusters (77%).
ICL-BIC (table 8) is the criterion with the best performance on mixed data sets, with
80% of structure recovery, followed by CAIC and BIC, both with 70%, and L (60%); it
performs well, regardless variable number and sample size (table A11).
Fig. 1. Criteria results (% of recovery structure) on multinomial, multivariate normal, and mixed base clustering
variables
Figure 1
Multinomial MultinormalMixed
10 16 20 21 23 30 31 39 40 46 54 60 62 69 70 74 77 80 84 90 95
% of structure recovery
AIC
AIC3
AICc
AICu
AWE
BIC
CAIC
CLC
ICL-BIC
L
NEC
crit
eria
19
The performance of the AIC family criteria (AICu, AICc, AIC3, AIC) and also of ICL-
BIC seems to be particularly sensible to the type of clustering variables (see fig.1). This
conclusion meets the former hypothesis of the present paper.
It seems (fig.1) that the L criterion (followed by CAIC) is the less influenced by the type
of clustering variables. May be this can be associated to the fact that L criterion was
proposed for any type of parametric mixture model.
In general this study indicates the existence of a relationship between the performance
of some information criteria and the types of clustering variables which are considered
for clustering with mixture models.
This conclusion needs further research. In fact, the link between the performance of
information criterion and types of clustering variables, is empirically observed. In future
work, the analysis of simulated data sets should be conducted, providing means to
confirm and describe this link.
Acknowledgements
The authors wish to thank the referees, for their many valuable suggestions, which led
to a significant improvement of the article.
References
[1] H. Akaike, Information Theory and an Extension of Maximum Likelihood
Principle, in K. T. Emanuel Parzen, Genshiro Kitagawa, ed., Selected Papers of
Hirotugu Akaike, in Proceedings of the Second International Symposium on
Information Theory, B.N. Petrov and F. caski, eds., Akademiai Kiado, Budapest,
1973, 267-281, Springer-Verlag New York, Inc, Texas, 1973, pp. 434.
[2] H. Akaike, Maximum likelihood identification of Gaussian autorregressive
moving average models, Biometrika, 60 (1973), pp. 255-265.
20
[3] R. L. Andrews and I. S. Currim, A comparison of segment retention for finite
Mixture Logit Models, Journal of Marketing Research, XI (2003a), pp. 235-243.
[4] S. Bandyopadhyay and U. Maulik, Nonparametric Genetic Clustering:
Comparison of Validity Indices, IEEE Transactions on Systems, Man, and
Cybernetics-Part C: Applications and Reviews, 31 (2001), pp. 120-125.
[5] J. D. Banfield and A. E. Raftery, Model-Based Gaussian and Non-Gaussian
Clustering, Biometrics, 49 (1993), pp. 803-821.
[6] H. Bensmail and J. J. Meulman, Model-based Clustering with Noise: Bayesian
Inference and Estimation, Journal of Classification, 20 (2003), pp. 49-76.
[7] J. C. Bezdek, W. Q. Li, Y. Attikiouzel and M. Windham, A geometric approach
to cluster validity for normal mixtures, Soft Computing, 1 (1997), pp. 166-179.
[8] A. Bhatnagar and S. Ghose, A latent class segmentation analysis of e-shoppers,
Journal of Business Research, 57 (2004), pp. 758-767.
[9] C. Biernacki, Choix de modéles en Classification-Ph.D. Tesis., 1997.
[10] C. Biernacki, G. Celeux and G. Govaert, Assessing a Mixture model for
Clustering with the integrated Completed Likelihood, IEEE Transactions on
Pattern analysis and Machine Intelligence, 22 (2000), pp. 719-725.
[11] C. Biernacki, G. Celeux and G. Govaert, Choosing starting values for the EM
algorithm for getting the highest likelihood in multivariate Gaussian mixture
models, Computational Statistics & Data Analysis, 41 (2003), pp. 561-575.
[12] C. Biernacki, G. Celeux and G. Govaert, An improvement of the NEC criterion
for assessing the number of clusters in mixture model, Pattern Recognition
Letters, 20 (1999), pp. 267-272.
[13] C. Biernacki and G. Govaert, Using the classification likelihood to choose the
number of clusters, Computing Science and Statistics, 29 (1997), pp. 451-457.
[14] S. Boucheron and E. Gassiat, Order Estimation and Model Selection, in O. C. a.
T. Rydden, ed., Inference in Hidden markov Models, 2002, pp. 25.
[15] H. Bozdogan, ICOMP: A New Model Selection criterion, in Classification and
Related Methods of Data Analysis, Hans H. Bock (ed.), North-Holland,
Amesterdam, 1988, pp. 599-608.
[16] H. Bozdogan, Mixture-Model Cluster Analysis using Model Selection criteria
and a new Informational Measure of Complexity, in H. Bozdogan, ed.,
Proceedings of the First US/Japan Conference on the Frontiers of Statistical
21
Modeling: An Approach, 69-113, Kluwer Academic Publishers, 1994, pp. 69-
113.
[17] H. Bozdogan, Model Selection and Akaikes's Information Criterion (AIC): The
General Theory and its Analytical Extensions, Psycometrika, 52 (1987), pp.
345-370.
[18] H. Bozdogan and P. Bearse, Information complexity criteria for detecting
influential observations in dynamic multivariate linear models using the genetic
algorithm, Journal of Statistical Planning and Inference, 114 (2003), pp. 31-44.
[19] K. P. Burnham and D. R. Anderson, Multimodel Inference. Understanding AIC
and BIC in Model Selection, Sociological Methods & Research, 33 (2004), pp.
261-304.
[20] G. Celeux and G. Soromenho, An entropy criterion for acessing the number of
clusters in a mixture model, Journal of Classification, 13 (1996), pp. 195-212.
[21] A. Cutler and M. P. Windham, Information-Based Validity Functionals for
Mixture Analysis, in H. Bozdogan, ed., First US/Japan Conference on the
Frontiers of Statistical Modeling: An Informational Approach, Kluwer
Academic Publishers, 1994.
[22] N. E. Day, Estimating the Components of a mixture of normal Distributions,
Biometrika, 56 (1969), pp. 463-474.
[23] A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum Likelihood from
ioncomplete Data via EM algorithm, Journal of the Royal Statistics Society, B,
39 (1977), pp. 1-38.
[24] J. G. Dias, Finite Mixture Models; Review, Applications, and Computer-
intensive Methods, Econimics, Groningen University, PhD Thesis, Groningen,
2004, pp. 199.
[25] W. R. Dillon and A. Kumar, Latent structure and other mixture models in
marketing: An integrative survey and overview, chapter 9 in R.P. Bagozi (ed.),
Advanced methods of Marketing Research, 352-388, Cambridge: blackwell
Publishers, 1994.
[26] A. F. Famili, G. Liu and Z. Liu, Evaluation and optimization of clustering in
gene expression data analysis, Bioinformatics, 20 (2004), pp. 1535-1545.
[27] M. A. T. Figueiredo and A. K. Jain, Unsupervised Learning of Finite Mixture
Models, IEEE Transactions on pattern analysis and Machine Intelligence, 24
(2002), pp. 1-16.
22
[28] R. S. Fonseca, Jaime and G. M. S. Cardoso, Margarida, Retail Clients Latent
Segments. Proceedings of the 12th Portuguese Conference on Artificial
Intelligence, EPIA 2005, December 5-8, in Progress in Artificial Intelligence,
Ed. C. Bento, A. Cardoso, G. Dias, Springer-Verlag, Lecture Notes in
Computing Science, Vol. 3808, 2005, pp. 348-358.
[29] J. G. Fryer, and Robertson, C.A., A Comparision of Some methods for
Estimating Mixed Normal Distributions, Biometrika, 59 (1972), pp. 639-648.
[30] M. Halkidi, Y. Batistakis and M. Vazirgiannis, On Clustering Validation
Techniques, Journal of Intelligent Information Systems, 17 (2001), pp. 107-145.
[31] P. Hall and D. M. Titterington, Efficient Nonparametric Estimation of Mixture
Proportions, Journal of the Royal Statistical Society, Series B, 46 (1984), pp.
465-473.
[32] R. J. Hataway, A Constrained Formulation of Maximum-Likelihood Estimation
for Normal Mixture Distributions, The Annals of Statistics, 13 (1985), pp. 795-
800.
[33] E. R. Hruschka and N. F. F. Ebecken, A genetic algorithm for cluster analysis,
Intelligent Data Analysis, 7 (2003), pp. 15-25.
[34] L. Hunt and M. Jorgensen, Mixture model clustering for mixed data with missing
information, Computational Statistics & Data Analysis, 41 (2003), pp. 429-440.
[35] L. A. Hunt and K. E. Basford, Fitting a Mixture Model to Three-Mode Trhee-
Way Data with Categorical and Continuous Variables, Journal of Classification,
16 (1999), pp. 283-296.
[36] C. M. Hurvich and C.-L. Tsai, regression and Time Series Model Selection in
Small Samples, Biometrika, 76 (1989), pp. 297-307.
[37] L. F. James, C. E. Priebe and D. J. Marchette, Consistency Estimation of Mixture
Complexity, The Annals of Statistics, 29 (2001), pp. 1281-1296.
[38] K. Jedidi, V. Ramaswamy and W. S. DeSarbo, A Maximum Likelihood Method
for Latent Class Regression involving a Censored dependent Variable,
Psycometrika, 58 (1993), pp. 375-394.
[39] A. B. M. L. Kabir, Estimation of Parameters of a Finite Mixture of
Distributions, Journal of the Royal Statistical Society, Series B
(Methodological), 30 (1968), pp. 472-482.
[40] C. Keribin, Estimation consistante de l'orde de modèles de mélange, Comptes
Rendues de l'Academie des Sciences, Paris, t. 326, Série I (1998), pp. 243-248.
23
[41] Y. Kim, W. N. Street and F. Menezer, Evolutionary model selection in
unsupervised learning, Intelligent Data Analysis, 6 (2002), pp. 531-556.
[42] B. G. Leroux, Consistent Estimation of a Mixing Distribution, The Annals of
Statistics, 20 (1992), pp. 1350-1360.
[43] B. G. Leroux and M. L. Puterman, Maximum-Penalized-Likelihood Estimation
for Independent and Markov-Dependent Mixture Models, Biometrics, 48 (1992),
pp. 545-558.
[44] R. Maronna and P. M. Jacovkis, Multivariate Clustering Procedures with
variable Metrics, Biometrics, 30 (1974), pp. 499-505.
[45] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions, John Wiley
& Sons, New York, 1997.
[46] G. F. McLachlan and N. Khan, On a resampling approach for tests on the
number of clusters with mixture model-based clustering of tissue samples,
Journal of multivariate Analysis, 90 (2004), pp. 90-105.
[47] G. F. McLachlan and D. Peel, Finite Mixture Models, John Wiley & Sons, Inc.,
2000.
[48] G. J. McLachlan, On Bootstrapping the Likelihood Ratio Test Statistic for the
Number of components in a Normal Mixture, Appllied Statistics, 36 (1987), pp.
318-324.
[49] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications
to Clustering., Marcel Deckker, Inc., New York, 1988.
[50] A. McQuarrie, R. Shumway and C.-L. Tsai, The model selection criterion AICu,
Statistics & Probability Letters, 34 (1997), pp. 285-292.
[51] I. Moustaki and I. Papageorgiou, Latent class models for mixed variables with
applications in Archaeometry, Computational Statistics & Data Analysis, In
Press (2004).
[52] S. Newcomb, A Generalized Theory of the Combination of Observations so as to
Obtain the Best Result, American journal of Mathematics, 8 (1886), pp. 343-
366.
[53] M. K. Pakhira, S. Bandyopadhyay and U. Maulik, Validity index for crisp anf
fuzzy clusters, Pattern Recognition, 37 (2004), pp. 487-501.
[54] K. Pearson, On the probability that two independent Distributions of frequency
are really samples of the same population with special reference to recent work
on the identity of trypanosome strains, Biometrika, 10 (1914), pp. 85-143.
24
[55] Y. Qu and S. Xu, Supervised cluster analysis for microarray data based on
multivariate Gaussian mixture, Bioinformatics, 20 (2004), pp. 1905-1913.
[56] J. Rissanen, Modeling by shortest data description, Automatica, 14 (1978), pp.
465-471.
[57] K. Roeder, A graphical Technique for Determining the Number of components
in a Mixture of Normals, Journal of the American Statistical Association, 89
(1994), pp. 487-495.
[58] G. Schwarz, Estimating the Dimenson of a Model, The Annals of Statistics, 6
(1978), pp. 461-464.
[59] J. Seo, M. Bakay, Y.-W. Chen, S. Hilmer, B. Shneiderman and E. P. Hoffman,
Interactively optimizing signal-to-noise ratios in expression profiling: project-
specific algorithm selection and detection p-value weighting in Affymetrix
mocroarrays, Bioinformatics, 20 (2004), pp. 2534-2544.
[60] M. Shaked, On Mixtures from Exponential Families, Journal of the Royal
Statistical Society, Series B (Methodological), 42 (1980), pp. 192-198.
[61] R. Shibata, Information Criteria for Statistical Model Selection, Electronics and
Communications in Japan, Part 3,, 85 (2002), pp. 605-611.
[62] J. Tao, N.-Z. Shi and S.-Y. Lee, Drug risk assessment with determining the
number of sub-populations under finite mixture normal models, Computational
Statistics & Data Analysis, 46 (2004), pp. 661-676.
[63] J. K. Vermunt and J. Magidson, Latent class cluster analysis., J.A. Hagenaars
and A.L. McCutcheon (eds.), Applied Latent Class Analysis, 89-106., Cambridge
University Press, 2002.
[64] M. Vriens, Market Segmentation. Analytical Developments and Applications
Guidlines, Technical Overview Series, Millward Brown IntelliQuest, 2001, pp.
1-42.
[65] H. x. Wang, Q. b. Zhang, B. Luo and S. Wei, Robust mixture modelling using
multivariate t-distribution with missing information, Pattern Recognition Letters,
25 (2004), pp. 701-710.
[66] R. Wehrens, L. M. C. Buydens, C. Fraley and A. E. Raftery, Model-Based
Clustering for Image Segmentation and Large Datasets via Sampling, Journal of
Classification, 21 (2004), pp. 231-253.
25
[67] X. L. Xie and G. Beni, A validity measure for Fuzzy Clustering, IEEE
Transactions on pattern analysis and Machine Intelligence, 13 (1991), pp. 841-
847.
[68] X. Yang and S. M. Krishnan, Image segmentation using finite mixtures and
spatial information, Image and Vision Computing, 22 (2004), pp. 735-745.
Appendix
Table A1 Categorical clustering variables
Data sets Clustering variables Sample size Clusters’ number
Landis 77 7 categorical 118 3
Judges 3 categorical 164 3
Universalistic/particularistic 4 categorical 216 2
MAT 4 categorical 264 2
Store 5 categorical 412 2
Heinen 2 5 categorical 542 3
Heinen 3 5 categorical 542 3
Financial 4 categorical 743 2
Vdheijiden 3 categorical (2 Cov.) 811 2
LSAT 6 5 categorical 1000 2
Political 5 categorical 1156 2
Gss82 White 5 categorical 1202 3
Gss82 5 categorical (1 Cov.) 1644 3
Midtown Manhattan 2 categorical 1660 2
Depression 5 categorical (1 Cov.) 1710 3
Gss94 3 categorical (1 Cov.) 1850 2
Coleman 2 categorical 3398 2
Eye and Hair color 2 categorical 5387 3
Hannover 5 categorical 7162 4
Table A2 Normal clustering variables
Data sets Clustering variables Sample size Clusters’ number
Glucose in C5YSF1 5 Normal 22 3
haemophilia A carriers 2 Normal 23 2
Les Papillons 3 Normal 23 4
Chaetocnema insect data 3 Normal 30 3
H.oleraceaH.carduorum 4 Normal 39 2
Males of 3 sp.Chaetocnema 6 Normal 74 3
With and Angle 2 Normal 74 3
Diabetes 3 Normal
145 3
Iris de Fisher 4 Normal 150 3
Japanese black pine 2 Normal 204 3
Halibut 2 Normal 208 2
K-means 2 Normal 300 2
Pearson trypanosome 2 Normal 1000 2
26
Table A3 Mixed clustering variables
Data sets Clustering variables Sample size Clusters’ number
Bird survival 3 Normal, 1 Categorical 50 2
Methadone treatment 2 Normal, 2 Categorical 238 2
North Central Wisconsin 4 log- Normal, 1 Categorical 34 3
Hepatitis 5 Normal, 5 Categorical 80 2
Neolithic Tools 2 Normal, 3 Categorical 103 3
Imports-85 15 Normal, 11 categorical 160 7
Heart 6 Normal, 7 Categorical 270 2
Cancer 8 Normal, 4 Categorical 471 3
AIDS 3 Normal, 3 Categorical 944 4
Ethylene Glycol 2 Normal, 1 Categorical 1028 2
Table A4 AIC performance
AIC Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 84% (16/19) 62% (8/13) 40% (4/10)
Number of Variables
4 >4 2 >2 <10 10
78% (7/9) 90% (9/10) 67% (4/6) 57% (4/7) 40% (2/5) 40% (2/5)
Sample
size 1000 >1000 130 >130 130 >130
80% (8/10) 89% (8/9) 86% (6/7) 33% (2/6) 40% (2/5) 40% (2/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
92% (12/13) 67% (4/6) - 62% (8/13) - 40% (4/10)
Table A5 AIC3 performance
AIC3 Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 95% (18/19) 62% (8/13) 30% (3/10)
Number of
Variables 4 >4 2 >2 <10 10
100% (9/9) 90% (9/10) 67% (4/6) 57% (4/7) 40% (2/5) 40% (1/5)
Sample
size 1000 >1000 130 >130 130 >130
100% (10/10) 89% (8/9) 57% (4/7) 67% (4/6) 20% (1/5) 40% (2/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
92% (12/13) 83% (5/6) - 62% (8/13) - 40% (4/10)
27
Table A6 AICc performance
AICc Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 84% (16/19) 39% (5/13) 20% (2/10)
Number of
Variables 4 >4 2 >2 <10 10
78% (7/9) 100% (9/9) 50% (3/6) 29% (2/7) 20% (1/5) 20% (1/5)
Sample
size 1000 >1000 130 >130 130 >130
80% (8/10) 89% (8/9) 29% (2/7) 50% (3/6) 20% (1/5) 20% (1/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
92% (12/13) 67% (4/6) - 39% (5/13) - 20% (2/10)
Table A7 AICu performance
AICu Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 90% (17/19) 31% (4/13) 40% (4/10)
Number of
Variables 4 >4 2 >2 <10 10
100% (9/9) 80% (8/10) 50% (3/6) 14% (1/7) 40% (2/5) 40% (2/5)
Sample
size 1000 >1000 130 >130 130 >130
90% (9/10) 89% (8/9) 29% (2/7) 33% (2/6) 40% (2/5) 40% (2/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
85% (11/13) 100% (6/6) - 31% (4/13) - 40% (4/10)
Table A8 CAIC performance
CAIC Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 74% (14/19) 69% (9/13) 70% (7/10)
Number of
Variables 4 >4 2 >2 <10 10
100% (9/9) 50% (5/10) 83% (5/6) 57% (4/7) 40% (2/5) 100% (5/5)
Sample
size 1000 >1000 130 >130 130 >130
78% (7/9) 70% (7/10) 57% (4/7) 83% (5/6) 60% (3/5) 80% (4/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
62% (9/13) 100% (6/6) - 69% (9/13) - 70% (7/10)
Table A9 BIC performance
BIC Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 84% (16/19) 77% (10/13) 70% (7/10)
Number of
Variables 4 >4 2 >2 <10 10
100% (9/9) 70% (7/10) 83% (5/6) 71% (5/7) 60% (3/5) 80% (4/5)
Sample
size 1000 >1000 130 >130 130 >130
80% (8/10) 89% (8/9) 71% (5/7) 83% (5/6) 60% (3/5) 80% (4/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
77% (10/13) 100% (6/6) - 77% (10/13) - 70% (7/10)
28
Table A10 CLC performance
CLC Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 21% (4/19) 23% (3/13) 10% (1/10)
Number of
Variables 4 >4 2 >2 <10 10
33% (3/9) 10% (1/10) 33% (2/6) 14% (1/7) 20% (1/5) -
Sample
size 1000 >1000 130 >130 130 >130
30% (3/10) 11% (1/9) 28% (2/7) 17% (1/6) - 20% (1/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
- 67% (4/6) - 23% (3/13) - 10% (1/10)
Table A11 ICL-BIC performance
ICL-BIC Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 16% (3/19) 54% (7/13) 80% (8/10)
Number of
Variables 4 >4 2 >2 <10 10
33% (3/9) - 50% (3/6) 57% (4/7) 80% (4/5) 80% (4/5)
Sample
size 1000 >1000 130 >130 130 >130
20% (2/10) 11% (1/9) 57% (4/7) 50% (3/6) 80% (4/5) 80% (4/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
- 50% (3/6) - 54% (7/13) - 80% (8/10)
Table A12 NEC performance
NEC Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 21% (4/19) 46% (6/13) 40% (4/10)
Number of
Variables 4 >4 2 >2 <10 10
33% (3/9) 10% (1/10) 50% (3/6) 43% (3/7) 40% (2/5) 40% (2/5)
Sample size
1000 >1000 130 >130 130 >130
30% (3/10) 11% (1/9) 71% (5/7) 17% (1/6) 40% (2/5) 40% (2/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
8% (1/13) 50% (3/6) - 46% (6/13) - 40% (4/10)
Table A13 AWE performance
AWE Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 16% (3/19) 23% (3/13) 40% (4/10)
Number of Variables
4 >4 2 >2 <10 10
33% (3/9) - 33% (2/6) 14% (1/7) 40% (2/5) 40% (2/5)
Sample
size 1000 >1000 130 >130 130 >130
20% (2/10) 11% (1/9) 29% (2/7) 17% (1/6) 40% (2/5) 40% (2/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
- 16% (3/19) - 23% (3/13) - 40% (4/10)
29
Table A14 L performance
L Structure recovery (%)
Multinomial Normal Multivariate Mixed
Overall 54% (11/19) 54% (7/13) 60% (6/10)
Number of Variables
4 >4 2 >2 <10 10
78% (7/9) 40% (4/10) 100% (6/6) 14% (1/7) 60% (3/5) 40% (2/5)
Sample
size 1000 >1000 130 >130 130 >130
70% (7/10) 44% (4/9) 29% (2/7) 83% (5/6) 40% (2/5) 80% (4/5)
Entropy ( sE )
0.7 > 0.7 0.7 > 0.7 0.7 > 0.7
39% (5/13) 100% (6/6) - 54% (7/13) - 60% (6/10)