mixture-model cluster analysis using information ... · 1 mixture-model cluster analysis using...

29
1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca 1 ISCSP-Instituto Superior de Ciências Sociais e Políticas, R. Almerindo Lessa, Pólo Universitário do Alto da Ajuda, 1349-055 Lisboa. Portugal [email protected] Margarida G. M. S. Cardoso ISCTE Business School. Department of Quantitative Methods. Av. das Forças Armadas 1649-026 Lisboa. Portugal [email protected] Abstract. The estimation of mixture models has been proposed for quite some time as an approach for cluster analysis. Several variants of the Expectation-Maximization algorithm are currently available for this purpose. Estimation of mixture models simultaneously allows the determination of the number of clusters and yields distributional parameters for clustering base variables. There are several information criteria that help to support the selection of a particular model or clustering structure. However, a question remains concerning the selection of specific criteria that may be more suitable for particular applications. In the present work we analyze the relationship between the performance of information criteria and the type of measurement of clustering variables. In order to study this relationship we perform the analysis of forty-two data sets with known clustering structure and with clustering variables that are categorical, continuous and mixed type. We then compare eleven information-based criteria in their ability to recover the data sets’ clustering structures. As a result, we select AIC3, BIC and ICL-BIC criteria as the best candidates for model selection that refers to models with categorical, continuous and mixed type clustering variables, respectively. Keywords: Cluster Analysis; Finite Mixture Models; Model Selection; Information Theoretical Criteria 1. Introduction The work of Newcomb [52] may be the first contribution for modelling a mixture of homogeneous groups. But it was only in 1914 that Pearson [54] explicitly referred to the “dissection of a given abnormal frequency curve into m components” and his work may be the first one on the decomposition of a mixture of two normal clusters. Since then, the use of finite mixtures of probability distributions has increased to model data collected from a population composed of several homogeneous subpopulations. Several types of variables may be considered as a base for clustering. Hall and Titterington [31] directed their study to categorical variables; others studied models for continuous variables, such as Wang, Zhang, Luo, and Wei [65]; but most real clustering problems involve both continuous and discrete variables, and methodologies for mixed clustering variables were also used [28], [34], [51]. 1 Corresponding author: Jaime Raúl Seixas Fonseca, Tel.: +351 213 619 430 (3179); Fax: +351 213 619 430

Upload: vuxuyen

Post on 30-Jul-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

1

Mixture-Model Cluster Analysis using Information Theoretical Criteria

Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências Sociais e Políticas,

R. Almerindo Lessa, Pólo Universitário do Alto da Ajuda,

1349-055 Lisboa. Portugal

[email protected]

Margarida G. M. S. Cardoso ISCTE – Business School. Department of Quantitative Methods.

Av. das Forças Armadas 1649-026 Lisboa. Portugal [email protected]

Abstract. The estimation of mixture models has been proposed for quite some time as an approach

for cluster analysis. Several variants of the Expectation-Maximization algorithm are currently

available for this purpose. Estimation of mixture models simultaneously allows the determination

of the number of clusters and yields distributional parameters for clustering base variables. There

are several information criteria that help to support the selection of a particular model or

clustering structure. However, a question remains concerning the selection of specific criteria that

may be more suitable for particular applications. In the present work we analyze the relationship

between the performance of information criteria and the type of measurement of clustering

variables. In order to study this relationship we perform the analysis of forty-two data sets with

known clustering structure and with clustering variables that are categorical, continuous and

mixed type. We then compare eleven information-based criteria in their ability to recover the data

sets’ clustering structures. As a result, we select AIC3, BIC and ICL-BIC criteria as the best

candidates for model selection that refers to models with categorical, continuous and mixed type

clustering variables, respectively.

Keywords: Cluster Analysis; Finite Mixture Models; Model Selection; Information Theoretical

Criteria

1. Introduction

The work of Newcomb [52] may be the first contribution for modelling a mixture of

homogeneous groups. But it was only in 1914 that Pearson [54] explicitly referred to the

“dissection of a given abnormal frequency curve into m components” and his work may

be the first one on the decomposition of a mixture of two normal clusters. Since then,

the use of finite mixtures of probability distributions has increased to model data

collected from a population composed of several homogeneous subpopulations.

Several types of variables may be considered as a base for clustering. Hall and

Titterington [31] directed their study to categorical variables; others studied models for

continuous variables, such as Wang, Zhang, Luo, and Wei [65]; but most real clustering

problems involve both continuous and discrete variables, and methodologies for mixed

clustering variables were also used [28], [34], [51].

1 Corresponding author: Jaime Raúl Seixas Fonseca, Tel.: +351 213 619 430 (3179); Fax: +351 213 619 430

Page 2: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

2

There are several information criteria that help to support the selection of a particular

mixture model (associated with a clustering structure). However, a question remains

concerning the performance of specific criteria that may be more suitable for particular

applications.

This paper analyzes the association between the performance of information criteria that

may be used for selecting the number of clusters on mixture models and the clustering

variables kind (types of measurement). As a result of this analysis, we may indicate

some preliminary guidelines concerning the selection of a specific information criterion,

when specific types of attributes are considered for clustering in an application.

The paper is organized as follows: in section 2.1, we define notation and review finite

mixture models and cluster analysis via mixture models; in section 2.2, we review

previous work on maximum likelihood, estimator properties and the EM algorithm; in

section 3, we review several model selection criteria proposed to estimate the number of

clusters of a mixture, and some comparisons; in section 4, we handle methodology; in

section 5 we report on data analysis and experimental results, and finally in section 6 we

present some concluding remarks and future work prospects.

2. Clustering of data via finite mixture models

2.1. Finite mixture model

Clustering is a task in which one seeks to identify a finite set of clusters to describe the

data [33]. Maronna and Jacovkis [44] or McLachlan and Basford [49] illustrated the use

of mixture models in the field of cluster analysis. Finite mixture models assume that

parameters of a statistical model of interest differ across unobserved or latent clusters.

They provide a useful method for grouping observations into clusters. In the mixture

method of clustering, each different cluster in the population is assumed to be described

by a different probability distribution, which may belong to the same family but differ

in the values they take for the parameters of the distribution.

This approach to clustering offers some advantages when compared with other

techniques: it identifies clusters [25]; it provides means to select the number of clusters

[47]; it is able to deal with diverse types of data (different measurement levels) [63]; it

outperforms more traditional approaches [64].

In order to present Mixture Models we give some notation below.

Page 3: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

3

n sample size

)pY,,Y( 1 p clustering variables (random variables)

)y,,y(n1

measurements on variables p1 Y,,Y

iy measurements vector of individual i on variables p1 Y,,Y

),...,1( nzzz clusters-label vectors

iz binary vector

),(x zy complete data

pdf

probability density function

S number of unknown clusters

S*

true number of clusters

sθ vector of all unknown p(d)f parameters of the sth

cluster

vector of mixture model parameters, without weights

)1,,( 1 s vector of weights (mixing proportions)

is probability that an individual i belongs to the sth

cluster,

given i

y ),( vector of all unknown mixture model parameters

)ˆ,ˆ(ˆ estimate of the vector of all unknown parameters

L likelihood function, L( )

LL

log-likelihood function, log L( )

cLL complete-data log-likelihood function

ψn number of mixture model parameters

EN(S) entropy associated to the model with S clusters

The mixture model approach to clustering assumes that data are from a mixture of an

unknown number S of clusters in some unknown proportions, S ,,1 ; it is assumed

that the probability (density) function of i

y , )|( i

yf , is a mixture or a weighted sum of

S cluster-specific probability (density) functions )|( siysf , with parameters s ,

s = 1, … , S, that is, each data point i

y is taken to be a realization of a mixture

)

1

|()|(

S

ssi

ysfsiyf (1)

where

0s , s = 1, ..., S, and 1

1

S

ss (2)

and

ss ,, and , 1,, , , 11 .

Page 4: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

4

The data are arranged in an (n × P) matrix denoted by Y, where n is the number of cases

and P is the attributes’ number. Let ipy be the result of the ith case in the pth attribute,

for i = 1,…,n and p = 1,…,P. Let TiPyiyi

y ,...,1 be the ith column of matrix TY , i.e.

an (P × 1) vector for results of case i under all attributes. Then, the log-likelihood

function for the parameters is

)|(

1 1

log )|(

1 1

log )( log siysf

n

i

S

sssi

ysfn

i

S

ssL

(3)

The particularization of mixture models for multinomial, multivariate normal and mixed

models can be seen in works such as [31], [65], and [35], respectively.

2.2. Maximum likelihood estimation

2.2.1. Introduction

With the maximum likelihood approach to the estimation of , an estimate is provided

by a suitable root of the likelihood equation

O )L( log

(4)

In order to derive meaningful results from clustering the mixture model must be

identifiable; this simply means that an unique solution to the maximum likelihood (ML)

problem is possible and the model parameters of the distributions are estimable and well

defined [16].

Several researchers have studied consistency property of both the number of clusters (S)

and of the other model parameters’ ML estimators. The consistency of the number of

clusters’ estimator (S), was stated by Leroux [42] (with a maximum-penalized-

likelihood method) considering information criteria AIC, and BIC. This property of the

number of clusters’ estimator was also discussed in Keribin [40], James, Priebe and

Marchette [37], Boucheron and Gassiat [14], all concluding that the estimated number

of clusters to the true (despite unknown) number of clusters is consistent.

Kabir [39] presented one of the earliest works which refers to the consistency of MLE

for the remaining parameters and he stated that mixture model estimators are consistent

and asymptotically normally distributed when clustering variables are assumed to

Page 5: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

5

belong to the exponential family of distributions. Other researchers addressed this issue,

like Day [22], Hataway [32], and Fryer and Robertson [29].

When dealing with Mixture Models for clustering purposes, we may define each

complete data observation )iz,y(ixi

, as having arise from one of the clusters of the

mixture (1). Values of clustering base variables i

y are then regarded as being

incomplete data, augmented by cluster-label variables, isz , that is, )isz,...,(ziz i1 is the

unobserved portion of the data; isz are binary indicator latent variables, so that

s)iz(isz is 1 or 0, according as to whether i

y belongs or does not belong to the sth

cluster, for i = 1,…,n, and s = 1, …S.

Assuming that { iZ } are independent and identically distributed, each one according to a

multinomial distribution of S categories with probabilities S

,,1 , the complete-data

log-likelihood to estimate , if the complete data )iz,y(ixi

was observed [45] is

}log)s|i

(slog{ n

1i

S

1sis )(cL log syfz

(5)

2.2.2. The EM algorithm

Fitting finite mixture models (1) provides a probabilistic clustering of the n entities in

terms of their posterior probabilities of membership of the S clusters of the mixture of

distributions. Since the ML estimates of the finite mixture model (1) cannot be found

analytically, estimation of finite mixture models iteratively computes the estimates of

clusters posterior probabilities and updates the estimates of the distributional parameters

and mixing probabilities [41].

Expectation-maximization (EM) algorithm [23] is a widely used class of iterative

algorithms for ML estimation in the context of incomplete data, e.g. fitting mixture

models to observed data.

The EM algorithm proceeds by alternately applying two steps, until some convergence

criterion is met.

The E-step, on the kth iteration, calculates the complete data expected log-likelihood

function, given y , defined by the so-called Q function

Page 6: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

6

)};(log){log)(;(

1 1

)])(;|;([))(,Q( siysfs

ki

yn

i

S

ss

ki

yi

ycLEk

(6)

where

S

jijf

sfy

kj

kj

ksi

ks

k

i

k

iis

1

)()(

)()(

)(is

)(ˆ

)ˆ|y(ˆ

)ˆ|y(ˆ

) ;y| Z( E )|(sˆ

(7)

is the membership probability of pattern i

y in cluster s (posterior probability) (i=1,…,n,

and s =1,…,S).

The M-step, on the (k+1)th iteration, demands the maximization of (6) with respect to

, to update the parameter estimation, obtaining )1(ˆ k .

Then, by the Bayesian rule, the ith pattern is probabilistically assigned into cluster s,

after algorithm EM convergence, if ))(

'ˆ|('

)('

ˆ))(ˆ|(

)(ˆ ksi

ysfk

sk

siysf

ks , Sss ,...,1' .

Since the mixture likelihood L ( ) can never be decreased during the EM sequence,

))(

ˆ())1(

ˆ(k

Lk

L

,

it implies that ))(

ˆ(k

L converges to some L for a sequence of likelihood values

bounded above. Since, typically with mixture model approach, the likelihood surface is

known to have many local maxima the selection of suitable starting values for the EM

algorithm is crucial [11]. Therefore, it is usual to obtain several values of the maximized

log-likelihood for each of the different sets of initial values applied to the given sample,

and then consider the maximum value as the solution.

3. Model selection

3.1. Model selection criteria

Model selection is an increasingly important part of data analysis. In fact, when dealing

with applications, one always has to decide which one is the most appropriate model to

characterize the data structure and several model selection criteria may be available for

this end. In the context of clustering via finite mixture models the problem of

simultaneously choosing an adequate clustering structure and a particular number of

clusters can be approached by several methods:

1) Hypothesis tests

Page 7: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

7

With a mixture model-based approach, to clustering, regularity conditions fail to hold

for the likelihood ratio statistic (LRS). It does not have its usual asymptotic null

distribution of chi-square with (01 nn ) degrees of freedom, for testing the null

hypothesis that the true number of clusters is S0 versus the alternative (S1 > S0).

However, a re-sampling approach may be used to the assessment of the p-value of the

LRS in testing those hypotheses [48]. Bootstrap samples are generated from the mixture

model fitted under the null hypothesis of S0 clusters, and the value of the LRS is

computed for each bootstrap sample after fitting mixture models for S0 and S1 in turns.

The process is repeated independently a number of times, and the replicated values of

the LRS, obtained from the successive bootstrap samples, provide an assessment of the

bootstrap, and hence of the true null distribution of LRS, thus enabling an

approximation to be made to the p-value.

Regarding the specific case of testing the null hypothesis ),;()(: 200 yfyfH versus

),;( )1(),;( )(: 22

2111 yfyfyfH , Kathryn Roeder [57] proposes the

construction of a “Mixture Detection Plot” (MDP). She considers the function

( 1)(

)(

0

1

yf

yf), which she shows is a good indicator for the presence of a mixture, since it

has the same number of sign changes of ( )()( 01 xfxf ).

The proposed diagnostic obtains a nonparametric density estimate (using the normal

kernel estimator) of )(1 yf , and the method-of-moments estimate of )(0 yf . By this plot

diagnostic, if the true model has 1 cluster, the mixture detection plot of ( 1)(

)(

0

1

yf

yf)

approximates a stationary Gaussian process; otherwise, if the true model has 2 clusters,

then the plot approximates a bimodal function which tends to exhibit the expected four

sign changes (the MDP results in two relatively large modes with a well-defined dip).

2) Validity indices

Indices of validity for partitions are frequently interpreted as measures of partition

quality, rather than of goodness of fit. Some cluster validity indexes, available in the

literature, are Dunn’s index [4], Davies-Bouldin (DB) index [53], Xie-Beni index [67],

stability index [26]. Dunn’s index is a ratio of within cluster and between cluster

separations; DB index is a function of ratio of the sum of within cluster scatter to

between cluster scatter; Xie-Beni index is a ratio of the fuzzy cluster sum of squared

Page 8: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

8

distances to the product of the number of elements and the minimum between cluster

separation; stability index is a stability-based technique based on clusters’ immovability

on partition. We can also see [30] for a good overview of this issue.

Bayes rule together with the common structure for posterior matrix and fuzzy partition (

they have the same mathematical structure) allows for transformation of every cluster

validity indexes into a measure that might be useful for mixture validation [7]; by doing

so, Bezdeck, Li, Attikiouzel, and Windham [7] showed that validity measures of cluster

validity that assess geometric properties of partitions that match the expected structure

in samples from mixtures of normal distributions can be as effective as information

criteria for estimating the number of components.

3) Information Criteria

In the present work we specifically refer to the use of Information Criteria for mixture

model selection and we focus on the determination of the true number of clusters for

these models.

Several criteria are available for this end. AIC – Akaike’s Information Criterion [2] and

BIC – Bayesian Information Criterion [58] are, perhaps, the best known. These and

some other information criteria are presented in Table 1.

Table 1. Some information-based criteria for assessing number of clusters on finite

mixture models

Criterion Definition Author / Bibliography

AIC ψ2n2LL [1]

AIC3 ψ3n2LL [16]

AICc 1)ψn1))/(nψ(nψ(2nAIC [36]

AICu 1))ψnnlog(n/(nAICc [50]

CAIC logn)(1ψn2LL [17]

BIC/MDL lognψn2LL [58] / [56]

CLC 2EN(S)2LL [9]

ICL-BIC 2EN(S)BIC [10]

NEC1 L(1)))EN(S)/(L(SNEC(S) [12]

AWE

logn)(3/2ψ2nc2LL [5]

L 1)/2ψS(n2)S/2log(n/1/12)slog(nλ/2)ψ(nLL [27]

1 We choose S* if NEC(S*) 1, (2S*S ), with the convention NEC(1)=1; otherwise we declare no

clustering structure in the data.

These criteria balance fitness (trying to maximize the likelihood function) and

parsimony (using penalties associated with measures of model complexity), trying to

Page 9: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

9

avoid overfit. Furthermore, fitting a model with a large number of clusters requires

estimation of a very large number of parameters and a consequent loss of precision in

these estimates [43].

The general form of information criteria is as follows

CL )ˆ(log2 , (8)

where the first term is the negative logarithm of the maximum likelihood which

decreases when the model complexity increases; the second term or penalty term

penalizes too complex models, and increases with the model number of parameters.

Thus, the selected mixture model should evidence a good trade-off between good

description of the data and the model number of parameters.

The emphasis on information criteria begins with the pioneer work of Akaike [2], with

the Akaike’s information criterion. AIC chooses a model with S clusters that minimises

(8) with C = 2 ψn .

Later, Bozdogan [17] suggested the modified AIC criterion (AIC3) in the context of

multivariate normal mixture models, using 3 instead of 2 as penalising term, that is C =

3 ψn . When a vector parameter lies on the boundary of the parameter space (as in the

case of the standard mixture problem), in comparing two models with n and *n

parameters, respectively, the likelihood ratio statistic has a non-central chi-square

distribution with 2( n - *n ) degrees of freedom, instead of ones considered in AIC. As

a result, he obtained a penalization factor C = 2 ψn + ψn .

Another variant of AIC, the corrected AIC (AICc), is proposed [36], focusing on the

small-sample bias adjustment (AIC may perform poorly if there are too many

parameters in relation to the sample size); AICc thus selects a model with S clusters that

minimises (8) with C = )(21

nn

nn .

Since AICc still tends to overfit as the sample size increases [50] a new criterion is then

proposed - AICu – which considers a greater penalty for overfitting, specially as the

sample size increases.

The consistent AIC criterion (CAIC) with C = ψn (1 + log n) was derived by Bozdogan

[17] . It tends to select models with fewer parameters than AIC does.

The Bayesian Information Criterion (BIC) was proposed by Schwarz [58], looking for

the appropriate modification of maximum likelihood, by studying the asymptotic

Page 10: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

10

behaviour of Bayes estimators, under a class of proper priors, which assigns positive

probability on some lower dimensional spaces of the parameter vector. It refers to C =

ψn log n, and is equivalent to the MDL- Minimum Description Length [56].

The CLC - Complete Likelihood Classification - criterion [47] was originated from the

link between the observed log-likelihood and log-classification likelihood, LLc = LL –

EN(S). It considers C = 2EN(S), where the term 2EN(S) penalizes poorly separated

clusters, with

is log

n

1i

S

1s is )S(EN

In order to account for the ability of the mixture model to give evidence for a clustering

structure of the data, Biernacki, Celeux, and Govaert [2] considered the integrated

likelihood of the complete data ( zx, ) or Integrated Classification Likelihood criterion

(ICL); an approximation, referred to as ICL-BIC [47], chooses a model with S clusters

that minimises (10) with C = 2EN(S) + ψn log n.

C. Biernacki, G. Celeux and G. Govaert [12] suggested the improved NEC (originally

introduced by Celeux and Soromenho [20]), which chooses a model with s clusters for

minimum NEC(s)1, (2 sS), because they state that NEC(1) =1; otherwise NEC

declares there is no clustering structure in the data.

Banfield and Raftery [1] have suggested a Bayesian solution to the choice of the

number of clusters, based on an approximation of the classification likelihood, the so-

called approximate weight of evidence – AWE – which penalizes more drastically

complex models than BIC; so it will select more parsimonious models than BIC, except

for well separated clusters, and chooses a model with S clusters that minimizes (8) with

logn)(3/2ψ2n2EN(S)C .

Finally, Figueiredo and Jain [27] proposed the L criterion for any type of parametric

mixture model for which it is possible to write an EM algorithm; this criterion chooses

a model with S clusters that minimizes

1)/2ψS(n2)S/2log(n/1/12)slog(nλ/2)ψ(nLL .

AIC and AIC3 are measures of model complexity associated with some criteria (see

table 1) that only depend on the number of parameters. Some other measures depend on

both the number of parameters and the sample size, as AICc, AICu, CAIC and

BIC/MDL; others depend on entropy, as CLC, and NEC; some of them depend on the

Page 11: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

11

number of parameters, sample size, and entropy, as ICL-BIC, and AWE; L depends on

the number of parameters, sample size and mixing proportions, s .

In the present work we specifically refer to information criteria presented in table 1,

which have been referred previously. All are currently in use for the estimation of

mixture models. Their origin is diverse and is illustrated on table 2.

Table 2. Some history of some criteria for model selection on finite mixture models

Proposed for

Criteria Aim

An example of the use on

mixtures

models

Reg

ression

mod

els

AIC To select the order of an autoregressive model [18]

AICc To correct AIC for bias, on regression models [19]

AICu To achieve a better performance for AICc ---

CAIC To make AIC asymptotically consistent [8]

BIC/MDL To select of the order of models in polinomial regression

(Schwarz) or minimum description message length (Rissanen).

[62]

Clu

stering

AIC3 Bozdogan AIC correction for model selection on mixture of multivariate normal.

[3]

CLC When we have a data set with well-separated clusters, using a measure of entropy

[47]

ICL-BIC Related with BIC, favours well-separated clusters, using a measure

of entropy

[10]

NEC Entropy criterion, making a compromise between clustering quality and fit quality on S clusters related to one cluster

[27]

AWE Adds a third dimension to the information criteria. Then, it weighs:

-Fit, Parsimony, and Performance of clustering

[6]

L Much less initialization dependent and automatically avoids the

boundary of the parameter space

[68]

3.2. Information criteria comparisons

There are some studies which refer to the comparison of Information Criteria for

mixture model selection.

Cutler and Windham [21], based their work on simulations of mixtures of bivariate

normal distributions and used AIC, AIC3, BIC/MDL, and ICOMP [15], among others,

as model selection criteria. They generated 500 data sets for each combination (sample

sizes, number of clusters, and levels of separation), and they showed that BIC/MDL and

AIC performed well, by this order, for the model which has Is , and

Ssss ,...,1,1 ; ICOMP was similar to AIC, but had lower success rates in general,

on recovering the true structure of the data. For the model which has Is , s = 1,…,S,

ICOMP had the best performance, and for the full model, the most general specification

because no restrictions are imposed on parameters, ICOMP had a good performance,

Page 12: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

12

but BIC/MDL might be preferred, especially when the clusters are well separated. Thus,

this study suggests that the type of model affects the performance of information

criteria.

Bozdogan [16], simulated n = 300 observations from a three-dimensional multivariate

normal distribution with S = 3 clusters. He developed a Monte Carlo study and

replicated the experiment 100 times. The goal was to identify and estimate the

appropriate choices for the number S of clusters and identify the best fitting model

using ICOMP, AIC3, and CAIC model selection criteria. He demonstrated the utility of

ICOMP, AIC3, and CAIC model selection criteria.

Bezdek, Li, Attikiouzel, and Windham [7], studied several criteria performance, namely

AIC, AIC3, BIC/MDL, AWE, ICOMP, and NEC, as probabilistic indices. They

simulated 12 data sets, from a two dimensional normal distribution mixture. They

showed that AIC3 had the best performance, followed by AIC, BIC/MDL, AWE, NEC,

and ICOMP. They also used other indices as Xie-Beni’s index or Dunn’s index and

their generalizations, which were as effective as information criteria for assessing the

number of clusters in a normal mixture.

Biernacki [9] studied normal distributions mixtures, and showed the best performance

of AIC3 and BIC; ICOMP performed worse, staying basically between AIC and AIC3.

However, for the situation of a non normal cluster (uniform + normal), AIC, AIC3, BIC,

and ICOMP underestimated the number of clusters; on the contrary, NEC, and AWE

showed a better performance.

The last two works we mentioned above suggest that the type of clustering variables

may influence the performance of certain information criterion. A methodological

approach is then proposed in order to explore this hypothesis.

4. Methodology

The goal of the present paper is to try to establish a relationship between the type of

clustering variables and Information Criteria performance, when using mixture-model

cluster analysis. We focus on the capacity of several Information Criteria to discover the

true number of clusters and how it associates with the type of clustering variables:

categorical, continuous or mixed.

Page 13: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

13

In order to meet this objective we rely on the analysis of several data sets. We thus

conduct the analysis of forty-two data sets (available on the web2) for which the true

number of clusters is known. This approach enables the comparisons between the true

(S*) and estimated number of clusters (S), for each data set.

As a main result from this methodological approach we then present a ranking of

information criteria based on the proportion of data sets each criterion is able to

discover i.e. the proportion of data sets yielding S = S*.

Table 3 summarizes the number of analysed data sets referred to each type of clustering

variables and includes the data sets identification.

Table 3 Data sets analized

Type of clustering variables Data sets analysed Number

Categorical

Landis 77; Judges; Universalistic/particularist; MAT; Store;

Heinen 2; Heinen 3; Financial; Vdheijden; LSAT 6; Political;

Gss82White; Gss82; Midtown Manhattan; Depression; Gss94; Coleman; Eye and Hair color; Hannover

19

Continuous

Glucose in C5YSF1; haemophilia A; carriers; Les Papillons;

Chaetocnema insect data; H. Oleracea H.carduorum; Males of 3 sp. Chaetocnema; With and Angle; Diabetes; Iris of Fisher;

Japanese black pine; Halibut; K-means; Pearson trypanosome

13

Mixed: continuous and

categorical

Bird survival; Methadone treatment; North Central Wisconsin;

Hepatitis; Neolithic Tools; Imports-85; Heart; Cancer ; AIDS;

Ethylene Glycol

10

Tables A1, A2, and A3 (in Appendix) include additional information concerning the data

sets we use on this study, namely the type of clustering variables, the sample size, and

the true number of clusters.

Since the data sets sample size and data dimension are known, we also provide separate

results concerning different sample size and data dimension categories.

In order to evenly distribute the available data sets we consider two data dimension

categories: n 130 and n > 130, for normal multivariate and mixed cases, n 1000 and

n > 1000 for multinomial cases. This cut-off also takes into account the fact that

empirically n = 100 is large enough to support asymptotic approximations on results

[61].

2 http://www.statisticalinnovations.com/products/latentgold datasets and

http://www.ics.uci.edu/~mlearn/MLSummary.

Page 14: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

14

In order to present results we also consider an entropy cut-off value - 0.7 - since several

criteria showed to perform better for values that are bigger than this value (AICu, CAIC,

BIC, L, and CLC, ICL-BIC, NEC, AWE).

In addition, we measure the degree of separation between clusters using index of Jedidi,

Ramaswamy, and DeSarbo [38],

))ln(/()

1 1

ˆlnˆ(1 Snn

i

S

sisissE

,

where

n

i

S

sisis

1 1

ˆlnˆ is the entropy term which measures the clusters’ overlap. It is a

relative measure and is bounded between 0 and 1. Values close to 1 indicate that the

clusters are well separated, and values close to 0 indicate that we are dealing with

overlapped clusters.

5. Experiments

In the present work forty-two mixture models are estimated through the data sets

presented in Tables 3. In order to illustrate the modelling procedure we decide on

detailing it referred to three data sets: Landis 77 (Categorical clustering variables);

Diabetes (Continuous clustering variables); Heart (Mixed clustering variables). In each

data set we particularly focus on the capacity of alternative Information Criteria to yield

the true number of clusters, S*. Table 4, table 5 and table 6 illustrate results referred to

Landis 77, Diabetes and Heart, respectively. In those tables, (< true S*) means that the

criterion tends to underestimate S*. > true S* means that the criterion tends to

overestimate S*.

The multivariate mixture models for clustering which are considered for the data sets

deal with clustering base variables which are categorical, continuous, or mixed.

When all of the clustering base variables are categorical, a mixture model of

multinomial probability functions is adopted.

When there are only continuous variables on the clustering base variables, we

propose a multivariate Gaussian mixture model.

When using mixed clustering base variables we have to specify the appropriate

univariate distribution functions for each element of iY , normal for continuous, and

multinomial for categorical variables.

Page 15: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

15

In all mixture models it is generally assumed that the clustering base variables are

mutually independent within clusters. In general, it is possible to include local

dependences between clustering base variables by using the appropriate multivariate

rather than univariate distributions for sets of locally dependent variables (multivariate

normal distribution for sets of continuous variables, and joint multinomial distribution

for a set of categorical variables) [63].

5.1. Landis 77 data set

Landis 77 data set consists of 118 entities and 7 categorical clustering variables. The

clustering base variables are presence/absence of carcinoma in the uterine cervix,

according to 7 pathologists, and the true number of clusters (S*) is known to be 3.

The adopted mixture model considers the local independence assumption.

Table 4. Information criteria results for Landis 77 Data set

cluster LL AIC AIC3 AICc AICu CAIC BIC CLC ICL-BIC NEC AWE L

1 -524.5 1062.9 1069.9 1063.9 1072.2 108.3 1082.3 1048.9 -537.6

2 -318.2 666.3 681.3 671.0 688.2 722.9 707.9 643.0 -378.9

3 -294.9 635.9 658.9 647.6 674.4 722.6 699.6 611.9 < true S* < true S* < true S* -339.4

4 -290.7 643.5 674.5 666.5 703.9 760.4 729.4 629.8 -447.4

Table 4 reports the results for this data set. Criteria AIC, AIC3, AICc, AICu, CAIC,

BIC, CLC, and L are minima at S = 3. They are all able to recover the known data

structure.

The relative cluster sizes are 44.5%, 37.5% and 18%. Es = 0.915, indicates that clusters

are well separated.

5.2. Diabetes data set

The Diabetes data set includes 145 entities, described by three continuous clustering

variables, with three clusters. The clustering base variables are: glucose, insulin and

SSPG.

The adopted mixture model considers the local independence assumption, except for

clustering variables glucose and insulin. The criteria selecting a mixture model with

three clusters (Table 5) are CAIC, BIC, and ICL-BIC.

Page 16: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

16

Table 5. Information criteria results for Diabetes data set

cluster LL AIC AIC3 AICc AICu CAIC BIC CLC ICL-BIC NEC AWE L

1 -2560.4 5162.6 5155.6 5155.6

2 -2380.3 4850.2 4835.2 4842.7

3 -2320.6 > true S* > true S* > true S* > true S* 4778.6 4755.6 > true S* 4804.2 < true S* < true S* < true S*

4 -2303.1 4791.6 4760.6 4812.3

The overlap index (Es) value is 0.847, indicating moderately separated clusters.

5.3. Heart data set

The Heart data set contains 270 entities, described by 12 clustering variables (five

continuous, and seven categorical), with two clusters. The clustering base variables are:

age, resting blood pressure, serum cholesterol (mg/dl), maximum heart rate achieved

and ST depression induced by exercise relative to rest, which are continuous; sex, chest

pain type (1, 2, 3, 4), fasting blood sugar > 120mg/dl (1 = true, 0 = false), resting

electrocardiographic results (0, 1, 2), exercise induced angina (0, 1), slope (1, 2, 3),

number of major vessels colored by flourosopy (0, 1, 2, 3), which are categorical. Table

6 reports the results of the analysis; the adopted mixture model considers the local

independence assumption, and selects a mixture model with two clusters, based on the

following criteria: BIC, ICL-BIC, CAIC, AWE and L. Relative sizes of two clusters are

60% and 40%. Es = 0.816, indicates that the clusters are moderately separated.

Table 6. Information criteria results for Heart Data Set

cluster LL AIC AIC3 AICc AICu CAIC BIC CLC ICL-BIC NEC AWE L

1 -6745.8 13643.4 13620.4 14184.7 13818.1 -6795.2

2 -6576.0 > true S* > true S* > true S* > true S* 13429.2 13387.2 > true S* 13882.3 > true S* 13817.1 -6722.9

3 -6528.4 13459.3 13398.3 13915.2 13997.3 -6779.7

An additional result is presented in table 7 (confusion matrix) that illustrates the

percentage of correctly and not correctly classified entities in each cluster. 95.3% of

entities of cluster 1 are correctly classified on cluster 1, and 86.7% of entities of group 2

are correctly classified on cluster 2.

Page 17: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

17

Table 7. Confusion matrix (Heart Data Set)

True clusters Model clusters

Model clusters Total 1 2 Total

1 143

(95.3%)

7

(4.7%)

150

2 16

(13.3%)

104

(86.7%)

120

Total 159 111 270

6. Discussion and perspectives

In table 8, we report the percentage of cases corresponding to recovering the true

structure of data sets for each information-based criterion and the type of clustering base

variables, based on the entire sample of data sets. The bold values identify the criterion

with the best performance for each type of clustering variables.

Table 8 Structure recovery (%)

Criteria

Clustering

variables

AIC

AIC

3

AIC

c

AIC

u

CA

IC

BIC

CL

C

ICL

-BIC

NE

C

AW

E

L

Multinormal 62

62

39

31

69

77

23

54

46

23

54

Multinomial 84

95

84

90

74

84

21

16

21

16

54

Mixed 40

30

20

40

70

70

10

80

40

40

60

According to the obtained results, we conclude that AIC3 is the best performing

criterion when categorical variables are considered (multinomial distribution adopted),

BIC performs better when continuous variables are considered (normal multivariate

distribution adopted), and ICL-BIC has the best performance when mixed variables are

considered (multinomial and normal multivariate distributions adopted).

For data sets with categorical clustering variables the AIC3 criterion achieves an

excellent performance (see table 8), with 95% of structure recovery. AIC3 is the best

information criterion to use across a large variety of multinomial data configurations.

As we can see (table A5), its performance seems to be very good (100%) for p 4 data

dimension, very good (100%) for small sample sizes, and very good (92%) for

Page 18: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

18

overlapped clusters ( sE 0.7). It was followed by AICu, with 90% of structure

recovery.

This conclusion is in accordance with Dias [24]. He used a Monte Carlo study to

compare the performance of information criteria for the selection of the number of

clusters of latent class models as a finite mixture model of conditionally independent

multinomial distributions. His results showed that AIC3 had the best overall success

rate of 72.9%, and outperformed other criteria such as AIC, BIC, and CAIC.

BIC (table 8) is the criterion with the best performance on normal multivariate cases,

with 77% of structure recovery, followed by CAIC with 69%. BIC (table A9) performs

very well (100%) for p 4, and performs well (80%) for small sample sizes, and for

overlapping clusters (77%).

ICL-BIC (table 8) is the criterion with the best performance on mixed data sets, with

80% of structure recovery, followed by CAIC and BIC, both with 70%, and L (60%); it

performs well, regardless variable number and sample size (table A11).

Fig. 1. Criteria results (% of recovery structure) on multinomial, multivariate normal, and mixed base clustering

variables

Figure 1

Multinomial MultinormalMixed

10 16 20 21 23 30 31 39 40 46 54 60 62 69 70 74 77 80 84 90 95

% of structure recovery

AIC

AIC3

AICc

AICu

AWE

BIC

CAIC

CLC

ICL-BIC

L

NEC

crit

eria

Page 19: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

19

The performance of the AIC family criteria (AICu, AICc, AIC3, AIC) and also of ICL-

BIC seems to be particularly sensible to the type of clustering variables (see fig.1). This

conclusion meets the former hypothesis of the present paper.

It seems (fig.1) that the L criterion (followed by CAIC) is the less influenced by the type

of clustering variables. May be this can be associated to the fact that L criterion was

proposed for any type of parametric mixture model.

In general this study indicates the existence of a relationship between the performance

of some information criteria and the types of clustering variables which are considered

for clustering with mixture models.

This conclusion needs further research. In fact, the link between the performance of

information criterion and types of clustering variables, is empirically observed. In future

work, the analysis of simulated data sets should be conducted, providing means to

confirm and describe this link.

Acknowledgements

The authors wish to thank the referees, for their many valuable suggestions, which led

to a significant improvement of the article.

References

[1] H. Akaike, Information Theory and an Extension of Maximum Likelihood

Principle, in K. T. Emanuel Parzen, Genshiro Kitagawa, ed., Selected Papers of

Hirotugu Akaike, in Proceedings of the Second International Symposium on

Information Theory, B.N. Petrov and F. caski, eds., Akademiai Kiado, Budapest,

1973, 267-281, Springer-Verlag New York, Inc, Texas, 1973, pp. 434.

[2] H. Akaike, Maximum likelihood identification of Gaussian autorregressive

moving average models, Biometrika, 60 (1973), pp. 255-265.

Page 20: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

20

[3] R. L. Andrews and I. S. Currim, A comparison of segment retention for finite

Mixture Logit Models, Journal of Marketing Research, XI (2003a), pp. 235-243.

[4] S. Bandyopadhyay and U. Maulik, Nonparametric Genetic Clustering:

Comparison of Validity Indices, IEEE Transactions on Systems, Man, and

Cybernetics-Part C: Applications and Reviews, 31 (2001), pp. 120-125.

[5] J. D. Banfield and A. E. Raftery, Model-Based Gaussian and Non-Gaussian

Clustering, Biometrics, 49 (1993), pp. 803-821.

[6] H. Bensmail and J. J. Meulman, Model-based Clustering with Noise: Bayesian

Inference and Estimation, Journal of Classification, 20 (2003), pp. 49-76.

[7] J. C. Bezdek, W. Q. Li, Y. Attikiouzel and M. Windham, A geometric approach

to cluster validity for normal mixtures, Soft Computing, 1 (1997), pp. 166-179.

[8] A. Bhatnagar and S. Ghose, A latent class segmentation analysis of e-shoppers,

Journal of Business Research, 57 (2004), pp. 758-767.

[9] C. Biernacki, Choix de modéles en Classification-Ph.D. Tesis., 1997.

[10] C. Biernacki, G. Celeux and G. Govaert, Assessing a Mixture model for

Clustering with the integrated Completed Likelihood, IEEE Transactions on

Pattern analysis and Machine Intelligence, 22 (2000), pp. 719-725.

[11] C. Biernacki, G. Celeux and G. Govaert, Choosing starting values for the EM

algorithm for getting the highest likelihood in multivariate Gaussian mixture

models, Computational Statistics & Data Analysis, 41 (2003), pp. 561-575.

[12] C. Biernacki, G. Celeux and G. Govaert, An improvement of the NEC criterion

for assessing the number of clusters in mixture model, Pattern Recognition

Letters, 20 (1999), pp. 267-272.

[13] C. Biernacki and G. Govaert, Using the classification likelihood to choose the

number of clusters, Computing Science and Statistics, 29 (1997), pp. 451-457.

[14] S. Boucheron and E. Gassiat, Order Estimation and Model Selection, in O. C. a.

T. Rydden, ed., Inference in Hidden markov Models, 2002, pp. 25.

[15] H. Bozdogan, ICOMP: A New Model Selection criterion, in Classification and

Related Methods of Data Analysis, Hans H. Bock (ed.), North-Holland,

Amesterdam, 1988, pp. 599-608.

[16] H. Bozdogan, Mixture-Model Cluster Analysis using Model Selection criteria

and a new Informational Measure of Complexity, in H. Bozdogan, ed.,

Proceedings of the First US/Japan Conference on the Frontiers of Statistical

Page 21: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

21

Modeling: An Approach, 69-113, Kluwer Academic Publishers, 1994, pp. 69-

113.

[17] H. Bozdogan, Model Selection and Akaikes's Information Criterion (AIC): The

General Theory and its Analytical Extensions, Psycometrika, 52 (1987), pp.

345-370.

[18] H. Bozdogan and P. Bearse, Information complexity criteria for detecting

influential observations in dynamic multivariate linear models using the genetic

algorithm, Journal of Statistical Planning and Inference, 114 (2003), pp. 31-44.

[19] K. P. Burnham and D. R. Anderson, Multimodel Inference. Understanding AIC

and BIC in Model Selection, Sociological Methods & Research, 33 (2004), pp.

261-304.

[20] G. Celeux and G. Soromenho, An entropy criterion for acessing the number of

clusters in a mixture model, Journal of Classification, 13 (1996), pp. 195-212.

[21] A. Cutler and M. P. Windham, Information-Based Validity Functionals for

Mixture Analysis, in H. Bozdogan, ed., First US/Japan Conference on the

Frontiers of Statistical Modeling: An Informational Approach, Kluwer

Academic Publishers, 1994.

[22] N. E. Day, Estimating the Components of a mixture of normal Distributions,

Biometrika, 56 (1969), pp. 463-474.

[23] A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum Likelihood from

ioncomplete Data via EM algorithm, Journal of the Royal Statistics Society, B,

39 (1977), pp. 1-38.

[24] J. G. Dias, Finite Mixture Models; Review, Applications, and Computer-

intensive Methods, Econimics, Groningen University, PhD Thesis, Groningen,

2004, pp. 199.

[25] W. R. Dillon and A. Kumar, Latent structure and other mixture models in

marketing: An integrative survey and overview, chapter 9 in R.P. Bagozi (ed.),

Advanced methods of Marketing Research, 352-388, Cambridge: blackwell

Publishers, 1994.

[26] A. F. Famili, G. Liu and Z. Liu, Evaluation and optimization of clustering in

gene expression data analysis, Bioinformatics, 20 (2004), pp. 1535-1545.

[27] M. A. T. Figueiredo and A. K. Jain, Unsupervised Learning of Finite Mixture

Models, IEEE Transactions on pattern analysis and Machine Intelligence, 24

(2002), pp. 1-16.

Page 22: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

22

[28] R. S. Fonseca, Jaime and G. M. S. Cardoso, Margarida, Retail Clients Latent

Segments. Proceedings of the 12th Portuguese Conference on Artificial

Intelligence, EPIA 2005, December 5-8, in Progress in Artificial Intelligence,

Ed. C. Bento, A. Cardoso, G. Dias, Springer-Verlag, Lecture Notes in

Computing Science, Vol. 3808, 2005, pp. 348-358.

[29] J. G. Fryer, and Robertson, C.A., A Comparision of Some methods for

Estimating Mixed Normal Distributions, Biometrika, 59 (1972), pp. 639-648.

[30] M. Halkidi, Y. Batistakis and M. Vazirgiannis, On Clustering Validation

Techniques, Journal of Intelligent Information Systems, 17 (2001), pp. 107-145.

[31] P. Hall and D. M. Titterington, Efficient Nonparametric Estimation of Mixture

Proportions, Journal of the Royal Statistical Society, Series B, 46 (1984), pp.

465-473.

[32] R. J. Hataway, A Constrained Formulation of Maximum-Likelihood Estimation

for Normal Mixture Distributions, The Annals of Statistics, 13 (1985), pp. 795-

800.

[33] E. R. Hruschka and N. F. F. Ebecken, A genetic algorithm for cluster analysis,

Intelligent Data Analysis, 7 (2003), pp. 15-25.

[34] L. Hunt and M. Jorgensen, Mixture model clustering for mixed data with missing

information, Computational Statistics & Data Analysis, 41 (2003), pp. 429-440.

[35] L. A. Hunt and K. E. Basford, Fitting a Mixture Model to Three-Mode Trhee-

Way Data with Categorical and Continuous Variables, Journal of Classification,

16 (1999), pp. 283-296.

[36] C. M. Hurvich and C.-L. Tsai, regression and Time Series Model Selection in

Small Samples, Biometrika, 76 (1989), pp. 297-307.

[37] L. F. James, C. E. Priebe and D. J. Marchette, Consistency Estimation of Mixture

Complexity, The Annals of Statistics, 29 (2001), pp. 1281-1296.

[38] K. Jedidi, V. Ramaswamy and W. S. DeSarbo, A Maximum Likelihood Method

for Latent Class Regression involving a Censored dependent Variable,

Psycometrika, 58 (1993), pp. 375-394.

[39] A. B. M. L. Kabir, Estimation of Parameters of a Finite Mixture of

Distributions, Journal of the Royal Statistical Society, Series B

(Methodological), 30 (1968), pp. 472-482.

[40] C. Keribin, Estimation consistante de l'orde de modèles de mélange, Comptes

Rendues de l'Academie des Sciences, Paris, t. 326, Série I (1998), pp. 243-248.

Page 23: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

23

[41] Y. Kim, W. N. Street and F. Menezer, Evolutionary model selection in

unsupervised learning, Intelligent Data Analysis, 6 (2002), pp. 531-556.

[42] B. G. Leroux, Consistent Estimation of a Mixing Distribution, The Annals of

Statistics, 20 (1992), pp. 1350-1360.

[43] B. G. Leroux and M. L. Puterman, Maximum-Penalized-Likelihood Estimation

for Independent and Markov-Dependent Mixture Models, Biometrics, 48 (1992),

pp. 545-558.

[44] R. Maronna and P. M. Jacovkis, Multivariate Clustering Procedures with

variable Metrics, Biometrics, 30 (1974), pp. 499-505.

[45] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions, John Wiley

& Sons, New York, 1997.

[46] G. F. McLachlan and N. Khan, On a resampling approach for tests on the

number of clusters with mixture model-based clustering of tissue samples,

Journal of multivariate Analysis, 90 (2004), pp. 90-105.

[47] G. F. McLachlan and D. Peel, Finite Mixture Models, John Wiley & Sons, Inc.,

2000.

[48] G. J. McLachlan, On Bootstrapping the Likelihood Ratio Test Statistic for the

Number of components in a Normal Mixture, Appllied Statistics, 36 (1987), pp.

318-324.

[49] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications

to Clustering., Marcel Deckker, Inc., New York, 1988.

[50] A. McQuarrie, R. Shumway and C.-L. Tsai, The model selection criterion AICu,

Statistics & Probability Letters, 34 (1997), pp. 285-292.

[51] I. Moustaki and I. Papageorgiou, Latent class models for mixed variables with

applications in Archaeometry, Computational Statistics & Data Analysis, In

Press (2004).

[52] S. Newcomb, A Generalized Theory of the Combination of Observations so as to

Obtain the Best Result, American journal of Mathematics, 8 (1886), pp. 343-

366.

[53] M. K. Pakhira, S. Bandyopadhyay and U. Maulik, Validity index for crisp anf

fuzzy clusters, Pattern Recognition, 37 (2004), pp. 487-501.

[54] K. Pearson, On the probability that two independent Distributions of frequency

are really samples of the same population with special reference to recent work

on the identity of trypanosome strains, Biometrika, 10 (1914), pp. 85-143.

Page 24: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

24

[55] Y. Qu and S. Xu, Supervised cluster analysis for microarray data based on

multivariate Gaussian mixture, Bioinformatics, 20 (2004), pp. 1905-1913.

[56] J. Rissanen, Modeling by shortest data description, Automatica, 14 (1978), pp.

465-471.

[57] K. Roeder, A graphical Technique for Determining the Number of components

in a Mixture of Normals, Journal of the American Statistical Association, 89

(1994), pp. 487-495.

[58] G. Schwarz, Estimating the Dimenson of a Model, The Annals of Statistics, 6

(1978), pp. 461-464.

[59] J. Seo, M. Bakay, Y.-W. Chen, S. Hilmer, B. Shneiderman and E. P. Hoffman,

Interactively optimizing signal-to-noise ratios in expression profiling: project-

specific algorithm selection and detection p-value weighting in Affymetrix

mocroarrays, Bioinformatics, 20 (2004), pp. 2534-2544.

[60] M. Shaked, On Mixtures from Exponential Families, Journal of the Royal

Statistical Society, Series B (Methodological), 42 (1980), pp. 192-198.

[61] R. Shibata, Information Criteria for Statistical Model Selection, Electronics and

Communications in Japan, Part 3,, 85 (2002), pp. 605-611.

[62] J. Tao, N.-Z. Shi and S.-Y. Lee, Drug risk assessment with determining the

number of sub-populations under finite mixture normal models, Computational

Statistics & Data Analysis, 46 (2004), pp. 661-676.

[63] J. K. Vermunt and J. Magidson, Latent class cluster analysis., J.A. Hagenaars

and A.L. McCutcheon (eds.), Applied Latent Class Analysis, 89-106., Cambridge

University Press, 2002.

[64] M. Vriens, Market Segmentation. Analytical Developments and Applications

Guidlines, Technical Overview Series, Millward Brown IntelliQuest, 2001, pp.

1-42.

[65] H. x. Wang, Q. b. Zhang, B. Luo and S. Wei, Robust mixture modelling using

multivariate t-distribution with missing information, Pattern Recognition Letters,

25 (2004), pp. 701-710.

[66] R. Wehrens, L. M. C. Buydens, C. Fraley and A. E. Raftery, Model-Based

Clustering for Image Segmentation and Large Datasets via Sampling, Journal of

Classification, 21 (2004), pp. 231-253.

Page 25: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

25

[67] X. L. Xie and G. Beni, A validity measure for Fuzzy Clustering, IEEE

Transactions on pattern analysis and Machine Intelligence, 13 (1991), pp. 841-

847.

[68] X. Yang and S. M. Krishnan, Image segmentation using finite mixtures and

spatial information, Image and Vision Computing, 22 (2004), pp. 735-745.

Appendix

Table A1 Categorical clustering variables

Data sets Clustering variables Sample size Clusters’ number

Landis 77 7 categorical 118 3

Judges 3 categorical 164 3

Universalistic/particularistic 4 categorical 216 2

MAT 4 categorical 264 2

Store 5 categorical 412 2

Heinen 2 5 categorical 542 3

Heinen 3 5 categorical 542 3

Financial 4 categorical 743 2

Vdheijiden 3 categorical (2 Cov.) 811 2

LSAT 6 5 categorical 1000 2

Political 5 categorical 1156 2

Gss82 White 5 categorical 1202 3

Gss82 5 categorical (1 Cov.) 1644 3

Midtown Manhattan 2 categorical 1660 2

Depression 5 categorical (1 Cov.) 1710 3

Gss94 3 categorical (1 Cov.) 1850 2

Coleman 2 categorical 3398 2

Eye and Hair color 2 categorical 5387 3

Hannover 5 categorical 7162 4

Table A2 Normal clustering variables

Data sets Clustering variables Sample size Clusters’ number

Glucose in C5YSF1 5 Normal 22 3

haemophilia A carriers 2 Normal 23 2

Les Papillons 3 Normal 23 4

Chaetocnema insect data 3 Normal 30 3

H.oleraceaH.carduorum 4 Normal 39 2

Males of 3 sp.Chaetocnema 6 Normal 74 3

With and Angle 2 Normal 74 3

Diabetes 3 Normal

145 3

Iris de Fisher 4 Normal 150 3

Japanese black pine 2 Normal 204 3

Halibut 2 Normal 208 2

K-means 2 Normal 300 2

Pearson trypanosome 2 Normal 1000 2

Page 26: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

26

Table A3 Mixed clustering variables

Data sets Clustering variables Sample size Clusters’ number

Bird survival 3 Normal, 1 Categorical 50 2

Methadone treatment 2 Normal, 2 Categorical 238 2

North Central Wisconsin 4 log- Normal, 1 Categorical 34 3

Hepatitis 5 Normal, 5 Categorical 80 2

Neolithic Tools 2 Normal, 3 Categorical 103 3

Imports-85 15 Normal, 11 categorical 160 7

Heart 6 Normal, 7 Categorical 270 2

Cancer 8 Normal, 4 Categorical 471 3

AIDS 3 Normal, 3 Categorical 944 4

Ethylene Glycol 2 Normal, 1 Categorical 1028 2

Table A4 AIC performance

AIC Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 84% (16/19) 62% (8/13) 40% (4/10)

Number of Variables

4 >4 2 >2 <10 10

78% (7/9) 90% (9/10) 67% (4/6) 57% (4/7) 40% (2/5) 40% (2/5)

Sample

size 1000 >1000 130 >130 130 >130

80% (8/10) 89% (8/9) 86% (6/7) 33% (2/6) 40% (2/5) 40% (2/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

92% (12/13) 67% (4/6) - 62% (8/13) - 40% (4/10)

Table A5 AIC3 performance

AIC3 Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 95% (18/19) 62% (8/13) 30% (3/10)

Number of

Variables 4 >4 2 >2 <10 10

100% (9/9) 90% (9/10) 67% (4/6) 57% (4/7) 40% (2/5) 40% (1/5)

Sample

size 1000 >1000 130 >130 130 >130

100% (10/10) 89% (8/9) 57% (4/7) 67% (4/6) 20% (1/5) 40% (2/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

92% (12/13) 83% (5/6) - 62% (8/13) - 40% (4/10)

Page 27: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

27

Table A6 AICc performance

AICc Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 84% (16/19) 39% (5/13) 20% (2/10)

Number of

Variables 4 >4 2 >2 <10 10

78% (7/9) 100% (9/9) 50% (3/6) 29% (2/7) 20% (1/5) 20% (1/5)

Sample

size 1000 >1000 130 >130 130 >130

80% (8/10) 89% (8/9) 29% (2/7) 50% (3/6) 20% (1/5) 20% (1/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

92% (12/13) 67% (4/6) - 39% (5/13) - 20% (2/10)

Table A7 AICu performance

AICu Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 90% (17/19) 31% (4/13) 40% (4/10)

Number of

Variables 4 >4 2 >2 <10 10

100% (9/9) 80% (8/10) 50% (3/6) 14% (1/7) 40% (2/5) 40% (2/5)

Sample

size 1000 >1000 130 >130 130 >130

90% (9/10) 89% (8/9) 29% (2/7) 33% (2/6) 40% (2/5) 40% (2/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

85% (11/13) 100% (6/6) - 31% (4/13) - 40% (4/10)

Table A8 CAIC performance

CAIC Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 74% (14/19) 69% (9/13) 70% (7/10)

Number of

Variables 4 >4 2 >2 <10 10

100% (9/9) 50% (5/10) 83% (5/6) 57% (4/7) 40% (2/5) 100% (5/5)

Sample

size 1000 >1000 130 >130 130 >130

78% (7/9) 70% (7/10) 57% (4/7) 83% (5/6) 60% (3/5) 80% (4/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

62% (9/13) 100% (6/6) - 69% (9/13) - 70% (7/10)

Table A9 BIC performance

BIC Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 84% (16/19) 77% (10/13) 70% (7/10)

Number of

Variables 4 >4 2 >2 <10 10

100% (9/9) 70% (7/10) 83% (5/6) 71% (5/7) 60% (3/5) 80% (4/5)

Sample

size 1000 >1000 130 >130 130 >130

80% (8/10) 89% (8/9) 71% (5/7) 83% (5/6) 60% (3/5) 80% (4/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

77% (10/13) 100% (6/6) - 77% (10/13) - 70% (7/10)

Page 28: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

28

Table A10 CLC performance

CLC Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 21% (4/19) 23% (3/13) 10% (1/10)

Number of

Variables 4 >4 2 >2 <10 10

33% (3/9) 10% (1/10) 33% (2/6) 14% (1/7) 20% (1/5) -

Sample

size 1000 >1000 130 >130 130 >130

30% (3/10) 11% (1/9) 28% (2/7) 17% (1/6) - 20% (1/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

- 67% (4/6) - 23% (3/13) - 10% (1/10)

Table A11 ICL-BIC performance

ICL-BIC Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 16% (3/19) 54% (7/13) 80% (8/10)

Number of

Variables 4 >4 2 >2 <10 10

33% (3/9) - 50% (3/6) 57% (4/7) 80% (4/5) 80% (4/5)

Sample

size 1000 >1000 130 >130 130 >130

20% (2/10) 11% (1/9) 57% (4/7) 50% (3/6) 80% (4/5) 80% (4/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

- 50% (3/6) - 54% (7/13) - 80% (8/10)

Table A12 NEC performance

NEC Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 21% (4/19) 46% (6/13) 40% (4/10)

Number of

Variables 4 >4 2 >2 <10 10

33% (3/9) 10% (1/10) 50% (3/6) 43% (3/7) 40% (2/5) 40% (2/5)

Sample size

1000 >1000 130 >130 130 >130

30% (3/10) 11% (1/9) 71% (5/7) 17% (1/6) 40% (2/5) 40% (2/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

8% (1/13) 50% (3/6) - 46% (6/13) - 40% (4/10)

Table A13 AWE performance

AWE Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 16% (3/19) 23% (3/13) 40% (4/10)

Number of Variables

4 >4 2 >2 <10 10

33% (3/9) - 33% (2/6) 14% (1/7) 40% (2/5) 40% (2/5)

Sample

size 1000 >1000 130 >130 130 >130

20% (2/10) 11% (1/9) 29% (2/7) 17% (1/6) 40% (2/5) 40% (2/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

- 16% (3/19) - 23% (3/13) - 40% (4/10)

Page 29: Mixture-Model Cluster Analysis using Information ... · 1 Mixture-Model Cluster Analysis using Information Theoretical Criteria Jaime R. S. Fonseca1 ISCSP-Instituto Superior de Ciências

29

Table A14 L performance

L Structure recovery (%)

Multinomial Normal Multivariate Mixed

Overall 54% (11/19) 54% (7/13) 60% (6/10)

Number of Variables

4 >4 2 >2 <10 10

78% (7/9) 40% (4/10) 100% (6/6) 14% (1/7) 60% (3/5) 40% (2/5)

Sample

size 1000 >1000 130 >130 130 >130

70% (7/10) 44% (4/9) 29% (2/7) 83% (5/6) 40% (2/5) 80% (4/5)

Entropy ( sE )

0.7 > 0.7 0.7 > 0.7 0.7 > 0.7

39% (5/13) 100% (6/6) - 54% (7/13) - 60% (6/10)