competitive clustering algorithms based on ultrametric properties
TRANSCRIPT
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi: 10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
Journal of Computational Science xxx (2012) xxx–xxx
Contents lists available at SciVerse ScienceDirect
Journal of Computational Science
j ourna l h o me page : www.e l sev ie r . com/ loca te / jocs
Competitive clustering algorithms based on ultrametric propertiesQ11
S. Fouchal a, �, 1 , M. Ahat b, 1 , S. Ben Amor c, 2 , I. Lavallée a, 1 , M. Bui a, 1Q42
a University Paris 8, France3b Ecole Pratique des Hautes Etudes, France4c University of Versailles Saint-Quentin-en-Yvelines, France5
6
a r t i c l e i n f o7
8
Art icle history:9
Received 30 July 201110
Received in revised form 16 October 201111
Accepted 24 November 201112
Available online xxx13
14
Keywords:15
Clustering16
Ultrametric17
Complexity18
Amortized analysis19
Average analysis20
Ordered space21
a b s t r a c tWe propose in this paper two new competitive unsupervised clustering algorithms: the first algorithmdeals with ultrametric data, it has a computational cost of O(n). The second algorithm has two strongfeatures: it is fast and flexible on the processed data type as well as in terms of precision. The secondalgorithm has a computational cost, in the worst case, of O(n2 ), and in the average case, of O(n). Thesecomplexities are due to exploitation of ultrametric distance properties. In the first method, we use theorder induced by an ultrametric in a given space to demonstrate how we can explore quickly data prox-imity. In the second method, we create an ultrametric space from a sample data, chosen uniformly atrandom, in order to obtain a global view of proximities in the data set according to the similarity criterion.Then, we use this proximity profile to cluster the global set. We present an example of our algorithmsand compare their results with those of a classic clustering method.
© 2012 Elsevier B.V. All rights reserved.
1. Introduction22
The clustering is useful process which aims to divide a set23
of elements into a set of finite number of groups. These groups24
are organized such as the similarity between elements in a same25
group is maximal, while similarity between elements from di�er-26
ent groups is minimal [15,17] .27
There are several approaches of clustering, hierarchical, parti-28
tioning, density-based, which are used in a large variety of fields,29
such as astronomy, physics, medicine, biology, archaeology, geol-30
ogy, geography, psychology, and marketing [24] .31
The clustering aims to group objects of a data set into a set of32
meaningful subclasses, so it can be used as a stand-alone tool to get33
insight into the distribution of data [1,24] .34
The clustering of high-dimensional data is an open problem35
encountered by clustering algorithms in di�erent areas [31] . Since36
the computational cost increases with the size of data set, the fea-37
sibility can not be fully guaranteed.38
We suggest in this paper two novel unsupervised clustering39
algorithms: The first is devoted to the ultrametric data. It aims to40
� Corresponding author.E-mail addresses: [email protected] (S. Fouchal), [email protected]
(M. Ahat), [email protected] (S. Ben Amor ), [email protected](I. Lavallée), [email protected] (M. Bui).
1 Laboratoire d’Informatique et des Systèmes Complexes, 41 rue Gay Lussac 75005Paris, France, http://www.laisc.net .
2 Laboratoire PRiSM, 45 avenue des Etats-Unis F-78 035 Versailles, France,http://www.prism.uvsq.fr/ .
show rapidly the inner closeness in the data set by providing a gen- 41
eral view of proximities between elements. It has a computational 42
cost of O(n). Thus, it guarantees the clustering of high-dimensional 43
data in ultrametric spaces. It can, also, be used as a preprocessing 44
algorithm to get a rapid idea on behavior of data with the similarity 45
measure used. 46
The second method is general, it is applicable for all kinds of 47
data, it uses a metric measure of proximity. This algorithm provides 48
rapidly the proximity view between elements in a data set with the 49
desired accuracy. It is based on a sampling approach (see details in 50
[1,15] ) and ultrametric spaces (see details in [20,23,25] ). 51
The computational complexity of the second method is in the 52
worst case, which is rare, of O(n2 ) + O(m2 ), where n is the size of data 53
and m the size of the sample. The cost in the average case, which is 54
frequent, is equal to O(n) + O(m2 ). In both cases m is insignificant, 55
we give proofs of these complexities in Proposition 9. Therefore, 56
we use O(n2 ) + ε and O(n) + ε to represent respectively the two com- 57
plexities. 58
This algorithm guarantees the clustering of high-dimensional 59
data set with the desired precision by giving more flexibility to the 60
user. 61
Our approaches are based in particular on properties of ultra- 62
metric spaces. The ultrametric spaces are ordered spaces such that 63
data from a same cluster are “equidistant ” to those of another 64
one (e.g. in genealogy: two species belonging to the same fam- 65
ily, “brother s”, are at the same distance from species from another 66
family, “cousins”) [8,9] . 67
We utilize ultrametric properties in the first algorithm to clus- 68
ter data without calculating all mutual similarities. The structure 69
1877-7503/$ – see front matter © 2012 Elsevier B.V. All rights reserved.doi: 10.1016/j.jocs.2011.11.004
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
2 S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx
Fig. 1. Steps of clustering process.
induced by ultrametric distance allows us to get a general infor-70
mation about proximity between all elements from just one data,71
consequently we reduce the computational cost. We can find ultra-72
metric spaces in many kinds of data sets such as: phylogeny where73
the distance of evolution is constant [16], genealogy, library, infor-74
mation and social sciences . . . to name a few.75
In the second algorithm, we use ultrametric to acquire a first76
insight of the proximity between elements in just a sample data77
w.r.t. the similarity used. Once this view obtained, we expand it to78
cluster the whole data set.79
The rest of the text is organized as follows. In Section 2, we80
present a brief overview of the clustering strategies. In Section 3,81
we introduce the notions of metric and ultra-metric spaces, dis-82
tance and balls. Our first algorithm is presented in Section 4. We83
present an example of this first algorithm in Section 5. The second84
algorithm is introduced in Section 6. In Section 7, we present an85
example of the second algorithm and we compare our results with86
those of a classic clustering algorithm. Finally, in Section 8 we give87
our conclusion and future work.88
2. Related work89
The term “clustering” is used in several research communities90
to describe methods for grouping of unlabeled data. The typical91
pattern of this process can be summarized by the three steps of92
Fig. 1 [21].93
Feature selection and extraction are preprocessing techniques94
which can be used, either or both, to obtain an appropriate set of95
features to use in clustering. Pattern proximity is calculated by sim-96
ilarity measure defined on data set between a pairs of objects. This97
proximity measure is fundamental to the clustering, the calcula-98
tions of mutual measures between element are essential to most99
clustering procedures. The grouping step can be carried in different100
way, the most known strategies are defined bellow [21].101
Hierarchical clustering is either agglomerative (“bottom-up”)102
or divisive (“top-down”). The agglomerative approach starts with103
each element as a cluster and merges them successively until104
forming a unique cluster (i.e. the whole set) (e.g. WPGMA [9,10],105
UPGMA [14]). The divisive begins with the whole set and divides106
it iteratively until it reaches the elementary data. The outcome of107
hierarchical clustering is generally a dendrogram which is diffi-108
cult to interpret when the data set size exceeds a few hundred of109
elements. The complexity of these clustering algorithms is at least110
O(n2) [28].111
Partitional clustering creates clusters by dividing the whole set112
into k subsets. It can also be used as divisive algorithm in hierar-113
chical clustering. Among the typical partitional algorithms we can114
name K-means [5,6,17] and its variants K-medoids, PAM, CLARA115
and CLARANS. The results depend on the k selected data in this kind116
of algorithms. Since the number of clusters is defined upstream of117
the clustering, the clusters can be empty.118
Density-based clustering is a process where the clusters are119
regarded as a dense regions leading to the elimination of the noise.120
DBSCAN, OPTICS and DENCLUE are typical algorithms based on this121
strategy [1,4,7,18,24].122
Since, the major clustering algorithms calculate similarities123
between all data prior to the grouping phase (for all types of simi-124
larity measure used), the computational complexity is increased to125
O(n2) before the execution of the clustering algorithm.126
Our first approach deals with ultrametric spaces, we propose the 127
first – as our best knowledge – unsupervised clustering algorithm 128
on this kind of data without calculating similarities between all 129
pairs of data. So, we reduce the computational cost of the process 130
from O(n2) to O(n). We give proofs that; since the data processed are 131
described with an ultrametric distance we do not need to calculate 132
all mutual distances to obtain information about proximity in the 133
data set (cf . Section 4). 134
Our second approach is a new flexible and fast unsuper- 135
vised clustering algorithm which costs mostly O(n) + ! and seldom 136
O(n2) + ! (in rare worst case), where n is the size of data set and " is 137
equal to O(m2) where m is the size of an insignificant part (sample) 138
of the global set. 139
Even if the size of data increases, the complexity of the second 140
proposed algorithm, the amortized complexity [30,32,34], remains 141
of O(n) + ! in the average case, and of O(n2) + ! in the worst case, 142
thus it can process dynamic data such as those of Web and social 143
network with the same features. 144
The two approaches can provide overlapped clusters, where one 145
element can belong to more than one or more than two clusters 146
(more general than weak-hierarchy), see [2,4,7] for detailed defi- 147
nitions about overlapping clustering. 148
3. Definitions 149
Definition 1. A metric space is a set endowed with distance 150
between its elements. It is a particular case of a topological space. 151
Definition 2. We call a distance on a given set E, an application d: 152
E " E # R+ which has the following properties for all x, y, z $ E: 153
1 (Symmetry) d(x, y) = d(y, x), 154
2 (Positive definiteness) d(x, y) ! 0, and d(x, y) = 0 if and only if x = y, 155
3 (Triangle inequality) d(x, z) " d(x, y) + d(y, z). 156
Example 1. The most familiar metric space is the Euclidean 157
space of dimension n, which we will denote by Rn, with the 158
standard formula for the distance: d((x1, . . ., xn), (y1, . . ., yn)) = 159
((x1 % y1)2 + · · · + (xn % yn)2)1/2 (1). 160
Definition 3. Let (E, d) be a metric space. If the metric d satisfies 161
the strong triangle inequality: 162
&x, y, z $ E, d(x, y) " max{d(x, z), d(z, y)} 163
then it is called ultrametric on E [19]. The couple (E, d) is an ultra- 164
metric space [11,12,29]. 165
Definition 4. We name open ball centered on a $ E and has a radius 166
r $ R+ the set {x $ E : d(x, a) < r} ' E, it is called Br(a) or B(a, r). 167
Definition 5. We name closed ball centered on a $ E and has a 168
radius r $ R+ the set {x $ E : d(x, a) " r} ' E, it is called Bf(a, r). 169
Remark 1. Let d be an ultrametric on E. The closed ball on a $ E 170
with a radius r > 0 is the set: B(a, r)={x $ E : d(x, a) " r} 171
Proposition 1. Let d be an ultrametric on E, the following properties 172
are true [11]: 173
1 If a, b $ E, r > 0, and b $ B(a, r), then B(a, r) = B(b, r), 174
2 If a, b $ E, 0 < i " r, then either B(a, r) ( B(b, i) = ) or B(b, i) * B(a, r). 175
This is not true for every metric space, 176
3 Every ball is clopen (closed and open) in the topology defined by d 177
(i.e. every ultrametrizable topology is zero-dimensional). Thus, the 178
parts are disconnected in this topology. 179
Hence, the space defined by d is homeomorphic to a subspace of 180
countable product of discrete spaces (c . f Remark 1) (see the proof 181
in [11]). 182
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx 3
Fig. 2. Illustration of some ultrametric distances on a plan.
Remark 2. A topological space is ultrametrizable if and only if it183
is homeomorphic to a subspace of countable product of discrete184
spaces [11].185
Definition 6. Let E be a finite set, endowed with a distance d. E is186
classifiable for d if: & ̨ $ R+ the relation on E:187
&x, y $ E, xR˛y + d(x, y) " ˛188
is an equivalent relation.189
Thus, we can provide a partition from E as [33]:190
• d(x, y) " ̨ + x and y belong to the same cluster, or,191
• d(x, y) > ̨ + x and y belong to two distinct clusters.192
Example 2. x, y and z are three points of plan endowed with an193
Euclidean distance d, we have:194
d(x, y) = 2, d(y, z) = 3, d(x, z) = 4.195
The set E is not classifiable for ̨ = 3. The classification leads to196
inconsistency.197
Proposition 2. A finite set E endowed with a distance d is classifiable198
if and only if d is an ultrametric distance on E[11].199
Proposition 3. An ultrametric distance generates an order in a set,200
viewed three to three they form isosceles triangles. Thus, the repre-201
sentation of all data is fixed whatever the angle of view is (see Fig.202
2).203
Proof 1. Let (E, d) an ultrametric set, for all x, y and z $E: Consider:204
d(x, y) " d(x, z) (1)205
d(x, y) " d(y, z) (2)206
(1) and (2) , (3) and (4)207
d(x, z) " max{d(x, y), d(y, z)} , d(x, z) " d(y, z) (3)208
d(y, z) " max{d(x, y), d(x, z)} , d(y, z) " d(x, z) (4)209
(3) and (4) , d(x, z) = d(y, z).210
4. First approach: an O(n) unsupervised clustering method 211
on ultrametric data. 212
We propose here the first – as our best knowledge – approach 213
of clustering in ultrametric spaces without calculating similarity 214
matrix. Ultrametric spaces represent many kinds of data sets such 215
as genealogy, library information and social sciences data, to name 216
a few. 217
Our proposed method has a computational cost of O(n), thus it 218
makes it treatable to cluster very large databases. It provides an 219
insight of proximities between all data. 220
The idea consists in using the ultrametric space properties, in 221
particular, the order induced by the Ultratriangular Inequality to 222
cluster all elements compared to only one of them chosen uni- 223
formly at random. 224
Since the structure of an ultrametric space is frozen (cf . 225
Proposition 3), we do not need to calculate all mutual distances. 226
The calculation of distances compared just to one element are suf- 227
ficient to determine the clusters, as clopen balls (or closed balls; cf . 228
Proposition 1.3). Consequently, we avoid the computation of O(n2). 229
However, our objective is to propose a solution to the problem of 230
computational cost, notably the feasibility of clustering algorithms, 231
in large data base. 232
4.1. Principle 233
Hypothesis 1. Consider E a set of data endowed with an ultramet- 234
ric distance d. 235
Our method is composed of the 3 following steps: 236
Step 1. Choose uniformly at random one data from E (see Fig. 3); 237
Step 2. Calculate distances between the chosen data and all 238
others. 239
Step 3. Define thresholds and represent the distribution of all 240
data “according” to these thresholds and the calculated distances of 241
the step 2 (see Fig. 4). 242
Fig. 3. Step 1: choosing uniformly at random one element A: we took the samescheme of Fig. 2 to depict the structuration of elements between them, but with ouralgorithm we do not need to calculate all distances (i.e. step 2).
Plea se cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi: 10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–134 S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxxProposition 4. The clusters are distinguished by the closed balls (or243
thresholds), around the chosen data.244
Proof 2. An ultrametric space is a classifiable set (cf . Proposition245
2). Thus, comparing mutual distances with a given threshold shows246
the belonging or not to the same cluster. Then, the distribution of247
data around any data in the set reflects entirely their proximity.248
Proposition 5. The random choice of initial point, does not a�ect the249
resulting clusters.250
Proof 3. Since an ultrametric distance generates an order in a set,251
the representation of all data is fixed whatever the angle of view.252
Hence, for every chosen data the same isosceles triangle are formed.253
Proposition 6. The computational cost of this algorithm is equal to254
O(n) .255
Proof 4. The algorithm consists in calculating distances between256
the chosen data and the n − 1 data.257
4.2. Algorithm258
Procedure UCMUMD:
• Variables:1. Ultrametric space:
a. data set E of size n;b. ultrametric distance du .
• Begin:1. Choose uniformly at random one element x�E ;2. Determine intervals of clusters; //
�exibility depending on context3. For i<n − 1 :
a. Calculate du (x, i) ; // complexity = O(n)b. Allocate i to cluster of the correspon-dent interval.
4. End for5. Return result.
• End
5. First approach: example259
We present in this section a comparison of our first algorithm260
with WPGMA . The Weighted Paired Group Method using Aver-261
ages (WPGMA ) is a hierarchical clustering algorithm developed262
by McQuitty in 1966, it has a computational complexity of O(n2 )263
[27] . We have tested the two methods on sets of 34 and 100 data,264
respectively.265
Let us consider the ultrametric distance matrix of 34 data (see266
Fig. 5), calculated from similarity matrix based on the Normalized267
Information Distance NID [10] .268
The resulting clusters with WPGMA , on 34 data, are shown in269
dendrogram of Fig. 6.270
To test our method we determine (arbitrarily) the following271
intervals of the desired clusters (step 2): [0.000, 0.440], [0.440,272
0.443], [0.443, 0.445], [0.445, 0.453], [0.453, 0.457], [0.457, 0.460],273
[0.463, 0.466], [0.466, 0.469], [0.469, 0.470], [0.470, 0.900].274
The results with our algorithm (Fig. 7 and 8), using two di�erent275
data as starting points, show the same proximities between276
Fig. 4. Representation of all elements around A, according to the calculated distancesand thresholds: the distances are those between clusters and thresholds are theradiuses of balls.
data such as those of the WPGMA clustering (cf . Fig. 6). Since 277
ultrametric distance induce an ordered space, the inter-cluster 278
and intra-cluster inertia for the purpose of the clustering are 279
guaranteed. 280
We see that even if we change the starting point, the proximity 281
in the date set the same (cf . Proposition 5). The clusters are the 282
same as shown in figures. 283
Fig. 9 summarizes the results of the two methods on the same 284
dendrogram, it shows that the generated clusters with our algo- 285
rithm (balls) are similar to those of WPGMA. 286
We have compared the two methods also on 100 words, chosen 287
randomly from dictionary. The WPGMA results are shown in Fig. 10 . 288
Our results are the same whatever the chosen data (as we have 289
seen in the first example), we choose here the data “toit”, Fig. 11 290
shows the resulting clusters (Fig. 12): Q2 291
We see in comparisons that, first, the results of our method 292
are similar to those of the hierarchical clustering (WPGMA ), hence 293
the consistency of our algorithm. Second, the clusters remain 294
unchanged whatever the selected data. 295
NB: A few di�erences between results are due to the values of 296
thresholds chosen (arbitrarily), as if we obtain slightly di�erent 297
classes with di�erent cuts in dendrograms. 298
Our first objective was to propose a method which provides a 299
general view of data proximity, in particular in large databases. 300
This objective is achieved, by giving a consistent result in just 301
O(n) iterations, thus allows the feasibility of the clustering in high- 302
dimensional data. 303
6. Second approach: fast and flexible unsupervised 304
clustering algorithm 305
Our second approach is a new unsupervised clustering algo- 306
rithm which is valuable for any kind of data, since it uses a distance 307
to measure proximity between objects. The computational of this 308
algorithm is O(n) + ε in the best and average case and O(n2 ) + ε 309
in the worst case which is rare. The value of ε = O(m2 ) (where 310
m is the size of a sample) is negligible in front of n. Thus, our 311
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx 5
harborSeal
graySeal
dog
cat
whiterhinoceros
indianrhinoceros
horse
donkey
finwhale
bluewhale
pig
sheep
cow
hippopotamus
fruitbat
aardvark
rat
mouse
fatdormouse
squirrel
rabbit
guineapig
armadillo
wallaroo
opossum
elephant
chimpanzee
pigmychimpanzee
human
gorilla
orangutan
gibbon
baboon
platypus
harborSeal
00,1
90,4
0,4
0,4
0,4
40,4
0,4
0,4
40,4
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7grayS
eal0
0,4
0,4
0,4
0,4
40,4
0,4
0,4
40,4
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7dog
00,4
0,4
0,4
40,4
0,4
0,4
40,4
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7cat
00,4
0,4
40,4
0,4
0,4
40,4
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7w
hiterhinoceros 0
0,3
80,4
0,4
0,4
40,4
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7indianrhinoceros
00,4
0,4
0,4
40,4
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7horse
00,3
0,4
40,4
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7donkey
00,4
40,4
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7finw
hale0
0,3
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7bluew
hale0
0,4
0,4
0,4
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7pig
00,4
0,40,44
0,4530,46
0,460,458
0,460,458
0,460,46
0,460,5
0,50,5
0,50,47
0,50,5
0,470,5
0,50,47
sheep0
0,40,44
0,4530,46
0,460,458
0,460,458
0,460,46
0,460,5
0,50,5
0,50,47
0,50,5
0,470,5
0,50,47
cow0
0,4
40,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7hippopotam
us0
0,4
53
0,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7fruitbat
00,4
60,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7aardvark
00,4
60,4
58
0,4
60,4
58
0,4
60,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7rat
00,4
34
0,4
60,4
55
0,4
50,4
60,4
60,5
0,50,5
0,50,4
70,5
0,50,4
70,5
0,50,4
7m
ouse0
0,460,455
0,450,46
0,460,5
0,50,5
0,50,47
0,50,5
0,470,5
0,50,47
fatdormouse
00,455
0,450,46
0,460,5
0,50,5
0,50,47
0,50,5
0,470,5
0,50,47
squirrel0
0,4
50,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7rabbit
00,4
60,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7guineapig
00,4
60,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7arm
adillo0
0,5
0,5
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7w
allaroo0
0,4
0,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7opossum
00,5
0,5
0,4
70,5
0,5
0,4
70,5
0,5
0,4
7elephant
00,5
0,470,5
0,50,47
0,50,5
0,47chim
panzee0
0,2
10,3
0,4
0,4
10,4
0,4
0,4
7pigm
ychimpanzee
00,3
0,40,41
0,40,4
0,47hum
an0
0,40,41
0,40,4
0,47gorilla
00,4
10,4
0,4
0,4
7orangutan
00,4
0,4
0,4
7gibbon
00,4
0,4
7baboon
00,4
7platypus
0
Fig. 5. Ultrametric distance matrix of 34 data.
algorithm has capability to treat very large data bases. The user312
can get the desired accuracy on proximities between elements.313
The idea consists in exploiting ultrametric properties, in partic-314
ular, the order induced by the Ultratriangular Inequality on a given315
space. First we deduce the behavior of all data relative to the dis-316
tance used, by creating an ultrametric (ordered) space from just a317
sample data (subset) chosen uniformly at random, of size m (petty318
compared to n). Then, we do the clustering on the global data set319
according to these order information.320
The construction of the ordered space, from sample, costs321
" = O(m2) operations. Once this ultrametric space (structured sam-322
ple) built, we use ultrametric distances to fix the thresholds (or323
Fig. 6. Clustering of the 34 data with WPGMA.
intervals) of the clusters, which depend on the choice of users. This 324
liberty of choice makes our algorithm useful for all kinds of data and 325
implicate the user as an actor in the construction of the algorithm. 326
After determining clusters and their intervals, we select uni- 327
formly at random one representative by cluster, then we compare 328
them with the rest of data, in the global set (of size n % m), 329
one by one. Next, we add the data to the clusters of the closest 330
representative according to the intervals. 331
Fig. 7. Results using fin-whale as the chosen origin.
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
6 S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx
If the compared data is away compared to all representative, we332
create a new cluster.333
Remark 3. This step costs O(x . n) + ! operations in the average334
case, where x is the number of clusters which is generally insignif-335
icant compared to n, therefore we keep only O(n) + !. But in the336
worst case, where the clustering provides only singleton which is337
rare since the aim of clustering is to group objects, x is equal to n,338
consequently the computational cost is O(n2) + !.339
6.1. Principle340
Hypothesis 2. Consider E a set of data and a distance d.341
The method is composed of the following steps:342
Step 1. Choose uniformly at random a sample data from E, of343
size m (e.g. m = n/10, 000 if n = 1 M).344
Remark 4. The choice of the sample depends on the processed345
data and on the expert in the domain. The user is an actor in the346
execution of the algorithm, he can intervene in the choice of the347
input and refine the output (see details in the step 4).348
Remark 5. The value of m depends on the value of n compared to349
the bounds of d [3].350
Example 3. Consider a set of size n = 100, 000 data and distance351
d $ [0, 0.5] if we use a sample of 50 data chosen uniformly at random,352
we can easily get a large idea on the aptitude of the data according353
to d. But, if n = 500 and d $ [0, 300], if we choose a sample of 5 data,354
we are not sure to get enough information about the manner how355
the data behave with d.356
Remark 6. More the size n is large more the size of m is petty357
compared to that of n [3].358
Step 2. Execute a classic hierarchical clustering algorithm (e.g.359
WPGMA) with the distance d on the chosen sample.360
Fig. 8. Results using horse as the chosen origin.
Step 3. Represent the distances in the resulting dendrogram, 361
thus the ultrametric space is built. 362
Step 4. Deduce the clusters intervals (closed balls or thresh- 363
olds), which depend on the clustering sought-after and d, as large 364
or precise view of proximities between elements (this step must be 365
specified by the user or specialist of the domain). 366
Step 5. Choose uniformly at random one representative per clus- 367
ter from the result of the step 2. 368
Step 6. Pick the rest of data one by one and compare them, 369
according to d, with the clusters representatives of the previous 370
step: 371
• If the compared data is close to one (or more) of the representative 372
data w.r.t. the defined thresholds, then add it to the same cluster 373
of the representative. 374
• Else, create a new cluster. 375
Remark 7. If the compared data is close to more than one 376
representative, then it will be added to more than one cluster, 377
consequently generate an overlapping clustering [4,7]. 378
Proposition 7. The clusters are distinguished by the closed balls (or 379
thresholds). 380
Proof 5. An ultrametric space is a classifiable set (cf . Proposition 381
2). Thus, comparing mutual distances with a given threshold shows 382
the belonging or not to the same cluster. 383
Proposition 8. The random choice of initial points, does not affect 384
the resulting clusters. 385
Proof 6. Since the generated space is ultrametric, it is then 386
endowed with a strong intra-cluster inertia that allows this 387
choices. 388
Proposition 9. The computational cost of our algorithm is, in the 389
average case, equal to O(n), and in the rare worst case, equal to O(n2). 390
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx 7
Fig. 9. Clustering with our method and WPGMA.
Table 1Example 1: clustering of the 34 data in O(N) + ".
Thresholds Clusters
Pig: closest items by d " 0.881 Whiterhinoceros cowPig: d $ [0.881, 0.889] Indian-rhinoceros, gray-seal harbor-seal, cat, sheepPig: d $ [0.889, 0.904] DogPig: d $ [0.904, 0.919] Pigmy-chimpanzee, orangutan, gibbon, human, gorillaPig: d $ [0.919, 0.929] Chimpanzee, baboonBluewhale: d " 0.881 Fin-whaleBluewhale: d $ [0.881, 0.889]Bluewhale: d $ [0.889, 0.904] Donkey, horseBluewhale: d $ [0.904, 0.919]Bluewhale: d $ [0.919, 0.929]Aardvark: d " 0.881Aardvark: d $ [0.881, 0.889]Aardvark: d $ [0.889, 0.904]Aardvark: d $ [0.904, 0.919] Rat, mouse, hippopotamus, fat-dormouse, armadillo, fruit-Bat, squirrel, rabbitAardvark: d $ [0.919, 0.929] Elephant, opossum, wallarooGuineapig: d " 0.881Guineapig: d $ [0.881, 0.889]Guineapig: d $ [0.889, 0.904]Guineapig: d $ [0.904, 0.919]Guineapig: d $ [0.919, 0.929]Platypus: d $ [0.881, 0.929]
391
392
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
8 S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx
Fig. 10. Clustering of 100 words with WPGMA.
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx 9
Proof 7.393
• In the two cases we ignore the complexity of the clustering of the394
sample and the construction of the ultrametric space " (i . e . steps395
2 and 3), because the size of the sample is insignificant compared396
to the cardinal of the global data set.397
• The worst case where the clustering provides only singleton is398
rare, because this process aims to group similar data in clusters399
whose numbers is frequently petty compared to the data set size.400
• Since there exist homogeneity between data according to the401
proximity measure used, the number of clusters is unimportant402
compared to the cardinal of the data set, consequently the com-403
plexity of our algorithm is frequently of O(n).404
6.2. Algorithm405
Procedure FFUCA
• Variables:1. Metric space:
a. data set E of size n;b. distance d).
• Begin:1. Choose a sample of data of size adjustable
according to data type (we use root (IEI) asa sample); // flexibility to expert of data
2. Do clustering of the sample with classichierarchical clustering algorithm; // com-plexity O(m2) where m = sample size (O(n)here)
3. Deduct an ultrametric space of the clusteredsample; // to exploit ultrametric propertiesin clustering, complexity O(n)
4. Determine intervals from the hierarchy ofstep 3; // flexibility to user to refine sizeoutput clusters
5. Choose uniformly at random one representativeper cluster from the clusters of step 4;
6. For i<n % nb; i + +: // nb= number of chosenrepresentative– For j<nb; j + +:
a. Calculate d(i, j); // complexity =O(nb " n), rare worst case O(n2), providingonly singleton* If d(i, j)" interval of the cluster of j:* Then: allocate i to cluster of j;
· If i$ more than one cluster:· Then: keep it only in the cluster of the
closest representative;// an element canbelong to two clusters if it is near-est to two representatives by the samedistance
· End If* Else: create a new cluster and allocate i to
this cluster.* End If
– End for1 End for2 Return result.
• End
Fig. 11. Clustering of 100 words with chosen data toit.
7. Example 406
This section is devoted to comparison of our algorithm with 407
WPGMA (cf . Section 5). We have tested the two methods on the 408
same set of 34 mitochondrial DNA sequences of the first example, 409
but here we used the metric (not ultrametric) distance Normal- 410
ized Information Distance NID [10] to calculate proximities between 411
elements. For WPGMA, we calculate a similarity matrix which costs 412
O(n2) (see Appendix A). With our proposed method, we do not need 413
to calculate the matrix, we calculate only distances between the 414
chosen representative and the rest of elements of the whole set, 415
this step costs O(nb . n) operations (where nb is the number of rep- 416
resentatives). The resulting dendrogram with WPGMA on 34 data, 417
is shown in Fig. 6 (cf . Section 5). 418
We execute FFUCA two times with different inputs chosen ran- 419
domly. The first step consists in choosing uniformly at random a 420
sample of data and doing clustering on this sample with a classic 421
clustering algorithm according to the similarity used. Then, create 422
an ultrametric space from the resulting clusters. 423
In this first example we choose arbitrarily the following data: 424
bluewhale, hippopotamus, pig, aardvark, guineapig and chim- 425
panzee. 426
We did a clustering and establish the ordered (ultrametric) 427
space using WPGMA, on the chosen data, the result is depicted in 428
Fig. 13. 429
The second step is the deduction of the intervals (clusters); thus, 430
we use the valuation of the dendrogram of Fig. 13. We keep here five 431
thresholds [0,0.881], [0.881, 0.889], [0.889, 0.904], [0.904, 0.919], 432
[0.919, 0.929]. Since, the size of data set is small (34) we use these 433
as representatives of clusters. Table 1 shows the resulting clusters 434
with the first inputs, the intervals are represented in the left column 435
and clusters in the right column. 436
NB: The chosen elements belong only to the first lines (e . g . 437
pig $ {whiterhinoceros, cow}). The last lines in the table repre- 438
sent the new clusters, which are far from all other clusters, they are 439
considered as outliers. 440
In the second example we choose donkey, gray-Seal, pig, fat- 441
dormouse, gorilla as starting data. After the clustering and the 442
construction of the ultrametric space from this chosen sam- 443
ple, we keep the following intervals [0,0.8827], [0.8827, 0.8836], 444
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi: 10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–1310 S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx
Fig. 12. Clustering of 100 words with chosen data toits.
Plea se cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi: 10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx 11
Table 2Example 2: clustering of the 34 data in O(N) + .Thresholds Clusters
Donkey: closest items by d 0.8827 Indian-rhinoceros, whiterhinoceros, horseDonkey: d � [0.8827, 0.8836]Donkey: d � [0.8836, 0.8900] OpossumDonkey: d � [0.8900, 0.9000]Gray-seal: d 0.8827 Harbor seal, catGray-seal: d � [0.8827, 0.8836] DogGray-seal: d � [0.8836, 0.8900]Gray-seal: d � [0.8900, 0.9000]Pig: d ,woC7288.0 fin-whale, bluewhalePig: d � [0.8827, 0.8836] SheepPig: d � [0.8836, 0.8900] HippopotamusPig: d � [0.8900, 0.9000] Squirrel, rabbitFat-dormouse: d 0.8827Fat-dormouse: d � [0.8827, 0.8836]Fat-dormouse: d � [0.8836, 0.8900]Fat-dormouse: d � [0.8900, 0.9000]Gorilla: d 0.8827 Chimpanzee, pigmy-chimpanzee, human, orangutan, gibbonGorilla: d � [0.8827, 0.8836]Gorilla: d � [0.8836, 0.8900]Gorilla: d � [0.8900, 0.9000] BaboonArmadillo: d � [0.8827, 0.9000]Platypus: d � [0.8827, 0.9000]Aardvark: d � [0.8827, 0.9000]Mouse: d � [0.8827, 0.9000] RatWallaroo: d � [0.8827, 0.9000]Elephant: d � [0.8827, 0.9000]Guineapig: d � [0.8827, 0.9000]
Fig. 13. Clus tering of chosen data: construction of the ordered space.[0.8836, 0.890 0], [0.8900, 0.900]. We do not list the details about445
di�erent steps in this second example, we show the results in446
Table 2.447
We see in the tables that, first, our results are similar to those448
of the hierarchical clustering (WPGMA ), hence the consistency of449
our algorithm. Second, the clusters remain unchanged whatever450
the selected data is.451
The main objective in this section consists in proposing an unsu-452
pervised flexible clustering method, which provides rapidly the453
desired view of data proximity. This objective is achieved, by giv-454
ing consistent results, from the point of view of intra-cluster and455
inter-cluster inertia according to the clustering aim ( i . e . indicat-456
ing which is the closest and is further according to the similarity),457
in generally O(n) + and rarely O(n2 ) + iterations, consequently458
allows the feasibility of the clustering in high-dimensional data.459
This algorithm can be applied to large data, thanks to its com-460
plexity. It can be used easily, thanks to its simplicity. It can treat461
dynamic data while keeping a competitive computational cost. The462
amortized complexity [30,32,34] is of O(n) + averagely and rarely463
of O(n2 ) + . Our method can also be used as an hierarchical cluster-464
ing method, by inserting the elements in the first (sample) resulting465
dendrogram according to intervals.466
8. Conclusion 467
We proposed in this paper two novel competitive unsupervised 468
clustering algorithms which can overcome the problem of feasibil- 469
ity in large databases thanks to their calculation times. 470
The first algorithm treats ultrametric data, it has a complexity 471
O(n). We gave proofs that in an ultrametric space we can avoid the 472
calculations of all mutual distances, without loosing information 473
about the proximities in the data set. 474
The second algorithm deals with all kind of data. It has two 475
strong particularities: the first feature is the flexibility; the user 476
is an actor in the construction and the execution of the algo- 477
rithm, he can manage the input and refine the output. The 478
second characteristic deals with the computational cost, which 479
is, in the average case, of O(n), and in the rare worst case, 480
of O(n2 ). 481
We showed that our methods treat faster data, while preserving 482
the information about proximities, and the result does not depend 483
on a specific parameter. 484
Our future work aims first to test our methods on real data and 485
compare them with other clustering algorithms by using valida- 486
tion measure, as V-measure and F-measure. Second, we intend 487
to generalize the second method to treat dynamic data such us 488
those of social network, detection of relevant information on web 489
sites. 490
Uncited references Q3 491
[13,22,26] . 492
Appendix A. Metric distance matrix 493
See Fig. A.1. 494
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
12 S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx
Fig.
A.1
. M
etri
c
dist
ance
mat
rix
of
34
data
.
Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004
ARTICLE IN PRESSG ModelJOCS 107 1–13
S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx 13
References495
[1] M. Ankerst, M.M. Breunig, H.P. Kreigel, J. Sander, OPTICS: ordering points to496
identify the clustering structure, in: ACM SIGMOD International Conference on497
Management of Data, Philadelphia, PA, 1999.498
[2] J.P. Barthélemy, F. Brucker, Binary clustering, Journal of Discrete Applied Math-499
ematics 156 (2008) 1237–1250.500
[3] K. Bayer, J. Goldstein, R. Ramakrishan, U. Shaft, When is nearest neighbor mean-501
ingful? in: Proceedings of the 7th International Conference on Database Theory502
(ICDT) LNCS, 1540, Springer-Verlag, London, UK, 1999, pp. 217–235.503
[4] F. Brucker, Modèles de classification en classes empiétantes, PhD Thesis, Equipe504
d’acceuil: Dep. IASC de l’École Supérieure des Télécommunications de Bretagne,505
France, 2001.506
[5] M. Chavent, Y. Lechevallier, Empirical comparison of a monothetic divisive507
clustering method with the Ward and the k-means clustering methods, in:508
Proceedings of the IFCS Data Science and Classification, Springer, Ljubljana,509
Slovenia, 2006, pp. 83–90.510
[6] G. Cleuziou, An extended version of the k-means method for overlapping clus-511
tering, in: Proceedings of the Nineteenth International Conference on Pattern512
Recognition, 2008.513
[7] G. Cleuziou, Une méthode de classification non-supervisée pour l’apprentissage514
de règles et la recherche d’information, PhD Thesis, Université d’Orleans,515
France, 2006.516
[8] S. Fouchal, M. Ahat, I. Lavallée, M. Bui, An O(N) clustering method on ultra-517
metric data, in: Proceedings of the 3rd IEEE Conference on Data Mining and518
Optimization (DMO), Malaysia, 2011, pp. 6–11.519
[9] S. Fouchal, M. Ahat, I. Lavallée, Optimal clustering method in ultrametric spaces,520
in: Proceedings of the 3rd IEEE International Workshop Performance Eval-521
uation of Communications in Distributed Systems and Web-based Service522
Architectures (PEDISWESA), Greece, 2011, pp. 123–128.523
[10] S. Fouchal, M. Ahat, I. Lavallée, M. Bui, S. Benamor, Clustering based on Kol-524
mogorov complexity, in: Proceedings of the 14th International Conference525
Knowledge-based and Intelligent Information and Engineering Systems (KES),526
Part I, Cardiff, UK, Springer, 2010, pp. 452–460.527
[11] L. Gaji’, Metric locally constant function on some subset of ultrametric space528
Novi Sad Journal of Mathematics (NSJOM) 35 (1) (2005) 123–125.529
[12] L. Gaji’, On ultrametric space, Novi Sad Journal of Mathematics (NSJOM) 31 (2)530
(2001) 69–71.531
[13] G. Gardanel, Espaces de Bannach ultramétriques, Séminaire Delange-Pistot-532
Poitou Théorie des nombres, Tome 9(2), exp. n-G2, G1-G8, 1967–1968.533
[14] I. Gronau, S. Moran, Optimal implementations of UPGMA and other common534
clustering algorithms, Information Processing Letters 104 (6) (2007) 205–210.535
[15] S. Guha, R. Rastogi, K. Shim, An efficient clustering algorithm for large databases536
on management of data, in: Proceedings ACM SIGMOD International Confer-537
ence, No. 28, 1998, pp. 73–84.538
[16] S. Guindon, Méthodes et algorithmes pour l’approche statistique en phylogénie,539
PhD Thesis, Université Montpellier II, 2003.540
[17] A. Hatamlou, S. Abdullah, Z. Othman, Gravitational search algorithm with541
heuristic search for clustering problems, in: 3rd IEEE Conference on Data Min-542
ing and Optimization (DMO), Malaysia, 2011, pp. 190–193.543
[18] A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multimedia544
databases with noise, in: Proceedings of the Fourth International Conference545
on Knowledge Discovery and Data Mining (KDD), New York, USA, 1998, pp.546
58–65.
[19] L. Hubert, P. Arabie, J. Meulman, Modeling dissimilarity: generalizing ultra- 547
metric and additive tree representations, British Journal of Mathematical and 548
Statistical Psychology 54 (2001) 103–123. 549
[20] L. Hubert, P. Arabie, Iterative projection strategies for the least squares fitting of 550
tree structures to proximity data, British Journal of Mathematical and Statistical 551
Psychology 48 (1995) 281–317. 552
[21] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing 553
Surveys 31 (3) (1999) 264–323. 554
[22] G. Karypis, E.H. Han, V. Kumar, Chameleon: hierarchical clustering using 555
dynamic modeling, The Computer Journal 32 (8) (1999) 68–75. 556
[23] M. Krasner, Espace ultramétrique et valuation, Séminaire Dubreil Algèbre et 557
théorie des nombres, Tome 1, exp. n-1, 1–17, 1947–1948. 558
[24] H.P. Kriegel, P. Kroger, A. Zymek, Clustering high-dimensional data: a survey on 559
subspace clustering, pattern-based clustering, and correlation clustering, ACM 560
Transactions on Knowledge Discovery from Data 3 (1) (2009), Article 1. 561
[25] B. Leclerc, Description combinatoire des ultramétriques, Mathématiques et Sci- 562
ences Humaines 73 (1981) 5–37. 563
[26] V. Levorato, T.V. Le, M. Lamure, M. Bui, Classification prétopologique basée 564
sur la complexité de Kolmogorov, Studia Informatica, 7.1, 198–222, Hermann 565
éditeurs, 2009. 566
[27] F. Murtagh, Complexities of hierarchic clustering algorithms: state of the art, 567
Computational Statistics Quarterly 1 (2) (1984) 101–113. 568
[28] F. Murtagh, A survey of recent advances in hierarchical clustering algorithms, 569
The Computer Journal 26 (4) (1983) 354–359. 570
[29] K.P.R. Rao, G.N.V. Kishore, T.R. Rao, Some coincidence point theorems in ultra 571
metric spaces, International Journal of Mathematical Analysis 1 (18) (2007) 572
897–902. 573
[30] D.D. Sleator, R.E. Tarjans, Amortized efficiency of list update and paging rules, 574
Proceedings of Communications of ACM 28 (1985) 202–208. 575
[31] M. Sun, L. Xiong, H. Sun, D. Jiang, A GA-based feature selection for 576
high-dimensional data clustering, in: Proceedings of the 3rd International Con- 577
ference on Genetic and Evolutionary Computing (WGEC), vol. 76, 2009, pp. 578
9–772. 579
[32] R.E. Tarjans, Data Structures and Network Algorithms, Society for Industrial 580
and Applied Mathematics, Philadelphia, PA, USA, 1983. 581
[33] M. Terrenoire, D. Tounissoux, Classification hiérarchique, Université de Lyon, 582
2003. 583
[34] A.C.-C. Yao, Probabilistic computations: toward a unified measure of complex- 584
ity, in: Proceedings of the 18th IEEE Conference on Foundations of Computer 585
Science, 1977, pp. 222–227. 586
Said Fouchal, is a PhD student in the University Paris 8. 587
The topic of his research work is clustering of complex 588
objects. His researches are based on ultrametric, computa- 589
tional complexity, Kolmogorov complexity and clustering. 590