competitive clustering algorithms based on ultrametric properties

Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi: 10.1016/j.jocs.2011.11.004

ARTICLE IN PRESSG ModelJOCS 107 1–13

Journal of Computational Science xxx (2012) xxx–xxx

Contents lists available at SciVerse ScienceDirect

Journal of Computational Science

j ourna l h o me page : www.e l sev ie r . com/ loca te / jocs

Competitive clustering algorithms based on ultrametric propertiesQ11

S. Fouchal a, �, 1 , M. Ahat b, 1 , S. Ben Amor c, 2 , I. Lavallée a, 1 , M. Bui a, 1Q42

a University Paris 8, France3b Ecole Pratique des Hautes Etudes, France4c University of Versailles Saint-Quentin-en-Yvelines, France5

6

a r t i c l e i n f o7

8

Art icle history:9

Received 30 July 201110

Received in revised form 16 October 201111

Accepted 24 November 201112

Available online xxx13

14

Keywords:15

Clustering16

Ultrametric17

Complexity18

Amortized analysis19

Average analysis20

Ordered space21

a b s t r a c tWe propose in this paper two new competitive unsupervised clustering algorithms: the first algorithmdeals with ultrametric data, it has a computational cost of O(n). The second algorithm has two strongfeatures: it is fast and flexible on the processed data type as well as in terms of precision. The secondalgorithm has a computational cost, in the worst case, of O(n2 ), and in the average case, of O(n). Thesecomplexities are due to exploitation of ultrametric distance properties. In the first method, we use theorder induced by an ultrametric in a given space to demonstrate how we can explore quickly data prox-imity. In the second method, we create an ultrametric space from a sample data, chosen uniformly atrandom, in order to obtain a global view of proximities in the data set according to the similarity criterion.Then, we use this proximity profile to cluster the global set. We present an example of our algorithmsand compare their results with those of a classic clustering method.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction22

The clustering is useful process which aims to divide a set23

of elements into a set of finite number of groups. These groups24

are organized such as the similarity between elements in a same25

group is maximal, while similarity between elements from di�er-26

ent groups is minimal [15,17] .27

There are several approaches of clustering, hierarchical, parti-28

tioning, density-based, which are used in a large variety of fields,29

such as astronomy, physics, medicine, biology, archaeology, geol-30

ogy, geography, psychology, and marketing [24] .31

The clustering aims to group objects of a data set into a set of32

meaningful subclasses, so it can be used as a stand-alone tool to get33

insight into the distribution of data [1,24] .34

The clustering of high-dimensional data is an open problem35

encountered by clustering algorithms in di�erent areas [31] . Since36

the computational cost increases with the size of data set, the fea-37

sibility can not be fully guaranteed.38

We suggest in this paper two novel unsupervised clustering39

algorithms: The first is devoted to the ultrametric data. It aims to40

� Corresponding author.E-mail addresses: [email protected] (S. Fouchal), [email protected]

(M. Ahat), [email protected] (S. Ben Amor ), [email protected](I. Lavallée), [email protected] (M. Bui).

1 Laboratoire d’Informatique et des Systèmes Complexes, 41 rue Gay Lussac 75005Paris, France, http://www.laisc.net .

2 Laboratoire PRiSM, 45 avenue des Etats-Unis F-78 035 Versailles, France,http://www.prism.uvsq.fr/ .

show rapidly the inner closeness in the data set by providing a gen- 41

eral view of proximities between elements. It has a computational 42

cost of O(n). Thus, it guarantees the clustering of high-dimensional 43

data in ultrametric spaces. It can, also, be used as a preprocessing 44

algorithm to get a rapid idea on behavior of data with the similarity 45

measure used. 46

The second method is general, it is applicable for all kinds of 47

data, it uses a metric measure of proximity. This algorithm provides 48

rapidly the proximity view between elements in a data set with the 49

desired accuracy. It is based on a sampling approach (see details in 50

[1,15] ) and ultrametric spaces (see details in [20,23,25] ). 51

The computational complexity of the second method is in the 52

worst case, which is rare, of O(n2 ) + O(m2 ), where n is the size of data 53

and m the size of the sample. The cost in the average case, which is 54

frequent, is equal to O(n) + O(m2 ). In both cases m is insignificant, 55

we give proofs of these complexities in Proposition 9. Therefore, 56

we use O(n2 ) + ε and O(n) + ε to represent respectively the two com- 57

plexities. 58

This algorithm guarantees the clustering of high-dimensional 59

data set with the desired precision by giving more flexibility to the 60

user. 61

Our approaches are based in particular on properties of ultra- 62

metric spaces. The ultrametric spaces are ordered spaces such that 63

data from a same cluster are “equidistant ” to those of another 64

one (e.g. in genealogy: two species belonging to the same fam- 65

ily, “brother s”, are at the same distance from species from another 66

family, “cousins”) [8,9] . 67

We utilize ultrametric properties in the first algorithm to clus- 68

ter data without calculating all mutual similarities. The structure 69

1877-7503/$ – see front matter © 2012 Elsevier B.V. All rights reserved.doi: 10.1016/j.jocs.2011.11.004

Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi:10.1016/j.jocs.2011.11.004


2 S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx

Fig. 1. Steps of clustering process.

induced by ultrametric distance allows us to get a general infor-70

mation about proximity between all elements from just one data,71

consequently we reduce the computational cost. We can find ultra-72

metric spaces in many kinds of data sets such as: phylogeny where73

the distance of evolution is constant [16], genealogy, library, infor-74

mation and social sciences . . . to name a few.75

In the second algorithm, we use ultrametric to acquire a first76

insight of the proximity between elements in just a sample data77

w.r.t. the similarity used. Once this view obtained, we expand it to78

cluster the whole data set.79

The rest of the text is organized as follows. In Section 2, we80

present a brief overview of the clustering strategies. In Section 3,81

we introduce the notions of metric and ultra-metric spaces, dis-82

tance and balls. Our first algorithm is presented in Section 4. We83

present an example of this first algorithm in Section 5. The second84

algorithm is introduced in Section 6. In Section 7, we present an85

example of the second algorithm and we compare our results with86

those of a classic clustering algorithm. Finally, in Section 8 we give87

our conclusion and future work.88

2. Related work89

The term “clustering” is used in several research communities90

to describe methods for grouping of unlabeled data. The typical91

pattern of this process can be summarized by the three steps of92

Fig. 1 [21].93

Feature selection and extraction are preprocessing techniques94

which can be used, either or both, to obtain an appropriate set of95

features to use in clustering. Pattern proximity is calculated by sim-96

ilarity measure defined on data set between a pairs of objects. This97

proximity measure is fundamental to the clustering, the calcula-98

tions of mutual measures between element are essential to most99

clustering procedures. The grouping step can be carried in different100

way, the most known strategies are defined bellow [21].101

Hierarchical clustering is either agglomerative (“bottom-up”)102

or divisive (“top-down”). The agglomerative approach starts with103

each element as a cluster and merges them successively until104

forming a unique cluster (i.e. the whole set) (e.g. WPGMA [9,10],105

UPGMA [14]). The divisive begins with the whole set and divides106

it iteratively until it reaches the elementary data. The outcome of107

hierarchical clustering is generally a dendrogram which is diffi-108

cult to interpret when the data set size exceeds a few hundred of109

elements. The complexity of these clustering algorithms is at least110

O(n2) [28].111

Partitional clustering creates clusters by dividing the whole set112

into k subsets. It can also be used as divisive algorithm in hierar-113

chical clustering. Among the typical partitional algorithms we can114

name K-means [5,6,17] and its variants K-medoids, PAM, CLARA115

and CLARANS. The results depend on the k selected data in this kind116

of algorithms. Since the number of clusters is defined upstream of117

the clustering, the clusters can be empty.118

Density-based clustering is a process where the clusters are119

regarded as a dense regions leading to the elimination of the noise.120

DBSCAN, OPTICS and DENCLUE are typical algorithms based on this121

strategy [1,4,7,18,24].122

Since, the major clustering algorithms calculate similarities123

between all data prior to the grouping phase (for all types of simi-124

larity measure used), the computational complexity is increased to125

O(n2) before the execution of the clustering algorithm.126

Our first approach deals with ultrametric spaces, we propose the 127

first – as our best knowledge – unsupervised clustering algorithm 128

on this kind of data without calculating similarities between all 129

pairs of data. So, we reduce the computational cost of the process 130

from O(n2) to O(n). We give proofs that; since the data processed are 131

described with an ultrametric distance we do not need to calculate 132

all mutual distances to obtain information about proximity in the 133

data set (cf . Section 4). 134

Our second approach is a new flexible and fast unsuper- 135

vised clustering algorithm which costs mostly O(n) + ! and seldom 136

O(n2) + ! (in rare worst case), where n is the size of data set and " is 137

equal to O(m2) where m is the size of an insignificant part (sample) 138

of the global set. 139

Even if the size of data increases, the complexity of the second 140

proposed algorithm, the amortized complexity [30,32,34], remains 141

of O(n) + ! in the average case, and of O(n2) + ! in the worst case, 142

thus it can process dynamic data such as those of Web and social 143

network with the same features. 144

The two approaches can provide overlapped clusters, where one 145

element can belong to more than one or more than two clusters 146

(more general than weak-hierarchy), see [2,4,7] for detailed defi- 147

nitions about overlapping clustering. 148

3. Definitions 149

Definition 1. A metric space is a set endowed with distance 150

between its elements. It is a particular case of a topological space. 151

Definition 2. We call a distance on a given set E, an application d: 152

E " E # R+ which has the following properties for all x, y, z $ E: 153

1 (Symmetry) d(x, y) = d(y, x), 154

2 (Positive definiteness) d(x, y) ! 0, and d(x, y) = 0 if and only if x = y, 155

3 (Triangle inequality) d(x, z) " d(x, y) + d(y, z). 156

Example 1. The most familiar metric space is the Euclidean 157

space of dimension n, which we will denote by Rn, with the 158

standard formula for the distance: d((x1, . . ., xn), (y1, . . ., yn)) = 159

((x1 % y1)2 + · · · + (xn % yn)2)1/2 (1). 160

Definition 3. Let (E, d) be a metric space. If the metric d satisfies 161

the strong triangle inequality: 162

&x, y, z $ E, d(x, y) " max{d(x, z), d(z, y)} 163

then it is called ultrametric on E [19]. The couple (E, d) is an ultra- 164

metric space [11,12,29]. 165

Definition 4. We name open ball centered on a $ E and has a radius 166

r $ R+ the set {x $ E : d(x, a) < r} ' E, it is called Br(a) or B(a, r). 167

Definition 5. We name closed ball centered on a $ E and has a 168

radius r $ R+ the set {x $ E : d(x, a) " r} ' E, it is called Bf(a, r). 169

Remark 1. Let d be an ultrametric on E. The closed ball on a $ E 170

with a radius r > 0 is the set: B(a, r)={x $ E : d(x, a) " r} 171

Proposition 1. Let d be an ultrametric on E, the following properties 172

are true [11]: 173

1 If a, b $ E, r > 0, and b $ B(a, r), then B(a, r) = B(b, r), 174

2 If a, b $ E, 0 < i " r, then either B(a, r) ( B(b, i) = ) or B(b, i) * B(a, r). 175

This is not true for every metric space, 176

3 Every ball is clopen (closed and open) in the topology defined by d 177

(i.e. every ultrametrizable topology is zero-dimensional). Thus, the 178

parts are disconnected in this topology. 179

Hence, the space defined by d is homeomorphic to a subspace of 180

countable product of discrete spaces (c . f Remark 1) (see the proof 181

in [11]). 182



S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx 3

Fig. 2. Illustration of some ultrametric distances on a plan.

Remark 2. A topological space is ultrametrizable if and only if it183

is homeomorphic to a subspace of countable product of discrete184

spaces [11].185

Definition 6. Let E be a finite set, endowed with a distance d. E is186

classifiable for d if: & ̨ $ R+ the relation on E:187

&x, y $ E, xR˛y + d(x, y) " ˛188

is an equivalent relation.189

Thus, we can provide a partition from E as [33]:190

• d(x, y) " ̨ + x and y belong to the same cluster, or,191

• d(x, y) > ̨ + x and y belong to two distinct clusters.192

Example 2. x, y and z are three points of plan endowed with an193

Euclidean distance d, we have:194

d(x, y) = 2, d(y, z) = 3, d(x, z) = 4.195

The set E is not classifiable for ̨ = 3. The classification leads to196

inconsistency.197

Proposition 2. A finite set E endowed with a distance d is classifiable198

if and only if d is an ultrametric distance on E[11].199

Proposition 3. An ultrametric distance generates an order in a set,200

viewed three to three they form isosceles triangles. Thus, the repre-201

sentation of all data is fixed whatever the angle of view is (see Fig.202

2).203

Proof 1. Let (E, d) an ultrametric set, for all x, y and z $E: Consider:204

d(x, y) " d(x, z) (1)205

d(x, y) " d(y, z) (2)206

(1) and (2) , (3) and (4)207

d(x, z) " max{d(x, y), d(y, z)} , d(x, z) " d(y, z) (3)208

d(y, z) " max{d(x, y), d(x, z)} , d(y, z) " d(x, z) (4)209

(3) and (4) , d(x, z) = d(y, z).210

4. First approach: an O(n) unsupervised clustering method 211

on ultrametric data. 212

We propose here the first – as our best knowledge – approach 213

of clustering in ultrametric spaces without calculating similarity 214

matrix. Ultrametric spaces represent many kinds of data sets such 215

as genealogy, library information and social sciences data, to name 216

a few. 217

Our proposed method has a computational cost of O(n), thus it 218

makes it treatable to cluster very large databases. It provides an 219

insight of proximities between all data. 220

The idea consists in using the ultrametric space properties, in 221

particular, the order induced by the Ultratriangular Inequality to 222

cluster all elements compared to only one of them chosen uni- 223

formly at random. 224

Since the structure of an ultrametric space is frozen (cf . 225

Proposition 3), we do not need to calculate all mutual distances. 226

The calculation of distances compared just to one element are suf- 227

ficient to determine the clusters, as clopen balls (or closed balls; cf . 228

Proposition 1.3). Consequently, we avoid the computation of O(n2). 229

However, our objective is to propose a solution to the problem of 230

computational cost, notably the feasibility of clustering algorithms, 231

in large data base. 232

4.1. Principle 233

Hypothesis 1. Consider E a set of data endowed with an ultramet- 234

ric distance d. 235

Our method is composed of the 3 following steps: 236

Step 1. Choose uniformly at random one data from E (see Fig. 3); 237

Step 2. Calculate distances between the chosen data and all 238

others. 239

Step 3. Define thresholds and represent the distribution of all 240

data “according” to these thresholds and the calculated distances of 241

the step 2 (see Fig. 4). 242

Fig. 3. Step 1: choosing uniformly at random one element A: we took the samescheme of Fig. 2 to depict the structuration of elements between them, but with ouralgorithm we do not need to calculate all distances (i.e. step 2).

Plea se cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi: 10.1016/j.jocs.2011.11.004

ARTICLE IN PRESSG ModelJOCS 107 1–134 S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxxProposition 4. The clusters are distinguished by the closed balls (or243

thresholds), around the chosen data.244

Proof 2. An ultrametric space is a classifiable set (cf . Proposition245

2). Thus, comparing mutual distances with a given threshold shows246

the belonging or not to the same cluster. Then, the distribution of247

data around any data in the set reflects entirely their proximity.248

Proposition 5. The random choice of initial point, does not a�ect the249

resulting clusters.250

Proof 3. Since an ultrametric distance generates an order in a set,251

the representation of all data is fixed whatever the angle of view.252

Hence, for every chosen data the same isosceles triangle are formed.253

Proposition 6. The computational cost of this algorithm is equal to254

O(n) .255

Proof 4. The algorithm consists in calculating distances between256

the chosen data and the n − 1 data.257

4.2. Algorithm258

Procedure UCMUMD:

• Variables:1. Ultrametric space:

a. data set E of size n;b. ultrametric distance du .

• Begin:1. Choose uniformly at random one element x�E ;2. Determine intervals of clusters; //

�exibility depending on context3. For i<n − 1 :

a. Calculate du (x, i) ; // complexity = O(n)b. Allocate i to cluster of the correspon-dent interval.

4. End for5. Return result.

• End

5. First approach: example259

We present in this section a comparison of our first algorithm260

with WPGMA . The Weighted Paired Group Method using Aver-261

ages (WPGMA ) is a hierarchical clustering algorithm developed262

by McQuitty in 1966, it has a computational complexity of O(n2 )263

[27] . We have tested the two methods on sets of 34 and 100 data,264

respectively.265

Let us consider the ultrametric distance matrix of 34 data (see266

Fig. 5), calculated from similarity matrix based on the Normalized267

Information Distance NID [10] .268

The resulting clusters with WPGMA , on 34 data, are shown in269

dendrogram of Fig. 6.270

To test our method we determine (arbitrarily) the following271

intervals of the desired clusters (step 2): [0.000, 0.440], [0.440,272

0.443], [0.443, 0.445], [0.445, 0.453], [0.453, 0.457], [0.457, 0.460],273

[0.463, 0.466], [0.466, 0.469], [0.469, 0.470], [0.470, 0.900].274

The results with our algorithm (Fig. 7 and 8), using two di�erent275

data as starting points, show the same proximities between276

Fig. 4. Representation of all elements around A, according to the calculated distancesand thresholds: the distances are those between clusters and thresholds are theradiuses of balls.

data such as those of the WPGMA clustering (cf . Fig. 6). Since 277

ultrametric distance induce an ordered space, the inter-cluster 278

and intra-cluster inertia for the purpose of the clustering are 279

guaranteed. 280

We see that even if we change the starting point, the proximity 281

in the date set the same (cf . Proposition 5). The clusters are the 282

same as shown in figures. 283

Fig. 9 summarizes the results of the two methods on the same 284

dendrogram, it shows that the generated clusters with our algo- 285

rithm (balls) are similar to those of WPGMA. 286

We have compared the two methods also on 100 words, chosen 287

randomly from dictionary. The WPGMA results are shown in Fig. 10 . 288

Our results are the same whatever the chosen data (as we have 289

seen in the first example), we choose here the data “toit”, Fig. 11 290

shows the resulting clusters (Fig. 12): Q2 291

We see in comparisons that, first, the results of our method 292

are similar to those of the hierarchical clustering (WPGMA ), hence 293

the consistency of our algorithm. Second, the clusters remain 294

unchanged whatever the selected data. 295

NB: A few di�erences between results are due to the values of 296

thresholds chosen (arbitrarily), as if we obtain slightly di�erent 297

classes with di�erent cuts in dendrograms. 298

Our first objective was to propose a method which provides a 299

general view of data proximity, in particular in large databases. 300

This objective is achieved, by giving a consistent result in just 301

O(n) iterations, thus allows the feasibility of the clustering in high- 302

dimensional data. 303

6. Second approach: fast and flexible unsupervised 304

clustering algorithm 305

Our second approach is a new unsupervised clustering algo- 306

rithm which is valuable for any kind of data, since it uses a distance 307

to measure proximity between objects. The computational of this 308

algorithm is O(n) + ε in the best and average case and O(n2 ) + ε 309

in the worst case which is rare. The value of ε = O(m2 ) (where 310

m is the size of a sample) is negligible in front of n. Thus, our 311




harborSeal

graySeal

dog

cat

whiterhinoceros

indianrhinoceros

horse

donkey

finwhale

bluewhale

pig

sheep

cow

hippopotamus

fruitbat

aardvark

rat

mouse

fatdormouse

squirrel

rabbit

guineapig

armadillo

wallaroo

opossum

elephant

chimpanzee

pigmychimpanzee

human

gorilla

orangutan

gibbon

baboon

platypus

harborSeal

00,1

90,4

0,4

0,4

0,4

40,4

0,4

0,4

40,4

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7grayS

eal0

0,4

0,4

0,4

0,4

40,4

0,4

0,4

40,4

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7dog

00,4

0,4

0,4

40,4

0,4

0,4

40,4

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7cat

00,4

0,4

40,4

0,4

0,4

40,4

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7w

hiterhinoceros 0

0,3

80,4

0,4

0,4

40,4

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7indianrhinoceros

00,4

0,4

0,4

40,4

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7horse

00,3

0,4

40,4

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7donkey

00,4

40,4

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7finw

hale0

0,3

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7bluew

hale0

0,4

0,4

0,4

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7pig

00,4

0,40,44

0,4530,46

0,460,458

0,460,458

0,460,46

0,460,5

0,50,5

0,50,47

0,50,5

0,470,5

0,50,47

sheep0

0,40,44

0,4530,46

0,460,458

0,460,458

0,460,46

0,460,5

0,50,5

0,50,47

0,50,5

0,470,5

0,50,47

cow0

0,4

40,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7hippopotam

us0

0,4

53

0,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7fruitbat

00,4

60,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7aardvark

00,4

60,4

58

0,4

60,4

58

0,4

60,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7rat

00,4

34

0,4

60,4

55

0,4

50,4

60,4

60,5

0,50,5

0,50,4

70,5

0,50,4

70,5

0,50,4

7m

ouse0

0,460,455

0,450,46

0,460,5

0,50,5

0,50,47

0,50,5

0,470,5

0,50,47

fatdormouse

00,455

0,450,46

0,460,5

0,50,5

0,50,47

0,50,5

0,470,5

0,50,47

squirrel0

0,4

50,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7rabbit

00,4

60,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7guineapig

00,4

60,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7arm

adillo0

0,5

0,5

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7w

allaroo0

0,4

0,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7opossum

00,5

0,5

0,4

70,5

0,5

0,4

70,5

0,5

0,4

7elephant

00,5

0,470,5

0,50,47

0,50,5

0,47chim

panzee0

0,2

10,3

0,4

0,4

10,4

0,4

0,4

7pigm

ychimpanzee

00,3

0,40,41

0,40,4

0,47hum

an0

0,40,41

0,40,4

0,47gorilla

00,4

10,4

0,4

0,4

7orangutan

00,4

0,4

0,4

7gibbon

00,4

0,4

7baboon

00,4

7platypus

0

Fig. 5. Ultrametric distance matrix of 34 data.

algorithm has capability to treat very large data bases. The user312

can get the desired accuracy on proximities between elements.313

The idea consists in exploiting ultrametric properties, in partic-314

ular, the order induced by the Ultratriangular Inequality on a given315

space. First we deduce the behavior of all data relative to the dis-316

tance used, by creating an ultrametric (ordered) space from just a317

sample data (subset) chosen uniformly at random, of size m (petty318

compared to n). Then, we do the clustering on the global data set319

according to these order information.320

The construction of the ordered space, from sample, costs321

" = O(m2) operations. Once this ultrametric space (structured sam-322

ple) built, we use ultrametric distances to fix the thresholds (or323

Fig. 6. Clustering of the 34 data with WPGMA.

intervals) of the clusters, which depend on the choice of users. This 324

liberty of choice makes our algorithm useful for all kinds of data and 325

implicate the user as an actor in the construction of the algorithm. 326

After determining clusters and their intervals, we select uni- 327

formly at random one representative by cluster, then we compare 328

them with the rest of data, in the global set (of size n % m), 329

one by one. Next, we add the data to the clusters of the closest 330

representative according to the intervals. 331

Fig. 7. Results using fin-whale as the chosen origin.




If the compared data is away compared to all representative, we332

create a new cluster.333

Remark 3. This step costs O(x . n) + ! operations in the average334

case, where x is the number of clusters which is generally insignif-335

icant compared to n, therefore we keep only O(n) + !. But in the336

worst case, where the clustering provides only singleton which is337

rare since the aim of clustering is to group objects, x is equal to n,338

consequently the computational cost is O(n2) + !.339

6.1. Principle340

Hypothesis 2. Consider E a set of data and a distance d.341

The method is composed of the following steps:342

Step 1. Choose uniformly at random a sample data from E, of343

size m (e.g. m = n/10, 000 if n = 1 M).344

Remark 4. The choice of the sample depends on the processed345

data and on the expert in the domain. The user is an actor in the346

execution of the algorithm, he can intervene in the choice of the347

input and refine the output (see details in the step 4).348

Remark 5. The value of m depends on the value of n compared to349

the bounds of d [3].350

Example 3. Consider a set of size n = 100, 000 data and distance351

d $ [0, 0.5] if we use a sample of 50 data chosen uniformly at random,352

we can easily get a large idea on the aptitude of the data according353

to d. But, if n = 500 and d $ [0, 300], if we choose a sample of 5 data,354

we are not sure to get enough information about the manner how355

the data behave with d.356

Remark 6. More the size n is large more the size of m is petty357

compared to that of n [3].358

Step 2. Execute a classic hierarchical clustering algorithm (e.g.359

WPGMA) with the distance d on the chosen sample.360

Fig. 8. Results using horse as the chosen origin.

Step 3. Represent the distances in the resulting dendrogram, 361

thus the ultrametric space is built. 362

Step 4. Deduce the clusters intervals (closed balls or thresh- 363

olds), which depend on the clustering sought-after and d, as large 364

or precise view of proximities between elements (this step must be 365

specified by the user or specialist of the domain). 366

Step 5. Choose uniformly at random one representative per clus- 367

ter from the result of the step 2. 368

Step 6. Pick the rest of data one by one and compare them, 369

according to d, with the clusters representatives of the previous 370

step: 371

• If the compared data is close to one (or more) of the representative 372

data w.r.t. the defined thresholds, then add it to the same cluster 373

of the representative. 374

• Else, create a new cluster. 375

Remark 7. If the compared data is close to more than one 376

representative, then it will be added to more than one cluster, 377

consequently generate an overlapping clustering [4,7]. 378

Proposition 7. The clusters are distinguished by the closed balls (or 379

thresholds). 380

Proof 5. An ultrametric space is a classifiable set (cf . Proposition 381

2). Thus, comparing mutual distances with a given threshold shows 382

the belonging or not to the same cluster. 383

Proposition 8. The random choice of initial points, does not affect 384

the resulting clusters. 385

Proof 6. Since the generated space is ultrametric, it is then 386

endowed with a strong intra-cluster inertia that allows this 387

choices. 388

Proposition 9. The computational cost of our algorithm is, in the 389

average case, equal to O(n), and in the rare worst case, equal to O(n2). 390




Fig. 9. Clustering with our method and WPGMA.

Table 1Example 1: clustering of the 34 data in O(N) + ".

Thresholds Clusters

Pig: closest items by d " 0.881 Whiterhinoceros cowPig: d $ [0.881, 0.889] Indian-rhinoceros, gray-seal harbor-seal, cat, sheepPig: d $ [0.889, 0.904] DogPig: d $ [0.904, 0.919] Pigmy-chimpanzee, orangutan, gibbon, human, gorillaPig: d $ [0.919, 0.929] Chimpanzee, baboonBluewhale: d " 0.881 Fin-whaleBluewhale: d $ [0.881, 0.889]Bluewhale: d $ [0.889, 0.904] Donkey, horseBluewhale: d $ [0.904, 0.919]Bluewhale: d $ [0.919, 0.929]Aardvark: d " 0.881Aardvark: d $ [0.881, 0.889]Aardvark: d $ [0.889, 0.904]Aardvark: d $ [0.904, 0.919] Rat, mouse, hippopotamus, fat-dormouse, armadillo, fruit-Bat, squirrel, rabbitAardvark: d $ [0.919, 0.929] Elephant, opossum, wallarooGuineapig: d " 0.881Guineapig: d $ [0.881, 0.889]Guineapig: d $ [0.889, 0.904]Guineapig: d $ [0.904, 0.919]Guineapig: d $ [0.919, 0.929]Platypus: d $ [0.881, 0.929]

391

392




Fig. 10. Clustering of 100 words with WPGMA.




Proof 7.393

• In the two cases we ignore the complexity of the clustering of the394

sample and the construction of the ultrametric space " (i . e . steps395

2 and 3), because the size of the sample is insignificant compared396

to the cardinal of the global data set.397

• The worst case where the clustering provides only singleton is398

rare, because this process aims to group similar data in clusters399

whose numbers is frequently petty compared to the data set size.400

• Since there exist homogeneity between data according to the401

proximity measure used, the number of clusters is unimportant402

compared to the cardinal of the data set, consequently the com-403

plexity of our algorithm is frequently of O(n).404

6.2. Algorithm405

Procedure FFUCA

• Variables:1. Metric space:

a. data set E of size n;b. distance d).

• Begin:1. Choose a sample of data of size adjustable

according to data type (we use root (IEI) asa sample); // flexibility to expert of data

2. Do clustering of the sample with classichierarchical clustering algorithm; // com-plexity O(m2) where m = sample size (O(n)here)

3. Deduct an ultrametric space of the clusteredsample; // to exploit ultrametric propertiesin clustering, complexity O(n)

4. Determine intervals from the hierarchy ofstep 3; // flexibility to user to refine sizeoutput clusters

5. Choose uniformly at random one representativeper cluster from the clusters of step 4;

6. For i<n % nb; i + +: // nb= number of chosenrepresentative– For j<nb; j + +:

a. Calculate d(i, j); // complexity =O(nb " n), rare worst case O(n2), providingonly singleton* If d(i, j)" interval of the cluster of j:* Then: allocate i to cluster of j;

· If i$ more than one cluster:· Then: keep it only in the cluster of the

closest representative;// an element canbelong to two clusters if it is near-est to two representatives by the samedistance

· End If* Else: create a new cluster and allocate i to

this cluster.* End If

– End for1 End for2 Return result.

• End

Fig. 11. Clustering of 100 words with chosen data toit.

7. Example 406

This section is devoted to comparison of our algorithm with 407

WPGMA (cf . Section 5). We have tested the two methods on the 408

same set of 34 mitochondrial DNA sequences of the first example, 409

but here we used the metric (not ultrametric) distance Normal- 410

ized Information Distance NID [10] to calculate proximities between 411

elements. For WPGMA, we calculate a similarity matrix which costs 412

O(n2) (see Appendix A). With our proposed method, we do not need 413

to calculate the matrix, we calculate only distances between the 414

chosen representative and the rest of elements of the whole set, 415

this step costs O(nb . n) operations (where nb is the number of rep- 416

resentatives). The resulting dendrogram with WPGMA on 34 data, 417

is shown in Fig. 6 (cf . Section 5). 418

We execute FFUCA two times with different inputs chosen ran- 419

domly. The first step consists in choosing uniformly at random a 420

sample of data and doing clustering on this sample with a classic 421

clustering algorithm according to the similarity used. Then, create 422

an ultrametric space from the resulting clusters. 423

In this first example we choose arbitrarily the following data: 424

bluewhale, hippopotamus, pig, aardvark, guineapig and chim- 425

panzee. 426

We did a clustering and establish the ordered (ultrametric) 427

space using WPGMA, on the chosen data, the result is depicted in 428

Fig. 13. 429

The second step is the deduction of the intervals (clusters); thus, 430

we use the valuation of the dendrogram of Fig. 13. We keep here five 431

thresholds [0,0.881], [0.881, 0.889], [0.889, 0.904], [0.904, 0.919], 432

[0.919, 0.929]. Since, the size of data set is small (34) we use these 433

as representatives of clusters. Table 1 shows the resulting clusters 434

with the first inputs, the intervals are represented in the left column 435

and clusters in the right column. 436

NB: The chosen elements belong only to the first lines (e . g . 437

pig $ {whiterhinoceros, cow}). The last lines in the table repre- 438

sent the new clusters, which are far from all other clusters, they are 439

considered as outliers. 440

In the second example we choose donkey, gray-Seal, pig, fat- 441

dormouse, gorilla as starting data. After the clustering and the 442

construction of the ultrametric space from this chosen sam- 443

ple, we keep the following intervals [0,0.8827], [0.8827, 0.8836], 444

Please cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi: 10.1016/j.jocs.2011.11.004

ARTICLE IN PRESSG ModelJOCS 107 1–1310 S. Fouchal et al. / Journal of Computational Science xxx (2012) xxx–xxx

Fig. 12. Clustering of 100 words with chosen data toits.

Plea se cite this article in press as: S. Fouchal, et al., Competitive clustering algorithms based on ultrametric properties, J. Comput. Sci.(2012), doi: 10.1016/j.jocs.2011.11.004



Table 2Example 2: clustering of the 34 data in O(N) + .Thresholds Clusters

Donkey: closest items by d 0.8827 Indian-rhinoceros, whiterhinoceros, horseDonkey: d � [0.8827, 0.8836]Donkey: d � [0.8836, 0.8900] OpossumDonkey: d � [0.8900, 0.9000]Gray-seal: d 0.8827 Harbor seal, catGray-seal: d � [0.8827, 0.8836] DogGray-seal: d � [0.8836, 0.8900]Gray-seal: d � [0.8900, 0.9000]Pig: d ,woC7288.0 fin-whale, bluewhalePig: d � [0.8827, 0.8836] SheepPig: d � [0.8836, 0.8900] HippopotamusPig: d � [0.8900, 0.9000] Squirrel, rabbitFat-dormouse: d 0.8827Fat-dormouse: d � [0.8827, 0.8836]Fat-dormouse: d � [0.8836, 0.8900]Fat-dormouse: d � [0.8900, 0.9000]Gorilla: d 0.8827 Chimpanzee, pigmy-chimpanzee, human, orangutan, gibbonGorilla: d � [0.8827, 0.8836]Gorilla: d � [0.8836, 0.8900]Gorilla: d � [0.8900, 0.9000] BaboonArmadillo: d � [0.8827, 0.9000]Platypus: d � [0.8827, 0.9000]Aardvark: d � [0.8827, 0.9000]Mouse: d � [0.8827, 0.9000] RatWallaroo: d � [0.8827, 0.9000]Elephant: d � [0.8827, 0.9000]Guineapig: d � [0.8827, 0.9000]

Fig. 13. Clus tering of chosen data: construction of the ordered space.[0.8836, 0.890 0], [0.8900, 0.900]. We do not list the details about445

di�erent steps in this second example, we show the results in446

Table 2.447

We see in the tables that, first, our results are similar to those448

of the hierarchical clustering (WPGMA ), hence the consistency of449

our algorithm. Second, the clusters remain unchanged whatever450

the selected data is.451

The main objective in this section consists in proposing an unsu-452

pervised flexible clustering method, which provides rapidly the453

desired view of data proximity. This objective is achieved, by giv-454

ing consistent results, from the point of view of intra-cluster and455

inter-cluster inertia according to the clustering aim ( i . e . indicat-456

ing which is the closest and is further according to the similarity),457

in generally O(n) + and rarely O(n2 ) + iterations, consequently458

allows the feasibility of the clustering in high-dimensional data.459

This algorithm can be applied to large data, thanks to its com-460

plexity. It can be used easily, thanks to its simplicity. It can treat461

dynamic data while keeping a competitive computational cost. The462

amortized complexity [30,32,34] is of O(n) + averagely and rarely463

of O(n2 ) + . Our method can also be used as an hierarchical cluster-464

ing method, by inserting the elements in the first (sample) resulting465

dendrogram according to intervals.466

8. Conclusion 467

We proposed in this paper two novel competitive unsupervised 468

clustering algorithms which can overcome the problem of feasibil- 469

ity in large databases thanks to their calculation times. 470

The first algorithm treats ultrametric data, it has a complexity 471

O(n). We gave proofs that in an ultrametric space we can avoid the 472

calculations of all mutual distances, without loosing information 473

about the proximities in the data set. 474

The second algorithm deals with all kind of data. It has two 475

strong particularities: the first feature is the flexibility; the user 476

is an actor in the construction and the execution of the algo- 477

rithm, he can manage the input and refine the output. The 478

second characteristic deals with the computational cost, which 479

is, in the average case, of O(n), and in the rare worst case, 480

of O(n2 ). 481

We showed that our methods treat faster data, while preserving 482

the information about proximities, and the result does not depend 483

on a specific parameter. 484

Our future work aims first to test our methods on real data and 485

compare them with other clustering algorithms by using valida- 486

tion measure, as V-measure and F-measure. Second, we intend 487

to generalize the second method to treat dynamic data such us 488

those of social network, detection of relevant information on web 489

sites. 490

Uncited references Q3 491

[13,22,26] . 492

Appendix A. Metric distance matrix 493

See Fig. A.1. 494




Fig.

A.1

. M

etri

c

dist

ance

mat

rix

of

34

data

.




References495

[1] M. Ankerst, M.M. Breunig, H.P. Kreigel, J. Sander, OPTICS: ordering points to496

identify the clustering structure, in: ACM SIGMOD International Conference on497

Management of Data, Philadelphia, PA, 1999.498

[2] J.P. Barthélemy, F. Brucker, Binary clustering, Journal of Discrete Applied Math-499

ematics 156 (2008) 1237–1250.500

[3] K. Bayer, J. Goldstein, R. Ramakrishan, U. Shaft, When is nearest neighbor mean-501

ingful? in: Proceedings of the 7th International Conference on Database Theory502

(ICDT) LNCS, 1540, Springer-Verlag, London, UK, 1999, pp. 217–235.503

[4] F. Brucker, Modèles de classification en classes empiétantes, PhD Thesis, Equipe504

d’acceuil: Dep. IASC de l’École Supérieure des Télécommunications de Bretagne,505

France, 2001.506

[5] M. Chavent, Y. Lechevallier, Empirical comparison of a monothetic divisive507

clustering method with the Ward and the k-means clustering methods, in:508

Proceedings of the IFCS Data Science and Classification, Springer, Ljubljana,509

Slovenia, 2006, pp. 83–90.510

[6] G. Cleuziou, An extended version of the k-means method for overlapping clus-511

tering, in: Proceedings of the Nineteenth International Conference on Pattern512

Recognition, 2008.513

[7] G. Cleuziou, Une méthode de classification non-supervisée pour l’apprentissage514

de règles et la recherche d’information, PhD Thesis, Université d’Orleans,515

France, 2006.516

[8] S. Fouchal, M. Ahat, I. Lavallée, M. Bui, An O(N) clustering method on ultra-517

metric data, in: Proceedings of the 3rd IEEE Conference on Data Mining and518

Optimization (DMO), Malaysia, 2011, pp. 6–11.519

[9] S. Fouchal, M. Ahat, I. Lavallée, Optimal clustering method in ultrametric spaces,520

in: Proceedings of the 3rd IEEE International Workshop Performance Eval-521

uation of Communications in Distributed Systems and Web-based Service522

Architectures (PEDISWESA), Greece, 2011, pp. 123–128.523

[10] S. Fouchal, M. Ahat, I. Lavallée, M. Bui, S. Benamor, Clustering based on Kol-524

mogorov complexity, in: Proceedings of the 14th International Conference525

Knowledge-based and Intelligent Information and Engineering Systems (KES),526

Part I, Cardiff, UK, Springer, 2010, pp. 452–460.527

[11] L. Gaji’, Metric locally constant function on some subset of ultrametric space528

Novi Sad Journal of Mathematics (NSJOM) 35 (1) (2005) 123–125.529

[12] L. Gaji’, On ultrametric space, Novi Sad Journal of Mathematics (NSJOM) 31 (2)530

(2001) 69–71.531

[13] G. Gardanel, Espaces de Bannach ultramétriques, Séminaire Delange-Pistot-532

Poitou Théorie des nombres, Tome 9(2), exp. n-G2, G1-G8, 1967–1968.533

[14] I. Gronau, S. Moran, Optimal implementations of UPGMA and other common534

clustering algorithms, Information Processing Letters 104 (6) (2007) 205–210.535

[15] S. Guha, R. Rastogi, K. Shim, An efficient clustering algorithm for large databases536

on management of data, in: Proceedings ACM SIGMOD International Confer-537

ence, No. 28, 1998, pp. 73–84.538

[16] S. Guindon, Méthodes et algorithmes pour l’approche statistique en phylogénie,539

PhD Thesis, Université Montpellier II, 2003.540

[17] A. Hatamlou, S. Abdullah, Z. Othman, Gravitational search algorithm with541

heuristic search for clustering problems, in: 3rd IEEE Conference on Data Min-542

ing and Optimization (DMO), Malaysia, 2011, pp. 190–193.543

[18] A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multimedia544

databases with noise, in: Proceedings of the Fourth International Conference545

on Knowledge Discovery and Data Mining (KDD), New York, USA, 1998, pp.546

58–65.

[19] L. Hubert, P. Arabie, J. Meulman, Modeling dissimilarity: generalizing ultra- 547

metric and additive tree representations, British Journal of Mathematical and 548

Statistical Psychology 54 (2001) 103–123. 549

[20] L. Hubert, P. Arabie, Iterative projection strategies for the least squares fitting of 550

tree structures to proximity data, British Journal of Mathematical and Statistical 551

Psychology 48 (1995) 281–317. 552

[21] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing 553

Surveys 31 (3) (1999) 264–323. 554

[22] G. Karypis, E.H. Han, V. Kumar, Chameleon: hierarchical clustering using 555

dynamic modeling, The Computer Journal 32 (8) (1999) 68–75. 556

[23] M. Krasner, Espace ultramétrique et valuation, Séminaire Dubreil Algèbre et 557

théorie des nombres, Tome 1, exp. n-1, 1–17, 1947–1948. 558

[24] H.P. Kriegel, P. Kroger, A. Zymek, Clustering high-dimensional data: a survey on 559

subspace clustering, pattern-based clustering, and correlation clustering, ACM 560

Transactions on Knowledge Discovery from Data 3 (1) (2009), Article 1. 561

[25] B. Leclerc, Description combinatoire des ultramétriques, Mathématiques et Sci- 562

ences Humaines 73 (1981) 5–37. 563

[26] V. Levorato, T.V. Le, M. Lamure, M. Bui, Classification prétopologique basée 564

sur la complexité de Kolmogorov, Studia Informatica, 7.1, 198–222, Hermann 565

éditeurs, 2009. 566

[27] F. Murtagh, Complexities of hierarchic clustering algorithms: state of the art, 567

Computational Statistics Quarterly 1 (2) (1984) 101–113. 568

[28] F. Murtagh, A survey of recent advances in hierarchical clustering algorithms, 569

The Computer Journal 26 (4) (1983) 354–359. 570

[29] K.P.R. Rao, G.N.V. Kishore, T.R. Rao, Some coincidence point theorems in ultra 571

metric spaces, International Journal of Mathematical Analysis 1 (18) (2007) 572

897–902. 573

[30] D.D. Sleator, R.E. Tarjans, Amortized efficiency of list update and paging rules, 574

Proceedings of Communications of ACM 28 (1985) 202–208. 575

[31] M. Sun, L. Xiong, H. Sun, D. Jiang, A GA-based feature selection for 576

high-dimensional data clustering, in: Proceedings of the 3rd International Con- 577

ference on Genetic and Evolutionary Computing (WGEC), vol. 76, 2009, pp. 578

9–772. 579

[32] R.E. Tarjans, Data Structures and Network Algorithms, Society for Industrial 580

and Applied Mathematics, Philadelphia, PA, USA, 1983. 581

[33] M. Terrenoire, D. Tounissoux, Classification hiérarchique, Université de Lyon, 582

2003. 583

[34] A.C.-C. Yao, Probabilistic computations: toward a unified measure of complex- 584

ity, in: Proceedings of the 18th IEEE Conference on Foundations of Computer 585

Science, 1977, pp. 222–227. 586

Said Fouchal, is a PhD student in the University Paris 8. 587

The topic of his research work is clustering of complex 588

objects. His researches are based on ultrametric, computa- 589

tional complexity, Kolmogorov complexity and clustering. 590

competitive clustering algorithms based on ultrametric properties

Documents