plugin kdd98 denclue

8

Upload: oromos

Post on 27-May-2017

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Plugin Kdd98 DENCLUE

An E�cient Approach to Clustering in Large Multimedia Databaseswith Noise

Alexander Hinneburg, Daniel A. Keim

Institute of Computer Science, University of Halle, Germanyfhinneburg, [email protected]

AbstractSeveral clustering algorithms can be applied to clustering inlarge multimedia databases. The e�ectiveness and e�ciencyof the existing algorithms, however, is somewhat limited,since clustering in multimedia databases requires cluster-ing high-dimensional feature vectors and since multimediadatabases often contain large amounts of noise. In this pa-per, we therefore introduce a new algorithm to clusteringin large multimedia databases called DENCLUE (DENsity-based CLUstEring). The basic idea of our new approach isto model the overall point density analytically as the sumof in uence functions of the data points. Clusters can thenbe identi�ed by determining density-attractors and clustersof arbitrary shape can be easily described by a simple equa-tion based on the overall density function. The advantagesof our new approach are (1) it has a �rmmathematical basis,(2) it has good clustering properties in data sets with largeamounts of noise, (3) it allows a compact mathematical de-scription of arbitrarily shaped clusters in high-dimensionaldata sets and (4) it is signi�cantly faster than existing algo-rithms. To demonstrate the e�ectiveness and e�ciency ofDENCLUE, we perform a series of experiments on a num-ber of di�erent data sets from CAD and molecular biology.A comparison with DBSCAN shows the superiority of ournew approach.

Keywords: Clustering Algorithms, Density-based Clus-

tering, Clustering of High-dimensional Data, Clustering in

Multimedia Databases, Clustering in the Presence of Noise

1 IntroductionBecause of the fast technological progress, the amountof data which is stored in databases increases very fast.The types of data which are stored in the computer be-come increasingly complex. In addition to numericaldata, complex 2D and 3D multimedia data such as im-age, CAD, geographic, and molecular biology data arestored in databases. For an e�cient retrieval, the com-plex data is usually transformed into high-dimensionalfeature vectors. Examples of feature vectors are colorhistograms [SH94], shape descriptors [Jag91, MG95],Fourier vectors [WW80], text descriptors [Kuk92], etc.In many of the mentioned applications, the databasesare very large and consist of millions of data objectswith several tens to a few hundreds of dimensions.Automated knowledge discovery in large multimedia

databases is an increasingly important research issue.Clustering and trend detection in such databases, how-ever, is di�cult since the databases often contain largeamounts of noise and sometimes only a small portionof the large databases accounts for the clustering. Inaddition, most of the known algorithms do not work ef-�ciently on high-dimensional data.The methods which

Copyright (c) 1998, American Association for Arti�cial

Intelligence (www.aaai.org). All rights reserved.

are applicable to databases of high-dimensional fea-ture vectors are the methods which are known fromthe area of spatial data mining. The most promi-nent representatives are partitioning algorithms suchas CLARANS [NH94], hierarchical clustering algo-rithms,and locality-based clustering algorithms suchas (G)DBSCAN [EKSX96, EKSX97] and DBCLASD[XEKS98]. The basic idea of partitioning algorithmsis to partition the database into k clusters which arerepresented by the gravity of the cluster (k-means) orby one representative object of the cluster (k-medoid).Each object is assigned to the closest cluster. A well-known partitioning algorithm is CLARANS which usesa randomized and bounded search strategy to improvethe performance. Hierarchical clustering algorithms de-compose the database into several levels of partition-ings which are usually represented by a dendrogram - atree which splits the database recursively into smallersubsets. The dendrogram can be created top-down (di-visive) or bottom-up (agglomerative). Although hier-archical clustering algorithms can be very e�ective inknowledge discovery, the costs of creating the dendro-grams is prohibitively expensive for large data sets sincethe algorithms are usually at least quatratic in the num-ber of data objects. More e�cient are locality-basedclustering algorithms since they usually group neighbor-ing data elements into clusters based on local conditionsand therefore allow the clustering to be performed inone scan of the database. DBSCAN, for example, usesa density-based notion of clusters and allows the discov-ery of arbitrarily shaped clusters. The basic idea is thatfor each point of a cluster the density of data points inthe neighborhood has to exceed some threshold. DB-CLASD also works locality-based but in contrast to DB-SCAN assumes that the points inside of the clustersare randomly distributed, allowing DBCLASD to workwithout any input parameters. A performance compar-ison [XEKS98] shows that DBSCAN is slightly fasterthan DBCLASD and both, DBSCAN and DBCLASDare much faster than hierarchical clustering algorithmsand partitioning algorithms such as CLARANS. Toimprove the e�ciency, optimized clustering techniqueshave been proposed. Examples include R*-Tree-basedSampling [EKX95], Grid�le-based clustering [Sch96],BIRCH [ZRL96] which is based on the Cluster-Feature-tree, and STING which uses a quadtree-like structurecontaining additional statistical information [WYM97].

A problem of the existing approaches in the con-text of clustering multimedia data is that most algo-rithms are not designed for clustering high-dimensionalfeature vectors and therefore, the performance of ex-

Page 2: Plugin Kdd98 DENCLUE

isting algorithms degenerates rapidly with increasingdimension. In addition, few algorithms can deal withdatabases containing large amounts of noise, which isquite common in multimedia databases where usuallyonly a small portion of the database forms the interest-ing subset which accounts for the clustering. Our newapproach solves these problems. It works e�ciently forhigh-dimensional data sets and allows arbitrary noiselevels while still guaranteeing to �nd the clustering. Inaddition, our approach can be seen as a generalizationof di�erent previous clustering approaches { partioning-based, hierarchical, and locality-based clustering. De-pending on the parameter settings, we may initalize ourapproach to provide similar (or even the same) resultsas di�erent other clustering algorithms, and due to itsgenerality, in same cases our approach even provides abetter e�ectiveness. In addition, our algorithm workse�ciently on very large amounts of high-dimensionaldata, outperforming one of the fastest existing meth-ods (DBSCAN) by a factor of up to 45.The rest of the paper is organized as follows. In sec-

tion 2, we introduce the basic idea of our approach andformally de�ne in uence functions, density functions,density-attractors, clusters, and outliners. In section 3,we discuss the properties of our approach, namely itsgenerality and its invariance with respect to noise. Insection 4, we then introduce our algorithm including itstheoretical foundations such as the locality-based den-sity function and its error-bounds. In addition, we alsodiscuss the complexity of our approach. In section 5,we provide an experimental evaluation comparing ourapproach to previous approaches such as DBSCAN. Forthe experiments, we use real data from CAD and molec-ular biology. To show the ability of our approach to dealwith noisy data, we also use synthetic data sets with avariable amount of noise. Section 6 summarizes our re-sults and discusses their impact as well as importantissues for future work.

2 A General Approach to Clustering

Before introducing the formal de�nitions required fordescribing our new approach, in the following we �rsttry to give a basic understanding of our approach.

2.1 Basic Idea

Our new approach is based on the idea that the in u-ence of each data point can be modeled formally usinga mathematical function which we call in uence func-tion. The in uence function can be seen as a func-tion which describes the impact of a data point withinits neighborhood. Examples for in uence functions areparabolic functions, square wave function, or the Gaus-sian function. The in uence function is applied to eachdata point. The overall density of the data space canbe calculated as the sum of the in uence function ofall data points. Clusters can then be determined math-ematically by identifying density-attractors. Density-attractors are local maxima of the overall density func-tion. If the overall density function is continuous and

di�erentiable at any point, determining the density-attractors can be done e�ciently by a hill-climbing pro-cedure which is guided by the gradient of the overalldensity function. In addition, the mathematical form ofthe overall density function allows clusters of arbitraryshape to be described in a very compact mathematicalform, namely by a simple equation of the overall densityfunction. Theoretical results show that our approach isvery general and allows di�erent clusterings of the data(partition-based, hierarchical, and locality-based) to befound. We also show (theoretically and experimentally)that our algorithm is invariant against large amounts ofnoise and works well for high-dimensional data sets.The algorithmDENCLUE is an e�cient implementa-

tion of our idea. The overall density function requires tosum up the in uence functions of all data points. Mostof the data points, however, do not actually contributeto the overall density function. Therefore, DENCLUEuses a local density function which considers only thedata points which actually contribute to the overall den-sity function. This can be done while still guaranteeingtight error bounds. An intelligent cell-based organiza-tion of the data allows our algorithm to work e�cientlyon very large amounts of high-dimensional data.

2.2 De�nitions

We �rst have to introduce the general notion of in u-ence and density functions. Informally, the in uencefunctions are a mathematical description of the in u-ence a data object has within its neighborhood. Wedenote the d-dimensional feature space by F d. The den-sity function at a point x 2 F d is de�ned as the sum ofthe in uence functions of all data objects at that point(cf. [Schn64] or [FH 75] for a similar notion of densityfunctions).

Def. 1 (In uence & Density Function)The in uence function of a data object y 2 F d is afunction fy

B: F d �! R+

0which is de�ned in terms of a

basic in uence function fB

fyB(x) = fB(x; y):

The density function is de�ned as the sum of the in- uence functions of all data points. Given N dataobjects described by a set of feature vectors D =fx1; : : : ; xNg � F d the density function is de�ned as

fDB (x) =

NXi=1

fxiB(x):

In principle, the in uence function can be an arbitraryfunction. For the de�ntion of speci�c in uence func-tions, we need a distance function d : F d � F d �! R+

0

which determines the distance of two d-dimensional fea-ture vectors. The distance function has to be re exiveand symmetric. For simplicity, in the following we as-sume a Euclidean distance function. Note, however,that the de�nitions are independent from the choice ofthe distance function. Examples of basic in uence func-tions are:

Page 3: Plugin Kdd98 DENCLUE

Den

sity

Den

sity

(a) Data Set (b) Square Wave (c) Gaussian

Figure 1: Example for Density Functions1. Square Wave In uence Function

fSquare(x; y) =

�0 if d(x; y) > �

1 otherwise

2. Gaussian In uence Function

fGauss(x; y) = e�d(x;y)2

2�2

The density function which results from a Gaussian in- uence function is

fDGauss

(x) =

NXi=1

e�d(x;xi)

2

2�2 :

Figure 1 shows an example of a set of data points in 2Dspace (cf. �gure 1a) together with the correspondingoverall density functions for a square wave (cf. �gure1b) and a Gaussian in uence function (cf. �gure 1c).In the following, we de�ne two di�erent notions of clus-ters { center-de�ned clusters (similar to k-means clus-ters) and arbitrary-shape clusters. For the de�nitions,we need the notion of density-attractors. Informally,density attrators are local maxima of the overall den-stiy function and therefore, we also need to de�ne thegradient of the density function.

Def. 2 (Gradient)The gradient of a function fD

B(x) is de�ned as

rfDB (x) =

NXi=1

(xi � x) � fxiB(x):

In case of the Gaussian in uence function, the gradientis de�ned as:

rfDGauss

(x) =

NXi=1

(xi � x) � e�d(x;xi)

2

2�2 :

In general, it is desirable that the in uence functionis a symmetric, continuous, and di�erentiable function.Note, however, that the de�nition of the gradient isindependent of these properties. Now, we are able tode�ne the notion of density-attractors.

Def. 3 (Density-Attractor)A point x� 2 F d is called a density-attractor for agiven in uence function, i� x� is a local maxium of thedensity-function fD

B.

A point x 2 F d is density-attracted to a density-attractor x�, i� 9k 2 N : d(xk ; x�) � � with

x0 = x; xi = xi�1 + � � rfD

B(xi�1)

krfDB(xi�1)k :

Den

sity

Data Space

Figure 2: Example of Density-Attractors

Figure 2 shows an example of density-attrators in a one-dimensional space. For a continuous and di�erentiablein uence function, a simple hill-climbing algorithm canbe used to determine the density-attractor for a datapoint x 2 D. The hill-climbing procedure is guided bythe gradient of fD

B.

We are now are able to introduce our de�nitions of clus-ters and outliers. Outliers are points, which are notin uenced by "many" other data points. We need abound � to formalize the "many".

Def. 4 (Center-De�ned Cluster)A center-de�ned cluster (wrt to �,�) for a density-attractor x� is a subset C � D, with x 2 C beingdensity-attracted by x� and fD

B(x�) � �. Points x 2 D

are called outliers if they are density-attraced by a localmaximum x�o with fD

B(x�o) < �.

This de�nition can be extended to de�ne clusters ofarbitrary shape.

Def. 5 (Arbitrary-Shape Cluster)An arbitrary-shape cluster (wrt �,�) for the set ofdensity-attractors X is a subset C � D, where1. 8x 2 C 9x� 2 X : fD

B(x�) � �, x is density-attracted

to x� and

2. 8x�1; x�

22 X : 9 a path P � F d from x�

1to x�

2with

8p 2 P : fDB(p) � �.

Figure 3 shows examples of center-de�ned clusters fordi�erent �. Note that the number of clusters found byour approach varies depending on �. In �gures 4a and4c, we provide examples of the density function togetherwith the plane for di�erent �. Figures 4b and 4d showthe resulting arbitrary-shape clusters. The parameter� describes the in uence of a data point in the dataspace and � describes when a density-attractor is sig-ni�cant. In subsection 3.3, we discuss the e�ects of theparameters � and � and provide a formal procedure ofhow to determine the parameters.

Page 4: Plugin Kdd98 DENCLUE

Den

sity

Den

sity

Den

sity

(a) � = 0:2 (b) � = 0:6 (d) � = 1:5

Figure 3: Example of Center-De�ned Clusters for di�erent �

Den

sity

Den

sity

(a) � = 2 (b) � = 2 (c) � = 1 (d) � = 1

Figure 4: Example of Arbitray-Shape Clusters for di�erent �

3 Properties of Our Approach

In the following subsections we discuss some impor-tant properties of our approach including its generality,noise-invariance, and parameter settings.

3.1 Generality

As already mentioned, our approach generalizes otherclustering methods, namely partition-based, hierarchi-cal, and locality-based clustering methods. Due thewide range of di�erent clustering methods and the givenspace limitations, we can not discuss the generality ofour approach in detail. Instead, we provide the basicideas of how our approach generalizes some other well-known clustering algorithms.Let us �rst consider the locality-based clustering algo-rithm DBSCAN. Using a square wave in uence func-tion with � =EPS and an outlier-bound � =MinPts,the abitary-shape clusters de�ned by our method (c.f.de�nition 5) are the same as the clusters found by DB-SCAN. The reason is that in case of the square wavein uence function, the points x 2 D : fD(x) > � satisfythe core-point condition of DBSCAN and each non-corepoint x 2 D which is directly density-reachable from acore-point xc is attracted by the density-attractor ofxc. An example showing the identical clustering of DB-SCAN and our approach is provided in �gure 5. Notethat the results are only identical for a very simple in- uence function, namely the square wave function.

(a) DBSCAN (b) DENCLUE

Figure 5: Generality of our Approach

To show the generality of our approach with respect topartition-based clustering methods such as the k-meansclustering, we have to use a Gaussian in uence function.If � of de�nition 3 equals to �=2, then there exists a� such that our approach will �nd k density-attractorscorresponding to the k mean values of the k-means clus-tering, and the center-de�ned clusters (cf. de�nition 4)correspond to the k-means clusters. The idea of theproof of this property is based on the observation thatby varying the � we are able to obtain between 1 andN�

clusters (N� is the number of non-identical data points)and we therefore only have to �nd the appropriate � toobtain a corresponding k-means clustering. Note thatthe results of our approach represent a globally optimalclustering while most k-means clustering methods onlyprovide a local optimum of partitioning the data set Dinto k clusters. The results k-means clustering and ourapproach are therefore only the same if the k-meansclustering provides a global optimum.The third class of clustering methods are the hierarchi-cal methods. By using di�erent values for � and thenotion of center-de�ned clusters according to de�nition5, we are able to generate a hierarchy of clusters. If westart with a very small value for �, we obtain N� clus-ters. By increasing the �, density-attractors start tomerge and we get the next level of the hierarchy. If wefurther increase �, more and more density-attractorsmerge and �nally, we only have one density-atractorrepresenting the root of our hierarchy.

3.2 Noise-Invariance

In this subsection, we show the ability of our approachto handle large amounts of noise. Let D � F d be agiven data set and DSD (data space) the relevant por-tion of F d. Since D consists of clusters and noise, it canbe partitioned D = DC [DN , where DC contains theclusters and DN contains the noise (e.g., points that are

Page 5: Plugin Kdd98 DENCLUE

uniformly distributed in DSD). Let X = fx�1; : : : ; x�

kg

be the canonically ordered set of density-attractors be-longing to D (wrt �; �) and XC = fx̂�1; : : : ; x̂�k̂g the den-sity attractors for DC (wrt �; �c) with � and �c beingadequately chosen, and let kSk denote the cardinalityof S. Then, we are able to show the following lemma.

Lemma 1 (Noise-Invariance)The number of density-attractors of D and DC is thesame and the probability that the density-attractors re-main identical goes against 1 for kDNk ! 1. Moreformally,

kXk = kXCk and

limkDNk!1

�P (

kXkXi=1

d(x�i ; x̂�i ) = 0)

�= 1:

Idea of the Proof: The proof is based on the fact, that

the normalized density ~fDN of a uniformly distributed

data set is nearly constant with ~fDN = c (0 < c � 1).According to [Schu70] and [Nad65],

limkDNk!1

supy2DSD

����c� 1

kDNkp2��2

dfDN (y)

���� = 0

for any � > 0. So the density distribution of D canapproximated by

fD(y) = fDC (y) + kDNkp2��2

d

� cfor any y 2 DSD. The second portion of this formulais constant for a given D and therefore, the density-attractors de�nded by fDC

do not change. �

3.3 Parameter DiscussionAs in most other approaches, the quality of the result-ing clustering depends on an adequate choice of theparameters. In our approach, we have two importantparameters, namely � and �. The parameter � deter-mines the in uence of a point in its neighborhood and �describes whether a density-attractor is signi�cant, al-lowing a reduction of the number of density-attractorsand helping to improve the performance. In the follow-ing, we describe how the parameters should be chosento obtain good results.Choosing a good � can be done by considering di�erent� and determining the largest interval between �max

and �min where the number of density-attractors m(�)remains constant. The clustering which results fromthis approach can be seen as naturally adapted to thedata-set. In �gure 6, we provide an example for thenumber of density-attractors depending on �.

σ

m( )σ

Figure 6:Number of Density-Attractors depending on �

The parameter � is the minimum density level for adensity-attractor to be signi�cant. If � is set to zero, alldensity-attractors together with their density-attracteddata points are reported as clusters. This, however, isoften not desirable since { especially in regions with alow density { each point may become a cluster of itsown. A good choice for � helps the algorithm to focuson the densely populated regions and to save computa-tional time. Note that only the density-attractors haveto have a point-density � �. The points belonging tothe resulting cluster, however, may have a lower point-density since the algorithm assigns all density-attractedpoints to the cluster (cf. de�nition 4). But what is agood choice for �? If we assume the database D tobe noise-free, all density-attractors of D are signi�cantand � should be choosen in 0 � � � min

x�2XffD(x�)g.

In most cases the database will contain noise (D =DC [ DN ; cf. subsection 3.2). According to lemma1, the noise level can be described by the constant

kDNk �p2��2

d

and therefore, � should be chosen in

kDNk �p2��2

d

� � � minx�2X

ffDC (x�)g:

Note that the noise level also has an in uence on thechoice of �. If the di�erence of the noise level and thedensity of small density-attractors is large, then thereis a large intervall for choosing �.

4 The Algorithm

In this section, we describe an algorithm which imple-ments our ideas described in sections 2 and 3. For ane�cient determination of the clusters, we have to �nda possibility to e�ciently calculate the density functionand the density-attractors. An important observationis that to calculate the density for a point x 2 F d,only points of the data set which are close to x actu-ally contribute to the density. All other points may beneglected without making a substantial error. Beforedescribing the details of our algorithm, in the followingwe �rst introduce a local variant of the density functiontogether with its error bound.

4.1 Local Density Function

The local density function is an approximation of theoverall density function. The idea is to consider the in- uence of nearby points exactly whereas the in uence offar points is neglected. This introduces an error whichcan, however, be guaranteed to be in tight bounds. Tode�ne the local density function, we need the functionnear(x) with x1 2 near(x) : d(x1; x) � �near . Then,the local density is de�ned as follows.Def. 6 (Local Density Function)

The local density f̂D(x) is

f̂D(x) =X

x12near(x)

fx1B(x) :

The gradient of the local density function is de�nedsimilar to de�nition 2. In the following, we assume that

Page 6: Plugin Kdd98 DENCLUE

�near = k�. An upper bound for the error made by us-

ing the local density function f̂D(x) instead of the realdensity function fD(x) is given by the following lemma.

Lemma 2 (Error-Bound)If the points xi 2 D : d(xi; x) > k� are neglected, theerror is bound by

Error =X

xi2D; d(xi;x)>k�

e�d(xi;x)

2

2�2

� kfxi 2 Djd(xi; x) > k�gk � e�k2=2:Idea of the Proof: The error-bound assumes that allpoints are on a hypersphere of k� around x. �The error-bound given by lemma 5 is only an upperbound for the error made by the local density function.In real data sets, the error will be much smaller sincemany points are much further away from x than k�.

4.2 The DENCLUE AlgorithmThe DENCLUE algorithm works in two steps. Stepone is a preclustering step, in which a map of the rel-evant portion of the data space is constructed. Themap is used to speed up the calculation of the densityfunction which requires to e�ciently access neighbor-ing portions of the data space. The second step is theactual clustering step, in which the algorithm identi�esthe density-attractors and the corresponding density-attracted points.Step 1: The map is constructed as follows. The mini-mal bounding (hyper-)rectangle of the data set is di-vided into d-dimensional hypercubes, with an edge-length of 2�. Only hypercubes which actually containdata points are determined. The number of populatedcubes (kCpk) can range from 1 to N� depending on thechoosen �, but does not depend on the dimensionality ofthe data space. The hypercubes are numbered depend-ing on their relative position from a given origin (cf. �g-ure 7 for a two-dimensional example). In this way, thepopulated hypercubes (containing d-dimensional datapoints) can be mapped to one-dimensional keys. Thekeys of the populated cubes can be e�ciently stored ina randomized search-tree or a B+-tree.

1 2 3 4 5 6

7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22 23 24

25 26 27 28 29 30

31 32 33 34 35 36

origin

Figure 7: Map in a 2-dim. data space

For each populated cube c 2 Cp, in addition to the keythe number of points (Nc) which belong to c, pointersto those points, and the linear sum

Px2c

x are stored.

This information is used in the clustering step for afast calculation of the mean of a cube (mean(c)). Sinceclusters can spread over more than one cube, neighbor-ing cubes which are also populated have to be accessed.To speed up this access, we connect neighboring pop-ulated cubes. More formally, two cubes c1; c2 2 Cpare connected if d(mean(c1);mean(c2)) � 4�. Doingthat for all cubes would normally take O(C2

p) time.

We can, however, use a second outlier-bound �c to re-duce the time needed for connecting the cubes. LetCsp = fc 2 CpjNc � �cg be the set of highly popu-lated cubes. In general, the number of highly popu-lated cubes Csp is much smaller than Cp, especially inhigh-dimensional space. The time needed for connect-ing the highly populated cubes with their neighbors isthen O(kCspk � kCpk) with kCspk << kCpk. The car-dinality of Csp depends on �c. A good choice for �c is�c = �=2d, since in high-dimensional spaces the clustersare usually located on lower-dimensional hyperplanes.The data structure generated in step 1 of our algorithmhas the following properties:

� The time to access the cubes for an arbitrary pointis O(log(Cp)).

� The time to access the relevant portion around agiven cube (the connected neighboring cubes) isO(1).

Step 2: The next step of our algorithm is the clusteringstep. Only the highly populated cubes and cubes whichare connected to a highly populated cube are consideredin determining clusters. This subset of Cp is noted asCr = Csp [ fc 2 Cpj9cs 2 Csp and 9connection(cs; c)g.Using the data structure constructed in step 1 of thealgorithm, we are able to e�ciently calculate the lo-

cal density function f̂D(x). For x 2 c and c; c1 2Cr, we set near(x) = fx1 2 c1jd(mean(c1); x) �k� and 9connection(c1; c)g The limit k� is chosen suchthat only marginal in uences are neglegted. A valueof k = 4 is su�cient for pratical purposes (compare tothe 3� rule for the Gaussian distribution [Sch70]). Theresulting local density-function is

f̂DGauss(x) =X

x12near(x)

e�d(x;x1)

2

2�2 :

Similarly, the local gradient rf̂DGauss

(x) can be com-puted. With these settings and according to lemma 5,the Error at a point x can be approximated as

Error � e�k2

2 � kD � near(x)k:where k = 4 according to the de�nition of near(x).To determine the density-attractors for each point inthe cubes of Cr, a hill-climbing procedure based on the

local density function f̂D and its gradient rf̂D is used.The density-attractor for a point x is computed as

x = x0; xi+1 = xi + �rf̂D

Gauss(xi)

krf̂DGauss

(xi)k:

The calculation stops at k 2 N if f̂D(xk+1) < f̂D(xk)and takes x� = xk as a new density-attractor. After

Page 7: Plugin Kdd98 DENCLUE

determining the density-attractor x� for a point x and

f̂DGauss

(x�) � �, the point x is classi�ed and attachedto the cluster belonging to x�. For e�ciency reasons,the algorithm stores all points x0 with d(xi; x0) � �=2for any step 0 � i � k during the hill-climbing proce-dure and attaches these points to the cluster of x� aswell. Using this heuristics, all points which are locatedclose to the path from x to its density-attractor can beclassi�ed without appling the hill-climbing procedure tothem.

4.3 Time ComplexityFirst, let us recall the main steps of our algorithm.

DENCLUE (D; �; �; �c)

1. MBR DetermineMBR(D)

2. Cp DetPopulatedCubes(D;MBR; �)Csp DetHighlyPopulatedCubes(Cp; �c)

3. map;Cr ConnectMap(Cp; Csp; �)

4. clusters DetDensAttractors(map;Cr ; �; �)

Step one is a linear scan of the data set and takesO(kDk). Step two takes O(kDk + kCpk � log(kCpk))time because a tree-based access structure (such as arandomized search tree or B+-tree) is used for storingthe keys of the populated cubes. Note that kCpk �kDk and therefore in the worst case, we have O(kDk �log(kDk). The complexity of step 3 depends on � and�, which determine the cardinatilty of Csp and Cp. For-mally, the time complexity of step 3 is O(kCspk � kCpk)with kCspk << kCpk � kDk. In step 4, only pointswhich belong to a cube c 2 Cr are examined. All otherpoints are treated as outliers and are not used. If theportion of D without outliers is denoted by Dr, thenthe time complexity of this step is O(kDrk � logkCrjj),since the density-attractor of a point can be determinedlocally. Note that all mentioned time complexities areworst case time complexities.As we will see in the experimental evaluation (cf. sec-tion 5), in the average case the total time complexity ofour approach is much better (in the orderO(log(kDk))).If there is a high level of noise in the data, the av-erage time complexity is even better (in the orderO(kfx 2 cjc 2 Crgk) ). Note that for a new � the datastructure does not need to be built up from scratch,especially if �new = k�old. Note further that the algo-rithm is designed to work on very large data sets andtherefore, all steps work secondary memory based whichis already re ected in the time complexities above.

5 Experimental EvaluationTo show the practical relevance of our new method,we performed an experimental evaluation of the DEN-CLUE algorithm and compared it to one of the most ef-�cient clustering algorithms { the DBSCAN algorithm[EKSX96]. All experimental results presented in thissection are computed on an HP C160 workstation with512 MBytes of main memory and several GBytes ofsecondary storage. The DENCLUE algorithm has beenimplemented in C++ using LEDA [MN89].

The test data used for the experiments are real multi-media data sets from a CAD and a molecular biologyapplication. To show the invariance of our approachwith respect to noise, we extended the CAD data setby large amounts of uniformly distributed noise.

5.1 E�ciencyThe data used for the e�ciency test is polygonal CADdata which are transformed into 11-dimensional featurevectors using a Fourier transformation. We varied thenumber of data points between 20000 and 100000. Wechose the parameters of our algorithm and DBSCANsuch that both algorithms produced the same cluster-ing of the data set.

0

1000

2000

3000

4000

5000

6000

0 20000 40000 60000 80000100000

CP

U-T

ime

Size of Database

DBSCANDENCLUE

Figure 8: Comparison of DENCLUE and DBSCAN

05

101520253035404550

0 20000 40000 60000 80000 100000

Spe

ed U

p

Size of Database

0

20

40

60

80

100

120

140

0 20000 40000 60000 80000 100000

CP

U-T

ime

Size of Database

DENCLUE

(a) Speed Up (b) DENCLUE

Figure 9: Performance of DENCLUE

The results of our experiments are presented in �gure8. The �gure shows that DBSCAN has a slightly super-linear performance and in �gure 9a, we show the speedup of our approach (DENCLUE is up to a factor of45 faster than DBSCAN). Since in �gure 8 the perfor-mance of our approach is di�cult to discern, in �gure9b we show the performance of DENCLUE separately.Figure 9b shows that the performance of DENCLUE islogarithmic in the number of data items. The logarith-mic performance of our approach results from the factthat only data points in highly populated regions areactually considered in the clustering step of our algo-rithm. All other data points are only considered in theconstruction of the data structure.

5.2 Application to Molecular BiologyTo evaluate the e�ectiveness of DENCLUE, we appliedour method to an application in the area of molecu-lar biology. The data used for our tests comes froma complex simulation of a very small but exible pep-tide. (The time for performing the simulation took sev-eral weeks of CPU-time.) The data generated by thesimulation describes the conformation of the peptideas a 19-dimensional point in the dihedral angle space[DJSGM97]. The simulation was done for a period of50 nanoseconds with two snapshots taken every picosec-ond, resulting in about 100000 data points. The pur-pose of the simulation was to determine the behavior

Page 8: Plugin Kdd98 DENCLUE

of the molecule in the conformation space. Due to thelarge amount of high-dimensional data, it is di�cult to�nd, for example, the preferred conformations of themolecule. This, however, is important for applicationsin the pharmaceutical industry since small and exiblepeptides are the basis for many medicaments. The ex-ibility of the peptides, however, is also responsible forthe fact that the peptide has many intermediate non-stable conformations which causes the data to containlarge amounts of noise (more than 50 percent).

(a) Folded State (b) Unfolded State

Figure 10: Folded Conformation of the Peptide

In �gures 10a we show the most important conforma-tions (corresponding to the largest clusters of conforma-tions in the 19-dimensional dihedral angle space) foundby DENCLUE. The two conformations correspond to afolded and an unfolded state of the peptide.

6 ConclusionsIn this paper, we propose a new approach to clusteringin large multimedia databases. We formally introducethe notion of in uence and density functions as well asdi�erent notions of clusters which are both based ondetermining the density-attractors of the density func-tion. For an e�cient implementation of our approach,we introduce a local density function which can be e�-ciently determined using a map-oriented representationof the data. We show the generality of our approach, i.e.that we are able to simulate a locality-based, partition-based, and hierachical clustering. We also formallyshow the noise-invariance and error-bounds of our ap-proach. An experimental evaluation shows the superiore�ciency and e�ectiveness of DENCLUE. Due to thepromising results of the experimental evaluation, we be-lieve that our method will have a noticeable impact onthe way clustering will be done in the future. Our plansfor future work include an application and �ne tuningof our method for speci�c applications and an exten-sive comparison to other clustering methods (such asBirch).

References[DJSGM97] X. Daura, B. Jaun, D. Seebach, W.F. van Gun-

steren A.E. Mark:'Reversible peptide folding insolution by molecular dynamics simulation', sub-mitted to Science (1997)

[EKSX96] Ester M., Kriegel H.-P., Sander J., Xu X.:'ADensity-Based Algorithm for Discovering Clus-ters in Large Spatial Databases with Noise',Proc. 3rd Int. Conf. on Knowledge Discoveryand Data Mining, AAAI Press, 1996.

[EKSX97] Ester M., Kriegel H.-P., Sander J., XuX.:'Density-Connected Sets and their Applica-tion for Trend Detection in Spatial Databases',

Proc. 3rd Int. Conf. on Knowledge Discoveryand Data Mining, AAAI Press, 1997.

[EKX95] Ester M., Kriegel H.-P., Xu X.:'Knowledge Dis-covery in Large Spatial Databases: FocusingTechniques for E�cient Class Identi�cation',Proc. 4th Int. Symp. on Large SpatialDatabases, 1995, in: Lecture Notes in ComputerScience, Vol. 951, Springer, 1995, pp. 67-82.

[FH75] K. Fukunaga, L.D. Hostler:'The estimation ofthe gradient of a density function, with applica-tion in pattern recognation', IEEE Trans. Info.Thy. 1975, IT-21, 32-40

[Jag91] Jagadish H. V.:'A Retrieval Technique for Simi-lar Shapes', Proc. ACM SIGMOD Int.Conf. onManagement of Data, 1991, pp. 208-217.

[Kuk92] Kukich K.:'Techniques for Automatically Cor-recting Word in Text', ACM Computing Surveys,Vol. 24, No. 4, 1992, pp. 377-440.

[MG95] Methrotra R., Gray J. E.:'Feature-Index-BasedSimilar Shape retrieval', Proc. of the 3rd Work-ing Conf. on Visual Database Systems, 1995.

[MN89] Mehlhorn K., Naher S.:'LEDA, a libary of e�-cient data types and algorithms', University ofSaarland, FB Informatik, TR A 04/89.

[Nad65] Nadaraya E. A.:'On nonparametric estimates ofdensity functions and regression curves', TheoryProb. Appl. 10, 1965, pp. 186-190.

[NH94] Ng R. T., Han J.:'E�cient and E�ective Clus-tering Methods for Spatial Data Mining', Proc.20th Int. Conf. on Very Large Data Bases, Mor-gan Kaufmann, 1994, pp. 144-155.

[Sch96] Schikuta E.:'Grid clustering: An e�cient hierar-chical clustering method for very large data sets',Proc. 13th Conf. on Pattern Recognition, Vol.2, IEEE Computer Society Press, pp. 101-105.

[Schn64] Schnell P.:'A method to �nd point-groups',Biometrika, 6, 1964, pp. 47-48.

[Schu70] Schuster E. F.:'Note on the uniform convergenceof density estimates', Ann. Math. Statist 41,1970, pp. 1347-1348.

[SH94] Shawney H., Hafner J.:'E�cient Color His-togram Indexing', Proc. Int. Conf.on Image Pro-cessing, 1994, pp. 66-70.

[WW80] Wallace T., Wintz P.:'An E�cient Three-Dimensional Aircraft Recognition Algorithm Us-ing Normalized Fourier Descriptors', ComputerGraphics and Image Processing, Vol. 13, 1980.

[WYM97] Wang W., Yang J., Muntz R.:'STING: A Statis-tical Information Grid Approach to Spatial DataMining', Proc. 23rd Int. Conf. on Very LargeData Bases, Morgan Kaufmann, 1997.

[XEKS98] Xu X., Ester M., Kriegel H.-P., Sander J.:'ANonparametric Clustering Algorithm for Knowl-ege Discovery in Large Spatial Databases', willappear in Proc. IEEE Int. Conf. on Data En-gineering, 1998, IEEE Computer Society Press.

[ZRL96] Zhang T., Ramakrishnan R., Linvy M.:'BIRCH:An E�cient Data Clustering Method for veryLarge Databases', Proc. ACM SIGMOD Int.Conf. on Management of Data, ACM Press,1996, pp. 103-114.