optimal smoothing of kernel-based topographic maps with application to density-based clustering of...

12
Journal of VLSI Signal Processing 37, 211–222, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands. Optimal Smoothing of Kernel-Based Topographic Maps with Application to Density-Based Clustering of Shapes MARC M. VAN HULLE AND TEMUJIN GAUTAMA K.U. Leuven, Laboratorium voor Neuro-en Psychofysiologie, Campus Gasthuisberg, Herestraat 49, B-3000 Leuven, Belgium Abstract. A crucial issue when applying topographic maps for clustering purposes is how to select the map’s overall degree of smoothness. In this paper, we develop a new strategy for optimally smoothing, by a common scale factor, the density estimates generated by Gaussian kernel-based topographic maps. We also introduce a new representation structure for images of shapes, and a new metric for clustering them. These elements are incorporated into a hierarchical, density-based clustering procedure. As an application, we consider the clustering of shapes of marine animals taken from the SQUID image database. The results are compared to those obtained with the CSS retrieval system developed by Mokhtarian and co-workers, and with the more familiar Euclidean distance-based clustering metric. Keywords: shape clustering, kernel-based topographic maps, density estimation, density-based clustering, opti- mal smoothing 1. Introduction The visualization of clusters in high-dimensional spaces with topographic maps has recently attracted the attention of the data mining community ([1–6], and references therein). Topographic maps can be regarded as discrete lattice-based approximations to non-linear data manifolds, and, in this way, used for projecting and visualizing the data. What is less evident from the literature, is that this required a shift in the clustering paradigm. Originally, topographic maps, such as the popular Self-Organizing Map (SOM) [7–9], were used for similarity-based clustering: the converged neuron weights correspond to the centers of the individual clus- ters, and the Voronoi regions defined by them corre- spond to the cluster regions. Data points are assigned to the neurons for which the Euclidean distances to their weights are minimal (Winner-Take-All neurons). 1 However, this requires prior knowledge of the number of clusters in the data set, which makes this approach less suited for an exploratory data analysis. Further- more, similarity-based clustering assumes, albeit often tacitly, that the cluster shape is hyperspherical, at least when the Euclidean distance metric is used. The shape of a real-world cluster may not comply with this as- sumption. There is another way by which clustering can be per- formed with topographic maps, and which is deemed more suitable for exploratory data analysis. The dis- tribution of the converged neuron weights can be re- garded as a non-parametric estimate of the data den- sity. The high density regions are then hypothesized to correspond to individual clusters. This approach is called density-based clustering. Unfortunately, the weight density achieved with the SOM algorithm at convergence is not a linear function of the data density [10, 11], so that the quality of the density estimate is often inferior to what can be expected from other tech- niques, including several topographic map formation algorithms (for a review, see [12]). One way to im- prove the density estimation capabilities is to employ neurons with kernel-based activation functions, such as Gaussians, instead of Winner-Take-All functions. Sev- eral versions of such kernel-based topographic maps,

Upload: marc-m-van-hulle

Post on 06-Aug-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Journal of VLSI Signal Processing 37, 211–222, 2004c© 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Optimal Smoothing of Kernel-Based Topographic Maps with Applicationto Density-Based Clustering of Shapes

MARC M. VAN HULLE AND TEMUJIN GAUTAMAK.U. Leuven, Laboratorium voor Neuro-en Psychofysiologie, Campus Gasthuisberg, Herestraat 49,

B-3000 Leuven, Belgium

Abstract. A crucial issue when applying topographic maps for clustering purposes is how to select the map’soverall degree of smoothness. In this paper, we develop a new strategy for optimally smoothing, by a commonscale factor, the density estimates generated by Gaussian kernel-based topographic maps. We also introduce a newrepresentation structure for images of shapes, and a new metric for clustering them. These elements are incorporatedinto a hierarchical, density-based clustering procedure. As an application, we consider the clustering of shapes ofmarine animals taken from the SQUID image database. The results are compared to those obtained with the CSSretrieval system developed by Mokhtarian and co-workers, and with the more familiar Euclidean distance-basedclustering metric.

Keywords: shape clustering, kernel-based topographic maps, density estimation, density-based clustering, opti-mal smoothing

1. Introduction

The visualization of clusters in high-dimensionalspaces with topographic maps has recently attractedthe attention of the data mining community ([1–6], andreferences therein). Topographic maps can be regardedas discrete lattice-based approximations to non-lineardata manifolds, and, in this way, used for projectingand visualizing the data. What is less evident from theliterature, is that this required a shift in the clusteringparadigm. Originally, topographic maps, such as thepopular Self-Organizing Map (SOM) [7–9], were usedfor similarity-based clustering: the converged neuronweights correspond to the centers of the individual clus-ters, and the Voronoi regions defined by them corre-spond to the cluster regions. Data points are assignedto the neurons for which the Euclidean distances totheir weights are minimal (Winner-Take-All neurons).1

However, this requires prior knowledge of the numberof clusters in the data set, which makes this approachless suited for an exploratory data analysis. Further-more, similarity-based clustering assumes, albeit often

tacitly, that the cluster shape is hyperspherical, at leastwhen the Euclidean distance metric is used. The shapeof a real-world cluster may not comply with this as-sumption.

There is another way by which clustering can be per-formed with topographic maps, and which is deemedmore suitable for exploratory data analysis. The dis-tribution of the converged neuron weights can be re-garded as a non-parametric estimate of the data den-sity. The high density regions are then hypothesizedto correspond to individual clusters. This approachis called density-based clustering. Unfortunately, theweight density achieved with the SOM algorithm atconvergence is not a linear function of the data density[10, 11], so that the quality of the density estimate isoften inferior to what can be expected from other tech-niques, including several topographic map formationalgorithms (for a review, see [12]). One way to im-prove the density estimation capabilities is to employneurons with kernel-based activation functions, such asGaussians, instead of Winner-Take-All functions. Sev-eral versions of such kernel-based topographic maps,

212 Van Hulle and Gautama

as they are called, have been suggested in the literature,together with a series of learning algorithms for train-ing them (for an overview, see [13]). In an attempt notto assume prior knowledge of the number of clusters,and since the weight distribution achieved at conver-gence is related to the input density, several authorshave developed ways of visualizing clusters in the in-put density using topographic maps (see [9], p. 117).A simple technique, called gray level clustering, rep-resents the relative distances between the weights ofneighboring neurons, by gray scales.

There are two key issues we have to deal with whenusing topographic maps for density-based clusteringpurposes: topological defects and optimal smoothing ofthe density estimate. Indeed, when the topographic mapcontains topological defects—neighboring data pointsare not mapped onto neighboring neurons—a contigu-ous cluster could become split into separate clusters.In previous contributions [14, 15], also to this jour-nal, we have introduced an algorithm that monitors thedegree of topology preservation achieved by kernel-based maps during learning. As a real-world applica-tion, we considered the identification of musical in-struments, and the notes played by them, by means ofa hierarchical clustering analysis, starting from the mu-sic signal’s spectrogram. Topographic map formationwas achieved with the kernel-based Maximum Entropylearning Rule (kMER), and it was shown to yield anequiprobabilistic map of heteroscedastic Gaussian den-sity mixtures [12, 16].

In this paper, we will concentrate on the second is-sue and develop a new technique for determining theoverall degree of smoothness of the density estimateby scaling the widths of all kernels by a common fac-tor. As a real-world application, we will consider the

Figure 1. Kernel-based maximum entropy learning. (A) Neuron i has a localized receptive field K (v − wi , σi ), centered at wi in input spaceV ⊆ �d . The intersection of K with the present threshold τi defines a region Si , with radius σi , also in V-space. The present input v ∈ Vis indicated by the black dot and falls outside Si . (B) Receptive field region update. The arrow indicates the update of the RF center wi ,given the present input v (not to scale); the dashed circles indicate the updated RF regions Si and S j . For clarity’s sake, the range of theneighborhood function is assumed to have vanished so that w j is not updated. The shaded area indicates the overlap between Si and S j before theupdate.

image database of contours of marine animals, com-piled by Mokhtarian and co-workers [17], and performa hierarchical clustering in order to group contours thatshow similar global shapes. We will introduce for thispurpose a new clustering metric for assigning contoursto clusters (“labeling”). The metric is based on outlierdetection.

The paper is organized as follows. First, we brieflyre-introduce the kMER learning scheme that we willuse for training the kernel-based topographic maps.Then, in Section 3, we show how density estimationcan be performed with these maps, introduce the issueof optimal smoothness, and our solution for determin-ing it. In Section 4, we describe the image databaseof marine animal contours, introduce our contour rep-resentation, and formulate the clustering problem. InSection 5, we detail our hierarchical clustering proce-dure as it is applied to the image database. We alsointroduce here our new labeling method based on out-lier detection. Finally, we discuss our results and com-pare them to Mokhtarian’s, and also to those obtainedby using the more familiar Euclidean distance-basedmetric.

2. Kernel-Based Maximum EntropyLearning Rule

Consider a lattice A, with a regular and fixed topol-ogy, of arbitrary dimensionality dA, in a d-dimensionalinput space V ⊆ �d . To each of the N positions inthe lattice corresponds a formal neuron i which pos-sesses, in addition to the traditional weight vector wi , a(hyper)spherical activation region Si , called receptivefield (RF) region, with radius σi , in V-space (Fig. 1(A)).

Optimal Smoothing of Kernel-Based Topographic Maps 213

The neural activation state is represented by the codemembership function:

1li (v) ={

1 if v ∈ Si

0 if v �∈ Si ,(1)

with v ∈ V . As the definition of Si suggests, severalneurons may be active for a given input v. Hence, weneed an alternative definition of competitive learning[16]. Define �i as the fuzzy code membership functionof neuron i :

�i (v) = 1li (v)∑k∈A 1lk(v)

, ∀i ∈ A, (2)

so that 0 ≤ �i (v) ≤ 1 and∑

i �i (v) = 1.With the kernel-based Maximum Entropy learning

Rule (kMER), the weights wi are adapted so as to pro-duce a topology-preserving mapping; the radii σi areadapted so as to produce a lattice of which the neuronshave an equal probability to be active (equiprobabilis-tic map), i.e., P(1li (v) = 1) = ρ

N , ∀i , with ρ a scalefactor. The neuron weights wi are updated as follows[16]:

�wi = η∑j∈A

�(i, j, σ�) � j (vµ) Sgn(vµ − wi ), (3)

and their radii σi :

�σi = η

(ρr

N(1 − 1li (vµ)) − 1li (vµ)

), ∀i, (4)

with ρr�= ρN

N−ρ, Sgn(x) the sign function acting com-

ponentwise, and �(·) the usual neighborhood function,e.g., a Gaussian, of which the range σ� is gradually de-creased during learning:

σλ = σλ,0 exp

(−2σλ,0

t

tmaxγOV

), (5)

where σλ,0 is the initial neighborhood range, and γOV

is a parameter that controls the slope of the coolingscheme (“gain”). The combined effect of the radiusand weight update rules is illustrated in Fig. 1(B).

3. Non-Parametric Density Estimationand Optimal Smoothness

With kMER, in addition to the centers, the radii of theSi are individually adapted such that they are activated

with equal probabilities. The connection with variablekernel density estimation using radially symmetricalGaussian kernels can be made easily. Hence, we obtainthe following heteroscedastic Gaussian density modelwith equal mixtures:

pρs (vµ) = 1

N

N∑i=1

exp(−‖vµ−wi ‖2

2(ρsσi )2

)(√

2πρsσi )d, (6)

since the kernels are active with the same proba-bility, with ρs a factor controlling the overall de-gree of smoothness. The question is now: how tochoose ρs?

Contrary to the univariate case, only very few meth-ods are available for determining the optimal smooth-ing factor in multivariate density estimation. For exam-ple, the method described by Luc Devroye, which hasbeen extensively tested on univariate examples [18],is not readily extendible to the multivariate case. Themethods that have been described are twofold. A firstclass of methods is based on, or is an extension ofthe Least-Squares Cross-Validation (LSCV) method(for an overview, see [19]). However, with these meth-ods, a (Gaussian) a kernel needs to be positioned atevery data point, whereas in our case, there are typ-ically much more data points than kernels. A secondclass of methods adopts a binning approach, and po-sitions a kernel at every bin’s center (for an overview,see [20]). The applicability of these methods is limitedto low-dimensional cases, exactly due to the requiredbinning of the input space. Finally, one should notethat both classes of methods are iterative proceduresfor which the optimal smoothing factors need to bederived by an exhaustive search method such as gridsearch.

We suggest here a new method for determining theoptimal smoothing factor ρs,opt. We locally match thedata density contained in a d-dimensional hypersphereSi , centered at wi , and with radius σi , to the densitygenerated by a d-dimensional Gaussian with center wi

and radius ρsσi . For sufficiently high input dimensions,d > 10, the distribution of the distances of the datapoints to wi becomes approximately Gaussian, withmean ρsσi

√d and standard deviation ρsσi/

√2. This

can be shown as follows.Assume a d-dimensional circularly symmetrical

Gaussian with mean [µ j ], j = 1, . . . , d, and stan-dard deviation σ . The squared Euclidean distance tothe center x

�= ∑Dj=1(v j − µ j )2, v = [v j ], is known

214 Van Hulle and Gautama

to obey the chi-squared distribution with θ = 2 andα = d

2 degrees of freedom [21]:

pχ2 (x) =xσ 2

d2 −1 · exp

(− x2σ 2

)2

d2 �

(d2

) , (7)

for 0 ≤ x < ∞, and with �(·) the gamma distribu-tion. Hence, the Euclidean distance to the center be-comes: p(r ) = 2r pχ2 (r2), with r = √

x , followingthe fundamental law of probabilities. After some alge-braic manipulations, we can write the distribution ofthe Euclidean distances as follows:

p(r ) = 2(

)d−1exp

(− r2

2σ 2

)2

d2 �

(d2

) . (8)

The distribution is plotted in Fig. 2 (thick and thin con-tinuous lines). The mean of r equals µr =

√2σ�( d+1

2 )

�( d2 )

,which can be approximated as

√dσ , for d large, using

the approximation for the ratio of the gamma functionsby Graham and co-workers [22]; the second momentaround zero equals dσ 2. The distribution p(r ) seemsto quickly approach a Gaussian with mean

√dσ and

standard deviation σ√2

when d increases. This can beshown formally by calculating the skewness and Fisherkurtosis:

skewness = E(r − µr )3

= µr(−2d + 1 + 2µ2

r

), (9)

Figure 2. (A) Distribution functions of the Euclidean distance from the center of a unit-variance, radially-symmetrical Gaussian, parameterizedwith respect to the dimensionality d. The functions are plotted for d = 1, . . . , 10 (from left to right); the thick line corresponds to the d = 1case. (B) Skewness (continuous line) and Fisher kurtosis (dashed line) of the distance distributions as a function of the dimensionality d.

Fisher kurtosis = E(r − µr )4

E(E(r )2 − µ2

r

)2 − 3

= d2 + d(2 + 2µ2

r

) − 4µ2r − 3µ4

r(d − µ2

r

)2 − 3.

(10)

We can verify that the skewness and Fisher kurtosis areequal to 0.116 and 3.70 10−2 for d = 5, and 8.07 10−2

and 8.71 10−3 for d = 10, respectively. They are plot-ted as a function of d in Fig. 2(B). Hence, we can safelystate that, for d > 10, the distance distribution closelyresembles a Gaussian.

We now suggest to determine the optimal degree ofsmoothing as the one for which the mean distance tothe center matches the Gaussian prediction (averagedover all neurons i):

ρs,opt = 1

N

N∑i=1

< ‖vµ − wi‖ >vµ∈Si

σi

√d

. (11)

Note that, unlike traditional methods, Eq. (11) doesnot require an iterative procedure. In order to test ourmethod, we consider two types of input distributions: ad-dimensional Gaussian N (0, 1), and a d-dimensionaluniform distribution [−1, 1]d . We train a 5 × 5 lat-tice with kMER during tmax = 2000 time steps, withη = 0.001 and ρr = 2. The results for our method areshown in Fig. 3(A) and (B), for the Gaussian and uni-form input distributions, respectively, together with theresults obtained for the LSCV method, and the theoret-ically optimal results which are found by minimizing

Optimal Smoothing of Kernel-Based Topographic Maps 215

Figure 3. Optimal degrees of smoothness obtained for a Gaussian-(A) and uniform input distribution (B), as a function of input dimensionalityd , using our method Eq. (11) (thick continuous lines) and the LSCV method (thin continuous lines). The theoretically optimal results (dashedlines) are also shown.

the MSE between the estimated and actual probabilitydensity function. We clearly observe that our methodoutperforms the LSCV method, in particular in thehigh-dimensional case.

4. Image Database

We will train our topographic maps on patterns takenfrom an image database with the purpose of detectinggroups of similar images. In particular, we will considerthe Shape Queries Using Image Databases (SQUID)[17], which consists of M = 1100 images of contoursof marine animals, with a large variety of shapes. Atypical set of contours is shown in Fig. 4. One can

Figure 4. Examples of marine animal contours in the SQUIDdatabase. Numbers above the contours refer to the indices in theoriginal database.

Figure 5. Normalized contour of marine animal 633, obtained byresampling and subtracting the mean of the original contour in theSQUID database.

immediately see that the contours display similaritiesin overall shape, tails and fins.

Since the contours are described by a varying numberof points (mean 692.8 and standard deviation 205.5),we resample them to 256 points, using linear inter-polation. One such result is shown in Fig. 5. After-wards, we translate every contour such that its centerof mass is positioned at (0, 0), and compute its Fouriertransform. Only the amplitude spectrum is retained,which is further normalized so that a translation-,rotation-, and scale-invariant representation of the con-tour is obtained. The resulting spectra are coded by256-dimensional vectors. An example is shown inFig. 6. Note that this representation is also invariant tothe direction in which the contour is traversed (clock-or counter-clockwise), since this only effects the phase,but not the amplitude.

The idea is now to perform a hierarchical clusteringof these Fourier-transformed contours in order to detectgroups of contours with similar global shapes.

216 Van Hulle and Gautama

Figure 6. Normalized amplitude spectrum corresponding to Fig. 5.

5. Hierarchical Clustering Procedure

We perform a hierarchical density-based clusteringanalysis of M = 1100 normalized amplitude spectra,which have been obtained as explained in the previoussection. The clustering analysis proceeds as follows.First, the probability density function (pdf) underlyingthe data is estimated with our kernel-based topographicmap. Clusters correspond to high density peaks in thepdf-estimate and are detected without a priori knowl-edge of their number. The data set is segmented intosubsets by classifying each contour to its correspond-ing cluster, after which the next level of the clusteringanalysis is performed on every such subset (hierarchi-cal clustering).

5.1. Topographic Map Formation

We train our lattices with kMER, Eqs. (3) and (4), usinga Gaussian neighborhood function �(·) and the neigh-borhood “cooling” scheme given in Eq. (5). The initialneighborhood range σλ,0 is set equal to

√N

2 . The “gain”γOV in Eq. (5) is optimized for, in an iterative manner,using a monitoring algorithm that tries to minimize thevariability in the degree of overlap between the neu-rons’ RF regions (Overlap Variability, OV). The latticeis more likely to be disentangled when the OV is min-imal. For more details, we refer to Van Hulle [14] andVan Hulle and Gautama [15].

A hierarchy of topographic maps is developed withwhich a hierarchical clustering analysis is performed(see further). For the root and Level 1 nodes in theclustering hierarchy, we use 7×7 lattices, and smaller,5 × 5 lattices for the other nodes. If no clusters are

detected for the smaller maps, and if the number ofcontours in the training set exceeds 49, we retrain themap but now using a 7 × 7 lattice (see Discussion).

5.2. Density Estimation

We construct a kernel-based density estimate for eachtopographic map developed in the previous stage,and determine the optimal smoothing factor ρs,opt, asexplained in Section 3.

5.3. Clustering and Labeling the Topographic Map

We apply the discrete hill-climbing algorithm de-scribed in Van Hulle [12, 14]. The algorithm determinesthe number c, and the location of the local density peaksin the pdf estimate Eq. (6), which is evaluated only atthe neurons’ weights, that are not surmounted by higherpeaks in a range of k nearest neurons. The number ofclusters is determined and plotted as a function of k,in the valid range k = 1 . . . N

2 . As the final number ofclusters, we take the number that corresponds to thelongest plateau in the plot. The plateau for c = 1 isrejected, since this is a degenerate case. Figure 7(A)shows an example plot, namely that found for the rootnode of the cluster hierarchy. Finally, the neurons inthe lattice are labeled according to which local densitypeak they belong. For this we take the clustering resultfor k at the beginning of the plateau. Each neuron re-ceives a greyscale according to which cluster it belongsto. This results in a cluster map, an example of whichis shown in Fig. 7(B), for the root node in the clusterhierarchy.

5.4. Labeling the Training Set

We now need a metric for assigning input patters (i.e.,contour amplitude spectra) to clusters. We opt for onebased on outlier detection, since outlier probabilitiesare one-dimensional quantities. To each cluster corre-sponds a (cluster-conditional) pdf estimate. An inputpattern is classified to the cluster for which the proba-bility of being an outlier is minimal (i.e., an outlier withrespect to the cluster’s pdf estimate). There is, however,a problem due to the high dimensionality of our data:the Gaussian kernels cannot be evaluated directly, sincethis leads to numerical instabilities (see the factor σ d

i inEq. (6)). We have, therefore, developed the followingnew method.

Optimal Smoothing of Kernel-Based Topographic Maps 217

Figure 7. (A) Number of clusters that are detected as a function of the number of nearest neurons, k, in the discrete hill-climbing algorithmfor the root node. The longest plateau is found for three clusters, starting at k = 5. (B) Cluster map for the root node corresponding to k = 5.

Rather than evaluating the high-dimensionalGaussians directly, we evaluate the one-dimensionaldistributions that describe the distances to the kernelcenters, which are approximately Gaussian for d > 10,with means ρsσi

√d and standard deviation ρsσi√

2(see

Section 3). By observing the corresponding cumulativedistribution, we compute the cluster outlier probabilityfor cluster n, pn , using:

pn(vµ) = 1

2Nn

∑i∈cluster n

×(

1 + erf

(‖vµ − wi‖ − ρs,optσi

√d

ρs,optσi

)),

(12)

where Nn is the number of neurons that belong tocluster n. Input pattern vµ is classified to the clusterfor which the outlier probability, pn , is the smallest.We apply this method, rather than a straightforwardcomputation of the cluster probability, since the latteryields numerically unstable results for higher dimen-sions due to the normalization term for unit-volumeGaussians. Furthermore, it offers the advantage over aEuclidean distance-based one, where an input patternreceives the same label as its nearest neuron (as is usedin [23]), since it takes into account the width of theclusters. However, since in high-dimensional spacesthe erf function is steep, the case where pn = 1 for allclusters can occur and, hence, a tie-breaking strategyis needed. We opt for the smallest Euclidean distanceto the nearest neuron as a tie-break.

5.5. Hierarchical Clustering

The set of input patterns on which a given node in thehierarchy is trained is segmented into subsets accord-ing to the patterns classification labels obtained in theprevious step. The clustering analysis is then re-appliedto each subset. The procedure terminates, and producesa leaf node in the clustering hierarchy, if the discretehill-climbing algorithm does not yield a plateau, or ifthe number of input patterns applied to a given nodedoes not exceed the number of neurons in the node’stopographic map.

In total, we have detected 65 clusters correspondingto leaf nodes in the clustering hierarchy (the deepestleaf node is situated at Level 10, with the root nodebeing at Level 0). The first cluster is found at Level 2and is shown in Fig. 8. Possibly, this cluster is foundat an early level because: (1) the overall shape is verydifferent from the others, and (2) there are sufficientsimilar contours in the database to define a cluster atthis level.

Note that the labeling of the data occurs in a hier-archical manner: every data point “traverses” the treeuntil it arrives at a leaf node. An alternative methodwould be to construct a partial pdf estimate for everyleaf node in the hierarchy. If every cluster is consid-ered equiprobable, a contour could be classified to thecorresponding pdf estimate for which the outlier proba-bility Eq. (12) is the smallest. This would yield differentresults for data points that lie in the border region be-tween clusters: the pdf estimates become more detailedat deeper levels, due to which the boundaries can shiftslightly.

218 Van Hulle and Gautama

Figure 8. Contours that are classified to a Level 2 cluster ([2 1]) (A), and a Level 7 cluster ([0010012]) (B). The numbers above the contoursrefer to the indices in the original database.

6. Discussion

6.1. Shape Clustering

In the literature, several clustering-based proceduresare described for constructing representation structuresfor sets of contours [24, 25]. These procedures assumeprior knowledge of the number of clusters in the dataset.In a different context, namely that of querying largedatabases for contours that are similar to the target,various shape similarity metrics have been suggested[17, 26, 27]. We have introduced in this article a novelway of constructing a representation structure for a setof contours, which can be used for shape classificationpurposes. The advantage is that it uses a data modelwhich, in general, greatly facilitates data handling.Our procedure relies on a hierarchical, density-basedclustering approach.

6.2. Shape Clustering with Our Approach

There are two aspects about the shapes in the SQUID-database that make it hard to cluster them. First, there isthe wide variety in shapes, due to which one would ex-pect a lot of clusters to be detected. Second, the numberof patterns that defines a single cluster is very small,which makes kMER-training more difficult (few train-ing patterns and high dimensionality). For example,sometimes no clusters were detected in relatively largesubsets (M > 49 input patterns), where clusters wouldbe expected. In these cases, the estimated pdf is notdetailed enough and is too smooth to indicate the pres-

ence of clusters. Indeed, since neuron i is activatedby Mi = ρr M

N−ρrsamples in the kMER-algorithm [16],

several clusters are modeled by a single kernel if thenumber of patterns belonging to one cluster (which isnot known a priori) is less than Mi . Thus, the result-ing kernels are too wide to yield distinct local peaksin the pdf-estimate. Therefore, we retrain the kMER-algorithm using a 7×7 lattice if no clusters are detectedat a given node, and only terminate the analysis for thatnode if the latter map does not yield any clusters ei-ther. In this way, a more detailed pdf estimate can bedeveloped if necessary. This was the case for 4 nodesin our clustering hierarchy. The converse situation hasnot posed any problems: the kMER-algorithm and thesubsequent pdf estimation has proven to be robust evenwhen there are very few contours in the training set:with as few as 25 contours, we have trained 5 × 5topographic maps, that still showed the presence ofclusters.

As a measure of confidence with which a given con-tour is classified at a certain level in the hierarchy, wecompute the difference between the lowest and second-lowest outlier probability taken over the subclusters ofthe same parent cluster (i.e., “sister”-clusters), whichwill be equal to 1 if the contour is classified with 100%certainty, and 0 if both probabilities are the same, pos-sibly when the contour is a definite outlier for bothclusters and where the Euclidean distance metric hasbeen used as a tie-break. We have computed this forthe 65 leaf nodes in the clustering hierarchy. The meanclassification confidence is 0.121. This is a relativelysmall number, possibly due to the high dimensionalityof the data, d = 256, and the relatively small number of

Optimal Smoothing of Kernel-Based Topographic Maps 219

Figure 9. Contours that are classified to a Level 4 cluster [1 0 1 2] (A), and a Level 5 cluster [0 0 1 0 1] (B). The target contours are indicatedby a rectangle.

important dimensions: a principal component analysisshows that the cumulative eigenvalue plot reaches 99%for 20 principal components. This means that most ofthe variance is concentrated in a subspace, which hasits implications on the computation of the outlier prob-ability, Eq. (12), and possibly even on the computationof the optimal degree of smoothing, Eq. (11).

Figure 9 shows two example clusters (Level 4 inFig. 9(A) and Level 5 in Fig. 9(B)). For one contourin every cluster, which would intuitively be consid-ered an outlier (the framed ones in Fig. 9), the classi-fication confidences are computed. The classificationconfidence for contour 780 in Fig. 9(A) is 6.5 × 10−6

(the mean confidence for this cluster is 0.121). TheEuclidean distances to the nearest neurons in the bestand second-best clusters are 0.110 and 0.212, respec-tively (mean distances between patterns and nearestneurons is 0.050). The large distances indicate thatthese contours are indeed located “in between” the sis-ter clusters. The second case (Fig. 9(B)) shows a simi-lar situation. Contour 543 (in cluster [0 0 1 0 1]) has azero classification confidence (the Euclidean distancemetric has been used as a tie-break), and is, thus, anoutlier to all “sister”-clusters, i.e., those coming from[0 0 1 0]. The mean classification confidence for thiscluster is 0.077. The Euclidean distances between thiscontour and the nearest neurons in the two sister clus-ters are 0.174 and 0.228 (the mean distance between

patterns belonging to this cluster and the nearest neuronis 0.068).

6.3. Comparison with Mokhtarian’s Approach

Although there is no objective way of quantifying theclustering performance, since the database is unla-beled, we compare our results to those obtained with theCurvature Scale Space (CSS) retrieval system devel-oped by Mokhtarian and co-workers [17]. Examples oftheir results are shown in Fig. 10(A) and (D): the upperleft contour corresponds to the “target”, and the resultsare ordered by their measure of similarity, downwards,then rightwards. The clusters that we find, and that con-tain the target contour, are shown in Fig. 10(B) and (E),respectively. In the first example (Fig. 10(A) and (B)),there is a good correspondence between the two sets:all but one contour that have been found by the CSS re-trieval system are contained in the corresponding clus-ter of our analysis. However, in the second case, the twosets are disjoint (except for the target contour). This isdue to the difference in similarity criteria. The CSSretrieval system takes into account only the positionsof the topological landmarks (with zero-curvature), andnot so much the fine shape details. The set in Fig. 10(D),e.g., possibly groups contours with seven points withzero-curvature (two for the tail, four fins and one for

220 Van Hulle and Gautama

Fig

ure

10.

The

cont

ours

that

best

mat

chth

eta

rget

cont

ours

262

and

102

(fra

med

cont

ours

),ob

tain

edw

ithth

ree

diff

eren

tm

etho

ds(c

olum

ns).

(A,D

)B

est

mat

chin

gre

sults

obta

ined

with

the

CSS

retr

ieva

lsys

tem

ofM

okht

aria

nan

dco

-wor

kers

,whe

nqu

eryi

ngth

eda

taba

sew

ithth

eup

per

left

cont

our

(ind

ices

262

and

102,

redr

awn

from

Fig.

4(A

)an

d(B

)in

[17]

).T

heco

ntou

rsar

eor

dere

dby

decr

easi

ngsi

mila

rity

,do

wnw

ards

,th

enri

ghtw

ards

.(B

,E

)T

heco

rres

pond

ing

clus

ters

foun

dw

ithou

rhi

erar

chic

alcl

uste

ring

proc

edur

e.(C

,F)

The

clus

ters

foun

dw

ithth

eE

uclid

ean

dist

ance

met

ric.

Optimal Smoothing of Kernel-Based Topographic Maps 221

the snout), whereas the Fourier representation we usetakes into account the global shape. However, we donot claim that the Fourier descriptor is better suitedfor representing shapes than the topological landmark-based descriptor used by the CSS retrieval system.

6.4. Comparison with EuclideanDistance-Based Approach

Finally, we can also perform our clustering analysisusing a criterion based on Euclidean distance, insteadof outlier detection. This results in a different cluster-ing hierarchy (only 53 nodes, with the deepest node inLevel 7). The clusters for target contours 262 and 102are shown in Fig. 10(C) and (F), respectively. It seemsthat the results are now inferior, since both clusterscontain mixtures of shapes. However, again, since thedatabase is unlabeled, we cannot objectively say whichmethod yields the best results.

7. Conclusion

We have introduced a new way to optimally smooth, bya common scale factor, heteroscedastic Gaussian den-sity models with equal mixtures. In addition, we haveintroduced a new way to build a representation structurefor images of shapes, based on the Fourier representa-tion of their contours, and a new method for labelingsamples, based on outlier detection. All this has been in-corporated into a hierarchical clustering procedure thatrelies on kernel-based topographic maps. The cluster-ing procedure has been applied to the SQUID imagedatabase [17]. The advantage of our approach is thatthe number of shape clusters is obtained without priorknowledge, and that a hierarchy of shapes is gener-ated. Finally, we have compared our clustering resultsto those obtained with the CSS retrieval system devel-oped by Mokhtarian and co-workers [17, 26], as wellas to those obtained when using the Euclidean distancemetric.

Acknowledgments

The authors would like to thank Dr. Farzin Mokhtarian,University of Surrey (UK), for making the SQUID-database available. M.M.V.H. is supported by re-search grants received from the Fund for Scien-tific Research (G.0185.96N), the National Lottery(Belgium) (9.0185.96), the Flemish Regional Ministry

of Education (Belgium) (GOA 95/99-06; 2000/11),the Flemish Ministry for Science and Technology(VIS/98/012), and the European Commission (QLG3-CT-2000-30161 and IST-2001-32114). T.G. is sup-ported by a scholarship from the Flemish RegionalMinistry of Education (GOA 2000/11).

Note

1. In fact, when used in batch mode, the SOM algorithm has anintimate connection with the classic k-means clustering algorithm[28, 29] (see also [9], p. 127).

References

1. G.J. Deboeck and T. Kohonen, Visual Explorations in Financewith Self-Organizing Maps, Heidelberg: Springer, 1998.

2. M. Cottrell, P. Gaubert, P. Letremy, and P. Rousset, “Analyzingand Representing Multidimensional Quantitative and Qualita-tive Data: Demographic Study of the Rhone Valley. The Domes-tic Consumption of the Canadian Families,” in Kohonen Maps,Proc. WSOM99, E. Oja and S. Kaski (Eds.), Helsinki, 1999,pp. 1–14.

3. K. Lagus and S. Kaski, “Keyword Selection Method for Char-acterizing Text Document Maps,” in Proc. ICANN99, 9thInt. Conf. on Artificial Neural Networks, vol. 1, London: IEE,1999, pp. 371–376.

4. J. Vesanto, “SOM-Based Data Visualization Methods,” Intelli-gent Data Analysis, vol. 3, no. 2, 1999, pp. 111–126.

5. J. Vesanto and E. Alhoniemi, “Clustering of the Self-OrganizingMap,” IEEE Trans. Neural Networks, vol 11, no. 3, 2000,pp. 586–600.

6. J. Himberg, J. Ahola, E. Alhoniemi, J. Vesanto, and O.Simula, “The Self-Organizing Map as a Tool in Knowledge En-gineering,” in Pattern Recognition in Soft Computing Paradigm,Nikhil R. Pal (Ed.), Singapore: World Scientific Publishing,2001, pp. 38–65

7. T. Kohonen, “Self-Organized Formation of Topologically Cor-rect Feature Maps,” Biol. Cybern., vol. 43, 1982, pp. 59–69.

8. T. Kohonen, Self-Organization and Associative Memory,Heidelberg: Springer, 1984.

9. T. Kohonen, Self-Organizing Maps, Heidelberg: Springer,1995.

10. H. Ritter, “Asymptotic Level Density for a Class of Vector Quan-tization Processes,” IEEE Trans. Neural Networks, vol. 2, no. 1,1991, pp. 173–175.

11. D.R. Dersch and P. Tavan, “Asymptotic Level Density in Topo-logical Feature Maps,” IEEE Trans. Neural Networks, vol. 6,1995, pp. 230–236.

12. M.M. Van Hulle, Faithful Representations and Topo-graphic Maps. From Distortion-to Information-Based Self-Organization, Haykin, S. (Ed.), New York: Wiley, 2000.

13. M.M. Van Hulle, “Kernel-Based Topographic Map Formation byLocal Density Modeling,” Neural Computation, vol. 14, 2002,pp. 1561–1573.

222 Van Hulle and Gautama

14. M.M. Van Hulle, “Monitoring the Formation of Kernel-BasedTopographic Maps,” in IEEE Neural Network for Signal Pro-cessing Workshop 2000, Sydney, 2000, pp. 241–250.

15. M.M. Van Hulle and T. Gautama, “Monitoring the Formationof Kernel-Based Topographic Maps with Application to Hierar-chical Clustering of Music Signals,” J. VLSI Signal ProcessingSystems for Signal, Image, and Video Technology, vol. 32, 2002,pp. 119–134.

16. M.M. Van Hulle, “Kernel-Based Equiprobabilistic TopographicMap Formation,” Neural Computation, vol. 10, 1998, pp. 1847–1871.

17. F. Mokhtarian, S. Abbasi, and J. Kittler, “Efficient and RobustRetrieval by Shape Content Through Curvature Scale Space,”in Proc. International Workshop on Image DataBases and Mul-tiMedia Search, Amsterdam, The Netherlands, 1996, pp. 35–42.

18. L. Devroye, “Universal Smoothing Factor Selection in DensityEstimation: Theory and Practice (with discussion),” Test, vol. 6,1997, pp. 223–320.

19. S.R. Sain, K.A. Baggerly, and D.W. Scott, “Cross-Validationof Multivariate Densities,” Journal of the American StatisticalAssociation, vol. 89, no. 427, 1994, pp. 807–817.

20. P. Hall and M.P. Wand, “On the Accuracy of Binned KernelDensity Estimators,” Journal of Multivariate Analysis, vol. 56,1996, pp. 165–184.

21. E.W. Weisstein, CRC Concise Encyclopedia of Mathematics,London: Chapman and Hall,1999.

22. R.L. Graham, D.E. Knuth, and O. Patashnik, Answer to Prob-lem 9.60 in Concrete Mathematics: A Foundation for ComputerScience, Reading, MA: Addison-Wesley, 1994.

23. T. Gautama and M.M. Van Hulle, “Hierarchical Density-BasedClustering in High-Dimensional Spaces Using TopographicMaps,” in IEEE Neural Network for Signal Processing Work-shop 2000, Sydney, 2000, pp. 251–260.

24. M.P. Dubuission Jolly, S. Lakshmanan, and A.K. Jain, “VehicleSegmentation and Classification Using Deformable Templates,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18,no. 3, 1996, pp. 293–308.

25. K.-M. Lee and N. Street, “Automatic Image Segmentation andClassification Using On-Line Shape Learning,” in Fifth IEEEWorkshop on the Application of Computer Vision, CA, USA:Palm Springs, 2000.

26. F. Mokhtarian, “Silhouette-Based Isolated Object RecognitionThrough Curvature Scale Space,” IEEE Trans. Pattern Analy-sis and Machine Intelligence, vol. 17, no. 5, 1995, pp. 539–544.

27. W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D.Petkovic, and P. Yanker, “The QBIC Project: Querying ImagesBy Content Using Color, Texture, and Shape,” SPIE, vol. 1908,1993, pp. 173–187.

28. J. MacQueen, “Some Methods for Classification and Analysisof Multivariate Observations,” in Proc. Fifth Berkeley Symp. onMath. Stat. and Prob., vol. 1, 1967, pp. 281–296.

29. P.R. Krishnaiah and L.N. Kanal, Classification, Pattern Recog-nition, and Reduction of Dimensionality, Handbook of Statistics,vol. 2, Amsterdam: North Holland, 1982.

Marc M. Van Hulle received a M.Sc. in Electrotechnical Engi-neering (Electronics) and a Ph.D. in Applied Sciences from theK.U.Leuven, Leuven (Belgium) in 1985 and 1990, respectively. Healso holds B.Sc. Econ. and MBA degrees. In 1992, he has been withthe Brain and Cognitive Sciences department of the MassachusettsInstitute of Technology (MIT), Boston (USA), as a postdoctoral sci-entist. He is affiliated with the Neuro- and Psychophysiology Labora-tory, Medical School, K.U. Leuven, as an associate Professor. He hasauthored the monograph Faithful representations and topographicmaps: From distortion- to information-based self-organization, JohnWiley, 2000, which is also translated in Japanese, and more than 80technical publications. (http://simone.neuro.kuleuven.ac.be)

Dr. Van Hulle is an Executive Member of the IEEE Signal Process-ing Society, Neural Networks for Signal Processing (NNSP) Tech-nical Committee (1996–2003), the Publicity Chair of NNSP’s 1999,2000 and 2002 workshops, and the Program co-chair of NNSP’s2001 workshop, and reviewer and co-editor of several special issuesfor several neural network and signal processing journals. He is alsofounder and director of Synes N.V., the data mining spin-off of theK.U. Leuven (http://www.synes.com). His research interests includeneural networks, biological modeling, vision, data mining and [email protected]

Temujin Gautama received a B.Sc. degree in electronical engineer-ing from Groep T, Leuven, Belgium and a Master’s degree in Artifi-cial Intelligence from the Katholieke Universiteit Leuven, Belgium.He is currently with the Laboratorium voor Neuro- en Psychofysi-ologie at the Medical School of the K.U.Leuven, where he is workingtowards his Ph.D.

His research interests include nonlinear signal processing, biolog-ical modeling, self-organizing neural networks and their applicationto data [email protected]