monitoring the formation of kernel-based topographic maps with application to hierarchical...

Journal of VLSI Signal Processing 32, 119–134, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Monitoring the Formation of Kernel-Based Topographic Mapswith Application to Hierarchical Clustering of Music Signals

MARC M. VAN HULLE AND TEMUJIN GAUTAMAK.U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, Campus Gasthuisberg,

Herestraat, B-3000 Leuven, Belgium

Received March 6, 2001; Revised July 26, 2001; Accepted November 19, 2001

Abstract. When using topographic maps for clustering purposes, which is now being considered in the datamining community, it is crucial that the maps are free of topological defects. Otherwise, a contiguous clustercould become split into separate clusters. We introduce a new algorithm for monitoring the degree of topologypreservation of kernel-based maps during learning. The algorithm is applied to a real-world example concernedwith the identification of 3 musical instruments and the notes played by them, in an unsupervised manner, by meansof a hierarchical clustering analysis, starting from the music signal’s spectrogram.

Keywords: kernel-based topographic maps, hierarchical clustering, monitoring, music

1. Introduction

Topographic maps have witnessed an impressive rangeof statistical applications such as clustering and pat-tern recognition, vector quantization, density esti-mation and regression (for overviews, see [1–3]).Basically, these maps can be regarded as discretelattice-based approximations to non-linear data man-ifolds and can, in this way, be used for visualizingmultidimensional data. In particular the clustering ap-plications have attracted the attention of the data miningcommunity [4–6]. The distribution of neuron weightscan be regarded as an estimate of the data density andcan, in turn, be used for detecting high density regionswhich correspond to clusters. The topographic map isthen used as a cluster map. However, albeit often tacitlyassumed, it is essential that the maps are truly topology-preserving, otherwise contiguous clusters in the inputdistribution could erroneously become split into sepa-rate clusters in the cluster map.

The most widely used topographic map formationalgorithm is Kohonen’s self-organizing map (SOM)[2, 7, 8]. The principle behind it is remarkably sim-ple. Let V be the d-dimensional input space, with

V ⊆ �d , and A a lattice of N neurons, labeledi = 1, 2, . . . , N , with corresponding weight vectorswi (t) = [wi j (t)] ∈ V . The self-organization processconsists of two stages, a competitive and a cooperativeone. First, the neuron is selected whose weight vectoris most “similar” to the current input vector v (“com-petition”), e.g., using the minimum Euclidean distancerule: i∗ = arg mini ‖wi −v‖ (neuron i∗ “wins” the com-petition). It then updates not only the weight vector ofneuron i∗, but also those of its nearest lattice neighbors(“cooperation”) by means of a neighborhood function:

�wi = η�(i, i∗, σ�)(v − wi ) (1)

with � the neighborhood function, usually a mono-tonous decreasing function of the lattice distance toneuron i∗, e.g., a Gaussian:

�(i, i∗, σ�) = exp

(−‖ri − ri∗‖2

2σ 2�

), (2)

withσ� the neighborhood function range, and ri neuroni’s lattice coordinate (we assume discrete lattices withrectangular topologies).

120 Van Hulle and Gautama

1.1. Topological Defects

The neighborhood function plays a crucial role in theformation of topology-preserving maps. Although thispoint is widely accepted, the proof of the ordering is ac-tually only valid for one-dimensional lattices developedin one-dimensional spaces [9, 10]: the proof for thehigher-dimensional case is unlikely to be found [10].In defense of this, Kohonen [2] notes that the orderingconditions are most stringent when the dimensional-ity of the lattice matches that of the input distribution.He also claims that the ordering will be much easierwhen the input space dimensionality is much higher.However, exactly for the high-dimensional case, thereis a multitude of possibilities by which a given lat-tice can approximate the data manifold, especiallywhen there is a mismatch in dimensionality. In ad-dition, one should also be aware of the fact that theproof for the one-dimensional case relies on a rect-

Figure 1. (A) Scatter plot of a two-dimensional “curved” distribution. The dashed line indicates the distribution’s principal curve. (B) Lattice(N = 25 neurons) obtained at convergence (solid line) for the distribution shown in (A), when the neighborhood function range has vanished.(C) Lattice obtained for a non-vanishing range. (D) Lattice obtained when the range is decreased too rapidly.

angular neighborhood function of which the range iskept constant during learning. However, in practice, theneighborhood function range is decreased until it van-ishes and hence, topological defects—violations in thetopographic ordering—could result even in the one-dimensional case [11]. Examples of these are givenin Figs. 1(D) and 2 (rightmost panel) for a two-dimensional input space and, respectively, a one- and atwo-dimensional lattice. As a consequence, the successof the lattice disentangling phase is expected to criti-cally depend on the rate with which the neighborhoodfunction range is decreased over the finite simulationtime. Unfortunately, the choice of this “cooling” ratecannot be motivated on analytical grounds.

1.2. Solutions

One obvious way to deal with this problem is not toreduce the neighborhood function range to zero but

Hierarchical Clustering of Music Signals 121

Figure 2. Evolution of a 24 × 24 lattice as a function of time. The outer squares outline the uniform input p.d.f. The values given below thesquares represent time.

only to a small, nonzero value. But the question is then:How small should this value be? When it is too large,the neuron weights will not properly span the inputspace [12, 13]. When it is too small, topological defectsare likely to occur. Another way is to visually inspectthe lattice for topological defects, but this can onlybe done for up to three-dimensional lattices and inputspaces.

A more appropriate way is to measure the degree oftopology preservation of the lattice. Topology preser-vation metrics have first been introduced by Bauer andPawelzik [14] (“topographic product”), and more re-cently by Villmann and co-workers [15], (“topographicfunction”), among others (for references, see [15]).The topographic product metric, and all other metricsthat rely solely on the relative positions of the neu-ron weights, cannot distinguish a correct folding of thelattice—due to a folded, non-linear data manifold—from an incorrect folding, e.g., due to a topological de-fect. The topographic function alleviates this problem,but requires a Delaunay triangularization of the inputspace, which is computationally much more intensivethan determining the topographic product (even thoughVillmann and co-workers approximate this triangular-ization by developing a lattice with the competitiveHebbian learning algorithm). Finally, it should be notedthat these metrics are normally used for quantifying thedegree of topology-preservation of the converged lat-tice, and for assessing the geometry and dimensionalityof the input distribution by developing a series of lat-tices with different geometries and dimensionalities.

A different approach to topographic map formationis to consider neurons with mutually overlapping ac-tivation regions or kernels [16–19], instead of non-overlapping regions as in, e.g., the SOM algorithm(Voronoi tessellation). Except for Bishop’s approach,which is topographic by construction, the overlappingactivation regions generate local correlations which, inturn, are since long known to provide information about

neighborhood relationships [20]. The question is now:How can this information be used for monitoring thedegree of topology preservation during learning, foradjusting the rate at which the neighborhood functionrange is decreased? We will introduce a new metric,called the Overlap Variability (OV ), which does notdepend on the positions of the weight vectors. The OV-metric will be used in combination with our kernel-based Maximum Entropy learning Rule (kMER) [19].

Finally, independent of the type of metric used, oneshould be aware of the fact that the global ordering ofa lattice can only be uniquely characterized for casesin which the input space and the lattice have the samedimensionality [21]: hence, strictly speaking, when thedimensionalities differ, any topology preservation met-ric should be regarded as a heuristic since the resultsobtained with it will very much depend on the choiceof the topology preservation metric.

1.3. Outlook

The article is structured as follows. We start with abrief introduction to the SOM and the effect of therate with which the neighborhood function range isdecreased (“cooling” rate). We then show the perfor-mance of the topographic product metric when usingit for monitoring the degree of topology-preservationachieved during training. In Section 3, we introduce,also briefly, kMER, and, in Sections 4 and 5, we ex-plain the OV-metric and the monitoring algorithm indetail. We also show the performance of the metric onthe same example as for the topographic product met-ric. Finally, a real-world example is considered for bothkMER and the SOM algorithm, namely, the identifica-tion, in an unsupervised manner, of the notes, as wellas the musical instruments that play them, from thetemporal evolution of a music signal’s spectral content(spectrogram).


2. Formation of Topographically-OrderedLattices

When a lattice is disentangled, and thus “maximally”ordered, the lattice is said to be topographically or-dered: neighboring neurons in the lattice will code forneighboring positions in the input space (but the inverseis not necessarily true—for an example, see further).The neighborhood function plays an important rolein producing disentangled lattices, however, this doesnot imply that we are guaranteed to obtain a disentan-gled lattice. For example, consider the two-dimensional“curved” distribution shown in Fig. 1(A). The distri-bution (M = 500 samples) is generated by randomlysampling the circle segment (solid line), defined be-tween 0.3 and π − 0.3 rad, and by adding Gaussianwhite noise (σ = 0.1) to both coordinates. The best de-scription the lattice can generate is the distribution’s“principal curve” (dashed line), which is defined insuch a manner that for any two infinitesimally sepa-rated principal curve normals, the center of gravity ofthe distribution contained between these curve normalscoincides with the curve itself.

We now show the performance of the SOM algorithmon this example. We take a one-dimensional lattice A(i.e., a chain) sized N = 25 neurons, and use the SOMalgorithm in batch mode with a learning rate η = 0.001and with a Gaussian neighborhood function � definedas in Eq. (2) of which the range σ� is decreased asfollows:

σ�(t) = σ�0 exp

(−2σ�0

t

tmax

), (3)

with t the present epoch, tmax the maximum number ofepochs, andσ�0 the range spanned by the neighborhoodfunction at t = 0. We take tmax = 200 and σ�0 = 12.5so that σ�(tmax) ≈ 0, following Eq. (3). Furthermore,we initialize the weights by taking samples from theuniform distribution (0, 1] × (0, 1]. For the sake of ex-position, and only in connection with the current ex-ample, we will refer to Eq. (3) as our “normal” rate ofdecreasing the neighborhood function range. Finally,since our lattice is one-dimensional, and our samplespace two-dimensional, we can join the weights withstraight lines in order to obtain an approximation of theprincipal curve. The result is shown in Fig. 1(B) (solidline).

We observe that the “curve” developed by the SOMalgorithm deviates from the desired one. This is dueto the fact that other Voronoi partitionings, namely,

other than those bounded by (bisector) lines perpen-dicular to the actual principal curve, can contribute tothe equilibrium location of the lattice weights. This is ageneral property of discrete approximations (see [22]).Hence, this result is usually seen in cases where thelattice dimension is lower than the space in which it isdeveloped. For example, when the neurons of a chainare trained with samples taken from a uniform two-dimensional square distribution, then the chain will at-tempt to fill the square as much as possible. The result-ing map then resembles a discrete approximation of theso-called Peano curve.1

This space-filling aspect of the SOM algorithm canbe countered by the neighborhood function since it hasan effect similar to a local “smearing out” of the sampledistribution. The amount of “smearing” depends on theneighborhood function range σ� but, when it is toolarge, then the lattice weights will be pulled towardsthe center of gravity of the sample distribution. In orderto show the smearing effect, assume that we decreasethe neighborhood function range at a ten times slowerrate, until σ�(tmax) ∼= 1:


(−0.2σ�0

t

tmax

). (4)

The result is shown in Fig. 1(C). Note that the chainmore closely corresponds to the principal curve, so thata better generalization performance is expected, butthat, at the same time, the end-points are already be-ing pulled towards the center of gravity of the sampledistribution.

If we decrease the neighborhood function range at arate which is two times faster than the initial one:


(−4σ�0

t

tmax

), (5)

then the chain appears to be heavily tangled (Fig. 1(D)).There are several topological defects, which are calledkinks. Hence, we conclude that the neighborhood func-tion range has been too rapidly decreased. We can nowalso verify that neighboring neurons in a lattice maycode for neighboring positions in the input space, butthat the inverse is not necessarily true.

Consider, as a second simulation example, a rectan-gular lattice sized N = 24 × 24 neurons with the inputsamples taken randomly from a two-dimensional uni-form probability density function (p.d.f.) p(v) withinthe unit square (0, 1] × (0, 1]. The initial weight vec-tors are randomly drawn from this distribution. We now


perform incremental learning and decrease the range asfollows:


(−2σ�0

t

tmax

), (6)

but now with t the present time step (single weightupdate step) rather than epoch number, and tmax =275,000. For the learning rate, we take η = 0.015. Theevolution is shown in Fig. 2. Seemingly, also for thiscase, the neighborhood function range was too rapidlydecreased since the lattice is twisted and even if wewould continue the simulation, with zero neighborhoodfunction range, the twist will not be removed.

2.1. Monitoring with the TopographicProduct Metric

Albeit that it has only been used for quantifying thedegree of topology-preservation of the converged lat-tice, the topographic product TP [14] is a valid can-didate metric for “on-line” monitoring the developingmap. The topographic product is briefly explained inthe Appendix section. But how reliable is this metricand how does it behave?

Consider again our one-dimensional example withthe three neighborhood function range reductionschemes, Eqs. (3)–(5). If we monitor TP during thelearning phase, then we obtain the three plots shownin Fig. 3. If we decrease the range too quickly, thenTP will be positive everywhere (dashed line). If we de-crease the range at our normal rate (thin solid line), thenTP will hover around zero most of the time, except thatit slightly increases at the end of the simulation. Indeed,

Figure 3. Monitoring of topographic product TP during learning.The thick solid line corresponds to a slow rate at which the neigh-borhood function range is decreased, Eq. (4), the thin solid line to anormal rate, Eq. (3), and the dashed line to a fast rate, Eq. (5). Ideally,TP should fluctuate around zero.

as evidenced by the graphs shown in Fig. 4, the latticebecomes less smooth when σ�(tmax) approaches zero.Indeed, when we decrease the neighborhood functionrange slowly until σ�(tmax) ∼= 1 (thick solid line), TPwill fluctuate around zero even for t large. Hence, therate at which σ� decreases and the total training timeneed to be determined, that is, in our case, the scalingfactor of σ�0 and tmax.

This is what we propose as a strategy for determin-ing an acceptable neighborhood “cooling” scheme: Westart with a fast rate and monitor the TP value when theneighborhood function range has vanished. We then re-start the simulation, but with a decreased rate, for ex-ample, by halving it, and monitor the TP value again.When the TP values of simulation runs with subse-quently decreased rates are sufficiently close to oneanother, then we can assume that the cooling scheme issufficiently slow for disentangling the lattice. We thenstill have to decide on the non-vanishing neighborhoodfunction range that yields the “optimally” smoothenedlattice, for example, by looking for the epoch numberfor which |TP| is minimal or for which TP is closest tozero. Such a strategy will serve as the basis for a fully-automatic procedure discussed in Section 5, albeit witha different neighborhood preservation metric.

3. Kernel-Based Maximum Entropy Learning

Consider a lattice A, with a regular and fixed topol-ogy, of arbitrary dimensionality dA, in a d-dimensionalinput space V ⊆ �d . To each of the N nodes of the lat-tice corresponds a formal neuron i which possesses, inaddition to the traditional weight vector wi , a circular-(or hyperspherical-, in general) activation region Si —called receptive field (RF) region—with radius σi , inV -space (Fig. 5(A)). The neural activation state is rep-resented by the code membership function:

1li (v) ={

1 if v ∈ Si

0 if v ∈ Si ,(7)

with v ∈ V . Depending on 1li , the weights wi areadapted so as to produce a topology-preserving map-ping; the radii σi are adapted so as to produce a lattice ofwhich the neurons have an equal probability to be active(equiprobabilistic map), i.e., P(1li (v) = 1) = ρ

N , ∀i ,with ρ a scale factor.

As the definition of Si suggests, several neurons maybe active for a given input v. Hence, we need an alter-native definition of competitive learning [19]. Define


Figure 4. Evolution of the N = 25 neurons chain when the neighborhood function range is decreased at the “normal” rate, Eq. (3). Shown arethe lattices at t = 0, 5, 8, 10 epochs (A–D). The final result for t = 200 epochs is shown in Fig. 1(B), and it does not differ substantially fromthe t = 10 epochs case.

Figure 5. Kernel-based maximum entropy learning. (A) Neuron i has a localized receptive field K (v − wi , σi ), centered at wi in input spaceV ⊆ �d . The intersection of K with the present threshold τi defines a region Si , with radius σi , also in V -space. The present input v ∈ V isindicated by the black dot and falls outside Si . (B) Receptive field update in incremental learning mode. The receptive fields centers wi and w j ,are indicated with small open dots; the arrow indicates the update of wi given the present input v (not to scale). The receptive field regions Si

and S j are indicated with large circles drawn in solid and dashed lines, i.e., before and after these regions are updated given the present input,respectively (also not to scale). The shaded area indicates the intersection between regions Si and S j (overlap).

�i as the fuzzy code membership function of neuron i :

�i (v) = 1li (v)∑k∈A 1lk(v)

, ∀i ∈ A, (8)

so that 0 ≤ �i (v) ≤ 1 and∑

i �i (v) = 1. Consider a tra-ining setM= {vµ} of M input samples. In batch mode,

the kernel-based Maximum Entropy learning Rule(kMER) updates the neuron weights wi as follows:

�wi = η∑

vµ∈M

∑j∈A

�(i, j, σ�)� j (vµ)Sgn(vµ − wi ),

(9)


and their radii σi :

�σi = η∑

vµ∈M

(ρr

N(1 − 1li (vµ)) − 1li (vµ)

), ∀i,

(10)

with ρr�= ρN

N−ρ, and �(·) the usual neighborhood func-

tion, e.g., a Gaussian, of which the rangeσ� is graduallydecreased during learning, and with Sgn(x) a functionreturning the vector containing the sign (−1 or 1) perdimension. The effect of the learning rule is illustratedin Fig. 5(B), in incremental mode, for clarity’s sake.

Finally, we will use the following “cooling” scheme:


(−2σ�0

t

tmaxγOV

), (11)

with σ�0 the initial range, and γOV a parameter thatcontrols the slope of the cooling scheme (“gain”).

4. Overlap Variability Metric

The overlap between active regions can be used for as-sessing the degree of topology-preservation of the map,albeit in a heuristic manner. Assume that the scale fac-tor ρ is chosen in such a manner that the neurons of thedisentangled map have overlapping activation regions(i.e., ρ > 1). In that case, provided that the neurons areequiprobabilistic, a given map is more likely to be dis-entangled if the number of neurons that are activatedby a given input is constant over the training set. In-deed, the number of neurons that will be activated inthe vicinity of a topological defect will be higher thanin an already disentangled part of the lattice. Further-more, if that number is constant, it also implies thatthe map is locally smooth. Hence, we wish to adjustthe neighborhood “cooling” scheme in such a mannerthat the variability in the number of active neurons fora given input pattern, N µ = ∑

i 1li (vµ), is minimizedover the training set. This “variability score,” dividedby the mean number of active neurons is then our met-ric, which we will call the Overlap Variability (OV):

OV = sdN (t)

meanN (t), (12)

with meanN and sdN the mean and standard deviationof the number of neurons that are active at epoch t :

meanN (t) = 〈N µ(t)〉M = 1

M

∑µ

N µ(t),

sdN (t) = 〈(N µ(t) − meanN (t))2〉12M. (13)

5. Monitoring Algorithm

The complete monitoring algorithm we propose, con-sists of the following steps:

1. Since we start with a random initialization of ourweights, we first train the map with a constant neigh-borhood function range, namely, σ�0, during a fixednumber of epochs, in order to obtain a more or lessdisentangled lattice: this lattice then serves as a com-mon starting point for all simulation runs.

2. We then continue and perform one complete sim-ulation run, namely, until the neighborhood func-tion range vanishes. This is run number 1. We takeγ 1

OV = 1 in Eq. (11)—the superscript 1 refers to ourfirst run.

3. Next, we determine the number of epochs, and cor-responding neighborhood function range, for whichOV is minimal. Let us label these as follows: t1

epochs, σ 1�, and OV1.

4. We now perform a new run, run j , 2 < j ≤ maxruns,starting from the lattice obtained at step 1, with thesame σ�0, but using t j

max ← 2t j−1 epochs and byensuring that we end our simulation with a neighbor-hood function range identical to the previously opti-mal one, σ

j−1� , plus a margin: σ�(t j

max) = 0.9σj−1

� .In this way, we indeed cool at a slower rate, but onlyrun the simulation as long as necessary. In order toachieve the proper neighborhood cooling scheme,we adjust the gain as follows:

γj

OV = − ln 0.9σj−1

�

σ�0

2σ�0. (14)

This yields a cooling scheme that starts with theinitial neighborhood and ends with 0.9 σ

j−1� , i.e., a

fraction of the optimal neighborhood range of theprevious run.

5. We determine t j and σj

�. (Note that, for the j th run,it could happen that the minimum coincides withthe last epoch: in that case, we take the minimumand continue.)

6. As long as OV j < OV j−1, go to step 4, else stop.

We have used the following stopping criterion: whentwo successive runs lead to higher OV-values, we stop


and retain the map which corresponds to the best OV-value. If a run yields a higher OV-value than the previ-ous run, we start the next run with tmax 1.5 times longer.

5.1. Example

Consider again the two-dimensional “curved” inputdistribution shown in Fig. 1(A) and, again, a one-dimensional lattice sized N = 25 neurons which weinitialize in the same manner as before; the RF radiiσi are initialized by taking samples from the uniformdistribution (0; 0.3]. We apply kMER in batch modeand take ρ = 2 and the learning rate η = 0.001. Themonitoring results are summarized in Fig. 6.

We first perform step 1 of our monitoring algorithmand run kMER for 25 epochs with σ�0 = 12.5. TheOV and the neighborhood cooling plots are shown

Figure 6. Monitoring the overlap variability (OV) during kMER learning on the sample distribution shown in Fig. 1(A). (A) Zero-th and firstsimulation runs. The zero-th run extends from −25 to −1 epochs, the first run from 0 to 100 epochs. Shown are OV (thick solid line), theneighborhood function range σ� (�-range; thin solid line), and the topographic product (TP) (thick dashed line). The thin dashed (horizontal)line indicates the optimal TP-value (zero). For the sake of exposition, the neighborhood function range is divided by 12.5 and the topographicproduct is multiplied by 10. (B) Second simulation run; t2

max = 24 epochs. (C) Sixth and “best” run; t6max = 165 epochs.

in Fig. 6(A) (thick and thin solid lines). We ob-serve that, after a transitional phase, OV stabilizes.We then perform step 2 of the monitoring algorithmover t1

max = 100 epochs. We observe a clear OV min-imum, which is located at t1 = 12 epochs (step 3)(OV = 0.394). We then perform step 4 of the algorithmand run kMER over t2

max = 2 × 12 = 24 epochs and withγ 2

OV = 0.124 (thick line in Fig. 6(B)). The minimumOV is now at t2

max = 20 epochs (step 5) (OV = 0.376).Since OV2 < OV1 (step 6), we continue (step 4). Fi-nally, since runs 7 and 8 do not lead to better OV-values than that obtained at epoch t6 = 158 in run 6(thick line in Fig. 6(C)), the monitoring algorithm stops(OV = 0.374).

For the sake of comparison, we have also plot-ted the topographic product (TP) (thick dashed linein Fig. 6(A)). Note that the desired TP-value here iszero (thin dashed line). However, unlike the overlap


Figure 7. (A, B) Lattices obtained with kMER, without and with monitoring, using the sample distribution shown in Fig. 1(A) (M = 500training samples). The dashed line indicates the theoretical principal curve. Note that the neurons’ RF regions are also indicated (circles).(C) Lattice obtained with monitoring, using M = 4000 training samples.

variability, there is no clear optimum in the topographicproduct plot.

Finally, we show the lattices obtained without andwith monitoring in Fig. 7(A) and (B), respectively: theformer is the result of continuing the first run until100 epochs have elapsed, while the latter is the resultof the sixth run. We clearly see the effect of monitor-ing (Fig. 7(B)): the lattice is disentangled and closelymatches the theoretical principal curve of the distri-bution (dashed line). The result for M = 4000 samples(η = 1.0 × 10−4) is also shown (Fig. 7(C)). We observethat the lattice is smoother and even closer to the the-oretical curve than in the M = 500 case, as expected,since the RF radii σi are better defined and, thus, alsothe RF centers wi (neuron weights).

6. Real-World Example

We now apply our technique to a real-world exampleconcerned with the identification, in an unsupervised

manner, of the notes, as well as the musical instrumentsthat play them, from a music signal’s spectral content(spectrogram). We develop a hierarchy of topographicmaps which we in turn use for estimating the density ofthe map’s inputs and, subsequently, for density-basedclustering.

The music signal is generated on a Crystal 4232audio controller (Yamaha OPL3 FM synthesizer) at asampling rate of 11,025 Hz, using the simulated soundsof an oboe, a piano and a clarinet. The data set con-sists of eight notes (one minor scale, namely “F1”,“G”, “A�”, “B�”, “C”, “D�”, “E�” and “F2”), playedby the three instruments consecutively, resulting in asignal track of approximately 14 seconds. The nam-ing convention is the following: subscripts indicate theinstruments and superscripts are used to distinguish be-tween octaves (only used in the case of the “F” notes),e.g. “F1

3” corresponds to the “F” note in the first (lower)octave, played on the third instrument (clarinet). Sincethe time signal is computer generated and therefore


shows hardly any variation, uniform white noise isadded to the time signal in order to obtain a contin-uous density distribution, rather than a discrete one(SNR = 12.13 dB). As shown in Fig. 10(A), one canclearly see the repetition of eight notes of increasingpitch in the spectrogram (evolution of the frequencycontent as a function of time) by observing the funda-mental frequencies only.

A Short-Time Fourier Transform (STFT) is com-puted every 256 samples, using time windows of 1024samples, resulting in a data set of M = 621 amplitudespectra (phase information has been discarded), thatare furthermore normalized. We have not applied anywindowing prior to the STFT. We feel that this is unnec-essary, since we do not make the inverse transformationto the time domain, and since this will not greatly influ-ence the structure of the data in terms of separability.

In order to reduce the dimensionality, and to exam-ine its effect, a Principal Component Analysis (PCA)has been performed on the amplitude spectra at everyclustering level. The spectra are projected onto the sub-space spanned by the first kPC Principal Components(PCs). The projected result is indicated by the vectorv = [v1, . . . , vkPC ], where vi is the length of the projec-tion onto the i th PC. Unless otherwise stated, kPC = 16.

A divisive, hierarchical clustering analysis is per-formed on the spectrogram. At each level of the hi-erarchy, the analysis consists of 4 consecutive steps,which will be outlined and illustrated with simulationsfor the first two levels. The second level is illustratedusing a typical subset obtained at the first clusteringlevel (Level 1, see the section Results). The originalset of amplitude spectra is clustered, labeled and cor-respondingly subdivided into subsets. Each resultingsubset is then subjected to the next level of analysis in

Figure 8. Overlap variability plots obtained for the first (A) and last (B) runs when applying kMER on the music sample set.

order to detect “clusters within clusters”, until a certainstopping criterion is met. The result of this process is aclustering hierarchy that can be visualized in a cluster-ing tree or dendrogram. The algorithms that are usedfor each processing step, are now explained.

6.1. Step 1: Topographic Map Formation

The first stage in estimating the p.d.f. is the develop-ment of a topographic map. We have used kMER andthe SOM algorithm in incremental mode. Throughoutthe simulations, we have used 20 × 20 lattices at Level 1and 7×7 ones at the other levels. The initial weights ofkMER and the SOM algorithm are drawn from a uni-form distribution (0; 0.2], and the initial radii of kMERare drawn from a uniform distribution (0.3; 1.2].

The neighborhood cooling schemes for both kMERand the SOM algorithm have been implemented as inEq. (11). Both γOV and tmax are optimized by monitor-ing the quality of the map. We further take σλ0 =

√N

2 ,with N the number of lattice neurons.

The monitoring algorithm for kMER tries to ensurea locally smooth and disentangled map by minimiz-ing the overlap variability (OV). The resulting coolingscheme (after an initial training run of 50 epochs, withσλ0 and η = 0.01) trained for tmax = 250 epochs withγOV = 0.03 at Level 1, and for tmax = 670 epochswith γOV = 0.196 at Level 2 (η = 0.001). Finally, anadditional run over 50 epochs is performed, keepingthe neighborhood function range constant at the “op-timal” value determined in the monitoring algorithm.The OV cooling plots for the first and last (i.e., eighth)runs are shown in Fig. 8. In all kMER training sessions,we set ρr = 2, since ρr = ρN

N−ρ≈ ρ.


The SOM cooling scheme has been optimized by ob-serving the mean squared error (MSE) (mean squaredEuclidean distance, in fact) between the samples of thetraining set, and the corresponding winning neuronsi∗. We prefer to monitor the MSE here, since the TPmetric fails to indicate a clear optimum (i.e., closestpoint to zero). We know that for the SOM algorithm,when the neighborhood function range has vanished,the MSE will be minimized [23]. Initially, the map istrained for 50 epochs using σλ0 (η = 0.01), after whichtraining runs of tmax = 1000 epochs, with various val-ues of γOV , are performed (i.e., γOV = 0.25, 0.5, 1, 2;η = 0.001), starting from the initial map. The run thatyields the minimal MSE is selected (γOV = 1 at bothLevel 1 and Level 2) and tmax is doubled until no furtherMSE improvement is observed (tmax = 8000 at Level 1,tmax = 4000 at Level 2). In our experience, it is not nec-essary to optimize for both γOV and tmax simultane-ously: determining γOVopt in the first run suffices.

6.2. Step 2: Density Estimation

A kernel-based density estimate can now be computedby positioning equivolume Gaussian kernels at the neu-ron weights (kernels that are variable in width forkMER, proportional to the neuron radii, and that arefixed for SOM) [3]. However, such an approach suffersfrom numerical problems in high-dimensional spaces,since the kernel volumes scale inversely proportionalto σ d (d is the dimensionality of the input space). Inorder to overcome this problem, we develop a two-

Figure 9. (A) Plots of MSE( pρs , p∗ρs

) as a function of ρs (see text for explanation). (B) Plots of the number of clusters as a function of k (seetext for explanation). In both panels, solid curves denote the results for kMER and dashed curves those for the SOM, at Level 1 (thick) andLevel 2 (thin).

dimensional density estimate as follows:

pρs (v) =N∑

j=1

exp(−‖v−w j ‖2

2S2

)2π N S2

, (15)

where S = ρs σ j for the kMER, and S = ρs for theSOM case, with ρs a scaling factor that determines thedegree of smoothing. This parameter can be optimizedfor the kMER case (variable kernels) as explained in[3] by minimizing MSE( pρs , p∗

ρs). In the SOM case,

ρs,opt can be determined using the least-squares cross-validation method [24] by minimizing the score func-tion M0(ρs):

M0(ρs) =∫V

pρs2(v) dv − 2

N

∑i

p−i (wi ) (16)

≈ 1

N 2ρs

∑i, j

exp(−‖wi −w j ‖

4ρ2s

)(2πρ2

s

)2

− 2

N

∑i

p−i (wi ), (17)

where p−i is the density estimate constructed from allkernels except the one at wi and Eq. (17) is the approx-imation in the case of Gaussian kernels.

Figure 9(A) shows the MSE( pρs , p∗ρs

) and M0(ρs)as a function of ρs for kMER-(solid lines) and the SOMdensity estimates (dashed lines); the thick lines showthe results for Level 1, while the thin ones show thosefor Level 2. The plots are normalized for visualizationpurposes by translating them along the Y -axis (on a


linear scale), such that the minima are located at 1.The optimal value for ρs for each plot is found at theminimal MSE or M0 value (optimal smoothness).

6.3. Step 3: Clustering and Labeling

Clustering is performed on the topographic map us-ing the discrete hill-climbing algorithm [3]—discretemeaning here that hill climbing is performed on thediscrete lattice, at the positions of the neuron weights.It determines the local density peaks in the topographicmap that are not surmounted by higher peaks in a rangeof k nearest neurons. We plot the number of clustersfound as a function of k, and look for the longest plateauin the plot to decide on the number of clusters in thedata set. The neurons in the lattice are labeled accord-ing to the clustering results for k at the beginning ofthis plateau.

Figure 9(B) shows the number of clusters that arefound at Level 1 (thick lines) and Level 2 (thin lines)using the discrete hill-climbing algorithm for kMER(solid, 8 clusters at Level 1 and 3 at Level 2) and theSOM results (dashed, 7 clusters at Level 1 and 2 atLevel 2). The X -axis has been normalized for visua-lization purposes only.

6.4. Step 4: Training Set Labeling

Finally, every input pattern is assigned the label ofits nearest neuron (in Euclidean distance terms). Next,subsets of input patterns (512-dimensional) are createdwhich carry the same label. A PCA is performed on ev-ery subset and the dimensionality is again reduced by

Figure 10. (A) Thresholded spectrogram and Level 1 kMER labeling results (top grey level bar). (B) Cluster map of the Level 1 kMERsimulation.

projecting the original spectra in the subspace spannedby the first kPC PCs. This is done at every level in theclustering hierarchy, due to which the data will have abetter spectral resolution with increasing levels usingthe same dimensionality.

6.5. Step 5: Hierarchical Clustering

For each of the subsets created in step 4, a new cluster-ing analysis (steps 1–4) is performed. We have opted fora simple stopping criterion: hierarchical clustering willcontinue as long as the Continued Clustering Analysis(CCA) passes:

1. if there is a valid minimum in the optimal smooth-ness curve, that is, one that is different from thedegenerate cases, ρs� and ρs�, and

2. if, in the hill-climbing step (step 3), the trivial casen = 1 is not reached for k ≤ N

3 .

If the CCA-test is not passed, the cluster is regarded aleaf node in the clustering tree. The hierarchical clus-tering analysis is continued until all branches in the treeend in leaf nodes.

6.6. Results

When observing the spectrogram of Fig. 10(A), it isclear that the data set contains different levels of sim-ilarity (notes and instruments). Intuitively, one wouldexpect 8 clusters at the first level (notes) and 3 for everysubset at the second level (instruments), or vice versa.There is also a higher level of similarity between thespectra corresponding to the “F1” and the “F2” classes,


Figure 11. Clustering tree for kMER (A) and the SOM algorithm (B). Arrows indicate the leaf nodes. The “X” refers to a meaningless cluster(see text).

since these notes differ by exactly one octave, leading tosimilar harmonic structures. The Level 1 kMER cluster-ing results are visualized in Fig. 10(A) (top grey levelbar). The algorithm detects 8 clusters, distinguishingbetween the seven notes (“F1” and “F2” are joined intoone cluster) and a cluster containing the “transient”spectra at the note transitions (labeled “Tr”). The clus-ter map (Fig. 10(B)), that is, the two-dimensional lat-tice for which every neuron is color coded according toits cluster membership, shows a clearly unfolded map.The “Tr” cluster is located at the center, since thesespectra are similar to all other clusters (a transition be-tween two notes in the music signal results in a changefrom one note cluster to another via the “Tr” cluster).The “F” cluster at the bottom right is roughly doublein size compared to the others since it represents twicethe amount of data. For every note that has been de-tected at Level 1, the separate instruments are detectedat Level 2 (3 clusters; the cluster maps are not shown).Figure 11(A) visualizes the complete clustering tree,showing that the intuitive levels of similarity noted ear-lier correspond to the first two levels. The leaf nodes,defined when the CCA-test does no longer passes, areindicated by arrows and correspond to subsets of the

data set where a single note is played by a single in-strument, e.g., “A�2”: an A� played by a piano. The“F” subset takes two additional levels to separate, dueto the high level of similarity.

In order to show the effect of monitoring kMER,we consider two cases. In the first case, we initializethe lattice weights and radii in the same way as before,and train the lattice with kMER over tmax = 100 epochswith γOV = 1. Note that this is the same configurationas in our first monitoring run. The cluster map at tmax isshown in Fig. 12(A) and it reveals several topologicaldefects. In the second case, we again start from the samerandomly initialized weights and radii, but now trainthe lattice over tmax = 250 epochs with γOV = 1, thus,with the same tmax as in the last monitoring run but withγOV = 1. The cluster map at tmax is shown in Fig. 12(B)and it again reveals several topological defects. Hence,at least for the music example, these cases clearly showthe necessity of the monitoring algorithm.

The cluster map obtained with the SOM at Level 1(Fig. 12(C)) is not completely unfolded and has topo-logical defects, hence, not all clusters correspond tocontiguous regions in the cluster map. Figure 11(B)shows the clustering tree obtained with the SOM. At


Figure 12. (A, B) Level 1 cluster maps obtained without monitoring kMER. (A) Cluster map for tmax = 100 epochs and γOV = 1, and (B) idembut for tmax = 250 epochs. See text. (C) Level 1 cluster map obtained with the SOM algorithm.

Level 1, it detects 7 clusters, which are identical to thekMER result, except for the “Tr” class. At Level 2, “A�”and “E�” are divided into two subsets, which are in turndivided into the separate instruments at Level 3. Notethat the clustering at Level 2 does not consistently groupthe same instruments. The “F” class again requires twoadditional levels of clustering to separate. The struc-ture of the tree is different from the kMER tree and atthe last level, a meaningless cluster “X” is found, thatis, one that does not correspond to notes (contiguousregions in time), but did pass the CCA-test.

The complete data set has been labeled at Level 1and the subset corresponding to the “A�” notes playedby the three instruments consecutively, has been usedto illustrate the Level 2 clustering. This {A�}1,2,3 setconsists of 73 patterns for kMER and 75 for the SOMalgorithm. Ideally, that is, disregarding transitional ef-fects due to the STFT and the note onsets, there shouldbe 78 spectra in the “A�” cluster (the ideal leaf nodesubsets can be obtained by observing the dips in the en-ergy over time, and considering them as the note bound-aries; grouping these yield subsets for non-leaf nodesin the clustering tree). The misclassification rates areshown in Table 1. At every level in the clustering tree,the ideal subsets are determined and the misclassifica-tion rate is computed as the ratio between the number ofmisclassifications at this level (including the leaf nodesof the previous level) and the total number of patterns.At Level 1, the SOM performs better than kMER, pos-sibly because the patterns labeled “Tr” by kMER are

Table 1. Misclassification percentages per level for kMER and theSOM algorithm.

Level 1 2 3 4

kMER 8.86 9.98 9.98 10.47

SOM 7.73 10.14 10.26 12.08

considered as misclassified (the SOM did not find a“Tr” cluster). At the next levels, kMER performs con-sistently better. Although the clusters “G”, “B�”, “C”and “D�” at Level 2 passed the CCA-test, they have notbeen further analyzed in this paper due to their analogyto the “A�” and “E�” clusters at that level.

7. Discussion

We have introduced a new heuristic, called the Over-lap Variability (OV) metric, for quantifying the degreeof topology preservation during topographic map for-mation. The metric is compatible with learning algo-rithms that rely on neurons with overlapping activationregions or kernels. The basic idea is that overlappingactivation regions carry, by the resulting correlationsin activity, information about neighborhood relation-ships [20]. The OV metric has been developed in orderto meet two concerns. First, a metric such as the topo-graphic product (TP) is not sensitive enough to detectsmall topological defects, or to distinguish them fromlocally non-smooth portions of the map [15]. Further-more, the effect of the neighborhood function range onTP is confounded with that due to a mismatch betweenlattice and input space geometries and dimensional-ities. Hence, when used for monitoring the learningprocess, it can lead to an incorrect judgment of thetopology-preserving state of the lattice. Second, theusual metrics are quite heavy to run as a monitoringtool (see Introduction). However, in the case of kMER[3, 19], when using the OV-metric, the overhead is min-imal since only the mean and the standard deviation ofthe number of active neurons per input sample need tobe calculated.

Finally, we have used our heuristic for solving a dif-ficult real-world task, namely, given a music signal,find the sequence of notes, together with the musical


instruments that played these notes. Hence, one couldcall this application the “music typewriter”, in analogywith Kohonen’s “phonetic typewriter” [2].

Appendix: Topographic Product

Consider a lattice A of N neurons. Let ri be the latticecoordinate of neuron i and let i(k, A) be the kth nearestneighbor of neuron i , with the nearest neighbor definedin terms of the Euclidean distance in lattice coordinates:

i(1, A) = arg minj∈A\{i}

‖ri − r j‖,i(2, A) = arg min

j∈A\{i,i(1,A)}‖ri − r j‖, (18)

...

Similarly, let i(k, V ) be the kth nearest neighbor ofneuron i , with the nearest neighbor defined in terms ofthe Euclidean distance in the input space V . Define thefollowing two ratios:

Q1( j, k) = ‖w j − w j(k,A)‖‖w j − w j(k,V )‖ ,

(19)

Q2( j, k) = ‖r j − r j(k,A)‖‖r j − r j(k,V )‖ .

Following these definitions we have that Q1( j, k) =Q2( j, k) = 1 only when the nearest neighbors of orderk in the input and output (i.e., lattice) spaces coincide.However, this measure is still sensitive to local varia-tions in the input density. In order to overcome this, theQ values are multiplied for all orders k. After somealgebraic manipulations, the following equation isobtained:

T ( j, k) =(

k∏l=1

Q1( j, l)Q2( j, l)

) 12k

, (20)

which, in turn, needs to be averaged in a suitable man-ner. Bauer and Pawelzik [14] suggest the following av-eraging procedure, which also results in the definitionof topographic product TP:

TP = 1

N (N − 1)

N∑j=1

N−1∑k=1

log(T ( j, k)). (21)

The topographic product should now be as close as pos-sible to zero for maximum lattice disentangling. Notethat TP can be larger or smaller than zero.

Acknowledgments

T.G. is supported by a scholarship from the FlemishRegional Ministry of Education (GOA 2000/11).M.M.V.H. is supported by research grants receivedfrom the Fund for Scientific Research (G.0185.96N),the National Lottery (Belgium) (9.0185.96), theFlemish Regional Ministry of Education (Belgium)(GOA 95/99-06; 2000/11), the Flemish Ministryfor Science and Technology (VIS/98/012) and theEuropean Commission, 5th framework programme(QLG3-CT-2000-30161 and IST-2001-32114).

Note

1. A Peano curve is an infinitely and recursively convoluted fractalcurve which represents the continuous mapping of, e.g., a one-dimensional interval onto a two-dimensional surface [2].

References

1. H. Ritter, T. Martinetz, and K. Schulten, Neural Computa-tion and Self-Organizing Maps: An Introduction, Reading, MA:Addison-Wesley, 1992.

2. T. Kohonen, Self-Organizing Maps, Heidelberg: Springer,1995.

3. M.M. Van Hulle, Faithful Representations and Topo-graphic Maps: From Distortion- to Information-Based Self-Organization, New York: Wiley, 2000.

4. G.J. Deboeck and T. Kohonen, Visual Explorations in Financewith Self-Organizing Maps, Heidelberg: Springer, 1998.

5. M. Cottrell, P. Gaubert, P. Letremy, and P. Rousset, “Analyzingand Representing Multidimensional Quantitative and Qualita-tive Data: Demographic Study of the Rhone Valley. The Domes-tic Consumption of the Canadian Families,” in Kohonen Maps,Proc. WSOM99, Helsinki, E. Oja and S. Kaski (Eds.), 1999,pp. 1–14.

6. K. Lagus and S. Kaski, “Keyword Selection Method for Char-acterizing Text Document Maps,” in Proc. ICANN99, 9thInt. Conf. on Artificial Neural Networks, IEE: London, vol. 1,1999, pp. 371–376.

7. T. Kohonen, “Self-Organized Formation of Topologically Cor-rect Feature Maps,” Biol. Cybern., vol. 43, 1982, pp. 59–69.

8. T. Kohonen, Self-Organization and Associative Memory,Heidelberg: Springer, 1984.

9. M. Cottrell and J.C. Fort, “Etude d’un processus d’auto-organization,” Ann. Inst. Henri Poincare, vol. 23, 1987, pp. 1–20.

10. E. Erwin, K. Obermayer, and K. Schulten, “Self-OrganizingMaps: Ordering, Convergence Properties and Energy Func-tions,” Biol. Cybern., vol. 67, 1992, pp. 47–55.

11. T. Geszti, Physical Models of Neural Networks, Singapore:World Scientific Press, 1990.

12. H. Ritter, “Asymptotic Level Density for a Class of Vector Quan-tization Processes,” IEEE Trans. Neural Networks, vol. 2, no. 1,1991, pp. 173–175.


13. D.R. Dersch and P. Tavan, “Asymptotic Level Density inTopological Feature Maps,” IEEE Trans. Neural Networks, vol.6, 1995, pp. 230–236.

14. H.-U. Bauer and K.R. Pawelzik, “Quantifying the Neighbor-hood Preservation of Self-Organizing Feature Maps,” IEEETrans. Neural Networks, vol. 3, 1992, pp. 570–579.

15. T. Villmann, R. Der, M. Herrmann, and T.M. Martinetz,“Topology Preservation in Self-Organizing Feature Maps: ExactDefinition and Measurement,” IEEE Trans. Neural Networks,vol. 8, no. 2, 1997, pp. 256–266.

16. T. Graepel, M. Burger, and K. Obermayer, “Phase Transitions inStochastic Self-Organizing Maps,” Physical Rev. E, vol. 56, no.4, 1997, pp. 3876–3890.

17. J. Sum, C.-S. Leung, L.-W. Chan, and L. Xu, “Yet Another Algo-rithm which can Generate Topography Map,” IEEE Trans. Neu-ral Networks, vol. 8, no. 5, 1997, pp. 1204–1207.

18. C.M. Bishop, M. Svensen, and C.K.I. Williams, “GTM: TheGenerative Topographic Mapping,” Neural Computat., vol. 10,1998, pp. 215–234.

19. M.M. Van Hulle, “Kernel-Based Equiprobabilistic TopographicMap Formation,” Neural Computat., vol. 10, no. 7, 1998,pp. 1847–1871.

20. J.J. Koenderink, “Simultaneous Order in Nervous Nets from AFunctional Standpoint,” Biol. Cybern., vol. 50, 1984, pp. 35–41.

21. V.T. Ruoppila, T. Sorsa, and H.N. Koivo, “Recursive Least-Squares Approach to Self-Organizing Maps,” in Proc. IEEEInt. Conf. on Neural Networks, San Francisco, 1993, pp. 1480–1485.

22. T. Hastie and W. Stuetzle, “Principal Curves,” J. Am. Statist.Assoc., vol. 84, 1989, pp. 502–516.

23. S.P. Luttrell, “Derivation of a Class of Training Algorithms,”IEEE Trans. Neural Networks, vol. 1, 1990, pp. 229–232.

24. B.W. Silverman, Density Estimation for Statistics and DataAnalysis, London: Chapman and Hall, 1992.

Marc M. Van Hulle received a M.Sc. in Electrotechnical Engineer-ing (Electronics) and a Ph.D. in Applied Sciences from the K.U.

Leuven, Leuven (Belgium) in 1985 and 1990, respectively. He alsoholds B.Sc. Econ. and MBA degrees. In 1992, he has been withthe Brain and Cognitive Sciences department of the MassachusettsInstitute of Technology (MIT), Boston (USA), as a postdoctoralscientist. He is affiliated with the Neuro- and PsychophysiologyLaboratory, Medical School, K.U. Leuven, as an associate Pro-fessor. He has authored the monograph Faithful representationsand topographic maps: From distortion- to information-based self-organization, John Wiley, 2000, and more than 60 technical publi-cations. (http://simone.neuro.kuleuven.ac.be)

Dr. Van Hulle is an Executive Member of the IEEE Signal Process-ing Society, Neural Networks for Signal Processing (NNSP) Tech-nical Committee (1996–1999, 2000–2003), the Publicity Chair ofNNSP’s 1999 and 2000 workshops, and the Program co-chair ofNNSP’s 2001 workshop, and reviewer and co-editor of several spe-cial issues for several neural network and signal processing journals.He is also founder and director of Synes N.V., the data mining spin-offof the K.U. Leuven (http://www.synes.com). His research interestsinclude neural networks, biological modeling, vision, data miningand signal [email protected]

Temujin Gautama received a B.Sc. degree in electronical engi-neering from Groep T, Leuven, Belgium and a Master’s degreein Artificial Intelligence from the Katholieke Universiteit Leuven,Belgium. He is currently with the Laboratorium voor Neuro- en Psy-chofysiologie at the Medical School of the K.U. Leuven, where heis working towards his Ph.D. His research interests include nonlin-ear signal processing, biological modeling, self-organizing neuralnetworks and their application to data [email protected]

monitoring the formation of kernel-based topographic maps with application to hierarchical...

Documents