learning vector quantization: the dynamics of winner-takes ...learning vector quantization: the...

Learning Vector Quantization: The Dynamics of

Winner-Takes-All Algorithms

Michael Biehl1, Anarta Ghosh1, and Barbara Hammer2

1- Rijksuniversiteit Groningen - Mathematics and Computing Science

P.O. Box 800, NL-9700 AV Groningen - The Netherlands

2- Clausthal University of Technology - Institute of Computer Science

D-98678 Clausthal-Zellerfeld - Germany

Abstract

Winner-Takes-All (WTA) prescriptions for Learning Vector Quan-tization (LVQ) are studied in the framework of a model situation: Twocompeting prototype vectors are updated according to a sequence of ex-ample data drawn from a mixture of Gaussians. The theory of on-linelearning allows for an exact mathematical description of the trainingdynamics, even if an underlying cost function cannot be identified. Wecompare the typical behavior of several WTA schemes including ba-sic LVQ and unsupervised Vector Quantization. The focus is on thelearning curves, i.e. the achievable generalization ability as a functionof the number of training examples.

Keywords: Learning Vector Quantization, prototype-based classification,Winner-Takes-All algorithms, on-line learning, competitive learning

1 Introduction

Learning Vector Quantization (LVQ) as originally proposed by Kohonen [10]is a widely used approach to classification. It is applied in a variety of prac-tical problems, including medical image and data analysis, e.g. proteomics,classification of satellite spectral data, fault detection in technical processes,and language recognition, to name only a few. An overview and furtherreferences can be obtained from [1].

LVQ procedures are easy to implement and intuitively clear. The clas-sification of data is based on a comparison with a number of so-called pro-totype vectors. The similarity is frequently measured in terms of Euclidean

1

distance in feature space. Prototypes are determined in a training phasefrom labeled examples and can be interpreted in a straightforward way asthey directly represent typical data in the same space. This is in contrastwith, say, adaptive weights in feedforward neural networks or support vectormachines which do not allow for immediate interpretation as easily. Amongthe most attractive features of LVQ is the natural way in which it can beapplied to multi-class problems.

In the simplest so-called hard or crisp schemes any feature vector isassigned to the closest of all prototypes and the corresponding class. Ingeneral, several prototypes will be used to represent each class. Extensionsof the deterministic assignment to a probabilistic soft classification are con-ceptually straightforward but will not be considered here.

Plausible training prescriptions exist which employ the concept of on-linecompetitive learning: Prototypes are updated according to their distancefrom a given example in a sequence of training data. Schemes in whichonly the winner, i.e. the currently closest prototype is updated have beentermed Winner-Takes-All algorithms and we will concentrate on this classof prescriptions here.

The ultimate goal of the training process is, of course, to find a classifierwhich labels novel data correctly with high probability, after training. Thisso-called generalization ability will be in the focus of our analysis in thefollowing.

Several modifications and potential improvements of Kohonen’s originalLVQ procedure have been suggested. They aim at achieving better ap-proximations of Bayes optimal decision boundaries, faster or more robustconvergence, or the incorporation of more flexible metrics, to name a fewexamples [5, 8–10, 16].

Many of the suggested algorithms are based on plausible but purelyheuristic arguments and they often lack a thorough theoretical understand-ing. Other procedures can be derived from an underlying cost function, suchas Generalized Relevance LVQ [8,9] or LVQ2.1, the latter being a limit caseof a statistical model [14,15]. However, the connection of the cost functionswith the ultimate goal of training, i.e. the generalization ability, is often un-clear. Furthermore, several learning rules display instabilities and divergentbehavior and require modifications such as the window rule for LVQ2.1 [10].

Clearly, a better theoretical understanding of the training algorithmsshould be helpful in improving their performance and in designing novel,more efficient schemes.

In this work we employ a theoretical framework which makes possiblea systematic investigation and comparison of LVQ training procedures. We

2

consider on-line training from a sequence of uncorrelated, random trainingdata which is generated according to a model distribution. Its purpose is todefine a non-trivial structure of data and facilitate our analytical approach.We would like to point out, however, that the training algorithms do notmake use of the form of this distribution as, for instance, density estimationschemes.

The dynamics of training is studied by applying the successful theoryof online learning [3, 6, 12] which relates to ideas and concepts known fromStatistical Physics. The essential ingredients of the approach are (1) theconsideration of high-dimensional data and large systems in the so-calledthermodynamic limit and (2) the evaluation of averages over the randomnessor disorder contained in the sequence of examples. The typical propertiesof large systems are fully described by only a few characteristic quantities.Under simplifying assumptions, the evolution of these so-called order pa-

rameters is given by deterministic coupled ordinary differential equations(ODE) which describe the dynamics of on-line learning exactly in the ther-modynamic limit. For reviews of this very successful approach to the inves-tigation of machine learning processes consult, for instance, [3, 6, 17].

The formalism enables us to compare the dynamics and generalizationability of different WTA schemes including basic LVQ1 and unsupervisedVector Quantization (VQ). The analysis can readily be extended to moregeneral schemes, approaching the ultimate goal of designing novel and effi-cient LVQ training algorithms with precise mathematical foundations.

The paper is organized as follows: in the next section we introducethe model, i.e. the specific learning scenarios and the assumed statisticalproperties of the training data. The mathematical description of the trainingdynamics is briefly summarized in section 3, technical details are given in anappendix. Results concerning the WTA schemes are presented and discussedin section 4 and we conclude with a summary and an outlook on forthcomingprojects.

2 The model

2.1 Winner-Takes-All algorithms

We study situations in which input vectors ξ ∈ IRN belong to one of twopossible classes denoted as σ = ±1. Here, we restrict ourselves to thecase of two prototype vectors wS where the label S = ±1 (or ± for short)corresponds to the represented class.

3

In all WTA-schemes, the squared Euclidean distances dS(ξ) = (ξ−wS)2

are evaluated for S = ±1 and the vector ξ is assigned to class σ if d+σ < d−σ.We investigate incremental learning schemes in which a sequence of sin-

gle, uncorrelated examples {ξµ, σµ} is presented to the system. The analysiscan be applied to a larger variety of algorithms but here we treat only up-dates of the form

wµS = w

µ−1

S + ∆wµS with ∆w

µS =

η

NΘµ

S g(S, σµ)(

ξµ − wµ−1

S

)

. (1)

Here, the vector wµS denotes the prototype after presentation of µ exam-

ples and the learning rate η is rescaled with the vector dimension N . TheHeaviside term

ΘµS = Θ(dµ

−S − dµ+S) with Θ(x) =

{1 if x > 0.0 else

singles out the current prototype wµ−1

S which is closest to the new input ξµ

in the sense of the measure dµS = (ξµ−w

µ−1

S )2. In this formulation, only thewinner, say, wS can be updated whereas the looser w−S remains unchanged.The change of the winner is always along the direction ±(ξµ − w

µ−1

S ). Thefunction g(S, σµ) further specifies the update rule. Here, we focus on threespecial cases of WTA learning:

I) LVQ1: g(S, σ) = Sσ = +1 (resp. − 1) for S = σ (resp. S 6= σ).This extension of competitive learning to labeled data corresponds toKohonen’s original LVQ1. The update is towards ξµ if the examplebelongs to the class represented by the winning prototype, the correct

winner. On the contrary, a wrong winner is moved away from thecurrent input.

II) LVQ+: g(S, σ) = Θ(Sσ) = +1 (resp. 0) for S = σ (resp. S 6= σ). Inthis scheme the update is non-zero only for a correct winner and, then,always positive. Hence, a prototype wS can only accumulate updatesfrom its own class σ = S. We will use the abbreviation LVQ+ for thisprescription.

III) VQ: g(S, σ) = 1. This update rule disregards the actual data label andalways moves the winner towards the example input. It correspondsto unsupervised Vector Quantization and aims at finding prototypeswhich yield a good representation of the data in the sense of Euclideandistances. The choice g(S, σ) = 1 can also be interpreted as describ-ing two prototypes which represent the same class and compete forupdates from examples of this very class, only.

4

h+

h−

b+

b−

Figure 1: Data as generated according to the model density (3) in N = 200dimension with p− = 0.6, p+ = 0.4. Open (filled) circles represent 240 (160)vectors ξ from clusters centered about orthonormal vectors `B+ (`B−) with` = 1.5. The left panel shows the projections h± = w± · ξ of the data ona randomly chosen pair of orthogonal unit vectors w±, whereas the rightpanel displays b± = B · ξ, i.e. the plane spanned by B+,B−; crosses markthe position of the cluster centers.

Note that the VQ procedure (III) can be readily formulated as a stochas-tic gradient descent with respect to the quantization error

e(ξµ) =∑

S=±1

1

2(ξµ − w

µ−1

S )2 Θ(dµ−S − dµ

+S), (2)

see e.g. [7] for details. While intuitively clear and well motivated, LVQ1 (I)and LVQ+ (II) lack such a straightforward interpretation and relation to acost function.

2.2 The model data

In order to analyse the behavior of the above algorithms we assume thatrandomized data is generated according to a model distribution P (ξ). As asimple yet non-trivial situation we consider input data drawn from a binarymixture of Gaussian clusters

P (ξ) =∑

σ=±1

pσP (ξ |σ) with P (ξ |σ) =1

√2π

Nexp

[

−1

2(ξ − λBσ)2

]

(3)

5

where the weights pσ correspond to the prior class membership probabilitiesand p+ + p− = 1. Clusters are centered about λB+ and λB−, respectively.We assume that Bσ · Bτ = Θ(στ), i.e. B2

σ = 1 and B+ · B− = 0. Thelatter condition fixes the location of the cluster centers with respect to theorigin while the parameter λ controls their separation.

We consider the case where the cluster membership σ coincides withthe class label of the data. The corresponding classification scheme is notlinearly separable because the Gaussian contributions P (ξ |σ) overlap. Ac-cording to Eq. (3) a vector ξ consists of statistically independent componentswith unit variance. Denoting the average over P (ξ |σ) by 〈· · · 〉σ we have,for instance, 〈ξj〉σ = λ(Bσ)j for a component and correspondingly

⟨ξ2⟩

σ=

N∑

j=1

⟨ξ2j

⟩

σ=

N∑

j=1

(

1 + 〈ξj〉2σ)

= N + λ2.

Averages over the full P (ξ) will be written as 〈· · · 〉 =∑

σ=±1〈· · · 〉σ .

Note that in high dimensions, i.e. for large N , the Gaussians overlapsignificantly. The cluster structure of the data becomes only apparent whenprojected into the plane spanned by {B+,B−}. However projections in arandomly chosen two-dimensional subspace would overlap completely. Fig-ure 1 illustrates the situation for Monte Carlo data in N = 200 dimensions.In an attempt to learn the classification scheme, the relevant directionsB± ∈ IRN have to be identified to a certain extent. Obviously this taskbecomes highly non-trivial for large N .

3 The dynamics of learning

The following analysis is along the lines of on-line learning, see e.g. [3,6,12].for a comprehensive overview and example applications. In this sectionwe briefly outline the treatment of WTA algorithms in LVQ and refer toappendix A for technical details.

The actual configuration of prototypes is characterized by the projections

RµSσ = w

µS · Bσ and Qµ

ST = wµS · wµ

T , for S, T, σ = ±1 (4)

The self-overlaps Q++ and Q−− specify the lengths of vectors w±, whereasthe remaining five overlaps correspond to projections, i.e. angles betweenw+ and w− and between the prototypes and the center vectors B±.

6

The algorithm (1) directly implies recursions for the above defined quan-tities upon presentation of a novel example:

N(RµSσ − Rµ−1

Sσ ) = η ΘµS g(S, σµ)

(

bµS − Rµ−1

Sσ

)

N(QµST − Qµ−1

ST ) = η ΘµS g(S, σµ)

(

hµT − Qµ−1

ST

)

+ η ΘµT g(T, σµ)

(

hµS − Qµ−1

ST

)

+ η2 ΘµS Θµ

T g(S, σµ) g(T, σµ) + O(1/N) (5)

Here, the actual input ξµ enters only through the projections

hµS = w

µ−1

S · ξµ and bµσ = Bσ · ξµ. (6)

Note in this context that ΘµS = Θ(Qµ−1

−S−S − 2hµ−S − Qµ−1

SS + 2hµS) also does

not depend on ξµ explicitly.An important assumption is that all examples in the training sequence

are independently drawn from the model distribution and, hence, are uncor-related with previous data and with w

µ−1

± . As a consequence, the statisticsof the projections (6) are well known for large N . By means of the CentralLimit Theorem their joint density becomes a mixture of Gaussians, which isfully specified by the corresponding conditional first and second moments:⟨hµ

S

⟩

σ= λRµ−1

Sσ , 〈bµτ 〉σ = λΘ(τσ),

⟨hµ

ShµT

⟩

σ−⟨hµ

S

⟩

σ

⟨hµ

T

⟩

σ= Qµ−1

ST

⟨hµ

Sbµτ

⟩

σ−⟨hµ

S

⟩

σ〈bµ

τ 〉σ = Rµ−1

Sτ ,⟨bµρbµ

τ

⟩

σ−⟨bµρ

⟩

σ〈bµ

τ 〉σ = Θ(ρτ), (7)

see the appendix for derivations. This observation enables us to perform anaverage of the recursions w.r.t. the latest example data in terms of Gaussianintegrations. Details of the calculation are presented in appendix A, seealso [4]. On the right hand sides of (5) terms of order (1/N) have beenneglected using, for instance,

⟨ξ2⟩/N = 1 + λ2/N ≈ 1 for large N .

The limit N → ∞ has further simplifying consequences. First, the aver-aged recursions can be interpreted as ordinary differential equations (ODE)in continuous training time α = µ/N . Second, the overlaps {RSσ, QST } asfunctions of α become self–averaging with respect to the random sequenceof examples. Fluctuations of these quantities, as for instance observed incomputer simulations of the learning process, vanish with increasing N andthe description in terms of mean values is sufficient. For a detailed mathe-matical discussion of this property see [11].

Given initial conditions {RSσ(0), QST (0)}, the resulting system of cou-pled ODE can be integrated numerically. This yields the evolution of over-laps (4) with increasing α in the course of training. The behavior of the

7

50 100 150 200 250 300

-1.0

0.0

1.0

2.0 R−−

R++

R−+

R+−

α0 50 100 150 200 250 300

-2.0

0.0

2.0

4.0

6.0Q−−

Q++

Q+−

α

Figure 2: The characteristic overlaps RSσ (left panel) and QST (right panel)vs. the rescaled number of examples α = µ/N for LVQ1 with p+ = 0.8, λ =1, and η = 0.2. The solid lines display the result of integrating the systemof ODE, symbols correspond to Monte Carlo simulations for N = 200 onaverage over 100 independent runs. Error bars are smaller than the symbolsize. Prototypes were initialized close to the origin with RSσ = Q+− = 0and Q++ = Q−− = 10−4.

system will depend on the characteristics of the data, i.e. the separation λ,the learning rate η, and the actual algorithm as specified by the choice ofg(S, σ) in Eq. (1). Monte Carlo simulations of the learning process are in ex-cellent agreement with the N → ∞ theory for dimensions as low as N = 200already, see Figure 2 for a comparison in the case of LVQ1. Our simulationsfurthermore confirm that the characteristic overlaps {RSσ(α), QST (α)} areself-averaging quantities: their standard deviation determined from inde-pendent runs vanishes as N → ∞, details will be published elsewhere.

The success of learning can be quantified as the probability of misclassi-fying novel random data, the generalization error εg =

∑

σ=±1pσ 〈Θ−σ〉σ .

Performing the averages is done along the lines discussed above, see theappendix and [4]. One obtains εg as a function of the overlaps {QST , RSσ}:

εg =∑

S=±1

pσ Φ

(QSS − Q−S−S − 2λ [RSS − R−SS ]√

Q++ + Q−− − 2Q+−

)

, (8)

where φ(x) =∫ x−∞

dt√2π

e−t2/2. Hence, by inserting {QST (α), RSσ(α)} we

can evaluate the learning curve εg(α), i.e. the typical generalization error asachieved from a sequence of αN examples.

8

100 2000.1

0.2

0.3

0.4

0.5εg

α0 0.5 1

0

0.1

0.2

0.3

0.4

0.5εg

p+

Figure 3: Left panel: Typical learning curves εg(α) of unsupervised VQ(dotted), LVQ+ (dashed) and LVQ1 (solid line) for λ = 1.0 and learningrate η = 0.2.Right panel: asymptotic εg for η → 0, ηα→∞ for λ = 1.2 as a function ofthe prior weight p+. The lowest, solid curve corresponds to the optimumεming whereas the dashed line represents the typical outcome of LVQ1. The

horizontal line is the p±–independent result for LVQ+, it can even exceedεg = min {p+, p−} (thin solid line). Unsupervised VQ yields an asymptoticεg as marked by the chain line.

4 Results

4.1 The dynamics

The dynamics of unsupervised VQ, setting III in section 2.1, has been stud-ied for p+ = p− in an earlier publication [7]. Because data labels are disre-garded or unavailable, the prototypes could be exchanged with no effect onthe achieved quantization error (2). This permutation symmetry causes theinitial learning dynamics to take place mainly in the subspace orthogonal toλ(B+ −B−). It is reflected in a weakly repulsive fixed point of the ODE inwhich all RSσ are equal. Generically, the prototypes remain unspecialized

up to rather large values of α, depending on the precise initial conditions.Without prior knowledge, of course, RSσ(0) ≈ 0 holds. This key feature, asdiscussed in [7], persists for p+ 6= p−.

While VQ does not aim at good generalization, we can still obtain εg(α)from the prototype configuration, see Figure 3 (left panel) for an example.The very slow initial decrease relates to the above mentioned effect.

Figure 2 shows the evolution of the order parameters {QST , RSσ} for anexample set of model parameters in LVQ1 training, scenario I in sec. 2.1.Monte Carlo simulations of the system for N = 200 already display excellentagreement with the N → ∞ theory.

9

In LVQ1, data and prototypes are labeled and, hence, specialization isenforced as soon as α > 0. The corresponding εg displays a very fast initialdecrease, cf. Figure 3. The non-monotonic intermediate behavior of εg(α) isparticularly pronounced for very different prior weights, e.g. p+ > p−, andfor strongly overlapping clusters (small λ).

Qualitatively, the typical behavior of LVQ+ is similar to that of LVQ1.However, unless p+ = p−, the achieved εg(α) is significantly larger as shownin Figure 3. This effect becomes clearer from the discussion of asymptoticconfigurations in the next section. The typical trajectories of prototypevectors in the space of order parameters will be discussed in greater detailfor a variety of LVQ algorithms in forthcoming publications.

4.2 Asymptotic generalization

Apart from the behavior for small and intermediate values of α, we are alsointerested in the generalization ability that can be achieved, in principle,from an unlimited number of examples. This provides important means forjudging the potential performance of training algorithms.

For stochastic gradient descent procedures like VQ, the expectation valueof the associated cost function is minimized in the simultaneous limits ofη → 0 and α → ∞ such that α = ηα → ∞. In the absence of a cost functionwe can still consider this limit in which the system of ODE simplifies and canbe expressed in the rescaled α after neglecting terms of order O

(η2). A fixed

point analysis then yields well defined asymptotics, see [7] for a treatmentof VQ. Figure 3 (right panel) summarizes our findings for the asymptoticε∞g as a function of p+.

It is straightforward to work out the decision boundary with minimalgeneralization error εmin

g in our model. For symmetry reasons it is a planeorthogonal to (B+−B−) and contains all ξ with p+P (ξ|+1) = p−P (ξ|−1)[5]. The lowest, solid line in Figure 3 (right panel) represents εmin

g for λ = 1.2as a function of p+. For comparison, the trivial classification according tothe priors p± yields εtriv

g = min {p−, p+} and is also included in Figure 3.In unsupervised VQ a strong prevalence, e.g. p+ ≈ 1, will be accounted

for by placing both vectors inside the stronger cluster, thus achieving alow quantization error (2). Obviously this yields a poor classification asindicated by ε∞g = 1/2 in the limiting cases p+ = 0 or 1. In the specialcase p+ = 1/2 the aim of representation happens to coincide with goodgeneralization and εg becomes optimal, indeed.

LVQ1, setting I in section 2.1, yields a classification scheme which is veryclose to being optimal for all values of p+, cf. Figure 3 (right panel). On the

10

contrary, LVQ+ (algorithm II) updates each wS only with data from class S.As a consequence, the asymptotic positions of the w± is always symmetricabout the geometrical center λ(B+ + B−) and ε∞ is independent of thepriors p±. Thus, LVQ+ is robust w.r.t. a variation of p± after training, i.e.it is optimal in the sense of the minmax-criterion supp± εg(α) [5].

5 Conclusions

LVQ type learning models constitute popular learning algorithms due totheir simple learning rule, their intuitive formulation of a classifier by meansof prototypical locations in the data space, and their efficient applicabilityto any given number of classes. However, only very few theoretical resultshave been achieved so far which explain the behavior of such algorithms inmathematical terms, including large margin bounds as derived in [2, 8] andvariations of LVQ type algorithms which can be derived from a cost function[9,13–15]. In general, the often excellent generalization ability of LVQ typealgorithms is not guaranteed by theoretical arguments, and the stabilityand convergence of various LVQ learning schemes is only poorly understood.Often, further heuristics such as the window rule for LVQ2.1 are introducedto overcome the problems of the original, heuristically motivated algorithms[10]. This has led to a large variety of (often only slightly) different LVQ typelearning rules the drawbacks and benefits of which are hardly understood.Apart from an experimental benchmarking of these proposals, there is clearlya need for a thorough theoretical comparison of the behavior of the modelsto judge their efficiency and to select the most powerful learning schemesamong these proposals.

We have investigated different Winner-Takes-All algorithms for LearningVector Quantization in the framework of an analytically treatable model.The theory of on-line learning enables us to describe the dynamics of trainingin terms of differential equations for a few characteristic quantities. Theformalism becomes exact in the limit N → ∞ of very high dimensional dataand many degrees of freedom.

This framework opens the way towards an exact investigation of sit-uations where, so far, only heuristic evaluations have been possible, andit can serve as a uniform approach to investigate and compare LVQ typelearning schemes based on a solid theoretical ground: The generalizationability can be evaluated also for heuristic training prescriptions which lacka corresponding cost function, including Kohonen’s basic LVQ algorithm.Already in our simple model setting, this formal analysis shows pronounced

11

characteristics and differences of the typical learning behavior. On the onehand, the learning curves display common features such as an initial non-monotonic behavior. This behavior can be attributed to the necessity ofsymmetry breaking for the fully unsupervised case, and an overshooting ef-fect due to different prior probabilities of the two classes, for supervisedsettings. On the other hand, slightly different intuitive learning algorithmsyield fundamentally different asymptotic configurations as demonstrated forVQ, LVQ1, and LVQ+. These differences can be attributed to the differentinherent goals of the learning schemes. In consequence, these differences areparticularly pronounced for unbalanced class distributions where the goalsof minimizing the quantization error, the minmax error, and the general-ization error do not coincide. It is quite remarkable that the conceptuallysimple, original learning rule LVQ1 shows close to optimal generalization forthe entire range of the prior weights p±.

Despite the relative simplicity of our model situation, the training ofonly two prototypes from two class data captures many non-trivial featuresof more realistic settings: In terms of the itemization in section 2.1, situationI (LVQ1) describes the generic competition at class borders in basic LVQschemes. Scenario II (LVQ+) corresponds to an intuitive variant which canbe seen as a mixture of VQ and LVQ, i.e. Vector Quantization within thedifferent classes. Finally, setting III (VQ) models the representation of oneclass by several competing prototypes.

Here, we have only considered WTA schemes for training. It is possibleto extend the same analysis to more complex update rules such as LVQ2.1or recent proposals [9, 14, 15] which update more than one prototype ata time. The treatment of LVQ type learning in the online scenario caneasily be extended to algorithms which obey a more general learning rulethan equation 1, e.g. learning rules which, by substituting the Heavisideterm, adapt more than one prototype at a time. Our preliminary studiesalong this line show remarkable differences in the generalization ability andconvergence properties of such variations of LVQ-type algorithms.

Clearly, our model cannot describe all features of more complex realworld problems. Forthcoming investigations will concern, for instance, theextension to situations where the model complexity and the structure of thedata do not match perfectly. This requires the treatment of scenarios withmore than two prototypes and cluster centers. We also intend to study theinfluence of different variances within the classes and non-spherical clusters,aiming at a more realistic modeling. We are convinced that this line ofresearch will lead to a deeper understanding of LVQ type training and willfacilitate the development of novel efficient algorithms.

12

References

[1] Bibliography on the Self-Organizing Map (SOM) and Learning VectorQuantization (LVQ). Neural Networks Research Centre, Helsinki Uni-versity of Technology, 2002.

[2] K. Crammer, R. Gilad-Bachrach, A. Navot, and A. Tishby. Marginanalysis of the LVQ algorithm. In: Advances in Neural InformationProcessing Systems, 2002.

[3] M. Biehl and N. Caticha, The statistical physics of learning and gen-eralization. In: M. Arbib (ed.), Handbook of brain theory and neuralnetworks, second edition, MIT Press, 2003.

[4] M. Biehl, A. Freking, A. Ghosh, and G. Reents, A theoretical frame-work for analysing the dynamics of Learning Vector Quantization. Tech-nical Report 2004-9-02, Institute of Mathematics and Computing Sci-ence, University Groningen, available from www.cs.rug.nl/ biehl,2004.

[5] R. Duda, P. Hart, and D. Stork, Pattern Classification, Wiley, 2001.

[6] A. Engel and C. van den Broeck, The Statistical Mechanics of Learning,Cambridge University Press, 2001.

[7] A. Freking, G. Reents, and M. Biehl, The dynamics of competitive learn-ing, Europhysics Letters 38 (1996) 73-78.

[8] B. Hammer, M. Strickert and T. Villmann. On the generalization capa-bility of GRLVQ networks, Neural Processing Letters 21 (2005) 109-120.

[9] B. Hammer and T. Villmann. Generalized relevance learning vectorquantization, Neural Networks 15 (2002) 1059-1068.

[10] T. Kohonen. Self-organizing maps, Springer, Berlin, 1995.

[11] G. Reents and R. Urbanczik. Self-averaging and on-line learning , Phys-ical Review Letters 80 (1998) 5445-5448.

[12] D. Saad, editor. Online learning in neural networks, Cambridge Uni-versity Press, 1998.

[13] S. Sato and K. Yamada. Generalized learning vector quantization. InG. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural

13

Information Processing Systems, volume 7, pages 423–429, MIT Press,1995.

[14] S. Seo, M. Bode, and K. Obermayer. Soft nearest prototype classifica-tion, IEEE Transactions on Neural Networks 14 (2003) 390-398.

[15] S. Seo and K. Obermayer. Soft Learning Vector Quantization. NeuralComputation 15 (2003) 1589-1604.

[16] P. Somervuo and T. Kohonen. Self-organizing maps and learning vectorquantization for feature sequences, Neural Processing Letters 10 (1999)151-159.

[17] T.L.H. Watkin, A. Rau, and M. Biehl, The statistical mechanics oflearning a rule, Reviews of Modern Physics 65 (1993) 499-556.

14

A The theoretical framework

In this appendix we outline key steps of the calculations referred to in thetext. The formalism was first used in the context of unsupervised VectorQuantization [7] and the calculations were recently detailed in a TechnicalReport [4].

Throughout this appendix indices l,m, k, s, σ ∈ {±1} (or ± for short)represent the class labels and cluster memberships.

Note that the following corresponds to a slightly more general inputdistribution with variances vσ of the Gaussian clusters σ = ±1

P (ξ) =∑

σ=±1

pσP (ξ |σ) with P (ξ |σ) =1

√2πvσ

Nexp

[

− 1

2 vσ(ξ − λBσ)2

]

.

(9)All results in the main text refer to the special choice v+ = v− = 1 whichcorresponds to Eq. (3). The influence of choosing different variances on thebehavior of LVQ type training will be studied in forthcoming publications.

A.1 Statistics of the projections

To a large extent, our analysis is based on the observation that the pro-jections h± = w± · ξ and b± = B± · ξ are correlated Gaussian randomquantities for a vector ξ drawn from one of the clusters contributing tothe density (9). Where convenient, we will combine the projections into afour-dimensional vector denoted as x = (h+, h−, b+, b−)T .

We will assume implicitly that ξ is statistically independent from theconsidered weight vectors w±. This is obviously the case in our on-lineprescription where the novel example ξµ is uncorrelated with all previousdata and hence with w

µ−1

± .The first and second conditional moments given in Eq. (7) are obtained

from the following elementary considerations.

First moments

Exploiting the above mentioned statistical independence we can show im-mediately that

〈hl〉k = 〈wl · ξ〉k = wl · 〈ξ〉k = wl · (λBk) = λRlk. (10)

Similarly we get for bl:

〈bl〉k = 〈Bl · ξ〉k = Bl · 〈ξ〉k = Bl · (λBk) = λ δlk, (11)

15

where δlk is the Kronecker delta and we exploit that B+ and B− are or-thonormal. Now the conditional means µ

k= 〈x〉k can be written as

µk=+1

=

λR++

λR−+

λ0

and µk=−1

=

λR+−λR−−

0λ

(12)

Second moments

In order to compute the conditional variance or covariance 〈hlhm〉k−〈hl〉k〈hm〉kwe first consider the average

〈hl hm〉k = 〈(wl · ξ)(wm · ξ)〉k =

⟨(N∑

i=1

(wl)i(ξ)i

)(N∑

j=1

(wm)j(ξ)j

)⟩

k

=

⟨N∑

i=1

(wl)i(wm)i(ξ)i(ξ)i +

N∑

i=1

N∑

j=1,j 6=i

(wl)i(wm)j(ξ)i(ξ)j

⟩

k

=N∑

i=1

(wl)i(wm)i〈(ξ)i(ξ)i〉k +N∑

i=1

N∑

j=1,j 6=i

(wl)i(wm)j〈(ξ)i(ξ)j〉k

=

N∑

i=1

(wl)i(wm)i[vk + λ2(Bk)i(Bk)i

]

+

N∑

i=1

N∑

j=1,j 6=i

(wl)i(wm)jλ2(Bk)i(Bk)j

= vk

N∑

i=1

(wl)i(wm)i + λ2

N∑

i=1

(wl)i(wm)i(Bk)i(Bk)i

+λ2

N∑

i=1

N∑

j=1,j 6=i

(wl)i(wm)j(Bk)i(Bk)j

= vkwl · wm + λ2(wl · Bk)(wm · Bk) = vk Qlm + λ2 Rlk Rmk

Here we have used once more that components of ξ from cluster k havevariance vk and are independent. This implies for all i, j ∈ {1, . . . , N}

〈(ξ)i (ξ)i〉k − 〈(ξ)i〉k 〈(ξ)i〉k = vk ⇒ 〈(ξ)i (ξ)i〉k = vk + 〈(ξ)i〉k 〈(ξ)i〉k ,

and 〈(ξ)i (ξ)j〉k = 〈(ξ)i〉k 〈(ξ)j〉k . for i 6= j.

16

Finally, we obtain the conditional second moment

〈hl hm〉k − 〈hl〉k〈hm〉k = vk Qlm + λ2RlkRmk − λ2RlkRmk = vk Qlm (13)

Similarly, we find for the quantities b+, b−:

〈blbm〉k − 〈bl〉k〈bm〉k = vkδlm + λ2δlkδmk − λ2δlkδmk = vk δlm (14)

Eventually, we evaluate the covariance 〈hlbm〉k − 〈hl〉k〈bm〉k by consideringthe average

〈hlbm〉k = 〈(wl · ξ)(Bm · ξ)〉k = vk wl · Bm + λ2(wl · Bk)(Bm · Bk)

= vk Rlm + λ2 Rlk δmk

and obtain

〈hl, bm〉k − 〈hl〉k〈bm〉k = vk Rlm + λ2Rlk δmk − λ2 Rlk δmk = vk Rlm. (15)

The above results are summarized in Eq. (7). The conditional covariancematrix of x can be expressed explicitly in terms of the order parameters asfollows:

Ck = vk

Q++ Q+− R++ R+−Q+− Q−− R−+ R−−R++ R−+ 1 0R+− R−− 0 1

(16)

The conditional density of x for data from class k is a Gaussian N(µk, Ck)

where µk

is the conditional mean vector, Eq. (12), and Ck is given above.

A.2 Differential Equations

For the training prescriptions considered here it is possible to use a partic-ularly compact notation. The function g(l, σµ) as introduced in Eq. (1) canbe written as

g(l, σ) = a + b l σ

where a, b ∈ R and l, σ ∈ {+1,−1}. The WTA schemes listed in section 2.1are recovered as follows

I) a = 0, b = 1: LVQ1 with g(l, σ) = l σ = ±1

II) a = b = 1/2: LVQ+ with g(l, σ) = Θ(l σ) =

{1 if l = σ0 else

III) a = 1, b = 0 : VQ with g(l, σ) = 1.

17

The recursion relations, Eq. (5), are to be averaged over the density of anew, independent input ξ drawn from the density (9). In the limit N → ∞this yields a system of coupled differential equations in continuous learningtime α = P/N as argued in the text. Exploiting the notations defined above,it can be formally expressed in the following way:

dRlm

dα= η

[

a(

〈bmΘl〉 − 〈Θl〉Rlm

)

+ bl(

〈σbmΘl〉 − 〈σΘl〉Rlm

)]

dQlm

dα= η

[

b(

l〈σhmΘl〉 − l〈σΘl〉Qlm + m〈σhlΘm〉

−m〈σΘm〉Qlm

)

+ a(

〈hmΘl〉 − 〈Θl〉Qlm

+〈hlΘm〉 − 〈Θm〉Qlm

)

+ δlmη[a2 + b2lm

]

(∑

σ=±1

vσpσ〈Θl〉σ)

+ δlmab(l + m)η

(∑

σ=±1

vσpσ〈Θl〉σ)]

(17)

where 〈· · · 〉 =∑

k=±1〈· · · 〉k. Note that the equations for Qlm contain

terms of order η2 whereas the ODE for Rlm are linear in the learning rate.

A.2.1 Averages

We introduce the vectors and auxiliary quantities

αl = (+2l,−2l, 0, 0) and βl = l(Q+l+l − Q−l−l) (18)

which allow us to rewrite the Heaviside terms as

Θl = Θ(Q−l−l − 2h−l − Q+l+l + 2h+l) = Θ(αl · x − βl

)(19)

Performing the averages in (17) involves conditional means of the form

〈(x)nΘs〉k and 〈Θs〉k

where (x)n is the nth component of vector x = (h+1, h−1, b+1, b−1). We firstaddress the term

〈(x)nΘs〉k =1

(2π)4/2 (det(Ck))1

2

∫

R4

(x)nΘ(αs · x − βs)

exp

(

−1

2

(

x − µk

)TC−1

k

(

x − µk

))

dx

18

=1

(2π)4/2 (det(Ck))1

2

∫

R4

(

x′

+ µk

)

nΘ(

αs · x′

+ αs · µk− βs

)

× exp

(

−1

2x

′T C−1

k x′

)

dx′

with the substitution x′

= x − µk

Let x′

= Ck1

2 y, where Ck1

2 is defined in the following way: Ck = Ck1

2 Ck1

2 .

Since Ck is a covariance matrix, it is positive semidefinite and Ck1

2 exists.

Hence we have dx′

= det(C1

2

k )dy = (det(Ck))1

2 dy and

〈(x)n Θs〉k =1

(2π)2

∫

R4

(C1

2

k y)nΘ

(

αsC1

2

k y + αs · µk− βs

)

exp

(

−1

2y2

)

dy + (µk)n〈Θs〉k

=1

(2π)2

∫

R4

4∑

j=1

(

(C1

2

k )nj(y)j

)

Θ

(

αsC1

2


)

exp

(

−1

2y2

)

dy + (µk)n〈Θs〉k

= I + (µk)n〈Θs〉k (with the abbreviation I) (20)

Consider the integrals contributing to I:

Ij =

∫

R

(C1

2

k )nj(y)jΘ

(

αsC1

2


)

exp

(

−1

2(y)2j

)

d(y)j .

We can perform an integration by parts,∫

udv = uv −∫

vdu, with

u = Θ

(

αsC1

2


)

, v = (C1

2

k )nj exp

(

−1

2(y)2j

)

du =∂

∂(y)jΘ

(

αsC1

2


)

d(y)j

dv = (−)(C1

2

k )nj(y)j exp

(

−1

2(y)2j

)

d(y)j , and obtain

Ij = −[

Θ

(

αsC1

2


)

(C1

2

k )nj exp

(

−1

2(y)2j

)]∞

−∞︸︷︷︸

0

19

+

[∫

R

(C1

2

k )nj∂

∂(y)j

(

Θ

(

αsC1

2


))

exp

(

−1

2(y)2j

)

d(y)j

]

=

∫

R

(C1

2

k )nj∂

∂(y)j

(

Θ

(

αsC1

2


))

exp

(

−1

2(y)2j

)

d(y)j (21)

The sum over j gives

I =1

(2π)2

4∑

j=1

(C1

2

k )nj

∫

R4

∂

∂(y)j

(

Θ

(

αsC1

2


))

exp

(

−1

2y2

)

dy

=1

(2π)2

4∑

j=1

(

(C1

2

k )nj

4∑

i=1

(αs)i(C1

2

k )i,j

)

∫

R4

(

δ

(

αsC1

2


))

exp

(

−1

2y2

)

dy.

=1

(2π)2(Ckαs)n

∫

R4

(

δ

(

αsC1

2


))

exp

(

−1

2y2

)

dy. (22)

In the last step we have used

∂

∂(y)jΘ

(

αsC1

2


)

=

4∑

i=1

(αs)i(C1

2

k )i,jδ(αsC1

2

k y + αs · µk− βs)

where δ(.) is the Dirac-delta function.Now, note that exp

[−1

2y2]dy is a measure which is invariant under

rotation of the coordinate axes. We rotate the system in such a way that

one of the axes, say y, is aligned with the vector C1

2

k αs. The remaining threecoordinates can be integrated over and we get

I =1√2π

(Ckαs)n

∫

R

δ

(

‖C1

2

k αs‖y + αs · µk− βs

)

exp

[

−1

2y2

]

dy (23)

We define

αsk = ‖C1

2

k αs‖ =√

αsCkαs and βsk = αs · µk− βs (24)

20

and obtain

I =1√2π

(Ckαs)n

∫

R

δ(

αsky + βsk

)

exp

[

−1

2y2

]

dy

=(Ckαs)n√

2παsk

∫

R

δ(

z + βsk

)

exp

[

−1

2(

z

αsk)2]

dz (with z = αsky)

=(Ckαs)n√

2παsk

exp

[

−1

2(βsk

αsk)2

]

(25)

Now we compute the remaining average in (20) in an analogous way and get

〈Θs〉k =1√2π

∫

R

Θ(

αsky + βsk

)

exp

[

−1

2y2

]

dy

=1√2π

∫ βskαsk

−∞exp

[

−1

2y2

]

dy = Φ

(

βsk

αsk

)

, (26)

where Φ(x) = 1

2π

∫ x−∞ e−

z2

2 dz.

Finally we obtain the required average using (25) and (26) as follows:

〈(x)nΘs〉k =(Ckαs)n√

2παsk

exp

[

−1

2(βsk

αsk)2

]

+ (µk)nΦ

(

βsk

αsk

)

. (27)

The quantities αsk and βsk are defined through Eqs. (24) and (18).

A.2.2 Final form of the differential equations

Using (26) and (27) the system differential equations reads

dRlm

dα= η

[

(bl)( ∑

σ=±1

σpσ

[(Cαl)nbm√2παlσ

exp[− 1

2(βlσ

αlσ)2]

+(µσ)nbm

Φ( βlσ

αlσ

)]

−∑

σ=±1

σpσΦ( βlσ

αlσ

)Rlm

)

21

+ a( ∑

σ=±1

pσ

[(Cαl)nbm√2παlσ

exp[− 1

2(βlσ

αlσ)2]+ (µ

σ)nbm

Φ( βlσ

αlσ

)]

−∑

σ=±1

pσΦ( βlσ

αlσ

)Rlm

)]

(28)

dQlm

dα= η

(

bl∑

σ=±1

σpσ

[(Cαl)nhm√2παlσ

exp[− 1

2(βlσ

αlσ)2]+ (µ

σ)nhm

Φ( βlσ

αlσ

)]

−bl∑

σ=±1

σpσ

[Φ( βlσ

αlσ

)Qlm + bm

∑

σ=±1

σpσ

[ (Cαl)nhl√2παmσ

exp[− 1

2(βmσ

αmσ)2]+ (µ

σ)nhl

Φ( βmσ

αmσ

)]

− bm∑

σ=±1

σpσΦ( βmσ

αmσ

)

Qlm

)

+ δlm(a2 + b2lm)η2∑

σ=±1

σvσpσΦ( βlσ

αlσ

)+ δlmη2

(ab(l + m)

)

∑

σ=±1

σvσpσΦ( βlσ

αlσ

)+ η

(

a∑

σ=±1

pσ

[(Cαl)nhm√2παlσ

exp[− 1

2(βlσ

αlσ)2]

+(µσ)nhm

Φ( βlσ

αlσ

)]

− a∑

σ=±1

pσΦ( βlσ

αlσ

)Qlm + a

∑

σ=±1

pσ

[ (Cαl)nhl√2παmσ

exp[− 1

2(βmσ

αmσ)2]+ (µ

σ)nhl

Φ( βmσ

αmσ

)]

− a∑

σ=±1

pσΦ( βmσ

αmσ

)Qlm

)

.

(29)

Here nbm =3 if m = 14 if m = −1

and nhm =1 if m = 12 if m = −1.

A.3 The generalization error

Using (26) we can easily compute the generalization error as follows:

εg =∑

k=±1

p−k〈Θk〉−k =∑

k=±1

p−kΦ( βk,−k

αk,−k

)

(30)

which amounts to Eq. 8 in the text after inserting βk,−k and αk,−k as givenabove with v+ = v− = 1.

22

learning vector quantization: the dynamics of winner-takes ...learning vector quantization: the...

Documents