michael biehl, anarta ghosh and barbara hammer- learning vector quantization: the dynamics of...

8/3/2019 Michael Biehl, Anarta Ghosh and Barbara Hammer- Learning Vector Quantization: The Dynamics of Winner-Takes-All

1/22

Learning Vector Quantization: The Dynamics of

Winner-Takes-All Algorithms

Michael Biehl1, Anarta Ghosh1, and Barbara Hammer2

1- Rijksuniversiteit Groningen - Mathematics and Computing Science

P.O. Box 800, NL-9700 AV Groningen - The Netherlands

2- Clausthal University of Technology - Institute of Computer Science

D-98678 Clausthal-Zellerfeld - Germany

Abstract

Winner-Takes-All (WTA) prescriptions for Learning Vector Quan-tization (LVQ) are studied in the framework of a model situation: Twocompeting prototype vectors are updated according to a sequence of ex-ample data drawn from a mixture of Gaussians. The theory of on-linelearning allows for an exact mathematical description of the trainingdynamics, even if an underlying cost function cannot be identified. Wecompare the typical behavior of several WTA schemes including ba-sic LVQ and unsupervised Vector Quantization. The focus is on the

learning curves, i.e. the achievable generalization ability as a functionof the number of training examples.

Keywords: Learning Vector Quantization, prototype-based classification,Winner-Takes-All algorithms, on-line learning, competitive learning

1 Introduction

Learning Vector Quantization (LVQ) as originally proposed by Kohonen [10]is a widely used approach to classification. It is applied in a variety of prac-tical problems, including medical image and data analysis, e.g. proteomics,

classification of satellite spectral data, fault detection in technical processes,and language recognition, to name only a few. An overview and furtherreferences can be obtained from [1].

LVQ procedures are easy to implement and intuitively clear. The clas-sification of data is based on a comparison with a number of so-called pro-totype vectors. The similarity is frequently measured in terms of Euclidean

1


2/22

distance in feature space. Prototypes are determined in a training phase

from labeled examples and can be interpreted in a straightforward way asthey directly represent typical data in the same space. This is in contrastwith, say, adaptive weights in feedforward neural networks or support vectormachines which do not allow for immediate interpretation as easily. Amongthe most attractive features of LVQ is the natural way in which it can beapplied to multi-class problems.

In the simplest so-called hard or crisp schemes any feature vector isassigned to the closest of all prototypes and the corresponding class. Ingeneral, several prototypes will be used to represent each class. Extensionsof the deterministic assignment to a probabilistic soft classification are con-ceptually straightforward but will not be considered here.

Plausible training prescriptions exist which employ the concept of on-linecompetitive learning: Prototypes are updated according to their distancefrom a given example in a sequence of training data. Schemes in whichonly the winner, i.e. the currently closest prototype is updated have beentermed Winner-Takes-All algorithms and we will concentrate on this classof prescriptions here.

The ultimate goal of the training process is, of course, to find a classifierwhich labels novel data correctly with high probability, after training. Thisso-called generalization ability will be in the focus of our analysis in thefollowing.

Several modifications and potential improvements of Kohonens original

LVQ procedure have been suggested. They aim at achieving better ap-proximations of Bayes optimal decision boundaries, faster or more robustconvergence, or the incorporation of more flexible metrics, to name a fewexamples [5,810,16].

Many of the suggested algorithms are based on plausible but purelyheuristic arguments and they often lack a thorough theoretical understand-ing. Other procedures can be derived from an underlying cost function, suchas Generalized Relevance LVQ [8,9] or LVQ2.1, the latter being a limit caseof a statistical model [14,15]. However, the connection of the cost functionswith the ultimate goal of training, i.e. the generalization ability, is often un-clear. Furthermore, several learning rules display instabilities and divergent

behavior and require modifications such as the window rule for LVQ2.1 [10].Clearly, a better theoretical understanding of the training algorithms

should be helpful in improving their performance and in designing novel,more efficient schemes.

In this work we employ a theoretical framework which makes possiblea systematic investigation and comparison of LVQ training procedures. We

2


3/22

consider on-line training from a sequence of uncorrelated, random training

data which is generated according to a model distribution. Its purpose is todefine a non-trivial structure of data and facilitate our analytical approach.We would like to point out, however, that the training algorithms do notmake use of the form of this distribution as, for instance, density estimationschemes.

The dynamics of training is studied by applying the successful theoryof online learning [3, 6, 12] which relates to ideas and concepts known fromStatistical Physics. The essential ingredients of the approach are (1) theconsideration of high-dimensional data and large systems in the so-calledthermodynamic limit and (2) the evaluation of averages over the randomnessor disorder contained in the sequence of examples. The typical properties

of large systems are fully described by only a few characteristic quantities.Under simplifying assumptions, the evolution of these so-called order pa-rameters is given by deterministic coupled ordinary differential equations(ODE) which describe the dynamics of on-line learning exactly in the ther-modynamic limit. For reviews of this very successful approach to the inves-tigation of machine learning processes consult, for instance, [3,6,17].

The formalism enables us to compare the dynamics and generalizationability of different WTA schemes including basic LVQ1 and unsupervisedVector Quantization (VQ). The analysis can readily be extended to moregeneral schemes, approaching the ultimate goal of designing novel and effi-cient LVQ training algorithms with precise mathematical foundations.

The paper is organized as follows: in the next section we introducethe model, i.e. the specific learning scenarios and the assumed statisticalproperties of the training data. The mathematical description of the trainingdynamics is briefly summarized in section 3, technical details are given in anappendix. Results concerning the WTA schemes are presented and discussedin section 4 and we conclude with a summary and an outlook on forthcomingprojects.

2 The model

2.1 Winner-Takes-All algorithmsWe study situations in which input vectors IRN belong to one of twopossible classes denoted as = 1. Here, we restrict ourselves to thecase of two prototype vectors wS where the label S = 1 (or for short)corresponds to the represented class.

3


4/22

In all WTA-schemes, the squared Euclidean distances dS() = (

wS)

2

are evaluated for S = 1 and the vector is assigned to class ifd+ < d.We investigate incremental learning schemes in which a sequence of sin-

gle, uncorrelated examples {, } is presented to the system. The analysiscan be applied to a larger variety of algorithms but here we treat only up-dates of the form

wS = w

1S + w

S with w

S =

NS g(S,

) w1S

. (1)

Here, the vector wS denotes the prototype after presentation of exam-ples and the learning rate is rescaled with the vector dimension N. TheHeaviside term

S = (dS d+S) with (x) = 1 if x > 0.0 else

singles out the current prototype w1S which is closest to the new input

in the sense of the measure dS = (w1S )2. In this formulation, only the

winner, say, wS can be updated whereas the looserwS remains unchanged.The change of the winner is always along the direction ( w1S ). Thefunction g(S, ) further specifies the update rule. Here, we focus on threespecial cases of WTA learning:

I) LVQ1: g(S, ) = S = +1(resp. 1) for S = (resp. S = ).This extension of competitive learning to labeled data corresponds to

Kohonens original LVQ1. The update is towards if the examplebelongs to the class represented by the winning prototype, the correctwinner. On the contrary, a wrong winner is moved away from thecurrent input.

II) LVQ+: g(S, ) = (S) = +1 (resp. 0) for S = (resp. S = ). Inthis scheme the update is non-zero only for a correct winner and, then,always positive. Hence, a prototype wS can only accumulate updatesfrom its own class = S. We will use the abbreviation LVQ+ for thisprescription.

III) VQ: g(S, ) = 1. This update rule disregards the actual data label and

always moves the winner towards the example input. It correspondsto unsupervised Vector Quantization and aims at finding prototypeswhich yield a good representation of the data in the sense of Euclideandistances. The choice g(S, ) = 1 can also be interpreted as describ-ing two prototypes which represent the same class and compete forupdates from examples of this very class, only.

4


5/22

h+

h

b+

b

Figure 1: Data as generated according to the model density (3) in N = 200dimension with p = 0.6, p+ = 0.4. Open (filled) circles represent 240 (160)vectors from clusters centered about orthonormal vectors B+ (B) with = 1.5. The left panel shows the projections h = w of the data ona randomly chosen pair of orthogonal unit vectors w, whereas the rightpanel displays b = B , i.e. the plane spanned by B+,B; crosses markthe position of the cluster centers.

Note that the VQ procedure (III) can be readily formulated as a stochas-tic gradient descent with respect to the quantization error

e() =S=1

1

2( w1S )2 (dS d+S), (2)

see e.g. [7] for details. While intuitively clear and well motivated, LVQ1 (I)and LVQ+ (II) lack such a straightforward interpretation and relation to acost function.

2.2 The model data

In order to analyse the behavior of the above algorithms we assume that

randomized data is generated according to a model distribution P(

). As asimple yet non-trivial situation we consider input data drawn from a binarymixture of Gaussian clusters

P() ==1

pP( | ) with P( | ) = 12

Nexp

1

2( B)2

(3)

5


6/22

where the weights p correspond to the prior class membership probabilities

and p+ + p = 1. Clusters are centered about B+ and B, respectively.We assume that B B = (), i.e. B2 = 1 and B+ B = 0. Thelatter condition fixes the location of the cluster centers with respect to theorigin while the parameter controls their separation.

We consider the case where the cluster membership coincides withthe class label of the data. The corresponding classification scheme is notlinearly separable because the Gaussian contributions P( |) overlap. Ac-cording to Eq. (3) a vector consists of statistically independent componentswith unit variance. Denoting the average over P( | ) by we have,for instance, j = (B)j for a component and correspondingly

2

=N

j=1

2j

=N

j=1

1 + j2

= N + 2.

Averages over the full P() will be written as = =1 .Note that in high dimensions, i.e. for large N, the Gaussians overlap

significantly. The cluster structure of the data becomes only apparent whenprojected into the plane spanned by {B+,B}. However projections in arandomly chosen two-dimensional subspace would overlap completely. Fig-ure 1 illustrates the situation for Monte Carlo data in N = 200 dimensions.In an attempt to learn the classification scheme, the relevant directions

B IRN

have to be identified to a certain extent. Obviously this taskbecomes highly non-trivial for large N.

3 The dynamics of learning

The following analysis is along the lines of on-line learning, see e.g. [3,6,12].for a comprehensive overview and example applications. In this sectionwe briefly outline the treatment of WTA algorithms in LVQ and refer toappendix A for technical details.

The actual configuration of prototypes is characterized by the projections

R

S = w

S B and Q

ST = w

S w

T, for S , T , = 1 (4)The self-overlaps Q++ and Q specify the lengths of vectors w, whereasthe remaining five overlaps correspond to projections, i.e. angles betweenw+ and w and between the prototypes and the center vectors B.

6


7/22

The algorithm (1) directly implies recursions for the above defined quan-

tities upon presentation of a novel example:

N(RS R1S ) = Sg(S, )

bS R1S

N(QST Q1ST ) = Sg(S, )

hT Q1ST

+ T g(T, )

hS Q1ST

+ 2 ST g(S,

) g(T, ) + O(1/N) (5)Here, the actual input enters only through the projections

hS = w1S

and b = B

. (6)

Note in this context that S = (Q1SS 2hS Q1SS + 2hS) also does

not depend on explicitly.An important assumption is that all examples in the training sequence

are independently drawn from the model distribution and, hence, are uncor-related with previous data and with w1 . As a consequence, the statisticsof the projections (6) are well known for large N. By means of the CentralLimit Theorem their joint density becomes a mixture of Gaussians, which isfully specified by the corresponding conditional first and second moments:

hS

= R1S , b = ( ),

hSh

T

hS

hT

= Q1ST

hSb hS b = R1S , bb b b = (), (7)see the appendix for derivations. This observation enables us to perform anaverage of the recursions w.r.t. the latest example data in terms of Gaussianintegrations. Details of the calculation are presented in appendix A, seealso [4]. On the right hand sides of (5) terms of order (1/N) have beenneglected using, for instance,

2

/N = 1 + 2/N 1 for large N.The limit N has further simplifying consequences. First, the aver-

aged recursions can be interpreted as ordinary differential equations (ODE)in continuous training time = /N. Second, the overlaps {RS , QST} asfunctions of become selfaveraging with respect to the random sequenceof examples. Fluctuations of these quantities, as for instance observed in

computer simulations of the learning process, vanish with increasing N andthe description in terms of mean values is sufficient. For a detailed mathe-matical discussion of this property see [11].

Given initial conditions {RS(0), QST(0)}, the resulting system of cou-pled ODE can be integrated numerically. This yields the evolution of over-laps (4) with increasing in the course of training. The behavior of the

7


8/22

50 100 150 200 250 300

-1.0

0.0

1.0

2.0 R

R++

R+

R+

0 50 100 150 200 250 300

-2.0

0.0

2.0

4.0

6.0

Q

Q++

Q+

Figure 2: The characteristic overlaps RS (left panel) and QST (right panel)vs. the rescaled number of examples = /N for LVQ1 with p+ = 0.8, =1, and = 0.2. The solid lines display the result of integrating the system

of ODE, symbols correspond to Monte Carlo simulations for N = 200 onaverage over 100 independent runs. Error bars are smaller than the symbolsize. Prototypes were initialized close to the origin with RS = Q+ = 0and Q++ = Q = 104.

system will depend on the characteristics of the data, i.e. the separation ,the learning rate , and the actual algorithm as specified by the choice ofg(S, ) in Eq. (1). Monte Carlo simulations of the learning process are in ex-cellent agreement with the N theory for dimensions as low as N = 200already, see Figure 2 for a comparison in the case of LVQ1. Our simulationsfurthermore confirm that the characteristic overlaps

{R

S(), Q

ST()

}are

self-averaging quantities: their standard deviation determined from inde-pendent runs vanishes as N , details will be published elsewhere.

The success of learning can be quantified as the probability of misclassi-fying novel random data, the generalization error g =

=1 p .

Performing the averages is done along the lines discussed above, see theappendix and [4]. One obtains g as a function of the overlaps {QST, RS}:

g =S=1

p

QSS QSS 2 [RSS RSS]

Q++ + Q 2Q+

, (8)

where (x) = x dt2et2/2. Hence, by inserting {QST(), RS ()} wecan evaluate the learning curve g(), i.e. the typical generalization error asachieved from a sequence of N examples.

8


9/22

100 2000.1

0.2

0.3

0.4

0.5

g

0 0.5 1

0

0.1

0.2

0.3

0.4

0.5

g

p+

Figure 3: Left panel: Typical learning curves g() of unsupervised VQ(dotted), LVQ+ (dashed) and LVQ1 (solid line) for = 1.0 and learningrate = 0.2.

Right panel: asymptotic g for 0, for = 1.2 as a function ofthe prior weight p+. The lowest, solid curve corresponds to the optimumming whereas the dashed line represents the typical outcome of LVQ1. Thehorizontal line is the pindependent result for LVQ+, it can even exceedg = min {p+, p} (thin solid line). Unsupervised VQ yields an asymptoticg as marked by the chain line.

4 Results

4.1 The dynamics

The dynamics of unsupervised VQ, setting III in section 2.1, has been stud-ied for p+ = p in an earlier publication [7]. Because data labels are disre-garded or unavailable, the prototypes could be exchanged with no effect onthe achieved quantization error (2). This permutation symmetry causes theinitial learning dynamics to take place mainly in the subspace orthogonal to(B+ B). It is reflected in a weakly repulsive fixed point of the ODE inwhich all RS are equal. Generically, the prototypes remain unspecializedup to rather large values of , depending on the precise initial conditions.Without prior knowledge, of course, RS(0) 0 holds. This key feature, asdiscussed in [7], persists for p+ = p.

While VQ does not aim at good generalization, we can still obtain g()

from the prototype configuration, see Figure 3 (left panel) for an example.The very slow initial decrease relates to the above mentioned effect.

Figure 2 shows the evolution of the order parameters {QST, RS} for anexample set of model parameters in LVQ1 training, scenario I in sec. 2.1.Monte Carlo simulations of the system for N = 200 already display excellentagreement with the N theory.

9


10/22

In LVQ1, data and prototypes are labeled and, hence, specialization is

enforced as soon as > 0. The corresponding g displays a very fast initialdecrease, cf. Figure 3. The non-monotonic intermediate behavior of g() isparticularly pronounced for very different prior weights, e.g. p+ > p, andfor strongly overlapping clusters (small ).

Qualitatively, the typical behavior of LVQ+ is similar to that of LVQ1.However, unless p+ = p, the achieved g() is significantly larger as shownin Figure 3. This effect becomes clearer from the discussion of asymptoticconfigurations in the next section. The typical trajectories of prototypevectors in the space of order parameters will be discussed in greater detailfor a variety of LVQ algorithms in forthcoming publications.

4.2 Asymptotic generalization

Apart from the behavior for small and intermediate values of , we are alsointerested in the generalization ability that can be achieved, in principle,from an unlimited number of examples. This provides important means for

judging the potential performance of training algorithms.For stochastic gradient descent procedures like VQ, the expectation value

of the associated cost function is minimized in the simultaneous limits of 0 and such that = . In the absence of a cost functionwe can still consider this limit in which the system of ODE simplifies and canbe expressed in the rescaled after neglecting terms of order O

2

. A fixedpoint analysis then yields well defined asymptotics, see [7] for a treatmentof VQ. Figure 3 (right panel) summarizes our findings for the asymptoticg as a function of p+.

It is straightforward to work out the decision boundary with minimalgeneralization error ming in our model. For symmetry reasons it is a planeorthogonal to (B+B) and contains all with p+P(|+ 1) = pP(|1)[5]. The lowest, solid line in Figure 3 (right panel) represents ming for = 1.2as a function of p+. For comparison, the trivial classification according tothe priors p yields trivg = min {p, p+} and is also included in Figure 3.

In unsupervised VQ a strong prevalence, e.g. p+ 1, will be accountedfor by placing both vectors inside the stronger cluster, thus achieving a

low quantization error (2). Obviously this yields a poor classification asindicated by g = 1/2 in the limiting cases p+ = 0 or 1. In the specialcase p+ = 1/2 the aim of representation happens to coincide with goodgeneralization and g becomes optimal, indeed.

LVQ1, setting I in section 2.1, yields a classification scheme which is veryclose to being optimal for all values of p+, cf. Figure 3 (right panel). On the

10


11/22

contrary, LVQ+ (algorithm II) updates eachwS only with data from class S.

As a consequence, the asymptotic positions of the w is always symmetricabout the geometrical center (B+ + B) and is independent of thepriors p. Thus, LVQ+ is robust w.r.t. a variation of p after training, i.e.it is optimal in the sense of the minmax-criterion supp g() [5].

5 Conclusions

LVQ type learning models constitute popular learning algorithms due totheir simple learning rule, their intuitive formulation of a classifier by meansof prototypical locations in the data space, and their efficient applicabilityto any given number of classes. However, only very few theoretical resultshave been achieved so far which explain the behavior of such algorithms inmathematical terms, including large margin bounds as derived in [2,8] andvariations of LVQ type algorithms which can be derived from a cost function[9,1315]. In general, the often excellent generalization ability of LVQ typealgorithms is not guaranteed by theoretical arguments, and the stabilityand convergence of various LVQ learning schemes is only poorly understood.Often, further heuristics such as the window rule for LVQ2.1 are introducedto overcome the problems of the original, heuristically motivated algorithms[10]. This has led to a large variety of (often only slightly) different LVQ typelearning rules the drawbacks and benefits of which are hardly understood.

Apart from an experimental benchmarking of these proposals, there is clearlya need for a thorough theoretical comparison of the behavior of the modelsto judge their efficiency and to select the most powerful learning schemesamong these proposals.

We have investigated different Winner-Takes-All algorithms for LearningVector Quantization in the framework of an analytically treatable model.The theory of on-line learning enables us to describe the dynamics of trainingin terms of differential equations for a few characteristic quantities. Theformalism becomes exact in the limit N of very high dimensional dataand many degrees of freedom.

This framework opens the way towards an exact investigation of sit-

uations where, so far, only heuristic evaluations have been possible, andit can serve as a uniform approach to investigate and compare LVQ typelearning schemes based on a solid theoretical ground: The generalizationability can be evaluated also for heuristic training prescriptions which lacka corresponding cost function, including Kohonens basic LVQ algorithm.Already in our simple model setting, this formal analysis shows pronounced

11


12/22

characteristics and differences of the typical learning behavior. On the one

hand, the learning curves display common features such as an initial non-monotonic behavior. This behavior can be attributed to the necessity ofsymmetry breaking for the fully unsupervised case, and an overshooting ef-fect due to different prior probabilities of the two classes, for supervisedsettings. On the other hand, slightly different intuitive learning algorithmsyield fundamentally different asymptotic configurations as demonstrated forVQ, LVQ1, and LVQ+. These differences can be attributed to the differentinherent goals of the learning schemes. In consequence, these differences areparticularly pronounced for unbalanced class distributions where the goalsof minimizing the quantization error, the minmax error, and the general-ization error do not coincide. It is quite remarkable that the conceptually

simple, original learning rule LVQ1 shows close to optimal generalization forthe entire range of the prior weights p.

Despite the relative simplicity of our model situation, the training ofonly two prototypes from two class data captures many non-trivial featuresof more realistic settings: In terms of the itemization in section 2.1, situationI (LVQ1) describes the generic competition at class borders in basic LVQschemes. Scenario II (LVQ+) corresponds to an intuitive variant which canbe seen as a mixture of VQ and LVQ, i.e. Vector Quantization within thedifferent classes. Finally, setting III (VQ) models the representation of oneclass by several competing prototypes.

Here, we have only considered WTA schemes for training. It is possible

to extend the same analysis to more complex update rules such as LVQ2.1or recent proposals [9, 14, 15] which update more than one prototype ata time. The treatment of LVQ type learning in the online scenario caneasily be extended to algorithms which obey a more general learning rulethan equation 1, e.g. learning rules which, by substituting the Heavisideterm, adapt more than one prototype at a time. Our preliminary studiesalong this line show remarkable differences in the generalization ability andconvergence properties of such variations of LVQ-type algorithms.

Clearly, our model cannot describe all features of more complex realworld problems. Forthcoming investigations will concern, for instance, theextension to situations where the model complexity and the structure of the

data do not match perfectly. This requires the treatment of scenarios withmore than two prototypes and cluster centers. We also intend to study theinfluence of different variances within the classes and non-spherical clusters,aiming at a more realistic modeling. We are convinced that this line ofresearch will lead to a deeper understanding of LVQ type training and willfacilitate the development of novel efficient algorithms.

12


13/22

References

[1] Bibliography on the Self-Organizing Map (SOM) and Learning VectorQuantization (LVQ). Neural Networks Research Centre, Helsinki Uni-versity of Technology, 2002.

[2] K. Crammer, R. Gilad-Bachrach, A. Navot, and A. Tishby. Marginanalysis of the LVQ algorithm. In: Advances in Neural InformationProcessing Systems, 2002.

[3] M. Biehl and N. Caticha, The statistical physics of learning and gen-eralization. In: M. Arbib (ed.), Handbook of brain theory and neuralnetworks, second edition, MIT Press, 2003.

[4] M. Biehl, A. Freking, A. Ghosh, and G. Reents, A theoretical frame-work for analysing the dynamics of Learning Vector Quantization. Tech-nical Report 2004-9-02, Institute of Mathematics and Computing Sci-ence, University Groningen, available from www.cs.rug.nl/ biehl,2004.

[5] R. Duda, P. Hart, and D. Stork, Pattern Classification, Wiley, 2001.

[6] A. Engel and C. van den Broeck, The Statistical Mechanics of Learning,Cambridge University Press, 2001.

[7] A. Freking, G. Reents, and M. Biehl, The dynamics of competitive learn-ing, Europhysics Letters 38 (1996) 73-78.

[8] B. Hammer, M. Strickert and T. Villmann. On the generalization capa-bility of GRLVQ networks, Neural Processing Letters 21 (2005) 109-120.

[9] B. Hammer and T. Villmann. Generalized relevance learning vectorquantization, Neural Networks 15 (2002) 1059-1068.

[10] T. Kohonen. Self-organizing maps, Springer, Berlin, 1995.

[11] G. Reents and R. Urbanczik. Self-averaging and on-line learning, Phys-

ical Review Letters 80 (1998) 5445-5448.[12] D. Saad, editor. Online learning in neural networks, Cambridge Uni-

versity Press, 1998.

[13] S. Sato and K. Yamada. Generalized learning vector quantization. InG. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural

13


14/22

Information Processing Systems, volume 7, pages 423429, MIT Press,

1995.

[14] S. Seo, M. Bode, and K. Obermayer. Soft nearest prototype classifica-tion, IEEE Transactions on Neural Networks 14 (2003) 390-398.

[15] S. Seo and K. Obermayer. Soft Learning Vector Quantization. NeuralComputation 15 (2003) 1589-1604.

[16] P. Somervuo and T. Kohonen. Self-organizing maps and learning vectorquantization for feature sequences, Neural Processing Letters 10 (1999)151-159.

[17] T.L.H. Watkin, A. Rau, and M. Biehl, The statistical mechanics oflearning a rule, Reviews of Modern Physics 65 (1993) 499-556.

14


15/22

A The theoretical framework

In this appendix we outline key steps of the calculations referred to in thetext. The formalism was first used in the context of unsupervised VectorQuantization [7] and the calculations were recently detailed in a TechnicalReport [4].

Throughout this appendix indices l ,m,k,s, {1} (or for short)represent the class labels and cluster memberships.

Note that the following corresponds to a slightly more general inputdistribution with variances v of the Gaussian clusters = 1

P() = =1pP( | ) with P( | ) =1

2vN

exp1

2 v

(

B)

2 .(9)

All results in the main text refer to the special choice v+ = v = 1 whichcorresponds to Eq. (3). The influence of choosing different variances on thebehavior of LVQ type training will be studied in forthcoming publications.

A.1 Statistics of the projections

To a large extent, our analysis is based on the observation that the pro-jections h = w and b = B are correlated Gaussian randomquantities for a vector drawn from one of the clusters contributing to

the density (9). Where convenient, we will combine the projections into afour-dimensional vector denoted as x = (h+, h, b+, b)T.We will assume implicitly that is statistically independent from the

considered weight vectors w. This is obviously the case in our on-lineprescription where the novel example is uncorrelated with all previousdata and hence with w1 .

The first and second conditional moments given in Eq. (7) are obtainedfrom the following elementary considerations.

First moments

Exploiting the above mentioned statistical independence we can show im-

mediately that

hlk = wl k = wl k = wl (Bk) = Rlk. (10)

Similarly we get for bl:

blk = Bl k = Bl k = Bl (Bk) = lk, (11)

15


16/22


17/22

Finally, we obtain the conditional second moment

hl hmk hlkhmk = vk Qlm + 2RlkRmk 2RlkRmk = vk Qlm (13)Similarly, we find for the quantities b+, b:

blbmk blkbmk = vklm + 2lkmk 2lkmk = vk lm (14)Eventually, we evaluate the covariance hlbmk hlkbmk by consideringthe average

hlbmk = (wl )(Bm )k = vkwl Bm + 2(wl Bk)(Bm Bk)= vk Rlm +

2 Rlk mk

and obtain

hl, bmk hlkbmk = vk Rlm + 2Rlk mk 2 Rlk mk = vk Rlm. (15)The above results are summarized in Eq. (7). The conditional covariancematrix of x can be expressed explicitly in terms of the order parameters asfollows:

Ck = vk

Q++ Q+ R++ R+Q+ Q R+ RR++ R+ 1 0R+ R 0 1

(16)

The conditional density of x for data from class k is a Gaussian N(k

, Ck)where

kis the conditional mean vector, Eq. (12), and Ck is given above.

A.2 Differential Equations

For the training prescriptions considered here it is possible to use a partic-ularly compact notation. The function g(l, ) as introduced in Eq. (1) canbe written as

g(l, ) = a + b l

where a, b R and l, {+1, 1}. The WTA schemes listed in section 2.1are recovered as follows

I) a = 0, b = 1: LVQ1 with g(l, ) = l = 1

II) a = b = 1/2: LVQ+ with g(l, ) = (l ) =

1 if l = 0 else

III) a = 1, b = 0 : VQ with g(l, ) = 1.

17


18/22

The recursion relations, Eq. (5), are to be averaged over the density of a

new, independent input drawn from the density (9). In the limit N this yields a system of coupled differential equations in continuous learningtime = P/N as argued in the text. Exploiting the notations defined above,it can be formally expressed in the following way:

dRlmd

=

abml lRlm

+ blbml lRlm

dQlmd

=

b

lhml llQlm + mhlm

m

m

Qlm+ ahml lQlm

+hlm mQlm

+ lm

a2 + b2lm

=1

vpl

+ lmab(l + m)

=1

vpl

(17)

where = k=1 k. Note that the equations for Qlm containterms of order 2 whereas the ODE for Rlm are linear in the learning rate.

A.2.1 Averages

We introduce the vectors and auxiliary quantities

l = (+2l, 2l, 0, 0) and l = l(Q+l+l Qll) (18)which allow us to rewrite the Heaviside terms as

l = (Qll 2hl Q+l+l + 2h+l) =

l x l

(19)

Performing the averages in (17) involves conditional means of the form

(x)nsk and sk

where (x)n is the nth

component of vector x = (h+1, h1, b+1, b1). We firstaddress the term

(x)nsk = 1(2)4/2 (det(Ck))

1

2

R4

(x)n (s x s)

exp

1

2

x

k

TC1k

x

k

dx

18


19/22


20/22


21/22

and obtain

I =12

(Cks)n

R

sky + sk

exp

1

2y2

dy

=(Cks)n

2sk

R

z + sk

exp

1

2(

z

sk)2

dz (with z = sky)

=(Cks)n

2skexp

1

2(

sksk

)2

(25)

Now we compute the remaining average in (20) in an analogous way and get

sk =1

2 R sky + sk exp12 y2 dy=

12

sksk

exp

1

2y2

dy =

sksk

, (26)

where (x) = 12

x e

z22 dz.

Finally we obtain the required average using (25) and (26) as follows:

(x)nsk = (Cks)n2sk

exp

1

2(

sksk

)2

+ (

k)n

sksk

. (27)

The quantities sk and sk are defined through Eqs. (24) and (18).

A.2.2 Final form of the differential equations

Using (26) and (27) the system differential equations reads

dRlmd

=

(bl) =1

p

(Cl)nbm2l

exp 1

2(

ll

)2

+(

)nbmll =1p

ll Rlm

21


22/22

+ a =1p(Cl)nbm2l exp 12 ( ll )2+ ()nbm ll

=1

p l

l

Rlm

(28)

dQlmd

=

bl=1

p

(Cl)nhm2l

exp 1

2(

ll

)2

+ (

)nhm l

l

bl=1

p

l

l

Qlm + bm

=1

p

(Cl)nhl2m

exp 12

(mm

)2+ (

)nhl m

m bm

=1p

mm

Qlm

+ lm(a

2 + b2lm)2=1

vp l

l

+ lm

2

ab(l + m)

=1

vp l

l

+

a=1

p

(Cl)nhm2l

exp 1

2(

ll

)2

+(

)nhm l

l

a =1

p l

l

Qlm + a

=1

p

(Cl)nhl2m

exp 1

2( m

m)2

+ (

)nhl m

m

a =1

p m

m

Qlm

.

(29)

Here nbm =3 if m = 14 if m = 1 and nhm =

1 if m = 12 if m = 1.

A.3 The generalization error

Using (26) we can easily compute the generalization error as follows:

g = k=1

pkkk = k=1

pk k,kk,k

(30)which amounts to Eq. 8 in the text after inserting k,k and k,k as givenabove with v+ = v = 1.

22

michael biehl, anarta ghosh and barbara hammer- learning vector quantization: the dynamics of...

Documents