a model for a multi-class classification machine

Physica A 185 (1992) 428-432 North-Holland

A model for a multi-class classification machine

A l b r e c h t R a u a and J e a n - P i e r r e Nad a l b

aDepartment of Theoretical Physics, University of Oxford, 1 Keble Rd., Oxford OX1 3NP, UK

bLab. de Physique Statisque ~, Ecole Normale SupOrieure, 24, rue Lhomond, F-75231 Paris Cedex 05, France

We consider the properties of multi-class neural networks, where each neuron can be in several different states. The motivations for considering such systems are manifold. In image processing for example, the different states correspond to the different grey tone levels. Another multi-class classification task implemented on a feed-forward network is the analysis of DNA sequences or the prediction of the secondary structure of proteins from the sequence of amino acids.

To investigate the behaviour of such systems, one specific dynamical r u l e - t h e "winner- take-all" r u l e - i s studied. Gauge invariances of the model are analysed. For a multi-class perceptron with N Q-state input neurons and one Q'-state output neuron, the maximal number of patterns that can be stored in the large N limit is found to be proportional to N ( Q - 1).f(Q') , where f (Q') is a slowly increasing and bounded function of order 1.

This contribution to the Trieste conference summarizes mainly the results of ref. [1] and comments on new implications of the work. The Hopfield model of a formal neural network uses neurons having two possible states and is based on an analogy with spin-glass systems. A natural extension of this model of associative memory is to consider neurons taking more than two states. Extensions can be made in the direction of graded response neurons [2], which still possess neighbourhood relations between the different states. It is, however, also possible to devise systems with a complete symmetry between the states. The corresponding extensions of Ising systems are known in the physics li terature as "Pot ts" systems [3]. Their analogues among connectionists' mod- els existed for some time in the computer literature under the name of "linear machines" [4].

One can thus consider "Potts-perceptrons", which are general multi-class (possibly multi-layer) feed-forward networks. The number of possible states may differ from layer to layer. In the following we will consider, however, only the simplest case, that is one input layer with N Q-state neurons and one

Laboratoire associ6 au CNRS (URA 1306) et aux Universit6s Paris VI et Paris VII.

0378-4371/92/$05.00 O 1992- Elsevier Science Publishers B.V. All rights reserved

A. Rau, J.-P. Nadal / A model for a multi-class classification machine 429

output layer with one Q'-state neuron (and no hidden layers). The binary perceptron is one particular case. In fact the perceptron algorithm, as well as its variants, can easily be adapted to multi-class neurons ([4, 5]). Recently, ref. [6] analysed the problem of learning from examples in such systems and were able to demonstrate that multi-class perceptrons solve many problems more ergonomically than more complicated architectures of binary feed-forward networks. A review of the statistical mechanics of learning a rule from examples has recently been given by ref. [7].

The motivations for considering such systems are manifold. In the analysis of DNA sequences, each input node would be in one of the Q = 4 letters, A, T, G or C and the output might be binary for codon versus exxon (see e.g. ref. [8]). In the analysis of proteins, if one is interested in predicting the secondary structure from the sequence of amino acids, Q would be 20, the number of different amino acids, and Q' would be 3, the number of different possible structures (a helix, /3 sheet or random coil).

Here we will be concerned with multi-class neural networks for classification tasks, without specifying the learning algorithm. In particular we will follow the approach initiated by Gardner [9] for computing the maximal storage capacity of such a network. Namely, we will compute the fractional volume of the weights which realize the learning of a set of randomly chosen patterns. We will limit our study to unbiased patterns, although the computation can be extended to biased patterns. In the following we will present the model, discuss gauge invariance properties and give the results for the storage capacity.

Let us define the model for the simplest feed-forward case with N inputs (Q-state neurons) and one Q'-state output neuron. The local field at state s' is given by

hs' Z s's = Jr ( Q ~ , , , , - 1). (1) j,s

s's The synaptic matrix Jr indicates the weight of a signal coming from node j, which is in state s, on the state s ' of the processing unit. Unless otherwise indicated j @ { 1 . . . N } , s E { 1 . . . Q} and s' E {1 . . . . Q'}. A pattern of input activities will be denoted by nj, j = 1 . . . N, with nj E { 1 . . . Q}. The decision rule is

S'ou, = {s'olhs;~ - os6 > h~, - 0~, V s ' ~ S'o}, (2)

where the {Os, } are thresholds. This is a "winner-take-all" rule: one can consider the Q'-state output neuron as Q' binary neurons; neuron s' receives the input field h s, defined by (1), and only the neuron with the highest local field above the threshold will become active. Updating rule (2) is invariant

430 A. Rau, J.-P. Nadal / A model for a multi-class classification machine

under the following translations:

O~,~O~, + Uo s ' s .._), s ' s s

JJ Jr + u j,

and under

s ' s _> s ' s s ' Jr JJ + vr ,

(3)

0~,--> 0~, - ~, v~', (4) J

where u 0, the u~ and the v~' are arbitrary real numbers. The first transforma- tion adds to each local field a term which, although a function of the input pat tern, is independent of the output states. The second set of translations on the couplings modifies the local fields by a term independent of the input pat tern, which then can be absorbed into the redefinition of the thresholds. In addition the updating rule is invariant under a global rescaling

s ' s _.~ s ' s JJ ~Jr , 0~,--' AOs, (5)

for any strictly positive real number A. These gauge invariances allow one to reduce the number of parameters by "fixing the gauge". One possible choice is to pick one particular input state s o and one particular output state So, and to

s ' s o set all the couplings j~V and Jr to zero. This is a choice made by several authors [4, 8]. It is clear that all gauge choices are equivalent. Here we will, however, make another choice, namely

s ' s s ' s ~ ' J r = 0 ' Vs' , X J j = 0 , Vs , (6)

s s '

which, as one can show by considering the dynamics, is more natural. We now consider the storage capacity of the multi-class perceptron described

above. To do that we will compute the maximal number of random input- output pairs {(n~, j = 1 . . . N) , n '~}, g = 1 . . . p, for which there exists a set of couplings such that, for every /z (2) is true with S'ou t = n '~, provided the input is pattern number /x . All states occur in the set of patterns with the same probability, which is equivalent to ( Q6s,nf - 1) a -- 0. Here the pattern average is indicated by ( - . . ) e . Since the patterns are unbiased, we can set all the thresholds to zero. The translational freedoms of the local field is fixed by (6). We will consider only the case when the scalar degree of freedom is fixed independently for each s' by

(j~/.,)2 = N ( Q - 1) (Q ' - 1 ) / Q ' . (7) j , s

From the gauge invariances we see that there are only N ( Q - 1) (Q ' - 1) free parameters . For each pattern there are Q ' - 1 inequalities which need to be

A. Rau, J.-P. Nadal / A model for a multi-class classification machine 431

satisfied. Thus one can expect that the maximal number Pmax of patterns that can be stored is Pmax = a(Q - 1)N. We will see that this is the case, with a a slowly varying, bounded, function of Q ' only. We follow Gardner 's [9] maximum entropy method and define the partition function of the system as

t* s'(~n'~')

Here hs ~, is the local field when the input is pa t te rn /x and d/x(J) is the natural measure in the space of interactions which is compatible with (6), (7). As in ref. [9], we are asking for a minimal stability K. In order to evaluate the quenched average over the distribution of patterns of the entropy S - (In Z} we use the replica method. Assuming the validity of a replica-symmetric solution and shrinking the volume of interactions to zero, we find for the maximum storage capacity a

--1 Ot~

f Dy (½H 2 (1 - 14o) ° ' - 1 + Q 'H o - I(Ho)ZQ'(Q' - 1) Q ' ( Q ' - 1)(Ho) 3 ) "

[2(H1) 2 + /4o/42]

(9)

The Gaussian measure is indicated by Dy = -exp( -y2 /2 ) /X / -~ dy and we have introduced the functions Hi(y ) =- S+~_y Dt (t + y + K)~; i = 0, 1, 2. Details of the derivation of (9) can be found in ref. [1]. For Q ' = 2 one recovers the well known result of a(K = 0 ) = 2 . For Q ' = 3 we find the analytical result of a(K = 0 ) = 2 / ( 1 - V ~ / 4 r r ) ~ 2 . 3 2 0 . In the large Q ' limit one gets a ( Q ' >> 1) = S dy [H 2 - (H~)2/Ho] ~ 3.850. For Q ' = 4, 5, 6 we find values of approximately 2.546, 2.714, 2.844. The critical capacity as measured by a is thus an increasing function of Q', saturating at a value of about 3.85. However , the information content i (Q ' ) , in nat per (free) parameters, is i(Q') = a ( Q ' ) l n Q ' / ( Q ' - 1 ) is a decreasing function of Q', going to zero when Q' goes to infinity. Hence we find that, qualitatively, the optimal behaviour with Q ' is similar to the one found for the Hebb rule. There are perceptron type algorithms which allow to find a set of couplings whenever there exists at least one solution [4, 5]. Moreover , such algorithms allow to respect any particular gauge choice. To see this, let us give the perceptron algorithm for the choice of gauges (6). One starts with zero couplings. Then, the following is repeated until convergence: Pick a pattern /, at random. For any s ' ~ n ' ~' such that h~ > ~' h,,~, make a learning step for every j and every s by

432 A. Rau, J.-P. Nadal / A model for a multi-class classification machine

j,~'-s____~ n'-s - 1), ~'s s', , J j + (QS~, ,~ J j --> J j - (Q6s , ,~ - 1). (10)

It is clear that at each time step the current couplings will satisfy (6).

In this article we have formulated a model for a system which can solve multi-class classification problems and derived the storage capacity for the

simplest case of such a system, It is clear that most of the analysis done for the

binary perceptron (Q = Q ' = 2) can be generalized to the Potts perceptron. In particular it would be interesting to consider cost functions other than the

number of errors considered here and to investigate the properties above

saturation. In practical applications of neural networks, winner-take-all systems

are currently used which is presently the main reason for pursuing the

analytical study of such systems.

One of us (A.R.) wants to thank the group at the Ecole Normale Sup6rieure

for their kind hospitality during his stay there. He further acknowledges

financial support by the Studienstiftung des deutschen Volkes, the SERC and

Corpus Christi College, Oxford. The work has been supported by EEC

B R A I N twinning contracts (ST2J.0312.C(EDB) and ST2J.0422.C(EDB)). We

would like to thank Marc M6zard for several discussions.

References

[1] J.-E Nadal and A. Rau, J. Phys. (Paris) I 1 (1991) 1109. [2] S. Mertens, H. K6hler and S. B6s, Universit~it G6ttingen preprint (1991). [3] R.B. Potts, Proc. Camb. Phil. Soc. 48 (1952) 106; F.Y. Wu, Rev. Mod. Phys. 54 (1982) 235. [4] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis (Wiley, New York, 1973)

ch. 5. [5] S.I. Gallant, IEEE Trans. Neural Net. 1 (1990) 179. [6] A. Rau, T. Watkin, D. Boll6 and J. van Mourik, J. Phys. (Paris) I 2 (1992) 167. [7] T.L.H. Watkin, A. Rau and M. Biehl, The statistical mechanics of learning a rule, University

of Oxford preprint (1992). [8] A. Lapedes, preprint (1990). [9] E. Gardner, J. Phys. A 21 (1988) 257.

a model for a multi-class classification machine

Documents