adaptive generalisation

Artificial Intelligence Review 7, 313--328, 1993. © 1993 KluwerAcademic Publishers. Printed in the Netherlands.

Adaptive generalisation

NOEL E. SHARKEY & AMANDA J. C. SHARKEY

Center for Connection Science, University of Exeter

Abstract. Adaptive generalisation is the ability to use prior knowledge in the performance of novel tasks. Thus, if we are to model intelligent behaviour with neural nets, they must be able to generalise across task domains. Our objective is to elucidate the aetiology of transfer of information between connectionist nets. First, a method is described that provides a standardised score for the quantification of how much task structure a net has extracted, and to what degree knowledge has been transferred between tasks. This method is then applied in three simulation studies to examine Input-to-Hidden (IH) and Hidden-to-Output (HO) decision hyperplanes as determinants of transfer effects. In the first study, positive transfer is demonstrated between functions that require the vertices of their spaces to be divided similarly, and negative transfer between functions that require decision regions of different shapes. In the other two studies, input and output similarity are varied independently in a series of paired associate learning tasks. Further explanation of transfer effects is provided through the use of a new technique that permits better visualisation of the entire computational space by showing both the relative position of inputs in Hidden Unit space, and the HO decision regions implemented by the set of weights.

Key Words: artificial neural nets, transfer of training, hyperplane method, machine learning

1. INTRODUCTION

Over 2000 years ago, in his principle of association by resemblance, Aristotle recognised the importance of generalisation and suggested that the recognition of similarity was fundamental to mental processes. In more recent times, James (1890), in the heyday of the doctrine of the neuron, and later Hebb (1949), discussed how learning and generalisation could emerge from the nervous system. Indeed their research has given rise to much of the modern field of connectionism or neural computing, and the ability of trained nets to generalise has become a standard performance measure. However, connectionist research, unlike earlier work on intelligent behaviour, has been concerned almost exclusively with generalisation from one set of examples to another set within the same task domain. Although important, such performance would be a poor indicator of animal intelligence.

More fundamental to the nature of intelligence is adaptive generalisation, or the ability to generalise across task domains. We know that people tackle novel tasks by building on past experience; not just in terms of their frequency of exposure to pattern sets for a particular task, but by drawing analogies to other similar problems. This is what gives them the power to deal with an ever

314 NOEL E. SHARKEY AND J. C. SHARKEY

changing world. For these reasons, studies of cross-task generalisation, or transfer were first conducted by such notable figures as Ebbinghaus, (1885), William James (1890), and Thorndyke & Woodworth, (1901). In the middle of this century methods were developed to show how the learning of one task may facilitate or inhibit learning of a second task, either because of a general factor of learning to learn (e.g. Harlow 1949), or because specific aspects of the first task facilitate or inhibit the performance on the second, (Gibson 1941; Hamilton 1943; Osgood 1949). It is our purpose here to exploit this stable methodology to assess the ability of connectionist nets to generalise between task domains.

In terms of connectionist research, adaptive generalisation can be identified with the process by which a set of weights developed during the learning of one task is adapted for use on a second (novel) task. The starting point for the second task can therefore be one of experience; its initializing weights incor- porate the knowledge gained during training on the first task. The question here then is one of understanding when knowledge can be transferred between nets: identifying the circumstances under which pretraining on one task will assist (or interfere) with the performance of a subsequent task. There are a number of reasons why connectionists should be concerned to apprehend the conditions under which transfer between nets can be obtained: (i) as is evident from the preceding paragraph, it makes little sense from a cognitive science perspective, to have neural nets that are trained from scratch on every new task, and that are unable to draw on any prior knowledge; (ii), so far as neural nets are used for psychological modelling, training nets that are in effect tabula rasa flies in the face of the mass of evidence of the important role of innate structure in the brain. In fact it has been argued elsewhere (Bates & Elman, in press) that prestructured connectionist nets could provide a strong basis for constructivist accounts of development, according to which cognition emerges from the interaction between innate structures and relevant experience; and finally (iii) it is simply not practical to have to create and train an entirely new net as each fresh problem is approached.

The problem addressed in this paper is that of understanding the determinants of transfer in neural nets. In the first section of the paper we compare the standard within-task generalisation with adaptive generalisation, and show how the standard methods for assessing generalisation performance do not assist our comprehension of transfer relationships between tasks. In the next section, we describe a method that provides a standardised score for the quantification of how much of the structure of a task has been extracted by a net, and to what degree knowledge has been transferred between tasks. We then turn to examine the role of Input-to-Hidden mappings as determinants of transfer effects. Examples of positive and negative transfer between three simple functions (Autoassociation, Bit displacement and Parity) are given, and their explanation in terms of IH hyperplanes considered. Subsequently, in two detailed simulation studies, the effects of input and output similarity are examined in a series of paired associate learning tasks. Varying the similarity of the training data in this manner has the effect of distorting the 'appropriate'

ADAPTIVE GENERALISATION 315

hyperplanes along various numbers of axes in either input or hidden unit space. A new method of visually presenting the computational space of a net is described, and used in conjunction with the empirical findings to provide an explanation of the transfer effects obtained.

2. A COMPARISON OF WITHIN-TASK AND ADAPTIVE GENERALISATION

Before describing the methods we have developed for examining adaptive generalisation, we first briefly examine the problem of within-task generalisation and how generalisation performance is measured. This enables us to see how the two problems differ in fundamental ways and thus require different assessment procedures.

Within- Task Generalisation

Ideally, a net is trained on a function f from a set of inputs, V, to a set of outputs, O, such that i t assigns to each v ~ V a unique output value f(v) ~ O. This is not difficult, say using a universal approximator such as backpropagation in a multilayer net (see White 1992), provided that the total function is used for training, i.e. the total set of input/output pairs in the pattern set, P = {(v, o)1 v ~ V, o ~ O}. The within-task generalisafion problem arises when P is too large for it to be practical to train a net on all p E p within a reasonable time, and thus it is typically trained on a sample, S of the input/output pairings. The problem is that any sample S, by definition of the word sample, will be a subset of at least one other set of ordered pairs P ' that are the range and domain of another function g, that is S c (P n P') . Indeed, as illustrated in Fig. 1, depending on the size of S relative to P, there could be many other functions that are compatible with it, i.e. not excluded by, S: gi(v) -- o, for every (v, o) E S, i --- 1, 2 . . . . , n (see also Sharkey & Partridge 1992 for an examination of the statistical independence of network solutions to the same task starting from different random initial conditions). In other words, the net may extract some function g from exposure to S instead of extracting the desired function f So, in order to determine 'how well' a net has extracted f, it is standardly tested on a further set of input/output pairs T c P, where S n T = O. Although several different methods have been employed for assessing the correctness of each n(v) ~ T (where n is the actual rather than the desired function learned by the net), they mostly boil down to calculating how many n(v) ---- o, for all (v, o) ~ T. The degree of accuracy on the set T is known as the 'percentage correct generalisation'.

Adaptive Generalisation

The problem for adaptive generalisation, in its simplest form, is to train a net on some initial function f such that the structure extracted on the weights can be used to facilitate the learning of a second function g. In this case, the percentage


(a)

P2

P X I [ s I I X P1

P3

Fig. 1. Verm diagram showing the relationship between a Sample set S, and ordered pairs (P, P1, P2 and P3). See the text for details.

correct generalisation is of little value, since, at its best, it can only tell us what function the net is currently performing accurately. It tells us nothing about what structure a net has extracted on the weights that may be beneficial to the training of other functions, or, for that matter, which other functions it could be useful for. This is obvious since the two sets of ordered pairs, P and P ' , for two functions may not overlap, P n P ' -- ~ , or they may have a very small intersection. For example, if we wished to study the relationship between autoassociafion and bit displacement the best percentage correct generalisafion that we could obtain after training a net on one total function would be 2/(2 n × 0.01) because P n P ' would contain only 2 pairs: all 0s and all ls. Nonethe- less, there is a relationship between the input/output pairs in the two tasks, e.g. the average hamming distance, in 5 point space, between the vectors v E V and v' E V' for shared inputs in Vis 2.5.

To summarise, the problem for within-task generalisation is to ensure that a net has extracted the desired function f and not some other function g. Conversely, the problem for adaptive generalisation is to train a net on a function f such that it will facilitate the learning of another function g. Of course these problems are not mutually exclusive. If adaptive generalisation is attempted

A D A P T I V E G E N E R A L I S A T I O N 3 1 7

with samples from f and g, the performance of the net will still have to be assessed using the standard generalisation tests. In this paper we partial out the within-task generalisation problems by training the nets on total functions. This enables us to focus exclusively on the problem of adaptive generalisation.

3. A METHOD FOR ASSESSING TRANSFER EFFECTS

One method for examining the amount of useful structure extracted from a task was employed in the psychology of the 1950s. The concern then was to determine the amount of transfer between different subjects taught in school, e.g. does training in Greek or Latin improve performance in other school subjects? To answer such questions, a basic transfer methodology was devised in which one group of people received training on Task A followed by training on Task B. Their training time on Task B was then compared to that of another group who learned Task B without pretraining on Task A. In this way, it was possible to calculate whether there was positive, negative, or no transfer from Task A to Task B. The same method can be employed for quantifying the transfer of structure between nets.

To demonstrate the efficacy of this method, we give a simple example that investigates the relation between Autoassociation and Bit displacement. First a net was pretrained using the backpropogation learning algorithm on the total function of Autoassociation, f~. Then the Input-Hidden weights from this net were used to initialise the lower half of a net to be trained on the total function of Bit displacement, fb (Only the lower weights were used here so that transfer could be compared with functions differing in the size of their output space). The number of cycles required to train the prestructured net could then be compared with the number of cycles required to learn fb on a net with random initial structure. To ensure that our results were stable, 40 nets were trained from different random initial conditions. Each of these 5-5-5 nets were trained with a learning rate t/--- 0.75, and an error tolerance 0.1.

The results in Table I show a mean reduction in the training times of 97.7 cycles for pretraining on f, then f0 compared to training on fb from random initial conditions. A comparison of the two conditions with an Independent t-test yielded statistically reliable difference (t = 5.93, df 78, p < 0.001). This is

TABLE I Transfer from Autoassociation to Bit Displacement task

Bit displacement Bit Displacement Reduction in from Random seed in net pretrained training cycles

on Autoassociation

Mean no of training cycles 233.7 136 97.7

318 NOELE. SHARKEY ANDJ. C. SHARKEY

clear evidence of positive transfer between f~ and lb. However, quantifying transfer with absolute values presents a problem when comparisons with transfer to other functions is required. For example, a 97.7 cycles reduction would be insignificant if the number of training cycles had been 10,000 cycles (the same problem applies to the saving measure considered by Hetherington and Seidenberg, 1989). To overcome this problem we normalised the transfer measure:

/3-p /3+p

where/3 (Baseline) is the number of cycles needed to train a net on the target task starting from a set of random weights, and p is the training time taken to learn the target task for a prestructured net. Thus - 1 ~< ~ ~< 1 provides a means of quantifying the relationship between tasks on a - 1 to 1 scale and allows direct comparison between transfer scores on any tasks. Accordingly, the positive Transfer between fa and fb was r -- 0.26 (See Mundy & Sharkey 1992 for a different method of examining structure extraction).

4. SOME DETERMINANTS OF TRANSFER EFFECTS

Although the simulations reported here show a positive transfer relationship between the two tasks of Autoassociation and Bit displacement, nothing has been said about the possible determinants. In these two tasks, and in all of the research reported here, we used feedforward nets consisting of two weight layers, with a sigmoidal transfer function f ( v ) -- 1/1 + e-ZW,/i, where w/j is a weight value between the i th and jh units, and vi is the activation value on the ith unit. For the first weight layer of the net, f ( v ) = h, the hidden unit values, and for the second layer, fOe(v)) ~ f ( h ) -- o, the output values. We may thus think of the net as being divided into two subnets with the output of one acting as input to the other: (i) Input to Hidden (IH) with f ( v ) = h, and (ii) Hidden to Output (HO) with f ( h ) = o. This means that we can examine how transfer works by analysing the mappings that have been implemented in the input space by the IH weights and those that have been implemented in hidden unit space by the HO weights.

With the inputs being a set of n-dimensional binary vectors, V, consisting of all 2 n combinations of is and 0s, the input space may be described as a unit hypercube (or square or cube), {0, 1} n = {v = (vi, v 2 . . . . In) ~ RnI Vi ~ {0, 1}, for all 1 ~< i ~< n}. The input vectors are vectors of the hypercube. One way to examine the computational space created by the IH weights is in terms of decision hyperplanes through the input hypercube (e.g. Pratt et aL 1991). That is, IH weights described decision hyperplanes that divide the hypercube into distinct regions. There is one such hyperplane for each hidden unit, perpen- dicular to each of their respective weights, where the decision hyperplane for the fh hidden unit is given by I//j = Z wq + w0j = 0, where w0j is a weight to j from a bias unit that is always 'on' and 0 is a threshold value. For example, in a

A D A P T I V E G E N E R A L I S A T I O N 319

net with two input units and one hidden unit, let 0 = 0, then the x and y intercepts for the decision line of the hidden unit are x = - w o / w 1 and y = - W o / W 2. If training a net has been successful, such decision lines separate the input vertices in a way that enables the f ( h ) to complete the computation of the function.

Now, if the IH decision hyperplanes for fa also segment the input space appropriately for fb, there should be positive transfer between f~ and fb (see also Pratt et al. 1991). To illustrate this point, two nets were trained on fa and fa in two point space with two hidden units. This enabled us to graph the IH decision hyperplanes (lines) for both hidden units for each of the two functions as shown in Figs. 2(a) and 2(b). Note that in both figures, the plane is divided in the same way into 4 quadrants such that each vertex of the square is isolated. This is a requirement for both functions (which need n hyperplanes to separate 2 n vertices of the input hypercube). It is not surprising then that pretraining a net on f~ results in positive transfer for fb, since in this analysis the IH hyperplanes will have already been set up appropriately, and all that is needed is to alter the H O weights such that the correct output is produced for the new task.

Contrast Figs. 2a and b with the decision hyperplanes required for the X O R task shown in Fig. 2c. In the latter case, the role of the hyperplanes is to linearly separate the two input classes, V = {(0, 1), (1, 0)} and {(1, 1), (0, 0)}. The vertices are thus being grouped rather than separated as in f~ and fb. Con- sequently, we should find negative transfer effects for training a net on the parity task, fp (XOR for n > 2), that has been pretrained on fa" Using the same experimental paradigm and learning parameters as in the previous simulations, forty 5-5-5 nets were first trained on f~ and then the IH weights were copied over to a 5-5-1 net for training fp. This was contrasted with fp being trained from random initial conditions. As expected there was negative transfer, v -- - 0 . 3 4 as shown in Table II (The difference in training times was statistically significant in an independent t-test, t -- - 2 .735 , df 57, p < 0.005).

On the basis of the analyses presented here, it is clear that setting the IH

(a) 01 11

O0 10

(b) 01 11

1

01 (c)

11

O0 O0 10 10

Fig. 2. (a) IH decision hyperplanes for Autoassociation, (b) IH decision hyperplanes for bit displacement, (c) IH decision hyperplanes for the XOR task. The coordinates in the corners indicate the input patterns.

320 N O E L E. S H A R K E Y A N D J. C. S H A R K E Y

TABLE II Transfer from Autoassociation to Parity Task

Parity from Parity (in net random seed pretrained on

Autoassociation)

Transfer

Mean no of training cycles 1257.4 (39) 2534.3 (20) -0 .34

(Numbers in brackets show the number of nets that converged within 10,000 cycles).

weights appropriately (or not) is an important determinant of transfer. In the first case, pretraining on Autoassociation set up the hyperplane decision regions appropriately for Bit displacement and positive transfer was found. In the second case, pretraining on Autoassociation set up the hyperplane decision regions inappropriately for Parity and thus there was negative transfer.

But is this the whole story? In related work, Pratt (1992) used a symbofic algorithm (DBT) to set up approximately correct IH weights. The findings from a number of experiments led Pratt to the conclusion that IH weights were the main determinants of transfer, and that 'since HO representations depend completely on proper IH representat ions. . . , it is best to set HO weights to low random values to achieve maximum flexibility.' However, Pratt did not provide any direct comparison between IH and HO transfer. This is the function of the simulation studies reported next.

5. THE CAUSAL ROLES OF IH AND HO DECISION HYPERPLANES

To investigate whether the HO hyperplanes are a factor in determining transfer of training, two simulation studies were conducted in which either the IH or HO weights were systematically manipulated in a paired associate learning task. This is a well known psychological task in which people are presented with a stimulus set and are required to memorise a response set over a number of training trials. The sorts of pairings that have been used range from pairing nonsense figures with words (Gibson 1941), to using pairs of nonsense letter trigrams (Bugelski 1942). We were particularly interested to know whether our results would have the same general characteristics as those found in the psychological studies. These were summarised by Osgood (1949) in terms of 3 empirical generalisations: 1. Stimuli varied and Responses held constant produces positive transfer 2. Stimuli held constant and Responses varied produces negative transfer 3. Stimuli and Responses both varied: negative transfer that increases as the

similarity of the responses increases. The paired-associates used in the simulations reported here were all letter trigrams as shown in Tables III and V. Unlike the functions we have been

ADAPTIVE GENERALISATION 321

examining up until now, paired associate learning is not ruleful since the input/ output mappings are arbitrary. Thus, there can be no generalisation to novel pairs. The advantage is that it is possible (as in the psychology studies) to independently vary the similarity of the stimuli and the responses used in the tasks without regard for any global function that may be learned over the whole training set.

Simulation Study 1: The Manipulation of the IH Decision Hyperplanes

The same transfer of training procedure was used here as in the preliminary simulations reported earlier. In the Pretrained condition, a net was first trained on an Initial set of 5 pairs of input/output trigrams (one bit is used to represent each letter). Then the whole net was retrained on one of three Variants (Table III). These altered each input trigram by 1, 2, or 3 letters. This is equivalent to distorting each of the two IH decision hyperplanes along 5, 8, or 12 of the 12 input axes (there are 12 rather than 30 axes because only 12 of the possible 30 inputs are actually used). The 12 axes of the input space reflect the number of times a different letter appears in a different position. In the first position, 5 different letters are used; in the second position 3 different letters are used and 4 different letters are used in the third position. If the IH hyperplanes are the sole determinant of transfer, then transfer performance should deteriorate to Baseline over the three input variants. Since the letters used as input in Variant 3 were orthogonai to the Initial training set, the weights from those input units were never altered; they maintain their random initial values. Thus, if lk-I weights were the sole determinants of performance, training time for Variant 3 should be no different to training time from random initial conditions.

TABLE III Paired Associate materials

Initial Variant1 Variant2 Variant3

EBA-ABC JBA-ABC JGA-ABC JGF-ABC DEC-BCD IEC-BCD IJC-BCD IJH-BCD AEB-CDE FEB-CDE FJB-CDE FJG-CDE CDA-DEA HDA-DEA HIA-DEA HIF-DEA BDE-EAB GDE-EAB GIE-EAB GIJ-EAB

In total, six simulations were conducted: one for each of the Variant conditions from random initial conditions; and one for each of the Variant conditions following pretraining on the Initial pairs. All nets had a 30-2-30 architecture, and were trained with a learning rate of 0.75 and an error tolerance of 0.1.

The transfer results (averaged over 20 nets), summafised in Table IV, 1 clearly show that the HO decision hyperplanes are a powerful determinant of transfer. Using independent t tests it was found that the transfer effects for all


TABLE IV Varying inputs, holding outputs constant

Net Transfer

Variantl 0.9 Variant2 0.9 Variant3 0.88

three variants were based on a significant difference (p < 0.0001) between the training times from random weights, and the training times for pretrained nets. These findings run counter to Pratt's proposal that the IH weightS are the main determinants of transfer effects. We incrementally shifted the approximation of the IH representations (from similar to random) and found positive transfer in all cases. Even when the IH weights were set to small random values, the correct HO hyperplanes yielded a strong positive transfer effect, r -- 0.89. We shall return to examine these differences in more detail after describing a study in which output similarity is systematically manipulated.

Simulation Study 2: The Manipulation of the HO Decision Hyperplanes

The same paired associate task, the same transfer of training procedure, and the same learning parameters were used as in Study 1. The only change is that this time the inputs were held constant whilst the outputs were varied (see Table V). For the pretraining condition in these simulations, the IH hyperplanes are set up appropriately for at least one solution to the task. However, in order to make use of this solution, the learning algorithm has to suppress 1, 2, or 3 of the outputs from a given input and also create 1, 2, or 3 new outputs.

The results from Study 2 were markedly different from those in Study 1. Negative transfer was observed for all three 'similarity' variants (see Table VI), although only two of these transfer effects turned out to be statistically reliable. Using independent t tests, it was found that the negative transfer effect obtained in Variant II was based on a significant difference (p < 0.05) as was the negative transfer effect obtained in Variant HI (p < 0.05). The negative transfer

TABLE V Paired Associate materials

Initial Variant1 Variant2 Variant3

ABC-EBA ABC-JBA ABC-JGA ABC-JGF BCD-DEC BCD-IEC BCD-IJC BCD-IJH CDE-AEB CDE-FEB CDE-FJB CDE-FJG DEA-CDA DEA-HDA DEA-HIA DEA-HIF EAB-BDE EAB-GDE EAB-GIE EAB-GIJ

ADAPTIVE GENERALISATION

TABLE VI Holding inputs constant, varying outputs

Variant1 -0.016 Variant2 -0.036 Variant3 -0.041

323

effect in Variant I was not statistically significant. These results, in combination with those from Study 1, demonstrate that the way in which the HO decision hyperplanes carve up hidden unit space is an important determinant of network transfer. The message here is that even if the IH decision hyperplanes are set up for an appropriate intermediate solution in a task, if the HO weights are inappropriate there will be negative transfer.

In preparing this paper we employed a new visualisation technique, computational space diagrams, that enables us to view the entire computational space of the nets at a glance. This method was used to take a closer look at the computation being performed in Simulation 2. First, the five inputs are plotted in hidden unit space. Since two hidden units were used in all of the simulations, the hidden unit activations, resulting from the presentation of each input can be used as co-ordinates for plotting the relative position of the inputs in hidden unit space. This results in a more accurate description of the inputs in hidden unit space than drawing IH hyperplanes. Second, on the same graph, the HO hyperplanes can be drawn. The HO hyperplanes divide up the hidden unit space into distinct regions: each corresponding to the required outputs (in the same way that the IH hyperplanes divide up the input space into decision regions). The combination of HO hyperplanes with the positioning of the inputs in the same hidden unit space, makes it possible to gain a comprehensive picture of the computation performed by the net.

In Fig. 3, four computational space diagrams have been produced using this method of showing both the position of the inputs in Hidden Unit space and the HO hyperplanes. In these diagrams, the labelled stars represent the positions of the five inputs in Hidden Unit space, and the lines drawn within the square show the position of the relevant HO hyperplanes. Thus for example, in 3(a) there is an HO decision region that contains only the input ABC. This is the only input that is paired with an output that contains an E in the first position (EBA). By contrast, two inputs are paired with an output with an A in the third position, ABC and DEA (outputs EBA and CDA respectively). These are segregated into a decision region by the slanted HO plane that runs diagonally from the top centre of the square to the bottom right.

The four diagrams shown in Fig. 3 correspond to the initial paired-associate set and its three variants used in Simulation 2, where the inputs were kept the same, and the outputs varied. It is apparent from these diagrams that the input positions change very little from one variant to another. Even more surprising is the apparent similarity of the arrangement of the H-O hyperplanes across the four variants. However, when we consider the task being performed by par-

ABC CDE

Bq

CDE ABe

I

(a)


~A

(b)

CDE

(e)

ABC

A

CDE ABC

B

(d)

Fig. 3. Computational Space diagrams showing positions of inputs in hidden unit space, and arrangement of hidden-output hyperplanes for Sirnulation set 2 (same inputs, different outputs). 3(a) corresponds to the computational space developed by Initial training set, 3(t9) corresponds to Variant 1, 3(c) corresponds to Variant 2 and 3(d) corresponds to Variant 3. Only the relevant hyperplanes have been drawn: the random hyperplanes (where the required output in zero) would be located outside the squares). See the text for further details.

ticular H O hyperplanes, it can be seen that although the overall appearance remains more or less the same, the individual hyperplanes have been trans- posed. In Fig. 4, for example, it is possible to look at the movement of H O hyperplanes relevant to one of the paired associates.

In Figs. 4(a to d) we have labelled the relevant hyperplanes for one of the paired associates, 'ABC-EBA' . Across the four diagrams in Fig. 4, the H O


(a )

ABC--EBA

B -

(b)

- - A

( c ) j-- &

--B--

G-

(d) j _ _ &

G-

--F ---A

Fig. 4. Representational Space diagrams showing movement of decision hyperplanes. In particular, it is possible to see the movement of the hyperplane associated with output letter J, trained to output a zero for all inputs in the Initial training condition. The labelled star shows the position of one of the input trigrams (ABC) and its corresponding output trigrams across the different conditions. The dashed labels for the decision hyperplanes (e.g. E--, -B-, --A) indicate the position of the letter in its output trigram. See the text for details.

hyperplanes drawn within the square remain in more or less the same position and orientation. However closer examination shows that the labels of the hyperplanes have changed. Fig. 4(a) shows the representation space for the Initial training condition, and the other three diagrams should be individually


compared to it. In Fig. 4(a) the relevant hyperplanes are labelled E-- and -B- (two hyperplanes overlaying each other) and --A. These hyperplanes define the required output to the stimulus ABC in the original net, namely 'EBA'. How- ever in Fig. 4(b), when the required output is 'JBA', one of these hyperplanes has been relabelled J--. And the HO hyperplane associated with E-- can be seen outside the square, off to the right. In other words, although HO hyperplanes of the same orientation and position are required for each variant, the actual hyperplanes used varies. The hyperplane labelled E-- has in Variant 1 been moved out of the square, and replaced by J--. The original position of J-- can be seen in Fig. 4(a), off to the left of the diagram. This basic pattern is repeated in the other diagrams. The previously unused hyperplanes are moved into the square as required, whilst the irrelevant hyperplanes (e.g. E-- in (b), -B- in (c) and --A in (d) are moved out of the square, off to the right.

In Study 2 we have seen that evidence of negative transfer was obtained in circumstances in which the output hyperplanes had to be adjusted, whilst the inputs are held constant. Further experiments are being conducted in our laboratory to establish whether the same negative transfer effect is obtained when an entirely new training set is used, in which both the inputs and outputs are altered from the original. Preliminary results indicate that the transfer effects thus obtained are negligible. Our contribution in this paper is to have under- taken a systematic manipulation of both the input and output similarity of paired-associate stimuli, leading to an indication of the circumstances under which positive and negative transfer effects can be obtained. In addition, an alternative visualisation technique is provided here which permits the simul- taneous display of both input groupings and the HO hyperplanes.

6. CONCLUSIONS

The claim was made, at the outset of this paper, that adaptive generalisation is fundamental to intelligence. The study of adaptive generalisation, or cross-task transfer of knowledge has a long history in psychology (e.g. Bartlett 1932; Pavlov 1927; Piaget 1952; Volkmann 1858), but has largely been ignored in connecfionist research. Connectionist nets have often been trained from the starting point of a set of random weights. This fact makes some comparisons between their learning to that of humans seem at least tenuous. For example, McCloskey & Cohen (1989) investigate the catastrophic forgetting that occurs when a net is trained on 'ones' addition facts, and then the 'twos' addition facts; but they ignore the incongruity of trying to teach addition facts to a net that has no experience of sorting and categorising objects or even of objects themselves.

If connecfionist nets are to be able to exhibit adaptive behaviour, they need to be prestructured. Such prestructuring can be accomplished through training on related tasks, and the transfer assessment method described in this paper provides a means of evaluating the resulting costs and benefits. A net can be said to exhibit a degree of adaptive generalisation when training on one task results in positive transfer to another task. In such a case, information has been


extracted that facilitates the performance of a second task. On the other hand, when negative transfer is obtained, prior experience interferes with subsequent learning. In this way, not only can previous knowledge be incorporated by means of positive transfer, but a net can be seen as having a predisposition to learn certain tasks rather than others. An interesting parallel with innate cognitive abilities can be drawn here; humans similarly can be thought of as having a predisposition to learn certain things rather than others, whilst still requiring input from the environment.

Even apart from its implications for the study of cognition, positive transfer between nets could be useful from a practical point of view in reducing long training times. However, if positive transfer is to be reliably obtained, an understanding of its determinants is required. The hyperplane analysis considered here provides at least a partial explanation. In preliminary experiments we found that setting the IH mappings appropriately resulted in a reduction in training time, whilst setting them inappropriately resulted in an increase. In the two more detailed simulation studies, it was found that a change in input similarity resulted in positive transfer, while a change in output similarity resulted in negative transfer. Interestingly, the data obtained here have the same general characteristics as those found in the psychological studies summarised by Osgood, 1949.

Further light is shed on the reasons for the negative transfer through the use of a new technique which permits visualisation of the entire computational space. By plotting in the same diagram, both the relative positions of the inputs in hidden unit space and the HO hyperplanes, it is apparent that the additional processing is attributable to the time taken to move HO hyperplanes in and out of position, whilst the inputs remain in approximately the same positions. Of course, it is easy to draw computational space diagrams when there are only two or three hidden units. For hidden unit sets of higher dimensionality, we have been exploring the use of Principal Components Analysis for dimensional reduction.

The provision, in this paper, of methods for identifying some of the determinants of transfer effects, represents an important first step in the study of adaptive generalisation. It opens the way for attempts to ensure a closer match between what is known about human cognition, and the behaviour of neural nets. Once a proper theory of knowledge transfer has been developed, future research will be able to build on this understanding and investigate the beneficial precursors of particular abilities in particular domains.

NOTE

* This research was supported by an award from the Economic and Social Research Council, Grant No R000233441. An earlier version of this paper appears in the Proceedings of the Second Irish Networks Conference, Belfast 1992.

1 The results in Table IV, and later in Table VI are slightly different to those reported in our earlier paper, (Sharkey & Sharkey, in press), since the simulations were rerun for the current paper with different random seeds.

328 N O E L E . SHARKEY A N D J . C. SHARKEY

REFERENCES

Bartlett, F. C. (1932). Remembering. Cambridge: Cambridge University Press. Bates, E. & Elman, J. L. (in press). Connectionism and the Study of Change. In M. H. Johnson

(ed.), Brain Development and Cognition: A Reader. Oxford: Blackwells. Bugelski, B. R. (1942). Interferences with Recall of Original Responses after Learning New

Responses to Old Stimuli. Journal of Experimental Psychology 30: 368--379. Ebbingh~ius, H. (1885). Memory: A Contribution to Experimental Psychology (H. A. Ruger & C.

E. Bussenius, Tr.) New York: Teachers College, Columbia University, 1913. Gibson, E. J. (1941). Retroactive Inhibition as a Function of Degree of Generalisation Between

Tasks. Journal of Experimental Psychology 28 (2): 93-- 115. Hebb, D. O. (1949). The Organisation of Behavior. New York: Wiley. Hetherington, P. A. & Seidenberg, M. S. (1989). Is there 'Catastrophic Interference' in Connec-

tionist Nets? In Proceedings 1 lth Annual Conference of Cognitive Science Society: Michigan. James, W. (1890). Principles of Psychology. New York: Holt. McCloskey, M. & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The

Sequential Learning Problem. The Psychology of Learning and Motivation 24: 109--165. Mundy, D. & Sharkey, N. E. (1992). Type Generalisations on Distributed Representations. In R.

Trapl (ed.), Cybernetics and Systems Research, 1327--1334. Kluwer Academic Publishers: Dordrecht, The Netherlands.

Pavlov, I. P. (1927). Conditioned Reflexes (Translated by G. V. Anrep) London: Oxford. Piaget, J. (1952). The Origins of Intelligence. Translated by M. Cook. New York: International

Universities. Pratt, L. Y. (1992). Non-Literal Transfer of Information among Inductive Learners. Computer

Science Department, Rutgers University Working Paper, May 1992. Pratt, L. Y. & Kamm, C. A. (1991). Improving a Phoneme Classification Task Through Problem

Decomposition. In Proceedings of the International Joint Conference on Neural Networks (IJCNN-91).

Osgood, C. E. (1949). The Similarity Paradox in Human Learning: A Resolution. Psychological Review 56: 132--143.

Sharkey, N. E. & Partridge, D. (1992). The Statistical Independence of Network Generalisation: An Application in Software Engineering. In P. G. Lisboa & M. J. Taylor (eds.), Neural Networks: Techniques and Applications. Ellis Horwood: Chichester, UK.

Sharkey, N. E. & Sharkey, A. J. C. (in press). Prestructured Neural Nets and the Transfer of Knowledge. In G. Orchard (ed.), Proceedings of Second Irish Neural Networks Conference. Adam I--Iilger.

Thorndyke, E. L. & Woodworth, R. S. (1901). The Influence of Improvement in One Mental Function upon the Efficiency of Other Functions. I, II, III Psychological Review 8: 247--261, 384--395,553--564.

Volkman, A. (1858). 13ber den Einfluss der Ubung auf das Erkennen raumlicher Distanzen. Berichte Sachisischer Gesellshaft der Wissenschaftern, Mathematisch-Physische Klasse 10: 38--69.

White, H. (1992). Artificial Neural Networks: Approximation and Learning Theory. Blackwell, Oxford, UK.

adaptive generalisation

Documents