efficient training of recurrent neural network with time delays
TRANSCRIPT
@Pergamon
Neural Networks, Vol. 10, No, 1, pp. 51-59, 1997Copyright ~ 1996 Elsevier Science Ltd. All rights reserved
Printed in Great Britain0893–6080/97 $17,00 + .00
PII:S0893-6080(96)OO072-X
CONTRIBUTED ARTICLE
Efficient Training of RecurrentNeural Networkwith Time Delays
BARAK COHEN,l DAVID SAAD2 AND EMANUEL
1Tel AvivUniversityand 2AstonUniversity
(Received18April 1994;accepted26 October1995)
MAROM1
Abstract—Trainingrecurrent neural networks to perform certain tasks is known to be dljicult. The possibility ofadding synaptic delays to the network properties makes the training task more dlficult. However, the disadvantageoftough trainingprocedure is diminished by the improved network performance. During our research of training neuralnetworks with time delays we encountered robust methodfor accomplishing the training task. Themethod is basedonadaptive simulated annealingalgorithm (ASA ) whichwasfound to be superior to other trainingalgorithms. It requiresno tuning and is fast enough to enable training to be held on low end platforms such as personal computers. Theimplementation of the algorithm is presented over a set of typical benchmark tests of training recurrent neuralnetworks with time delays. Copyright ~ 1996 Elsevier Science Ltd.
Keywords—Recurrentneural networks, Synaptic time delays,Training.
1. INTRODUCTION
Recurrent neural networks (RNNs) are networkswhich include feedback connections in addition tothe feedforward connections commonly used inneural networks (NNs). The nets examined in thiswork have continuous dynamics and activationfunctions common to all neurons, producing con-tinuous trajectories. Time delay recurrent neuralnetworks (TDRNNs) are an extension of conven-tional recurrent neural networks (RNNs), allowingthe use of synaptic time delays, which may vary fromconnection to connection. A TDRNN with synapticdelays of value Oreduces to the conventional RNN. Ina conventional RNN, the information flows throughthe synapse instantly, i.e., when one of the neuronschanges its output value, all other neurons will beinstantly affected, while in TDRNNs the informationcoming from neuron i will arrive to neuron j after adelay ~ij.one should note that biological networks dopossess synaptic time delays which vary over a rangeof several orders of magnitude and play a significantrole in the behaviour of these nets.
The applicative advantages of TDRNNs overRNNs are:
Requestsfor reprints shouldbe sent to DavidSaad, Departmentof Computer Scienceand AppliedMathematics, Aston University,BirminghamB4 7ET, UK.
Increasedcapacity resultingfrom the addeddegreeof freedom.These nets can easily handle tasks thatconventional RNN techniques can hardly betrained to handle, such as representation of non-derivable functions.Reducing network size. Results from the factthat one can perform similar tasks with a smallernumber of free parameters. Using nets with asmaller number of parameters usually results inbetter generalization capabilities.Transferring past information. Information onpast behaviour of the system is usually stored asdifferent internal representations requiring alarger number of hidden neurons, but can also beprovided using time delays with no additionalhidden neurons.
The drawback of the additional parameters isthat it is harder to implement the training procedure.The adaptation of previous RNN training algorithms(Pearlmutter, 1989) to TDRNNs results in a very longtraining procedure as well as sensitivity to the choiceof parameters. This problem is especially significantwhen implementing training algorithms on low-endplatforms such as personal computers (PCs). Thetraining algorithm presented in this paper is based onadaptive simulated annealing (ASA) presented byIngeber (1989). The algorithm overcomes some ofthese problems, having a mechanism to tackle thedifferent sensitivities of the various parameters (by
51
52
employing a reannealing schedule) in multidimen-sional parameter space.
In this work we shall concentrate on training aTDRNN designed to produce a continuous trajectoryboth in its representation and in its dynamics.The network dynamics used in this work havebeen mentioned and implemented before withoutsynaptic time delays (Pearlmutter, 1989). The net-work dynamics are described by the followingequations:
‘i= f ‘ijYj(t- ‘ij) (1)j=~
T~ji= ‘J”i + C7(X~)+ ‘~ (2)
Where Ii (t) is an external dynamical input functionto neuron i, a(x) is the neural response function,defined by ~(xi) = 1/(1 + e-x’), ~i(t) is the outputof neuron i, Ti is a time constant of each neuronthat controls the dynamical behaviour of eachneuron, ~ij is the weight matrix element connectingneurons i and j, xi represents the total input toneuron i and N represents the number of neurons inthe system. The new parameter Tij represents thesynaptic delay between neurons i and j. A costfunction for such nets is defined in the usualmanner:
( )E = ‘f /:: (y,(t) –fr(t))2 dt (3)r=l
whereby E measures the squared error between theactual trajectory y,(t)and the desired one .fi(t)andM represents the number of output neurons in thesystem.
In Section 2 we introduce the ASA trainingalgorithm and in Section 3 we present the tasksused to examine the algorithm and the simulationresults.
B. Cohen et al.
2. SIMULATED ANNEALING BASEDTRAINING ALGORITHM
Simulated annealing techniques are implemented forfinding a global minimum (or maximum) of a targetfunction in parameter space. The method is adoptedfrom the physical annealing procedure where a liquidis cooled down in order to obtain a minimum energyformation (Metropolis, Rosenbluth, Rosenbluth,Teller, & Teller, 1953).
Stochastic methods such as simulated annealinghave been found to be extremely useful tools for awide variety of minimization problems related to largenon-linear systems, which are extremely difficult tominimize.
We try to minimize a cost function E(q) where E isa function of a vector of parameters q. For eachparameter vector indexed i, we define the next set ofparameters by performing the following stages:
Select a candidate for the next step. The state j isselected with a probability density g(i, j).Compute the cost function value. The value of thecost function for the new state E(j) is computed,obtaining the cost function difference AE =E(j) – E(i).New state decision. If AE <0 the new state isadopted as the state of the system and a newiteration begins, otherwise the new state is adoptedwith a probability h(k) defined usually as:
h(k) = eA~/~’J (4)
where T(k)is defined as the temperature at time stepk.
The process is repeated while the temperature T(k)being slowly lowered (a quasi-stationary process),
and if the annealing process is carried out adequately(Gemam & Gemam, 1984), the system is expected toconverge to the global minimum energy position. Onethus has three main properties to consider:
TABLE 1
A comparison between different simulated annealing techniques. The probability g for choosing the next stata, the acceptance probabilityh and the annealing schadule T(k). A~ - T.I– ~0,AE represents the cost function difference batween two statas, p, E [–1, 1] is a randomnumber generating the new parameter set and c1 is a parameter that controls the annealing schedule. In the practical algorithm thisparametar is modified dynamically according to the sensitivity of the cost function to the parameter / which is a key feature of the ASA
algorithm
Parameter BA CFA ASA
Probability Density g(~) (**~)-w2e-M2T(Aq2 + ;2)(0+1)/2 Ii (21/4+ Al + l/~/)/=1
Acceptance Probability 1
h(q)~–E/T
1 + eAE/T 1 + L’
TOAnnealing Schedule To/In (k) TOe-c{k’J”T(k)
k
E&cient Training 53
●
●
●
g(q) is the probability density for selecting aparameter vector from the D dimensional state-space.h(q) is the probability of accepting a newparameter vector.T(k) is the annealing “temperature” in step k affect-ing the probability for transition to a new state h(q).The annealing schedule function is actually a result ofthe probability functions g(q) and h(q).
Several simulated annealing techniques have beensuggested over the years, among them Boltzmannannealing (BA) (a variation of Metropolis et al.,1953), Cauchy fast annealing (CFA) (Szu andHartley, 1987) and recently ASA (Ingber, 1989). Themain differences between them is the selection of theprobability functions g(~) and h(q), resulting insignificantly different convergence rates. The main
-Ie
Neum#2
properties of these algorithms are listed in Table 1.For further details see Ingeber (1989).
Though all of these methods had been con-sidered as possible training techniques, the only onethat was actually tested was ASA, since it offers thepossibility of reannealing whereby an individual(optimal) temperature decay rate is adopted for eachparameter. This is most suitable for tackling thesensitivity of the training process to the modificationof time delays. The actual implementation involvedthe coding of the last function [eqn (3)] usingadaptive integration techniques (based on Runge–Kutta algorithms) into the ASA algorithm, whiledefining boundary and initial conditions (Ingber,1989). The ASA algorithm was found to be a robuststable algorithm with a relatively fast convergence,able to train neural networks with synaptic time
Fight
%!6&4 -0:02 o O.& 0.04 0.
Nernm#2
.5
6
FIGURE 1. Some of the task teacher algnala. The circle, the eight pattern, the quadrangle and the BST pattern. The petterna are ahown aaX-Y plots where X valuee are output of one neuron and Y ere output of the other.
54
delays to perform various tasks showing bothcomputational efficiency (most applications ofnetwork training were implemented on a 486 PCplatform) and stability.
3. COMPUTER SIMULATIONS
3.1. Task Description
Several tasks were chosen to demonstrate thecapabilities and main incentives for using synaptictime delays. The following tasks are of differentdifficulty and an effort was made to implementthem using minimum size networks:
The oscillator. The task here is to force a neuralnetwork to oscillate with a certain frequency.The circle. This problem is a common benchmarkproblem, used by Pearlmutter (1989) and byToomarian and Barhen (1991), where a secondoscillation with a phase difference of 7r/2 withrespect to the first one is required as well. Teachersignals for the various tasks are introduced inFigure 1.The “eight” shape. The eight shape problem, also acommon benchmark problem, requires some sortof internal memory (implemented in the internalrepresentations) to decide on the direction of thetrajectory at the cross-point.The quadrangle. The quadrangle problem showsthe ability of TDRNNs to produce non-derivabletrajectories. The quadrangle trajectory requiresjigsaw shaped signals which are virtuallyimpossible to create with conventional RNNframework.The bidirectional single trace (BST). This problemrequires the production of different types ofperiodic signals with non-derivable end pointsand a continuous memory for the direction of thetrajectory. Here the network is required to producea “camel-back” shaped function.
3.2. Computer Resources
Including time delays in the neural networkcaused both memory as well as computation time toincrease.
A comparison between the memory and computa-tion time requirements of the ASA algorithm forRNN and TDRNN is shown in Table 2 for a networkof size N (neurons), a trajectory of length L (timesteps) and a maximum time delayl S.
i The time delay S represents the “memory” of the neuron.The output of a neuron is a function of the inputs at timestj = ti – ~ij thus ,S is the maximum delay available, i.e.,Si = M(ZX’(TiI T;j Tj,v).
B. Cohenet al.
TABLE 2Computer resource consumption of the ASA algorithm forTDRNN and for RNN: Memory and time required to perform oneiteration over the whole interval. The valuea are stated in termsof the network size N, the trajectory length L and the time delay
extent S.
Resource TDRNN RNN
Memory o(2fv* + LN) O(2AP + LN)Time O(LSN) O(LN2)
The estimated computation time consumptionrelates to the additional computation time periteration. Since the number of iterations cannot bedetermined, the only meaningful presentation is thetime required to complete one iteration, showing thecomplexity of the calculation. In practice, sinceconvergence was very fast we have not encounteredcomputation time difficulties in spite of the increase inthe training complexity.
Based on the cases tested, one can deduce thatwhile the resources needed to train the network getbigger with respect to network size, the network sizegets smaller due to the improved capabilities of theTDRNN, and the overall performance improves.
3.3. ExecutionOverview
All the described tasks were implemented using aTDRNN with less than four neurons. The conver-gence time of the algorithm was relatively small. TheASA algorithm was found to be a reliable trainingalgorithm, showing robustness and stability withminimal tuning requirements. Figure 2 shows atypical behaviour of the ASA algorithm in thecourse of training. The trained TDRNNs show asignificant capability improvement over conventionalRNNs, enabling smaller nets to produce similartrajectories to those obtained by conventionaltechniques as well as trajectories that cannot beproduced by conventional RNN techniques. As anexample, the quadrangle and BST tasks emphasize thecapability of TDRNNs to produce non-derivabletrajectories.
Tables 3 and 4 below show a comparison betweenTDRNN and RNN performance for the circle andeight problems. TDRNN training was performedusing ASA while RNN data was taken from pre-vious works of Pearlmutter (1989) and Toomarianand Barhen (1991) using gradient descent basedmethods. The comparison is based on the followingcriteria:
. Network size. Comparing both the number ofneurons and the number of connections.
● Time. The time it takes (in iterations) to completethe training procedure. The completion of the
Ej)icient Training 55
Iteration
FIGURE 2. Convergence behaviour of the ASA algorithm. Thegraph repreeanta a typical training acenario.
training procedure is determined by a minimalerror threshold criterion as in Toomarian andBarhen (1991), while Pearlmutter (1989) exam-ined the propagation of error as a criterion forcompletion.
Table 3 presents the network size comparisonbetween RNNs and TDRNNs for two of the testedtasks.
It is easy to see that a TDRNN requiressignificantly less neurons and connections for per-forming similar tasks2. The trained networks pro-duced trajectories described in Figure 3. Followingis a description of training results for the varioustasks:
●
●
�
The oscillator. Training the network to producean oscillator requires a two-neuron configuration.The key for creating an oscillation is a localfeedback loop with time delay for the oscillatingneuron. The smoothness of the oscillation isinfluenced mostly by the time constant Ticontrolling the magnitude of signal derivative.Using ASA the algorithm coincided with therequired pattern after 1,200 epochs. Outputsignals for all tests, excluding the oscillator(which is a subset of the circle problem) for alltests are shown in Figure 3.The Circle. A circle trajectory was obtained usinga three-neuron network, reflecting a significantdecrease in network size in comparison to aconventional RNN where a seven neuron net wasrequired.
TABLE 3Network aize comparlaon. One notices that the TDRNN requires
significantly lees neurons and connections than the RNN
Circle shape Eight shapeNetwork type task task
TDRNN 3 neurons 3 neurons6 connections 7 connections
RNN (Pearlmutter, 7 neurons 13 neurons1989) 21 connections 78 connections
RNN (Toomarian and 7 neurons 7 neuronsBarhen, 1991) 21 connections 21 connections
The eight shape. The eight shape is generatedby two related internal frequencies. Using aTDRNN the eight shape was achieved with threeneurons. It takes two training cycles (repeating theeight shape) for the net to generate the propersignal.The Quadrangle. The quadrangle task showsTDRNN capability to produce non-derivabletrajectories. The three neuron network stabilizesafter two cycles. It is clear that the non-derivablepoints of the trajectory are generated by thesynaptic delay mechanism where the sign of oneof the differential equation components ischanged.BST. This non-standard repeated pattern is learned
●2 It is fair to indicatethat usinga TDRNN adds freeparameters
to the system. However, even if the additional parameters areconsidered,one is better off using a TDRNN rather than a largerRNN.
by the network, showing the capability of TDRNNto store the direction of a repeated trajectory withnon-derivable end points using the time delays. Ithas been previously shown (Day & Camprose,1991) that time delayed networks can learn toproduce such signals.
Table 4 summarizes the configurations and thetraining results for the various tasks. The ASAalgorithm reaches the vicinity of the desired param-eters quite fast though the final convergence process issomewhat longer. This behaviour is typical for ASA;an example for the decrease in training error [eqn (3)]is shown in Figure 2.
It is important to note that eliminatingoptional connections in the network contributes toa faster training procedure. However, if no externalsuppression of redundant weights is implementedthe training procedure carries it out by enlargingthe time delay to the maximum or reducing theweight itself, requiring additional computationtime.
Several typical behaviors for both the trainingprocedure and the trained nets themselves have beennoticed:
Signal mismatch. In all the tasks trained a slightmismatch between the output signal and therequired output was observed. This mismatchwas also observed in conventional RNN networks
56 B. Cohen et al.
x103 Clrc!4rwalouipota salwd!c40fOm1.5, 1
.2L Io 10 20 30 40 50 @l 70 80 w ICO
The
Neu0n6tqed0ry0.4
1. I
-2.41 IO 2 4 6 b 10 12 14 16 .18 m
4“%!15 -0.1 4.05 0 O.ti 0.1
Neuron#2
:?!5 .; 4.5 0 0.5 1Neum#2
Klo’
%!’ -&3 -0.2 -0.1 0 01 0:2 03
Neur0n#2
BST0.15
-0”!8.08 -0:06 -oh’ -0:02 o O.& 0.04 0.66
Neuron#2
5
4
)8
FIGURE 3. The TDRNN Output Signals. The top pair of graphs presents (Ieftto right) the network response in time (two output neurons) andthe X-Y output for the circle problem. The middle graphs present the response in time and the X-Y plot for the quadrangle problem andthe bottom graphs are X–Y plots of the eight figure and the BST pattern. The circle, eight and BST trajectories are plotted over the entiretraining process.
E@cient Training 57
TABLE 4ASA training parformanca aummary. The ASA algorithm wae ehown to be aa a stabla algorithm for solvingvarioua taaka. In all of the tasks axcluding the oscillator problam, tha goal was to force two output neurone tofoilow the “X” and “Y” components of a trajectory. The oscillator task was to force one output neuron to producean oscillation. The architecture column apecifles the dlrectiona of the connections from aource to destination, i.a.,1-2 maans connection from neuron 1 to nauron 2. Tha numbar of epocha is baaad on aeveral executlona of the
algorithm for each taak.
Task Network size Connections Epochs Architecture
Oscillator 2 2 1,200 1-2, 2–2Circle 3 6 1,800 1–2, 1–3, 2-2, 2-3, 2–1, 3–1Eight Shape 3 7 2,200 1–3, 1–3, 2–2, 2-3, 2-1, 3-1, 2-3Quadrangular 3 7 1,900 1-3, 1-3, 2-2, 2–3, 2–1 , 3–1, 2–3BST 3 7 2,400 1–3, 1–3, 2-2, 2-3, 2–1, 3–1 , 2–3
(Toomarian & Barhen, 1991) and training pro-cedures, and is a result of the discrete dynamicsof the network. One can assume that largernetworks will result in a smaller mismatch.Perturbation reaction. The sensitivity of thetrained net to changes in the input signal wasexamined for the quadrangle task. Modificationsto the input signal of the first neuron Z(t) = h(0)(where h(t)equals 1 when t> t.and O otherwise)were implemented in several ways:
—DC gain. The input gain was modified: Z(t) =2h(0) and I(t) = 0.5h(0). The only effect on thetrajectory was on the transient behaviour (untilthe second cycle). The free neuron3 was affected,changing its amplitude. Figure 4 shows thenetwork behaviour.
—Frequency. Sinusoidal signal was added tothe input in two cases I(t) = /z(O)+ 0.5 sin (t)and l(t) = A(O)– sin (lOt). Also here, the onlyeffect on the trajectory was on the transientbehaviour (until the second cycle). The freeneuron was affected, and changed itsbehaviour.
The input signal in the quadrangle task triggersthe oscillation while affecting only the transientbehaviour of the oscillation. The oscillation itselfis mainly controlled by the other neurons. Similarbehaviour in conventional RNNs was mentionedin the previous works of Pearlmutter (1989)and Toomarian and Barhen (1991). Neithersystem was explicitly trained to handle inputsignal perturbation. This phenomenon is con-sistent as long as the input signal does not causethe system to saturate.
3 The free neuron is the neuron that does not need to complywith an external teachingrule. In our case neurons2 and 3 have toproduce a trajectory and neuron 1 is free.
4. CONCLUSIONS
In this paper we have introduced a robust way totrain continuous neural networks with synaptic timedelays: the ASA. These networks are usually morepowerful than conventional RNNs. Moreover, thereis evidence that synaptic time delays play asignificant role in the behaviour of biological neuralnetworks, thus strengthening the interest in suchnetworks.
The algorithm showed fast convergence andstability, enabling training of small scale TDRNNsto perform rather hard benchmark tasks on a low-endplatform.
Using ASA in training TDRNNs leads to success-ful training of rather difficult tasks with significantreduction both in training time and network size tothat previously reported by Pearlmutter (1989) andToomarian and Barhen (1991) for conventionalRNN.
Although the training procedure of TDRNNs isgenerally slower than training RNNs of the same size,the overall performance usually improves due to thefact that TDRNNs can perform the same tasks in amuch smaller configuration.
The training results of this work emphasizeseveral potential advantages of TDRNNs withrespect to conventional RNNs. Among these advan-tages are:
. Enlargement of the set of trainable trajectories.● Reduction in network size for a given task.. Increase in internal state-space.. Potential improvement of generalization capabilities.
TDRNNs are expected to be useful for a variety ofpossible applications, such as: signal generators,switches, prediction nets, control systems, etc. Thepromising results reported in this work indicate theadvantages, possible use and training techniques forusing TDRNNs. These subjects should be exploredfurther in future work.
B. Cohen et al.
N-lmjmmy, l(&O.5 h(0)1, (
Nwmstrqccwy.l(t)-2 h(0)25
N-tnLjecwf, 10)=h(Oj-sk@31)1.
j 0,
-oTim
4..4 & ~-9,2 4.1 0.1 02 0,3
MJOnbz4
FIGURE 4. A neural network trained by ASA to produce a quadrangle trajectory with an input l(t) = /r(O). Reaponae to input perturbationa:The original trajectory is shown in Figure 3. The graphs (from top to bottom) ahow the response of the TDRNN (3 neurons) to the followingperturbationa in the input aignai: l(t)= 0.5/r(0), I(t) = 2h(0), l(t) = h(0) + 0.5 sin (t) and I(t) = h(0) – sin (lOt). The right-hand graphsahow the trajectory on an X–Y plane while the left-hand graphs show the response aa a function of time. On the left-hand graphs onenoticea the changes in the behaviour of the free neuron, while after attaining the oscillation the other neurona maintain a stable trajectory.
Ejficient Training 59
REFERENCES
Day, S. P., & Camprose, D. S. (1991).Continuous time temporalback propagation. International Joint Conference on NeuralNetworks, Vol. 2. (pp. 95-100).
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbsdistribution and the Bayasian resortation in images. IEEETrans.Patl. Anal.Mac. Int., 6,721-724.
Ingber, L. (1989).Very fast simulated re-annealing,Mathematicaland Computational Modelling, 12, 967–973.
Metropolis,N., Rosenbluth,A. W., Rosenbluth,M. N., Teller, A.H., &Teller,E. Equationof state calculationsby fast computingmachines.Journal of Chemical Physics, 21, 1087–1092.
Pearlmutter, B. A. (1989). Learning state space trajectories inrecurrent neural networks.NeuralComputation,1, 263–269.
Szu, H., & Hartley, R. (1987).Fast simulated annealing. PhysicsLetters A, 122, 157-162.
Toomarian, N., & Barhen, J. (1991).Adjoint operators and nonadiabatic algorithmsin neural networks.AppliedMathLetters,4,69-73.