recurrent neural network training with feedforward complexity

13
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 2, MARCH 1994 I85 Recurrent Neural Network Training with Feedforward Complexity Oluseyi Olurotimi, Member, ZEEE Abstract- This paper presents a training method that is of no more than feedforward complexity for fully recurrent net- works. The method is not approximate, but rather depends on an exact transformation that reveals an embedded feedforward structure in every recurrent network. It turns out that given any unambiguous training data set, such as samples of the state variables and their derivatives, we need only to train this em- bedded feedforward structure. The necessary recurrent network parameters are then obtained by an inverse transformation that consists only of linear operators. As an example of modeling a representative nonlinear dynamical system, the method is applied to learn Bessel’s differential equation, thereby generating Bessel functions within, as well as outside the training set. I. INTRODUCTION UCH SUCCESS has been achieved in the use of static, feedforward neural networks for modeling nonlinear maps [ 11. It is now known that such networks are in fact capa- ble of approximating not only any continuous map arbitrarily closely [2], [3], but also the derivatives of such maps [4]-[6]. There is growing interest in extending this performance to the dynamic case-modeling nonlinear dynamical systems using recurrent neural networks. Networks so designed will readily have applications in the area of control and robotics, among several others [7]-[ 101. Consequently, several training schemes have been proposed for training recurrent neural networks. We mention only a few in this introduction. Recurrent hackpropagation, as developed by Pineda [ 1 11, employs a key steady-state approximation that allows the gradient at the output of a recurrent network to be expressed analytically. Nonetheless, the feedback nature of recurrent networks necessitates the solution of an associated auxiliary dynamical system in order to obtain the quantities needed to actually compute the gradient. Although Pineda’s original proposal was for the storage of static pattems, others have attempted to modify this version of recurrent backprop- agation for the storage of dynamic pattems [ 121. Pearlmutter has presented a method specifically suited for dynamic pattems [ 131. The result in this algorithm can also be obtained using the classical variational calculus or optimal control approaches. For the discrete-time case, a generalized view of neural networks based on interconnections of linear and nonlinear blocks in various feedforward/feedback architectures has been Manuscript received February 9, 1993: revised November 1, 1993. This work was supported by the National Science Foundation under Grant ECS- 9209456 The author is with the Electrical and Computer Engineering Department. and the C,” I Center. George Mason University, 4400 University Drive, Fairfax, VA 22030. IEEE Log Number 92 15776. proposed by Narendra et al. [14], [15]. This leads, in a systematic way, to the derivation of rules for computing the gradients for blocks connected in a recurrent fashion. The resulting dynamic hackpropagution scheme also results in auxiliary dynamical systems whose outputs are necessary for the computation of the actual dynamical system gradients. Williams and Zipser also propose an algorithm in the discrete- time case, in which teacher forcing is used to improve the efficiency of the computation [16]. Rumelhart, et al. [17], and also Werbos [IS] describe a scheme of hackpropugation- through-time that effectively unfolds the time axis of the recurrence such that the network can be visualized as a (theo- retically infinite) series of neural network layers in feedforward connection, with shared weights. Methods for the learning of trajectories using adjoint funca?ins and teacher forcing have also been proposed by Barhen et al. [ 191-1 21 1. Although this method also generates auxiliary sets of dynamical equations, much faster convergence than other methods is reported. The analytical complexity of recurrent network training stems from the nature of the credit assignment problem in a situation where there is a combinatorially large number of cyclic paths for signal propagation. In the feedforward case, the unidirectional flow of the architecture sidesteps this problem. When the credit assignment problem for the feedback network is expressed analytically, such expression necessarily inherits the dynamic nature of the underlying system. In several of the algorithms based on the variational calculus, the dynamic credit assignment system must be solved backwards [ 131. This is a potential cause of conzpuiational complexity for the resulting training algorithm, relative to the feedforward counterpart. However algorithms such as those of Williams and Zipser [ 161, and Toomarian and Barhen 12 1 I allow the error gradients to be computed forward in time. Nonetheless, little control can be exerted on the recurrent hidden units, leading to possibly long training times. The main result of this paper deals with this fundamental problem by presenting a transformation that reveals an em- bedded feedforward system in a fully recurrent architecture. We describe here a systematic method for training the recur- rent system through conventional feedforward training (e.g., using backpropagation) of the embedded feedforward system, followed by certain transformations back to the fully recurrent form. The training complexity of the method is thus no more than that of feedforward training. Section I1 sets up and discusses the problem. We also state the main result of this paper. Section I11 derives the result. Also in this section, we compare the algorithm with others 1045-9227/94$04.00 0 1994 IEEE

Upload: o

Post on 27-Jan-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent neural network training with feedforward complexity

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 2, MARCH 1994 I85

Recurrent Neural Network Training with Feedforward Complexity

Oluseyi Olurotimi, Member, ZEEE

Abstract- This paper presents a training method that is of no more than feedforward complexity for fully recurrent net- works. The method is not approximate, but rather depends on an exact transformation that reveals an embedded feedforward structure in every recurrent network. It turns out that given any unambiguous training data set, such as samples of the state variables and their derivatives, we need only to train this em- bedded feedforward structure. The necessary recurrent network parameters are then obtained by an inverse transformation that consists only of linear operators. As an example of modeling a representative nonlinear dynamical system, the method is applied to learn Bessel’s differential equation, thereby generating Bessel functions within, as well as outside the training set.

I. INTRODUCTION

UCH SUCCESS has been achieved in the use of static, feedforward neural networks for modeling nonlinear

maps [ 11. It is now known that such networks are in fact capa- ble of approximating not only any continuous map arbitrarily closely [2], [3], but also the derivatives of such maps [4]-[6]. There is growing interest in extending this performance to the dynamic case-modeling nonlinear dynamical systems using recurrent neural networks. Networks so designed will readily have applications in the area of control and robotics, among several others [7]-[ 101.

Consequently, several training schemes have been proposed for training recurrent neural networks. We mention only a few in this introduction. Recurrent hackpropagation, as developed by Pineda [ 1 11, employs a key steady-state approximation that allows the gradient at the output of a recurrent network to be expressed analytically. Nonetheless, the feedback nature of recurrent networks necessitates the solution of an associated auxiliary dynamical system in order to obtain the quantities needed to actually compute the gradient. Although Pineda’s original proposal was for the storage of static pattems, others have attempted to modify this version of recurrent backprop- agation for the storage of dynamic pattems [ 121. Pearlmutter has presented a method specifically suited for dynamic pattems [ 131. The result in this algorithm can also be obtained using the classical variational calculus or optimal control approaches. For the discrete-time case, a generalized view of neural networks based on interconnections of linear and nonlinear blocks in various feedforward/feedback architectures has been

Manuscript received February 9, 1993: revised November 1, 1993. This work was supported by the National Science Foundation under Grant ECS- 9209456

The author is with the Electrical and Computer Engineering Department. and the C,” I Center. George Mason University, 4400 University Drive, Fairfax, VA 22030.

IEEE Log Number 92 15776.

proposed by Narendra et al. [14], [15]. This leads, in a systematic way, to the derivation of rules for computing the gradients for blocks connected in a recurrent fashion. The resulting dynamic hackpropagution scheme also results in auxiliary dynamical systems whose outputs are necessary for the computation of the actual dynamical system gradients. Williams and Zipser also propose an algorithm in the discrete- time case, in which teacher forcing is used to improve the efficiency of the computation [16]. Rumelhart, et al. [17], and also Werbos [IS] describe a scheme of hackpropugation- through-time that effectively unfolds the time axis of the recurrence such that the network can be visualized as a (theo- retically infinite) series of neural network layers in feedforward connection, with shared weights. Methods for the learning of trajectories using adjoint funca?ins and teacher forcing have also been proposed by Barhen et al. [ 191-1 21 1. Although this method also generates auxiliary sets of dynamical equations, much faster convergence than other methods is reported.

The analytical complexity of recurrent network training stems from the nature of the credit assignment problem in a situation where there is a combinatorially large number of cyclic paths for signal propagation. In the feedforward case, the unidirectional flow of the architecture sidesteps this problem. When the credit assignment problem for the feedback network is expressed analytically, such expression necessarily inherits the dynamic nature of the underlying system. In several of the algorithms based on the variational calculus, the dynamic credit assignment system must be solved backwards [ 131. This is a potential cause of conzpuiational complexity for the resulting training algorithm, relative to the feedforward counterpart. However algorithms such as those of Williams and Zipser [ 161, and Toomarian and Barhen 12 1 I allow the error gradients to be computed forward in time. Nonetheless, little control can be exerted on the recurrent hidden units, leading to possibly long training times.

The main result of this paper deals with this fundamental problem by presenting a transformation that reveals an em- bedded feedforward system in a fully recurrent architecture. We describe here a systematic method for training the recur- rent system through conventional feedforward training (e.g., using backpropagation) of the embedded feedforward system, followed by certain transformations back to the fully recurrent form. The training complexity of the method is thus no more than that of feedforward training.

Section I1 sets up and discusses the problem. We also state the main result of this paper. Section I11 derives the result. Also in this section, we compare the algorithm with others

1045-9227/94$04.00 0 1994 IEEE

Page 2: Recurrent neural network training with feedforward complexity

I86 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 2, MARCH 1YY4

I ( t ) - i ( t ) = f ( Z ( t ) ? I ( t ) )

A practical observation concerns ( I ) . This equation is in 4 t ) c the standard form of a first-order differential equation for the

in the literature, and we also highlight several distinguishing features. In Section IV, we give design "recipes" for the use of the algorithm. An example is presented in Section V where a recurrent neural network is trained to leam Bessel's differential equation, thereby generating Bessel functions. In Section VI, we take advantage of the developments of the preceding sections to present the discrete-time version of the result in a brief manner. Concluding remarks are given in Section VII.

11. PROBLEM DISCUSSION AND MAIN RESULT

In this section we describe the problem that we studied. We then state the main result of this paper with respect to the problem.

I(t i

A. Problem Discussion First, we put our goal in perspective by taking a particular

viewpoint on the temporal learning problem. The perspective we use is that of modeling an arbitrary, continuous dynamical model. The fundamental assumption is that the time-varying output we observe is generated by this unknown dynamical model. We seek to design a fully recurrent neural network whose dynamic behavior mimics that of the unknown model. If successful, the network will be able to generate outputs similar to that of the unknown system in response to inputs similar to those applied to the unknown system.

A simple block diagram depiction of the dynamical system is shown in Fig. l(a).

We note that both the input I ( t ) , and the output z( t ) are in general time-varying. The input and output signals can be test and response signals in the case of system identification, or the input can be viewed as an address for a temporal associative memory or sequence recognition device. The input could also be a function of previous outputs in the case of time-series prediction. In any case, this simple block model accounts for the dynamic pattem recognition problem with sufficient generality for our purposes. Our goal will be to model this system with a fully recurrent neural network as shown in Fig. 1 (b).

To be specific, suppose that we observe the state z( t ) of some unknown dynamical system:

i ( t ) = f ( z ( t ) . I ( t ) ) , (1)

where f(.) is some unknown continuous vector function, and I ( / , ) is the vector of inputs to the system.

c output y ( t ) . Since our tacit assumption is that this output 4 1 )

possesses a dynamic relationship with the input, we can form *A& unambiguously represents the dynamical system. A novel idea might be to observe that the output y( t ) is a nonrecurrent function o f t . We can therefore in principle always reconstruct y(t) by training a feedforwwd neural network to map t to y ( t ) using the given training set. Such a scheme may use some of the recently proposed training methods which also leam function derivatives [4], [5]. The dimension TL of the formed state vector should ideally coincide with the order of the system. However, this system order will not be known in general, and a design choice will have to be made. This is similar to the kind of design choice required for the number of neurons in the hidden layers of multilayer perceptron networks-too small a number can guarantee that the model will never be close enough, or the wrong system will be accurately modeled, while beyond some critical number the design is always achievable in principle. In this case, too small an n corresponds to incomplete observation of the system states, while beyond some critical 71, , all the information needed to model the system is available, possibly in redundant (but consistent) form.

We define the dynamic neural network state equation as follows:

iL(t) = -au(t) + Wg(u(t)) + J ( t ) . ( 2 )

In addition, we generalize this description slightly by also introducing input and output equations:

J ( t ) = RI [ I ( t ) ] . (3)

and

As we shall see, the input and output operators HI [.] and R0[.] are simple linear operators. In contrast to the usual recurrent additive model, we have a possibly time-varying additive term J ( t ) . This is to allow for a time-varying input to the system being modeled. Refer to Fig. I and (3). Furthermore, since in the temporal mapping problem, we are not necessarily interested in steady-state, fixed-point attractor dynamics, we do not require the weight matrix W to be symmetric. rL is the decay term which we will assume to be identical for all the neurons. g ( u ( t ) ) is a vector of neuron output functions, often sigmoidal, with each function g i ( . ) acting only on the corresponding neuron state 'U, ( t ) .

Page 3: Recurrent neural network training with feedforward complexity

OLUROTIMI: RECURRENT NEURAL NETWORK TRAINING WITH FEEDFORWARD COMPLEXITY 187

B . Main Result

The main result is stated in the form of a Proposition. Proposition 1 : Recurrent Neural Network Training with Feed- forward Complexity: Let Ni:iZ,,,.,iN+l be the class of func- tions generated by a multilayer feedforward network with 21

inputs, i ~ + l outputs, and N - 1 hidden layers with iz; . . . , inr nodes, and let “a” be a constant.

Then for every dynamical system

k( t ) = f(z(t),f(t)),to I t I tf, ( 5 )

with input I ( t ) E R p and output z( t ) E R”, I(.,.) a continuous function, such that

f(z,I) + a z E Nu+p,m,n’ L 71,

there exist constant matrices W1 W2, R, and constant vectors b,c, such that the recurrent neural network

h(t) = -au(t) + Wg(u(t)) + J ( t ) , t o L t L t f , ( 6 )

with the weight matrix W specified by

W = WlWZ; (7)

an input equation defined by

J ( t ) = W I [ I ( t ) ] = R(aI(t) + i-(t)) + ab + WlC,

y(t) = Wo[u(t) , I ( t )] = W,l(U(t) - .nr(t) - b) ,

y(t) = z( t ) , t o L t L tf 7

(8)

and an output equation defined by

(9)

has

provided

! / ( t o ) = z(t0). (10)

~ ( t ) E R”, g ( . ) is a vector of neuron output functions, and [.I, 80 [., .] are linear operators defined as shown.

Remarks: Since it is well known that N2+p,m,n can approximate any continuous function (with the appropriate input and output dimensions) arbitrarily closely, Proposition 1 is a de facto universal approximation (of dynamical systems) result for recurrent neural networks. Since in practice, training is rarely exact, this implication will be practically useful only for systems that are robust with respect to parameter variations. This is a reasonable requirement for systems that we desire to actually build. This result can be used in practice as follows. I ( t ) is the input to the dynamical system, and is given as part of the input training set. The state variable vector z( t ) constitutes the rest of the input training set. Recall as discussed earlier that the state variables can either be observed at the system output or in many cases formed from the output variable and its time derivatives. We choose the neural network decay term, a, a constant. The values of f(z(t),I(t)) at the training instants are obtained as the derivatives of these state variables at these instants according to (5). As

will be seen later, we are then able to form another output training set corresponding to the left hand side of (1 1). In accordance with the right-hand side of (1 l), a two-layer feedforward network is then trained to obtain the parameters Wl,Wz, R,b,c. At this point all the training required is complete. Notice that only feedforward training has been employed. The recurrent network (6) is then designed as follows. The recurrent network weight matrix W is formed according to W = WlW2. The input J ( t ) is formed, according to (8) as a proportional-plus-derivative operation on the actual system input, and the output y(t) FZ z(t) according to the linear transformation given by (9). This design methodology implicitly specifies the recur- rent network size as the size of the hidden layer in the feedforward map. Thus existing results on choosing the size of feedforward architectures can be used as a guide in designing recurrent networks [l]. This design procedure is systematically laid out in Section IV. Since it is typical for the hidden layer size to be larger than the input layer dimension, the column rank of W1 will be full for well-specified problems (otherwise there are redundant degrees of freedom in the input data dimension). Thus, the matrix inversion of (9) will be possible for typical nontrivial problems. This formulation may be extended to allow for

However, the corresponding output function g ( .) that results for the fully recurrent network in equation (6) is no longer sigmoidal (assuming we were using sig- moidal nonlinearities). Rather, the recurrent network output function depends on the particular architecture of the multilayer network, as well as on the function being leamed. See further discussion in Section 111-B.

111. DERIVATION OF THE RESULT

A . Proof of Proposition I

Since f(zlI) + a z E n/u2+p,m;n, v 2 n, it is possible to choose W1, W2, R, b, c according to the parameters in the following valid expression:

f(z(t), I ( t ) ) + nz ( t ) = W2g(W1z(t) +RI(t) + b ) + e . ( 1 1)

In particular, the right-hand side of (1 1) can be obtained after training a two-layer feedforward neural network for mapping [z( t ) , I ( t )] to f ( z ( t ) , I ( t ) ) + az( t ) . b is the vector of bias inputs to the hidden layer, and c is the vector of bias inputs to the output layer. The weight matrix from the input to the hidden layer is

[ R 4 ,

and the weight matrix from the hidden to the output layer is W2.

Page 4: Recurrent neural network training with feedforward complexity

188 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5 , NO. 2, MARCH 1994

The recurrent network of Proposition 1 along with its input and output functions are thus completely specified. Now consider (9),

y(t) = W,l(u(t) - L?I(t) - b) .

jr(t) = W,l(U(t) - L?i(t)).

$(t) = W,l(-au(t) + Wg(u(t)) + J ( t ) - L?j(t)).

Taking time derivatives,

Substituting for ;(t) from (6),

Substituting for J ( t ) from (S),

*(t) =W,l(-au(t) + Wg(u(t)) + L?(aI(t) + i ( t ) ) + ab + Wlc - L?j(t))

= - aW,l(u(t) - L?I(t) - b) + W,g(u(t)) + c, (12)

viewpoint. Furthermore, a suitable set of state variables can be obtained by taking an appropriate number of derivatives (or time-delays in the discrete-time case) of the observable outputs. This idea has already been used in adaptations of feedforward structures for modeling dynamical processes. Ex- amples are the time delay neural network (TDNN) [27], and the series-parallel identification model of Narendra, et al. [ 141. In these cases however, strictly static, and not dynamic maps are leamed.

Although it is well known that state information can be readily obtained through conventional methods, significant direct training benefit to fully recurrent neural networks has not, to the best of our knowledge, been reported up till now.

In this work, we show that in fact, for fully recurrent neural networks, the training need not be of more than feedforward complexity provided state information is first retrieved. Note that the process of obtaining state information by conventional means is well-specified and of bounded complexity. Thus it is preferable to conventionally obtain the states prior to training, instead of lumping this task with the overall model leaming

where we have Now from (9), we can also write

wed the fact that W1W2 = from (7)'

u(t) = W1y(t) + L?I(t) + b. ( 13) problem. Our approach is applicable directly to the problems studied

in the other algorithms mentioned above. The generality of Using both (9) and (13) in (12),

transformation, the universal approximation property of mul-

Therefore, over the interval of interest, to 5 t 5 t f, the neural tilayer feedforward networks. In the method, we first realize that for the problem to be solvable as desired, the state variables must be retrievable from the output observations. Such retrieval is done using, for example, time-derivatives in the continuous-time case, or time-delayed samples in the discrete-time case. Training is then performed as specified by

network output y(t) obeys the same dynamical equation ( 5 ) as does the dynamical system output by the provision of I , the initial conditions are identical. Therefore, by uniqueness,

U

B . Further Interpretation of the Result and Comparison with other Algorithms

1 ) On system realization and the availability of a canonical set of state variables: It is well known in realization theory that for a model to appropriately realize some unknown system, the observable output of the unknown system must encode all the systems states in some retrievable fashion. This leads to the system-theoretic notion of observability, which is specified for linear [23], as well as nonlinear [24] systems. Neural networks are of course included in the latter class. If the state variables are not fully retrievable from the output observations, we simply do not have enough information to model the system using any technique.

Many algorithms in current use such as those of Giles, et al., [25], [26], Williams and Zipser [16], and several others [ 131, [ 141, [ 181 develop an implicit representation of the state from the output observations. This, along with the input is subsequently used (internally) to compute the next state vector.

The result of this paper takes advantage of an elementary fact of state-space representation. Namely, one set of state variables is as good as any other, at least from a theoretical

Proposition 1 or Proposition 2 as appropriate (see also Section IV). This is followed by simple transformations to obtain the fully recurrent model. The output of the fully recurrent model is passed through simple linear operators to recover the variables in the problem space. 2) Graphical Interpretation and Further Comparisons Our result says, in neural network architectural terms, that the states of a fully recurrent network behave like the hidden layer states of a two-layer neural network with recurrent output layer, and with output feedback. See Fig. 2.

We emphasize that Fig. 2 is not the physical network we are interested in. We are interested in designing the weights for a fully recurrent network. However the input and output of the network in the figure are simply related to the input and output of the desired fully recurrent network. Therefore we shall use the former network equivalently to illustrate certain features and make comparisons.

The choice of the input and output terminals for training this effective network is what separates the algorithm of this paper from several others. Refer to Fig. 3. One of the recurrent neurons in the output layer has been enlarged to show detail regarding the mathematical operations performed in order to produce an output. In particular, notice that integrator blocks are necessary to produce the state output vector since the evolution of the state is described by a differential equation.

Page 5: Recurrent neural network training with feedforward complexity

OLUROTIMI: RECURRENT NEURAL NETWORK TRAINING WITH FEEDFORWARD COMPLEXITY

~

189

State vector: x(r) Note Input: I ( r ) output layer. Output: y (rJ - ... I /

**. I / \ Input:I(rJ

/ \ \ Weight: H; Weight: n Weighi matrix: W, Bias: b

Fig. 2. layer, output-feedback network.

Fully recurrent network is equivalent to two-layer, recurrent output

Slate vector: x ( I )

I Range of map in m m n t BP, dynamic BP. RD

BPlT. teacher-forcing. ca.

fD

n, Domain of map in lecumnt BP, dynamic BP, BPlT, teacher-forcing, cic.

Fig. 3 Mathematical detail of equivalent network, indicating the domain and range for different training algorithms. The output layer recurrent self-connections of Fig. 2 are now included as part of the output feedback connection bundle for clarity of description. The rightmost neuron in the recurrent output layer is enlarged to show relevant detail. In particular, note the integrator block separating the range set of recurrent backpropagation from that of the method presented here.

Recurrent network training algorithms such as recurrent or dynamic backpropagation, backpropagation-through-time, as well as versions of these algorithms incorporating teacher forcing attempt to capture, within the weights and architecture of a fully recurrent network, the overall system dynamic map

t o is the initial time. The domain of this transformation with respect to the figure is indicated by the line labeled DD, and the range by RD. Notice that the hardware corresponding to this transformation includes active devices, namely integrator blocks. Thus the transformation can not be memoryless. For this reason the transformation is sensitive to the state vector feedback, and the corresponding training algorithms have to account for recurrent credit assignment or error propagation. It is appropriate to mention here the context and role of teacher forcing. The context of teacher forcing is in the implementation of the recurrent training algorithms mentioned above, where it provides a heuristic variation on the exact

formulas. Specifically, it calls for the use of the teacher signals wherever they would be expected if the network were running ideally, instead of the actual, intermediate network outputs. The role of teacher forcing, first from an analytical viewpoint, is that it reduces the complexity of the error propagation system. Second, from a practical viewpoint, it has been observed to train faster than the exact formulas, as well as performing more satisfactorily in certain oscillating trajectory problems [16], [21].

The algorithm that we describe in this paper implements the map

The domain of this transformation with respect to the figure is indicated by the line labeled Ds, and the range by Rs. Notice that the hardware corresponding to this transformation contains only passive devices. In particular, notice that the integrator blocks are not contained in this transformation. Thus, from elementary signal and system theory [ 2 8 ] , this map is memoryless regardless of the state vector feedback. Hence the corresponding training algorithm has only to account for static credit assignment or error propagation. In this case, backpropagation will be adequate. However, we quickly point out that unlike the earlier algorithms, the training in this case does not directly address the recurrent leaming task. It is the main result of Proposition 1 which provides the transformations that complete the design process.

Our main result shows that the hidden layer of Fig. 2, although possessing no intralayer connections, nonetheless behaves exactly like a fully recurrent network whose input and output are related to that of the full network of Fig. 2 by linear operators. Furthermore, the recurrent weight matrix is the product of the input-hidden and hidden-output weight matrices. In particular, note that this weight matrix relationship can not be obtained by trivially passing the hidden layer output through the hidden-output layer weights and then feeding it back directly to the input to pass through the input-hidden layer weights and thereby achieve the matrix product. This is not possible for two reasons. First, straightforward analysis will reveal that such a procedure is not sufficient to induce a continuous-time recurrent neural network relationship among the hidden layer variables. Secondly, we cannot trivially eliminate the output layer nodes because their behavior is dynamic, containing an essential memory component. In other words, the output layer nodes process the hidden layer outputs through a specific neural network dynamic system before feedback. The result of Proposition 1 essentially realizes that utilizing the particular form of the simple additive model of equation (6), we are able to propagate similar recurrent behavior unaltered (to within linear operators) to the space of hidden layer variables. It is significant to note that such preservation of recurrent behavior does not appear possible with some other recurrent models such as the shunting models

The method presented here is distinct from teacher forcing from several viewpoints. One viewpoint observes from Fig. 3 that the domain and range of the respective systems which the two algorithms learn are quite different. In fact, the context

~291.

Page 6: Recurrent neural network training with feedforward complexity

190 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5 , NO. 2, MARCH 1994

for teacher forcing vanishes in this new algorithm-the static transformation fs can be trained, for example, by static backpropagation. To highlight this fact, consider that the input

We therefore notice that the architecture of Fig. 2 is “natu- ral” for learning a dynamical system of the form

and output time samples for training can be randomized while 4 t ) = f(z(t),qt)), learning fs (and this is indeed done in the example pre- sented later in this paper). Such randomization will completely destroy learning (teacher forced or otherwise) for f n . A

since this simply involves matching the right hand side ex- pressions according to

more fundamental difference can be seen by considering that teacher forcing starts out as an approximation to the exact recurrent training formula. In fact it is possible to graphically represent the equivalent network that teacher forcing trains. This network is initially different from the actual network to

f(z(t),I(t)) = -az(t) + W2g(W1z(t) + nI(t) + b) +

or

f(z(t), I @ ) ) az(t ) = W2dW1z(t ) + b ) +c, (14)

be trained. In the ideal teacher forcing case, the former network approaches the actual network as training proceeds. Provided this training is successful, the teacher forcing equivalent net- work converges to the actual network required. Thus teacher forcing is asymptotically accurate provided the task is learned. If this learning is not achieved, some compensation will have to be made for the non-vanishing teacher forcing term. Toomarian and Barhen for example address this by discounting the teacher forcing term as time progresses [21]. On the other hand, the algorithm presented here is not an approximation. The training and consequent operation of the network is mathematically exact with respect to the learning requirements. This is demonstrated in the statement of Proposition 1 and its proof.

One can also observe in Fig. 3 that teacher forcing, by definition, is not capable of transforming the range of the f mapping to any range set occurring prior to the integrator block. After the teacher forcing heuristic is applied, the range of the mapping still lies after the integrator block. Thus, the resulting credit assignment system still has memory, and the learning is of recurrent complexity.

On the other hand, the algorithm described here actually specifies the range of the fs mapping to lie prior to the integrator block. In order for the algorithm to remain exact with respect to the learning problem, it must then specify how to deal with the outstanding issue of passage through the integrator block. This issue is resolved in the development of the main result of Proposition 1. Meanwhile, the memoryless transformation fs enables the learning to be of feedforward complexity.

Let the output vector of the conceptual two-layer, output- feedback network in Fig. 2 be z(t). This output is instanta- neously fed back to the input. Let the state of the neurons in the hidden layer be U @ ) . Let the vector of output functions for the hidden layer be g(.). Due to recurrence in the output layer, we have:

k( t ) = - a z ( t ) + Wzg(u(t)) + c

=az( t ) + W2g(W1z(t) + OI(t) + b) + c,

where W1, W Z are the weight matrices from the input z(t) to the hidden, and from the hidden to the output layers, respectively. O is the weight matrix from the input I ( t ) to the hidden layer, and b,c are bias vectors.

and since the universal approximation theorem of neural networks guarantees that we can achieve this equality with arbitrary accuracy for virtually every function of practical significance.

We also see an equivalence to the fully recurrent network by considering the equation for the hidden layer states,

U ( t ) = WlZ(t) + O q t ) + b,

so that

2( t ) = wy 1 (U( t ) - OI( t ) - a>.

k( t ) = -a2(t) + W*g(u(t)) + c,

iL(t) = -au(t>+W1W2g(u(t)) +O(aI(t) +i-(t)) +Ub+WlC,

Substituting this in the equation for the recurrent output layer,

we find

which is the equation of a fully recurrent network with.weight matrix W = W1W2, and input J ( t ) = O(aI( t ) + I @ ) ) + ab + WlC.

Consequently, the task of training the fully recurrent net- work is identical to that of training the feedforward section of the output-feedback network with the appropriate data transformations.

C . Some Features of the New Algorithm

In this section, we highlight some specific implications of the main resut with respect to fully recurrent network design.

1) The main result precisely specifies for the first time an analytical expression relating the weights, inputs, and outputs of a fully recurrent network to the weights, biases, inputs, and outputs of a feedforward structure (embedded in recurrent dynamics). Consequently, this result provides a systematic vehicle for the transfer of useful results from the extensive body of feedforward neural network research to the recurrent domain. This will have the advantage of reducing or eliminating some ad hoc procedures in recurrent network design. For example, results derived from the Vapnik-Chervonenkis dimension (VCdim) [30] relating the number of hidden layer nodes to network generalization ability will be found useful at least in guiding the choice of recurrent network size.

Page 7: Recurrent neural network training with feedforward complexity

OLUROTIMI: RECURRENT NEURAL NETWORK TRAINING WITH FEEDFORWARD COMPLEXITY 191

The resulting training algorithm is less constraining on the training data collection process. For example, although it models a system with temporal pattern out- put, the algorithm does not require time-stamping of the training pairs. It also does not require, during a specific training run, that the training data samples be homogeneous in the sense that they are all from the same temporal pattem. Both of these requirements are necessary in virtually all other recurrent algorithms in current use. This is because the latter algorithms need the temporal sequence information, otherwise a totally unrelated system may be modeled. On the other hand, the new algorithm models an embedded static system, whose mapping is independent of time. Refer again to Fig. 3. Thus training data can be presented in a sequence that is independent of time, and independent of the particular (time-varying) pattem used in training. In fact, in the example of Section V, the training samples were presented in random order. In this example, note that the time variable input is needed only because the time- varying nature of the system is equivalent to the explicit dependence of an external input on the time variable. A consideration of the time-invariant case (extemal input dependence on time is not explicit) illustrates this point in a more obvious way. While one intuitively expects that some advantages, such as network size reduction may be achieved by using more general, non-sigmoid nonlinearities in re- current network design, it is not clear how one can go about selecting such nonlinearities. For example, reported work suggests that deeper multilayer networks can lead to significant reduction in the hidden layer sizes [31]. Extension of the result of this paper to the general multilayer case provides a systematic solution to this problem. Such a solution can take advantage of depth- size tradeoffs to reduce the recurrent network size at the cost of more complex nonlinearities. This is briefly described below. Consider the learning equation (1 1) from the proof of Proposition 1:

f(z(t),I(t)) + a z ( t ) = W2g(W1z(t) + RI(t) + b) + c.

For the multilayer case, this learning equation can be rewritten

f ( z ( t ) , I ( t ) ) + uz ( t ) = Wsgv(W1z(t) + RI(t) + b) + c. (15)

In this last equation, W1 is the weight matrix between the input layer and the first hidden layer, and Wz is the weight matrix between the last hidden layer and the output layer. R is the weight matrix between the external input I ( t ) and the first hidden layer. b is the vector of bias inputs to the first hidden layer. c is the vector of bias inputs to the output layer. The nonlinearity g v (.) is now viewed as the equivalent function mapping the vector variable at the input to the first hidden layer to the vector output of the last hidden layer. Since this equivalent function is necessarily dependent on the weights and biases between its input and output

terminals, the subscript V is included, where V is a vector containing the respective weights and biases. Note that the dimension of gv(.) may be different from that of its argument, since the last hidden layer size may be different from the first hidden layer size. In any case, analysis in the same spirit as that of the main result leads to the fully recurrent description

zi(t) = -na(t) + Wgv(U(t)) + J ( t ) , t o 5 t 5 t f , (16)

where the weight matrix, input, and output equations are again defined as:

w =W1WzJ(t) =3-11[I(t)] = R ( u I ( t ) + q t ) ) + ub + Wlcy(t) =3-10[u(t).I(t)] = W,’(U(t) - RI(t) - b ) .

In the general case, this describes a recurrent neural network with arbitrary nonlinear units depending on the dynamic mapping being leamed. Furthermore, the nonlinear units no longer perform local operations, their outputs being a nonlinear combination of the neuron states. In fact, the nonlinear units can no longer be viewed as attached to any particular neuron, and the “connectivity” as it applies to signal feedback is now abstract. Such a system supports the possible disparity in the input and output dimensions of gv( .)-the maxi- mum number of feedback terms to each neuron does not have to equal the network dimension. Note however that it will often be possible to insert an additional layer of linear units just before the output layer with dimension equal to the first hidden layer. This then equalizes the the recurrent network dimension and the number of feedback terms. The design is systematic, and will typically lead to smaller recurrent network sizes at the cost of increased feedback complexity. The algorithm suggests that the inclusion of a few linear units in the recurrent network may significantly enhance the training. This is because the transformation of Propo- sition 1 shows that the nonlinear feedback portion of the neural network in the additive model effectively has to model the (presumably) purely nonlinear system plus the linear decay term of the neural model. Thus the typical all-sigmoid (or all-nonlinear) design is unnecessarily required to model some linear range. The neural network has to work harder than is necessary. This translates into long training times. The number of linear units needed to alleviate this problem is ideally equal to the system order. See Section IV on design guidelines. The transformation of Proposition 1 allows the training algorithm to directly control all the units in a fully recurrent network, just as the hidden units of a regular 2- layer feedforward network can be controlled in training. This leads to the possibility of faster training, which is ultimately reflected in the fact that the training need not be of more than feedforward complexity. Many current algorithms can at most apply heuristics to the visible units (e.g., teacher-forcing), with little control over the other units.

Page 8: Recurrent neural network training with feedforward complexity

~

I92

IV. DESIGN RECPE This section outlines some design guidelines. The guide-

lines are divided into two parts, one for the mutliple-input single-output (MISO) case, and another for the mutliple-input multiple-output (MIMO) case. Although the multiple-output case includes the single-output case, degeneracies in the single- output case lead to simplifications worth pointing out.

A . Mutliple-Input Single-Output (MISO) Nonlinear System

network to model the dynamical system To be specific, suppose we wish to design a fully recurrent

d''y( t ) d" - y ( t ) d"-' y ( t ) =F(- ~ dt71 dtn-1 ' din-2 '

It is assumed that we have training samples

1 , 2 , . . . . N for some N . It is also assumed that we have determined the system order

that will be learned, either by prior knowledge, or by making a guess that we know is at least as large as the system order.

In the following, the system has been transformed to a state variable representation

{ I l ( t t ) ' ~ z ( ~ % ) % . . . , I P ( t Z ) ? Y ( t Z ) . Y ( t Z ) , G ( L ) ! . . . . -hi =

i l ( t ) =f1(z(t)) = m(t) Lj:;?(t) =f2(z(t)) = a ( t )

. . . -. . -.

where z l ( t ) = y ( t ) , z z ( t ) = $(t) ,z3( t ) = y ( t ) , . . . ,zn(t) = d^ ~ ' y( t )

dt, ,- l . Also, 5 = [ x ~ , I c ~ : . . . ,znIT, f = [fl(z(t)),fz(z(t)),...:f,(z(t))] T , I = [~1 '~2 , . . ' ! - r , lT .

1) Data set for feedforward training. Form a new input data set consisting of the state vector and the input vector, VI = { I ( t j ) , z ( t i ) } , and a new output data set VZ)O = {:iTL(tj) = fn (z ( t ; ) ) = F ( z ( t i ) ) } ! i = 1 , 2 , . . . . N . See further discussion about this output data set in Step 5.

2 ) Recurrent network decay term. Select a decay term a for the recurrent network.

3) Conventional feedforward training. Train a N:,,,,,, network to map VI to Vo. Note that the elements of VI should be kept in the order shown above for the transformations given below to be consistent.

4) Weight processing. Process the weights as follows. De- note by I,, the rL x n identity matrix, and denote by On, the TL x m matrix of zeros.

a) Let w1 be the 7h x ( 7 ~ + p + 1) weight matrix from the input layer to the hidden layer. The extra column contains the bias weights. Let Wz be the 1 x (f i + 1) weight matrix from the hidden layer to the output layer. Again, the extra column is for the bias weight.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 2. MARCH 1994

b) Write

w z = [I&Ic],

where c is the bias weight. Also let n x n matrix:

be the

a 1 0 . . . a 1

... w 1 =

[ o : . .

. . . . . . . . .

a 1 ... . .

. . . . . . . . .

c) Now form weight matrices W ; and Wz according to

w1

on, ! l k 1 i on, . . . . . . . . . . . . w; = - 1 ,

c 1 . [

[ w z m and

on-1,6 : : on-1,1 w; = ... . . . 1, . . . . . .

d) Finally, extract the matrices necessary for the transformation to recurrent form as follows. First define m = f i + n. Also let A(: , i ) denote the i th column of the matrix A, and let A ( : , i : j ) denote the submatrix formed by taking columns i through j of A. Let A ( i , j ) denote the element in row i and column j . Then,

R =WY(:, 1 : p)

w1 =w;(:,p + 1 : p + n) b =w;(:,p + n + 1)

wz =w;(:, 1 : m) c =w;(n, m + 1)

5 ) Recurrent network design. The recurrent neural network is a fully connected network with m neurons. The first f i of the neurons have output functions corresponding to those used in the hidden layer of the feedforward training step above. For convenience, we will assume they are sigmoidal. The remaining n neurons are linear, with unity gradient. This last specification is theoretically not necessary. It is a result of taking advantage of the state space form as shown above. More specifically, note from Fig. 2 that what we are seeking to do is replicate the dynamic behavior of the given system in the recurrent output layer. That is, as indicated in (14)

f ( z ( t ) , I ( t ) ) + a z ( t ) = W 2 g ( W 1 z ( t ) + M ( t ) + b) + c.

By applying this to the state space form above, it is straightforward to see that the first n - 1 equations are purely linear, while the nth equation is the arbitrary, continuous function, plus a linear term. The design steps presented here take advantage of this, using simple linear units to supply the linear terms. We thus need to learn only a one-dimensional output function. This is

Page 9: Recurrent neural network training with feedforward complexity

OLUROTIMI: RECURRENT NEURAL NETWORK TRAINING WITH FEEDFORWARD COMPLEXITY I93

reflected in the output data set DO in Step 1. In fact, all the neurons could have sigmoidal output functions if we so desire. In such a case, the design recipe given here will have to be modified accordingly. It is obvious however, that given the state space representation, using all nonlinear output functions will typically render the problem more difficult than it needs to be. This is a significant point, since most recurrent network designs contain all sigmoid output functions. With the result presented here, it is clear that in many cases the difficulty of learning will be reduced with the introduction of a few linear units (note that usually m >> n). The equation of the network is

U(t) = -au(t) + Wg(u(t)) + J ( t ) ,

with the weights specified by

w = w1w2, and the input obtained as

J ( t ) = R(aI(t) + i ( t ) ) + ab + W l C .

y(t) = W,l(u(t) - nqt) -a) . The desired output y(t) is the first element of the vector

B . Mutliple-Input Multiple-Output (MIMO) Nonlinear System

In this case, we wish to design a fully recurrent network to model a dynamical system

4 t ) = f(z(t),qt))l

Data set for feedfonvard training. Form an input data set consisting of the state vector and the input vector, VI = { I ( t i ) , z ( t i ) } , and an output data set consisting of the derivative of the state vector Do = {k( t l ) } , i =

Recurrent network decay term. Select a decay term a for the recurrent network. Conventional feedfonvard training. Train a network to map VI to DO. Note that the elements of 2 ) ~ should be kept in the order shown above for the transformations given below to be consistent.

1 , 2 , . . . , N .

Weight processing. Process the weights as follows. De- note by In the n x n identity matrix, and denote by U,, the n x m matrix of zeros.

Let be the 7% x ( n + p + 1) weight matrix from the input layer to the hidden layer. The extra column contains the bias weights. Let w2 be the n x ( T ? + 1) weight matrix from the hidden layer to the output layer. Again, the extra column contains the bias weight. Write

A . w, = [W,:c],

where c is the bias weight vector. Also let fi, be the n x n diagonal matrix:

Wl = az,,

Now form weight matrices W ; and W $ according to

W l

unp i f i l ! UI11

. , . . . . . . . . . . . . .

and

w;= w2 [ Finally, extract the matrices necessary for the transformation to recurrent form as follows. First define m = ril + 71. Also let A ( : , i ) denote the it'' column of the matrix A, and let A ( : , i : j ) denote the submatrix formed by taking columns i through j of A. Then,

R =w;(:, 1 : p)

w1 =w;(:,p + 1 : p + n)

b =w;(:,p + n + 1) w2 =W$(:, 1 : m)

c =Wz(:, r n + I)

Recurrent network design. The recurrent neural network is a fully connected network with r n neurons. The first rh of the neurons have output functions corresponding to those used in the hidden layer of the feedforward training step above. For convenience, we will assume they are sigmoidal. The remaining n neurons are linear, with unity gradient. This last specification is theoretically not necessary. It is a result of taking advantage of the linear decay term in the additive model used for the recurrent network. As before, the equation of the network is

&( t ) = -au(t) + Wg(u(t)) + J ( t ) ,

with the weights specified by

w = WIW,,

Page 10: Recurrent neural network training with feedforward complexity

1 94

- 1 1 . . . 1 3 3 . . . 3 -

f l t z " ' t , O l tl f Z " ' t l 0 l

.Il(tl) . I l ( t Z ) _ ' ' Jl(flf l1) J 3 ( t l ) . 1 3 ( t Z ) ' . . J3(t101)

-.Jl(tl) . i l ( t Z ) . . ' j l ( t I O 1 ) j 3 ( t l ) . j 3 ( f Z ) " ' . j 3 ( f l O l ) -

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 2, MARCH 1994

Fig. 4. Neural network Bessel function generator.

and the input obtained as

J ( t ) = sa(aI(t) + i ( t ) ) + ab + WlC.

z(t) = y(t) = W,l(u(t) - .nr(t) - b ) .

The desired output z( t ) is

V. DESIGN EXAMPLE: A BESSEL FUNCTION GENERATOR

In this section we illustrate the method of this paper by presenting the results of its application to learning Bessel's differential equation,

&(t) + t i ( t ) + ( t 2 - a2). = 0.

If we consider I l ( t ) = a and 1 2 ( t ) = t as the inputs, and .c ( t ) as the output, this system is a representative nonlinear dynamical system. Since its solutions are well known, it is a convenient system for testing our result.

Our goal is to train a recurrent network such that given the order (I: and the time variable t as input, it generates at its output a Bessel function of the first kind, of order a, J n ( t ) . The network is trained with a few Bessel functions, and is then expected to produce Bessel functions not only within its training set, but also to generalize to Bessel functions outside the training set provided the appropriate order is applied at its input. This situation is diagrammed in simple block form in Fig. 4.

A. Design Procedure A set of state variables for this Bessel system is

and the set of inputs is

We can write i l ( t ) =xz(t) = f l ( Z ( t ) , I ( t ) )

Note that x1(t) = ~ ( t ) = J,(t), and x2( t ) = i ( t ) = j a ( t ) . Using the design recipe for the MIS0 system (Section IV-A), the input data set is

A-

i=l

for N training samples. Similarly, the output data set is iz-

i= l

The decay term a was set equal to unity. Note that system order n = 2, and the input vector I ( t ) = [a tIT has dimensionality p = 2. A hidden layer size of 50 was arbitrarily chosen. Hyperbolic tangent sigmoids were used in the hidden layer as the output function. A neural network in the class was then trained to map the input data set to the output data set. The training algorithm used was the standard backpropa- gation algorithm, with learning rate 0.3, and momentum factor 0.7. The training proceeded for 16,703 iterations at which point the fractional error in the output was 0.01. The parameters at this point were used to design the recurrent network as outlined below. It should be remarked that no attempt was made to optimize this feedforward procedure. Almost all the choices were made arbitrarily, from the choice of the neural network size to the choice of training parameters. The reason for this is that the main result being demonstrated is independent of whether these parameters are optimal or not. The literature contains many methods of improving backpropagation training [ l ] , and the focus of our in- vestigation was not to optimize the feedforward training technique, but to demonstrate that the generically less expensive feedforward training procedure will yield the appropriate fully recurrent network. It is therefore almost sure that the network size and training time reported here can be improved upon. The process of Step 4 in Section IV-A were carried out to obtain the weights needed for the transformation to recurrent form. The design process is completed in a straightforward manner according to Step 5 in Section IV-A.

I ) Performance on Bessel functions in the training set J l ( t ) and . J . j ( t ) : The performance of the trained network when prompted first with a = 1 and then with a = 3 is shown in Fig. 5.

The maximum error between the curves in each case is 0.03.

Page 11: Recurrent neural network training with feedforward complexity

OLUROTIMI: RECURRENT NEURAL NETWORK TRAINING WITH FEEDFORWARD COMPLEXITY 195

I 7 8 9 10 11 12 13 14 IS 16 17

-0.3 ‘

I 12 13 14 15 16 17

timc

(b) Fig. 5. Samples from both of these Bessel functions were used in the training set. (a) J I ( t ) and the corresponding neural network output. (b) J3(t) and the corresponding neural network output.

It is evident that the neural network output behavior is close to that of the corresponding Bessel function. Theoretically, if we had achieved perfect training, the two curves would be identical. Practically, we would expect that if the system being modeled is robust to small parameter deviations (correspond- ing to imperfect training), the closeness of the two curves should improve as our training improves. That is, accuracy is limited only by the feedforward training fidelity. 2) Generalization to Bessel functions outside the training

network was then tested on six other Bessel functions not in the training set. In other words, inputs I ( t ) = [a t]* were applied to the network for a = 0.5,1.5,2,2.5,3.5,4, and the output observed. The performance is illustrated in Fig. 6.

Similar remarks apply here as were made for J1(t) and J3(t).The maximum error between the curves is 0.03 for Q = 1.5 and Q = 3.5; 0.04 for a = 2 and a = 2.5; 0.05 for Q = 0.5; and 0.07 for Q = 4.

These tests (including the training set) cover the range of Bessel functions taken at half order increments from order 0.5

set Jo.s(t), J1.5(t), J 2 ( t ) , J2.5(t), J3.5(t), J 4 ( t ) : n e neural

- --I--

.-

U

~ -r

s 9 m XI 11 L) U Y IS II I m 13 ,I ~1 ,I 8 . 81

I .-

a,

I I

Fig. 6 (None of these Bessel functions were used in the training set.) (a) Jo.s(t) and the corresponding neural network output. (b) JI.s(t) and the corresponding neural network output. (c) J2(t) and the corresponding neural network output. (d) J2.5(t) and the corresponding neural network output. (e) J3.5(t) and the corresponding neural network output. (0 J4(t) and the corresponding neural network output.

to order 4. In all cases the time interval was 7 5 t < 17.

VI. THE DISCRETE-TIME CASE

The extension to the discrete-time case is straightforward, and it is succintly presented in this section.

Proposition 2: Let N:,i2,..,,iN+l be the class of functions generated by a multilayer feedforward network with 21 inputs, i ~ + 1 outputs, and N - 1 hidden layers with i2,. . . , i~ nodes, and let ‘‘a” be a constant.

Then for every dynamical system

z ( k + 1) = f ( z (k ) , I (k ) ) , ko I k I kf, (17)

with input I(k) E RP and output z(k) E Rn, f(.,.) a continuous function, such that

f(z, I ) E N;+p,m,n, v L 71,

there exist constant matrices W1, Wz, R, and constant vectors b,e , such that the recurrent neural network

u ( k + 1) = Wg(u(k)) + J ( k + I), ko I k I kf, (18)

with the weight matrix W specified by

W = w1w2, an input equation defined by

J ( k + 1) = Saqk + 1) + WlC+ b,

Page 12: Recurrent neural network training with feedforward complexity

1% IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 2, MARCH 1994

and an output equation defined by

y(k) = W,l(u(k) - f2I(k) - b),

Y(k-1 = z(k), ko I k I kf,

(21)

has

u(k) E Rm, and g(.) is a vector of neuron output functions. Proof of Proposition 2: Since

it is possible to choose W1,W2,f2,b1c according to the parameters in the following valid expression:

f ( ~ ( k ) , l ( k ) ) = W29(Wiz(k) + f l l (k ) + b) + C. (23)

In particular, the right-hand side of (23) can be obtained after training a two-layer feedforward neural network for mapping [ z ( k ) , I ( k ) ] to f ( z ( k ) , I ( k ) ) . b is the vector of bias inputs to the hidden layer, and c is the vector of bias inputs to the output layer. The weight matrix from the input to the hidden layer is

, and the weight matrix from the hidden to the output layer is W2.

The recurrent network of Proposition 2 along with its input and output functions are thus completely specified. Now consider (21),

y(k) = W,’(u(k) - f2I(k) - b).

y(k + 1) = W,l(u(k + 1) - n I ( k + 1) - b) .

y(k + 1) = W,l(Wg(u(k)) + J ( k + 1) - Ol(k + 1) - b).

y(k + 1) =W,’(Wg(u(k)) + f2I(k + 1) + W l C + b - nI (k + 1) - b)

Advancing one time step,

Substituting for u(k + 1) from (18),

Substituting for J ( k + 1) from (20),

=W,l(Wg(u(k)) + W l C ) =W2g(u(k)) + c (24)

where we have also used the fact that W1W2 = W from (19). Now from (21), we can also write

(25) u(k) = W1y(k) + f2I(k) + b.

y(k + 1) =W2g(W1y(k) + f2l(k) + b) + c

Using (25) in (24),

=f(y( k), I ( k)) by equation (23)

Therefore, over the interval of interest, ko 5 k 5 k f , the neural network output y(k) obeys the same dynamical (17) as does the dynamical system output z ( k ) . Furthermore, by

provision (22) of Proposition 2, the initial conditions are identical. Therefore, by uniqueness,

VII. CONCLUSION We have presented a method for leaming with only feed-

forward complexity in fully recurrent networks. No approxi- mations are involved in this method. Rather, it depends on a mathematically exact transformation that reveals an embedded feedforward structure in the recurrent formulation. Further- more through simple linear operations, we are able to relate the inputs and outputs of the fully recurrent network and its embedded feedforward component. It tums out that achieving the feedforward map automatically achieves the dynamical map (to within a linear operation). Consequently, recurrent training can be achieved with feedforward complexity. In particular, no secondary or auxiliary dynamical equations are needed to keep track of error-propagation. The method utilizes all the state variables of the dynamical system-which must be accessible in some form or the other for meaningful modeling to be attainable by any method. The available system output and its derivatives will suffice in many cases. We compared the method here to others that have been reported in the literature, and we also highlighted some particular features of the new method. We presented a step-by-step design rule for typical dynamical systems. We also presented simulation results for the modeling of a representative nonlinear dynamical system: Bessel’s differential equation with the order as the input. The neural network designed provided close approximations to the Bessel function outputs not only within the training set, but for a significant number of different orders of Bessel functions outside the training set.

The development of the method is sufficiently general to be applicable to generic temporal learning problems such as sys- tem identification, sequence recognition, temporal associative memory, time-series modeling, and so on.

The method was also extended in a straightforward manner to discrete-time fully recurrent networks.

ACKNOWLEDGMENT

The author is deeply grateful to the reviewers. Their com- ments have enabled me to focus the presentation on the relevant contributions of this work. I would also like to acknowledge very helpful discussions with Ronald J. Williams.

REFERENCES

[l] D. R. Hush and B. G. Home, “Progress in supervised neural networks,” IEEE Signal Processing Magazine , vol. 10, no. 1, pp. 8-39, 1993.

[2] K. Homik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks , vol. 2, pp. 359-366, 1989.

[3] R. Hecht-Nielsen, “Kolmogorov’s mapping neural network existence theorem,” in Proceedings of the IEEE First International Conference on Neural Networks , vol. 111, 1987, pp. 11-14.

[4] K. Homik, M. Stinchcombe and H. White, “Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks,” Neural Networks , vol. 3, pp. 551-560, 1990.

Page 13: Recurrent neural network training with feedforward complexity

OLUROTIMI: RECURRENT NEURAL NETWORK TRAINING WITH FEEDFORWARD COMPLEXITY I97

[5] P. Cardaliaguet and G. Euvrard, “Approximation of a function and its derivative with a neural network,” Neural Nefworks , vol. 5, pp. 207-220. 1992.

161 A. R. Gallant and H. White, “On learning the derivatives of an unknown mapping with multilayer feedforward networks,” Neural Networks , vol. 5, pp. 129-138, 1992.

171 W. Thomas Miller, III, R. S. Sutton, and Paul J. Werbos, Neural Networks for Control, Cambridge, MA: MIT Press, 1990,

[S] P. J. Werbos, “Neurocontrol and supervised learning: An overview and evaluation,” in Handbook of intelligent control, D. White and D. Sofge, Eds. Van Nostrand, 1992.

[9] M. I. Jordan and R. A. Jacobs, “Learning to control an unstable system with forward modeling,” in Neural Informarion Processing Systems 2 , David S. Touretzky, Ed. Morgan Kaufmann, 1990, pp. 324-331.

[IO] H. Miyamoto, M. Kawato, T. Setoyama and R. Suzuki, “Feedback-error- learning neural network for trajectory control of a robotic manipulator,” Neural Networks , vol. 1, pp. 251-265, 1988.

[ 1 I] F. J. Pineda, “Generalization of backpropagation to recurrent and higher order neural networks,” in Neural Information Processing Sysrems, Dana Z. Anderson, Ed. American Institute of Physics, 1988, pp. 602-61 1.

1121 R. Hecht-Nielsen, Neurocompuring. Addison-Wesley, 1990, 1131 B. A. Pearlmutter, “Learning state space trajectories in recurrent neural

networks,” Neural Compuration , vol. 1, no. 2, pp. 263-269, 1989. 1141 K. S. Narendra and K. Parthasarathy, “Identification and control of dy-

namical systems using neural networks,” IEEE Transactions on Neural Networks , vol. 1, no. 1 , pp. 4 2 7 , 1990.

[I51 K. S. Narendra and K. Parthasarathy, “Gradient methods for the op- timization of dynamical systems containing neural networks,” IEEE Transactions on Neural Networks , vol. 2, pp. 252-262, 1991.

[16] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computarion , vol. 1, no. 2, pp. 27C-280, 1989.

1171 D. Rumelhart and J. McClelland, Parallel Distributed Processing. Vol. I , Cambridge, MA: MIT Press, 1987,

1181 P. J. Werbos, “Generalization of backpropagation with application to a recurrent gas market model,” Neural Networks , vol. I , pp. 339-356, 1988.

1191 J. Barhen, N. Toomarian and S. Gulati, “Application of adjoint operators to neural leaming,” Applied Mathemarics Letters , vol. 3, no. 3, pp. 13-18.

[20] J. Barhen, N. Toomarian and S . Gulati, “Adjoint operator algorithms for faster learning in dynamical neural networks,” Advances in Neural Information Processing Systems 2 David S . Touretzky, Ed. San Matteo, CA: Morgan Kaufmann, 1990, pp. 498-508.

[21] N. B. Toomarian and J. Barhen, “Learning a trajectory using adjoint functions and teacher forcing,” Neural Networks , vol. 5 , no. 3, pp. 473-484.

[22] D. G. Luenberger, Introduction to Dynamic Systems. New York: John Wiley, 1979,

[23] T. Kailath, Linear Systems. Englewood Cliffs, NJ: Prentice Hall, 1980, 1241 S. P. Banks, Mathematical Theories of Nonlinear Systems, Englewood -~

Cliffs, NJ: &entice Hall, 1988, ”

1251 C. L. Giles, G. 2. Sun, H. H. Chen, Y. C. Lee and D. Chen, . .

“Higher order recurrent networks and grammatical inference,” Neural Information Processing Systems 2, David S. Touretzky, Ed. San Matteo, CA: Morgan Kaufmann, 1990, pp. 38C-387.

1261 C. L. Giles, C. B. Miller, D. Chen, H. H. Chen, G. 2. Sun and Y. C. Lee, “Leaming and extracting finite state automata with second- order recurrent neural networks,” Neural Computation , vol. 4, 1992, pp. 393-405.

1271 K. J. Lang, A. H. Waibel, and G. E. Hinton, “A time-delay neural network architecture for isolated word recognition,” Neural Networks , vol. 3, pp. 2 3 4 3 , 1990.

[28] C.-T. Chen, System and Signal Analysis. Holt, Rinehart and Winston, 19x9

[29] S . Grossberg, “Nonlinear neural networks: principles, mechanisms, and architectures,” Neural Networks , vol. 1, no. 1, pp. 17-61, 1988.

[30] E. B. Baum and D. Haussler, “What size net gives valid generalization?’ Neural Computation , vol. 1, pp. 151-160, 1989.

1311 K.-Y. Siu, V. P. Roychowdhury, and T. Kailath, “Depth-size tradeoffs for neural computation,” IEEE Transactions on Computers , vol. 40, no. 12, pp. 1402-1412, 1991.

Oluseyi Olurotimi, (M’92). received the B.Sc. de- gree in electncal and electronics engineering from the University of Ife (now Obafemi Awolowo Uni- versity, Ile-If?, Nigeria) in 1981. He received the M.S. and Ph.D. degrees in electrical engineenng from Stanford University in 1986 and 1990, respec- tively.

Currently, he is an Assistant Professor with the Electrical and Computer Engineering department, and the Center of Excellence in C31 at George Mason University, Fairfax, VA. His main research

interests are deterministic and stochastic recurrent neural network learning and applications.