a neural-network learning theory and a polynomial time rbf algorithm

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 6, NOVEMBER 1997 1301

A Neural-Network Learning Theory anda Polynomial Time RBF Algorithm

Asim Roy, Senior Member, IEEE,Sandeep Govil, and Raymond Miranda

Abstract—This paper presents a new learning theory (a set ofprinciples for brain-like learning) and a corresponding algorithmfor the neural-network field. The learning theory defines compu-tational characteristics that are much more brain-like than thatof classical connectionist learning. Robust and reliable learningalgorithms would result if these learning principles are followedrigorously when developing neural-network algorithms. This pa-per also presents a new algorithm for generating radial basisfunction (RBF) nets for function approximation. The design ofthe algorithm is based on the proposed set of learning principles.The net generated by this algorithm is not a typical RBF net, but acombination of “truncated” RBF and other types of hidden units.The algorithm uses random clustering and linear programming(LP) to design and train this “mixed” RBF net. Polynomialtime complexity of the algorithm is proven and computationalresults are provided for the well-known Mackey–Glass chaotictime series problem, the logistic map prediction problem, vari-ous neuro-control problems, and several time series forecastingproblems. The algorithm can also be implemented as an on-lineadaptive algorithm.

Index Terms—Designing neural networks, feedforward nets,learning complexity, learning theory, linear programming, poly-nomial time complexity, radial basis function networks.

I. A NEURAL-NETWORK LEARNING THEORY

L ACK of a robust and reliable learning theory has been asignificant impediment to the field of neural networks and

has held back its wide application. A robust neural-networklearning theory, that specifies a set of principles for learning,has been proposed recently. These principles are summarizedand explained below. The algorithm that follows is based onthese learning principles.

A. Perform Network Design Task:

A neural-network learning method must be able to designan appropriate network for a given problem, since it is atask performed by the brain. A predesigned net should notbe provided to the method as part of its external input, sinceit never is an external input to the brain. It definitely is not a“brain-like” algorithm if it has to be provided with a design ofthe network. This is an essential property of (for) any “stand-alone” learning system, whether a human being, an animal,

Manuscript received January 16, 1995; revised May 25, 1995 and May 19,1996. This work was supported in part by the National Science FoundationGrant IRI-9113370 and by grants from the College of Business, Arizona StateUniversity.

The authors are with the Department of Decision and Information Systems,Arizona State University, Tempe, AZ 85287 USA.

Publisher Item Identifier S 1045-9227(97)05236-3.

or a robot that is expected to learn “on its own” without anyexternal design assistance.

Some might argue that parts of the brain (like the visionpreprocessing part) are predesigned for us and that the brainis responsible only for fine-tuning and adjusting them. How-ever, in most other cases, the knowledge to be acquired iscompletely new (e.g., knowledge of mathematics, language,science, and so on), often unknown to the parent biologicalsystem (most parents of the current generation of say fiftyto sixty year olds knew nothing about computers to be ableto pass on relevant net designs to their children) and thus asolution to the learning problem (network design) could not beinherited at all. Hence, not all parts of the brain can feasiblycome predesigned for us. Hence the brain must know howto design the network in order to acquire and store all newknowledge. Therefore, neural-net algorithms also must knowhow to perform the same task.

B. Robustness in Learning:

The method must be robust so as not to have the localminima problem, the problems of oscillation and catastrophicforgetting, uncertain storage and recall of memories, or similarlearning difficulties. Some might argue that ordinary brains,and particularly those with learning disabilities, do exhibitsuch problems and that the robustness criterion imposes therequirements of a “super” brain. From a computer scienceand engineering perspective, building learning systems that donot have learning problems is definitely the desired goal. Inaeronautics, for instance, the task is to build flying machinesthat can fly at high speeds for long distances with a substantialload, that can fly under different weather conditions, that canfly at very high altitudes, and have other not so “bird-like”features. Their task is not to build systems that replicate the“natural” flying characteristics of birds, with all its pitfalls;such systems would be useless for them. The engineeringgoal in neural networks is to design and build learningsystems that are robust, reliable and powerful. There is nointerest in creating weak and problematic learning devices thatpurportedly replicate the “ordinary” biological systems withall its problems.

C. Quickness in Learning:

The method must be quick in its learning and learn rapidlyfrom only a few examples. For example, one which learnsfrom only ten examples learns faster than one which requiresa 100 or a 1000 examples. As before, some might argue that

1045–9227/97$10.00 1997 IEEE

1302 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 6, NOVEMBER 1997

fast learning is not a characteristic of all human beings. Butit is hard to justify building learning systems that behave likelearners characterized by such sayings as: “Told him a milliontimes and he still doesn’t understand.” From an engineeringpoint of view, on-line learning systems must learn rapidly fromonly a few examples.

D. Efficiency in Learning:

The method must be computationally efficient in its learningwhen provided with a finite number of training examples [19].It must be able to both design and train an appropriate netin polynomial time. That is, given examples, the learningtime (i.e., both design and training time) must be a polynomialfunction of . This property has its origins in the belief thatanimal brains (insects, birds, for example) could not be solvingNP-hard problems, especially when efficient, polynomial timelearning methods can be designed and developed. This, again,is a critical computational property from an engineering andcomputer science perspective.

E. Generalization in Learning:

The method must be able to generalize reasonably well sothat only a small amount of network resources is used. Thatis, it must try to design the smallest possible net, althoughit might not be able to do so everytime. This property mustbe an explicit part of the algorithm. The property is based onthe notion that the brain could not be wasteful of its limitedresources, so it must be trying to design the smallest possiblenet for every task.

This learning theory defines algorithmic characteristics thatare obviously much more brain-like than those of classicalconnectionist learning, which is characterized by predefinednets, local learning laws, and memoryless learning (no storingof training examples). Beyond being brain-like, these char-acteristics are also extremely desirable from a computationalpoint of view. Judging by this theory, classical connectionistlearning is not very powerful or robust. First of all, it doesnot even address the issue of network design, a task thatshould be central to any neural-network learning mechanism.It is also plagued by efficiency (lack of polynomial timecomplexity [6], [11], need for excessive number of teachingexamples) and robustness problems (local minima, oscillation,catastrophic forgetting, and uncertain storage and recall ofmemories). Classical connectionist learning, therefore, is notvery brain-like at all.

The learning scheme postulated here does not specify howlearning is to take place. For example, it does not specifywhether memory is to be used or not to store training examplesfor learning, or whether learning is to be through local learningat the nodes or through some global mechanism. It merelydefines broad computational characteristics that are brain-like.But there is complete freedom otherwise in designing thealgorithms.

All of the previous algorithms [22], [31]–[33], based onthese learning principles, were for classification. This paperpresents an algorithm for function approximation, based onradial basis function (RBF) ideas.

II. RBF NETS

RBF nets belong to the group of kernel function nets thatutilize simple kernel functions, distributed in different neigh-borhoods of the input space, whose responses are essentiallylocal in nature. The architecture consists of one hidden and oneoutput layer. This shallow architecture has great advantage interms of computing speed compared to multiple hidden layernets.

Each hidden node in an RBF net represents one of the kernelfunctions. An output node simply computes the weighted sumof the hidden node outputs. A kernel function is a localfunction and the range of its effect is determined by its centerand width. Its output is high when the input is close to thecenter and it decreases rapidly to zero as the input’s distancefrom the center increases. The Gaussian function is a popularkernel function and will be used in this algorithm. The designand training of an RBF net consists of 1) determining howmany kernal functions to use; 2) finding their centers andwidths; and 3) finding the weights that connect them to theoutput node.

Mathematically, the overall response function of such anetwork with a single output node is given by

(1)

(2)

Here, is the total number of hidden or RBF nodes,is theinput pattern vector, is the response function of thethkernel unit (RBF node), is a radially symmetric functionwhose output is maximum at the center and decreases rapidlyto zero as the input’s distance from the center increases,and

are the center and width of theth unit, and is the weightassociated with theth unit. Generally, a Gaussian functionwith unit normalization is chosen as the kernel function

(3)

and

(4)

where an input is represented by the-dimensional vectorand the training set consists of

vectors of the form ( ), where is the predicted valuewhen the input vector is .

RBF nets have been studied extensively in recent years ([2],[3], [5], [7], [8], [12], [20], [21], [23], [27]–[30], [36], [37],[39], [40], and others). It has also been shown that RBF netshave the universal approximation property [10], [25], [26].

III. B ASIC IDEAS AND THE ALGORITHM

In forecasting, regression, signal processing, and control the-ory, a linear or nonlinear function is usually constructedfrom the data to model a certain phenomenon. The unknownparameters of the function are generally found by some

ROY et al.: NEURAL-NETWORK LEARNING THEORY 1303

error minimizing technique. The following LP model can alsobe used to construct certain best fit functions for the data [9],[17]:

Minimize (5)

subject to

unrestricted in sign in

where

is a known real function of is the total numberof such functions, is the vector ofcoefficients of is the desiredoutput for the th input vector and are the positiveand negative deviation variables, respectively, for theth inputand is a constant. The function in the LP can take arich variety of forms. The following are some examples:

and

The LP solves for the unknown coefficients ( , andso on) of the function . Note that the LP minimizes the

norm and not , as is the usual practice.In the proposed algorithm, the idea of a compact and fixed

model of the phenomenon, similar to one of the functionsabove, is combined with the idea of using RBF’s to handle anyunknown nonlinearities in the phenomenon. From a purelyRBF point of view, the addition of a fixed model to thenet can potentially assist the generalization process and reducethe need for a large number of RBF units. In computationalexperiments reported in this paper, a linear function hasbeen used in combination with Gaussian RBF units. Hence,the function for the LP in (5) is more properly defined,for this algorithm, as

(6)

which corresponds to the output of a “mixed” RBF net. Whenthe effect of an RBF unit is small, it can be safely ignored.This idea of ignoring small RBF outputs leads to the definitionof a truncated RBF unit as follows:

if

otherwise (7)

where is the truncated RBF function and a smallconstant. In computational experiments,was set to .Thus, the function for the LP in (5) is redefined in termsof as follows:

(8)

where now corresponds to the output of a “mixed” RBFnet with truncated RBF units. Henceforth, solving the LP in(5) implies solving it using the function instead of .

A. Generation of Gaussian Units

The network constructed by this algorithm deviates froma typical RBF net. For example, there is truncation of RBFnode outputs and, in addition, non-RBF units are used inthe hidden layer. Besides that, in a clear departure fromthe kernel function idea, the basis functions are no longerviewed as purely local units, since it generally results in avery large net. Here, the basis functions are a combinationof local and global feature detectors. For that purpose, avariety of overlapping Gaussians (different centers and widths)are created to provide for both global and local featuredetection. Though both “fat” (i.e., ones with large widths)and “narrow” Gaussians can be provided, the “fat” ones,which detect global features, are created and explored firstto see how well the broad territorial features work. TheGaussians, therefore, are generated incrementally, in stages,starting with the fat ones and they gradually become narrowlocal feature detectors in later stages. As new Gaussians aregenerated at each stage, the LP in (5) is solved using allof the Gaussians generated till that stage [the Gaussians areused in the function ] and the resulting net evaluatedfor error. Whenever the incremental changes in the error rates(both training and validation or test set errors) become smallor overfitting occurs on the training set, the algorithm stopsand either the current net (if the incremental changes in theerror rates become small) or the previous net (if overfittingoccurs) is used as the final net. This incremental process ofnet generation, going from global to local Gaussian units,is consistent with the learning theory requirement that anexplicit attempt be made by the algorithm to obtain goodgeneralization.

The Gaussians are generated incrementally, in stages, byrandom clustering and several Gaussians can be generated ina stage. Let denote a stage of this process.A stage is characterized by a parameterthat specifies themaximum radius for the hypersphere that includes the randomcluster of points that is to define a Gaussian. The parameteressentially controls the nature of the Gaussian generated (fat ornarrow). In random clustering, the starting point for a cluster israndomly selected and the cluster is grown around that startingpoint up to a radius. Therefore, each random cluster includesall points in the -neighborhood of the starting point. Letbe the neighborhood radius at stage.

The Gaussians at any stageare randomly selected inthe following way. Randomly select an input vector from


TABLE I(a)DETECTION OF OVERFITTING BY THE SA ALGORITHM

the training set and search for all other training vectorswithin the -neighborhood of . The training vectors inthe -neighborhood are used to define a Gaussian and thenremoved from the training set. To define the next Gaussian,another input vector is randomly selected from the remain-ing training set and its -neighborhood similarly searchedfor other vectors. This process of randomly picking an in-put vector from the remaining training set and search-ing for vectors in its -neighborhood to define the nextGaussian is then repeated until the remaining training set isempty.

Let be the set of training vectors within the -neighborhood of starting vector for the th random clustergenerated in stage. The centroid of the set becomes thecenter of the th Gaussian and the standard deviation ofthe distances of the points in from becomes the width

of the Gaussian, where the Gaussian being defined is theth overall Gaussian. That is, is the cumulative number of

Gaussians generated over all of the past and current stages.When the number of vectors in is less than a certainminimum, no Gaussian is created; the vectors in the setare however removed from the remaining training set. Soa Gaussian is not necessarily produced from every randomcluster.

Gaussians of various widths can be generated by repeatingthis whole process for various values. can be variedin many different ways. In the particular variation of thealgorithm stated here,is initially set to the standard deviationof the distances of the training vectors from their centroid, andit is reduced at a fixed rate, . In computationalexperiments, values in the range 0.5–0.8 were tried andit was discovered that the method is fairly insensitive to the-reduction rate as far as the final error rate is concerned,

although the size of the net may vary some.

B. The Algorithm

The following notation is used to describe the algorithm.denotes the initial training set and denotes the training setremaining in a stage at any time after removal of randomclusters. denotes the number of vectors in the randomcluster is the stage counter and the counter forGaussians generated. and are the center and width of

the th Gaussian unit. TREis the training set error and TSEis the validation or testing set error at theth stage. isthe minimum value for and is the minimum number ofpoints in a random cluster for a Gaussian to be defined.is the -reduction rate and is the standard deviation of thedistances of the training points from their centroid. is theneighborhood radius at stage. The sequential approximation(SA) algorithm is summarized below.

The SA Algorithm:

1) Initialize counters and constants:some fraction (say 0.8).

2) Increment stage counter: .Reduce neighborhood radius: if .

If , stop.3) Select Gaussian units for theth stage: .

a) Set .b) Select an input vector at random from , the

remaining training set.c) Search for all vectors in within the -

neighborhood of . Let this set of vectors be.

d) Remove the set from .If , go to .

e) Increment Gaussian counter: . Computethe center and width of the th Gaussianunit:

centroid of the set , andstandard deviation of the points in the

random cluster .f) If is not empty, go to a), else go to 4).

4) Solve LP (5) using function with number ofGaussians.

5) Compute TSE and TRE .

a) If TSE TSE , go to 2);b) If TSE TSE and TRE TRE , go to 2);c) Otherwise, stop. Overfitting has occurred. Use the

net generated in the previous stage.

Other stopping criteria, like maximum number of Gaus-sians used or incremental change in TSE, can also be used.Table I(a) shows how overfitting is detected during training.The two neurocontrol problems are described in the nextsection. In Problem 4, overfitting occurs at Stage 2, when the


training set error TRE decreases, but the validation set errorTSE increases. So, the net generated in the first stage is used,which has 15 Gaussian units. In Problem 2, overfitting occursin Stage 4. So the net generated in the third stage is used,which has 43 Gaussian units.

Many parts of this algorithm can be parallelized to makeit a distributed algorithm. For example, Gaussians of differentstages can be generated in parallel and combined appropriatelyto set up the stagewise LP’s. All such LP’s can then be solvedin parallel. In this distributed framework, the algorithm canbe seen as first studying the acquired information to obtainsome general problem features (e.g., one part trying to extractbroad global features, while another part trying to extract localfeatures) and then integrating these features in the LP modelto develop the network. In the distributed form, networks ofvarious sizes, with different numbers of Gaussians, will besolved and the smallest net with the best error rate selectedfor final use. Many parts of the Gaussian generation processitself can be parallelized. Thus, even though the algorithmis presented as a sequential algorithm, it is ultimately meantto be a distributed algorithm implemented in special-purposeparallel hardware.

C. Polynomial Time Convergence of the Algorithm

Polynomial time convergence of the SA algorithm is provednext.

Proposition: The SA algorithm terminates in polynomialtime.

Proof: Let be the number of completed stages of theSA algorithm at termination. The largest number of linearprograms is solved when the algorithm does not stop until

reaches its terminal value of. In this worst case, with a-reduction rate of , the number of completed stages is

given by . At each stage, new Gaussian unitsare generated. Let , so that a minimum of two points isrequired per Gaussian. Suppose, in a worst case scenario, onlytwo-point Gaussians are generated at each stage and withoutany loss of generality, assumeis even. Thus, in this worstcase, a total of Gaussians are produced fromexamples.At the th stage, a total of Gaussians are accumulated.Assume that these Gaussians, which have been randomlygenerated, have different centers and widths.

In this worst case, to generate the two point Gaussiansat each stage,

distances are computed and compared withto find the pointswithin the -neighborhood. For stages,distances are computed and compared, which is a polynomialfunction of .

Assuming that only Gaussian units are used to approximatethe function, the number of variables in LP (5) is equal to( ) at the th stage. The LP variables representthe weights of the Gaussians and the othercorrespond tothe deviation variables. Let (where

) be the total number of variables and be thebinary encoding length of the input data for the LP in stage.For the LP in (5), the number of constraints is always equal to

. The binary encoding length of input data for each constraintis proportional to in stage when Gaussian truncation isignored. Hence or , where is aproportionality constant.

Khachian’s method [14] solves a linear program inarithmetic operations, where is the binary

encoding length of input data andis the number of variablesin the LP. Karmarkar’s method [13] solves a linear programin arithmetic operations. More recent algorithms[18], [35] have a complexity of . Using more recentresults, the worst case total LP solution time for all stagesis proportional to

which again is a polynomial function of . Thus, both theGaussian generation and the LP solution parts of the algorithmhave polynomial time complexity.

In practice, LP solution times have been found to grow ata much slower rate, perhaps about , if not less.

IV. COMPUTATIONAL RESULTS

This section presents computational results on a variety ofproblems that have appeared in the literature. All problemswere solved on a SUN Sparc2 workstation. Linear programswere solved using R. Marsten’s OB1 interior point code fromGeorgia Institute of Technology. OB1 has a number of interiorpoint methods implemented and the dual log barrier penaltymethod was used.

In these problems, the test sets have been used as validationsets during training; no independent validation sets were used.In general, if the test set is representative of the population,there should not be a problem of overfitting to a particular testset when it is used for validation. Except for the time seriesproblems (Section IV-D), large test sets, representative of thepopulation, were used in all cases.

In general, whenever possible, the principle that more train-ing examples mean better training information was followed.Thus, in generating a training and a test set, like for logisticmap and the four nonlinear dynamical systems problems, asufficiently large set was generated. Since Gaussian generationis computationally cheap, all of the training set was used togenerate Gaussians. Compared to Gaussian generation, linearprogramming solutions take more time and it increases withthe number of training examples used. Hence, in cases likethe Mackey–Glass equation and the four nonlinear dynamicalsystems, a smaller part of the training set was used for the LPmodel. The examples for the LP model were selected randomlyfrom the larger training set. One should always use the wholetraining set in the LP model whenever the training set is small.This has been done for the other problems.


TABLE I(b)MACKEY–GLASS EQUATION RESULTS WITH 80% DELTA REDUCTION RATE. (800 TRAINING POINTS, 500TESTING POINTS, MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM 3000 POINTS)

TABLE IIMACKEY–GLASS EQUATION RESULTS WITH 65% DELTA REDUCTION RATE. (800 TRAINING POINTS, 500TESTING POINTS, MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM 3000 POINTS)

A. Chaotic Time Series Prediction

The chaotic time series generated by the Mackey–Glassdelay-difference equation

has been used by many as a test problem [15], [21], [27].They have used the particular series generated by setting

, and . The actual dataset for thisproblem was obtained from J. Platt. The networks are trainedto predict the value at time , from inputs at time

, and , where . The testing setconsists of 500 points of the time series starting at .Normalized error is used to measure prediction accuracy andis the root mean square (rms) error divided by the standarddeviation of the output of the time series. Six parametersdefine an RBF unit for this problem: the center is determinedby four values and the width and weight are determined bytwo values. The size of a net is measured by the number ofparameters required to define it. For example, a 200 node RBFnet for this problem will require parametersto define it. From the various graphs in [27], it appears thathis RAN algorithm can achieve a normalized rms (nrms) errorof about 0.04 (4% error) with a net of approximately 1000parameters with off-line training using 1000 training points. Itcan obtain about 0.028 (2.8%) nrms error with 3000 training

points and a slightly bigger net. In on-line learning, RAN’sbest performance is 0.055 (5.5%) normalized error using 189RBF units or 1134 parameters. Reference [27] also reportsthat a backpropagation net with 541 adjustable parameters(weights and thresholds) (it is the same three-layer perceptronstudied by [15], with 20 hidden nodes in each layer) tookapproximately 30–60 min on a CRAY X-MP to achieve acomparable normalized error of about 0.05 (5%), whereas hisRAN algorithm took only about 8 min of SUN-4 CPU time.Reference [21] reports achieving a normalized error of 0.061(6.1%) with 1000 RBF units (or 6000 parameters) that took1858 s to train on a SUN 3/60 machine. Their graphs showthat the error is more than 0.1 (10%) when RBF nets haveless than 500 RBF units. They, however, can obtain an errorrate as low as 2%, but with a very large net (5012 RBF unitsor 30 072 parameters).

Tables I(b)–III show the performance of the SA algorithmon this problem for various rates of reduction of. It can beseen that the algorithm is not very sensitive to the-reductionrate in the sense that all of them produce nets of reasonablesize with error rates of 5% or below. Hence, fine tuning ofthe reduction rate is not a necessity. The tables also showthat error rates of 3–5% can easily be obtained with netsranging in size from 200–300 Gaussian units. It can also beseen that generating the Gaussian units takes very little timecompared to the LP solution times. Some of the large LP’swere solved on an IBM RISC machine, where it takes only a


TABLE IIIMACKEY–GLASS EQUATION RESULTS WITH 50% DELTA REDUCTION RATE. (800 TRAINING POINTS, 500TESTING POINTS, MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM 3000 POINTS)

TABLE IVLOGISTIC MAP RESULTS. (500 TESTING POINTS, MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM 100 POINTS)

few minutes to solve compared to the many hours on a SUNSparc2 workstation.

B. The Logistic Map—Another Chaotic Time Series

The logistic map is anotherfunction often used as a test problem [15], [21]. It is asimple quadratic approximation problem and [21] reportsapproximating it with RBF nets of four, five, and six Gaussianunits and with a comparable backpropagation net of fivesigmoidal hidden units and a linear output unit. They useda training set of 1000 points and a testing set of 500 points.Their lowest test error rate of 0.27% was obtained with sixGaussian units that took 4945 s to train on a SUN 3/50machine. By comparison, the backpropagation net had anerror rate of 0.59% and took 3802 s to train. Reference [15]trained an identical backpropagation net on a CRAY X-MP toobtain higher accuracy and faster convergence. They obtainedsolutions in the order of minutes and an error rate as low as0.014%.

Table IV shows the performance of the SA algorithm onthis problem for various rates of reduction ofand varioustraining set sizes. It can be seen again that the algorithm isnot very sensitive to the-reduction rate. It also shows thatan error rate of less than 0.5% can be easily obtained with aslittle as 100 training examples and six Gaussian units. Note

that the training time is only a few seconds and an error rateas low as 0.111% was achieved. Only two Gaussian passeswere made in each case and only pass two error rates and LPsolution times are reported.

C. Identification of Nonlinear Dynamical Systems

Narendra and Parthasarathy [24] have demonstrated thatneural networks can be used effectively for both identifica-tion and control of nonlinear dynamical systems or plants.They presented four nonlinear difference equation models ofdiscrete-time single-input single-output (SISO) plants that canbe generalized to the multivariable case. These models havethe property that they are continuous, time-invariant, bounded,and causal in nature. They demonstrated the use of neuralnetworks in identification on several example plants from thesefour basic model categories. One example from each modelcategory has been solved here.

The most general of the four models is defined by thenonlinear difference equation

(9)

where represents the input–output pair of theSISO plant at time and .


TABLE VPLANT 1 IDENTIFICATION RESULTS. (500 TRAINING POINTS, 500 TESTING POINTS, MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM5000 POINTS)

TABLE VIPLANT 2 IDENTIFICATION RESULTS. (500 TRAINING POINTS, 500 TESTING POINTS, MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM5000 POINTS)

Example 1: The plant to be identified is defined by thenonlinear difference equation

where is the unknown function and has the form

(10)

Reference [24] used a three-layer perceptron with one inputand one output node and 20 nodes in the first hidden layerand ten in the second. They trained it using backpropagationfor 50 000 time steps with random input in the range [1, 1].Table V shows the performance of the SA algorithm on thisproblem for a -reduction rate of 80%. (Other-reduction ratesprovide similar results.) Here, 500 randomly generated inputvalues in the range [ 1, 1] were used for training andanother 500 used for testing the RBF net that approximatedthe function . The table shows that less than 0.1%error (0.001 nrms) was obtained with only 23 Gaussians or

parameters. This compares well with the 260parameter backpropagation net of [24].

Example 2: The plant to be identified is defined by thesecond-order difference equation

where

(11)

Reference [24] used the same net as in example 1, exceptthat it now had two inputs, and , and trainedit using backpropagation for 100 000 time steps with randominput in the range [ 2, 2]. Table VI shows the performance

of the SA algorithm on this problem for a-reduction rateof 80%. (Other -reduction rates provide similar results.)Here again, 500 randomly generated input values were usedfor training and another 500 used for testing the RBF netthat approximated the function . The tableshows that less than 7% error (0.07 nrms) was obtained withonly 43 Gaussians or parameters compared to280 parameters in the backpropagation net of [24]. With 1000training points and 116 Gaussians, the error rate goes downto 3.153%.


(12)

[24] trained two different nets, and , each identicalin structure to the net of example 1, to identify the twosubsystems of the plant

and

The nets were trained using backpropagation for 100 000 timesteps with random input in the range [ 2, 2].

Tables VII and VIII show the performance of the SAalgorithm on this problem for a -reduction rate of 80%.(Other -reduction rates provide similar results.) As before,500 randomly generated input values in the interval [ 2,2] were used for training and another 500 used for testing. Thetables show that very low error rates were obtained with only38 (subsystem ) 19 (subsystem ) 57 total Gaussians


TABLE VIIPLANT 3 (SUBSYSTEM NF ) IDENTIFICATION RESULTS. (500 TRAINING POINTS, 500 TESTING

POINTS, MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM 5000 POINTS)

TABLE VIIIPLANT 3 (SUBSYSTEM Ng) IDENTIFICATION RESULTS. (500 TRAINING POINTS, 500 TESTING

POINTS, MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM 5000 POINTS)

TABLE IXPLANT 4 IDENTIFICATION RESULTS WITH 100 TRAINING POINTS. (500 TESTING POINTS,MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM 5000 POINTS)

or parameters. This compares well withparameters used by the two backpropagation nets of [24].


(13)where the unknown function has the form

This is an example of the most general form of a plant, asshown in [24]. Reference [24] again used the same net as inexample 1, except that it now had five inputs instead of one.The net was trained using backpropagation for 100 000 timesteps with random input in the interval [1, 1].

Tables IX–XI show the performance of the SA algorithmon this problem for different -reduction rates and differenttraining set sizes. In all cases, randomly generated input values

in the range [ 1, 1] were used to create training examplesand another 500 random examples used for testing. The tablesshow that an error rate of 3.5% or less can easily be achievedwith only 15 Gaussians or parameters, which ismuch less than the 340 parameters used in the backpropagationnet of [24]. An error rate as low as 1.85% was obtained inthis problem.

D. Time Series Forecasting

Reference [34] have compared the performance of neuralnetworks trained by backpropagation with that of the widelyused Box–Jenkins method [4] as models for time series fore-casting. They used 75 time series of various nature from the


TABLE XPLANT 4 IDENTIFICATION RESULTS WITH 250 TRAINING POINTS. (500 TESTING POINTS,MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM 5000 POINTS)

TABLE XIPLANT 4 IDENTIFICATION RESULTS WITH 500 TRAINING POINTS. (500 TESTING POINTS,MINIMUM FIVE POINTS PER GAUSSIAN, GAUSSIANS GENERATED FROM 5000 POINTS)

well-known M-Competition series [16]. Their results show thatthe Box–Jenkins method can outperform the neural-networkmodel in 36 of 75 cases. Ten of the 75, on which neural net-works performed badly compared to the Box–Jenkins method,were used for the test here. The error is measured by meanabsolute percentage error (MAPE). The ten series includesboth monthly and quarterly data. Only one period aheadprediction is required of these models. As per convention, thelast eight values of each quarterly series and the last 18 valuesof each monthly series were retained for testing. Each serieswas first normalized to be in the interval [0, 1] by dividingby its maximum value. The first difference of the series wasthen taken to remove any trend. This normalization and trendremoval is different from that of [34]. For forecasting, pastyear’s data is used as input to the model (four inputs fora quarterly series, 12 for a monthly series). Reference [34]used two-layer feedforward nets whose number of hiddenunits equaled the number of inputs, and trained them withthe backpropagation algorithm for 500 epochs.

Table XII compares the performance of the SA algorithmwith that of the other two as reported by [34]. In general,the SA algorithm does significantly better than Sharda andPatil’s backpropagation nets, in terms of prediction accuracy(MAPE), on nine of the ten cases and it does better than the

Box–Jenkins method ([1], a Box–Jenkins automatic forecast-ing expert system, was used by Sharda and Patil) on sevenof the ten cases. The SA algorithm took less than a minuteto train in almost all of the cases. These test results showthat the method does well even when the data is sparse.Use of zero Gaussians is equivalent to a linear model beingused.

V. ADAPTIVE LEARNING WITH MEMORY

In neural networks, on-line adaptive algorithms are notsupposed to store any training data. The algorithms canobserve and learn whatever they can from an example, butmust forget that example completely thereafter. This processof learning is obviously very memory efficient. However, asdiscussed in [33], this process of learning is slow becauseit does not allow learning by comparison. Humans learnrapidly when allowed to compare the objects/concepts tobe learned; comparison provides useful extra information tothe learner. And remembering relevant facts and examplesis very much a part of the human learning process, sinceit facilitates mental comparison of facts and information forrapid learning. And in order to remember, compare and learn,humans need memory. Reference [33] have shown that on-line adaptive learning without using memory is an inefficient


TABLE XIITIME SERIES FORECASTING RESULTS

TABLE XIIION-LINE ADAPTATION OF MACKEY–GLASS EQUATION

form of learning because it requires to observe many moreexamples for the same level of error-free learning than thosethat use memory.

An on-line adaptive learning method for function approx-imation, that uses memory, is proposed next. It is basedon the SA algorithm. An example follows the algorithm.The algorithm is very similar to the one for classificationin [33]. The basic ideas are as follows. Suppose that somememory is available to the algorithm to store examples. Ituses part of the memory to store some testing examples andthe remaining part to store training examples. Assume it firstcollects and stores some test examples. Training examplesare collected and stored incrementally and the SA algorithmused on the available training set at each stage to generate(regenerate) the RBF net. Once the training and testing seterrors converge and stabilize, on-line training is complete.During the operational phase, the system continues to monitorits error rate by collecting and testing batches of incomingexamples. If the test error is found to have increased beyonda certain level, it proceeds to retrain itself in the mannerdescribed above.

The following notation is used to describe the algorithm.denotes the maximum number of examples that can be

stored by the algorithm. corresponds to the number oftesting examples stored and the number of training ex-amples stored, where . is the incrementaladdition to the training set . TSE and TRE correspond

to the testing and training set errors, respectively, after thethincremental addition to the training set. TSEdenotes the testset error after completion of training and TSE the test erroron a new batch of on-line examples.is the tolerance for thedifference between TSE and TSE and the tolerancefor the error rate difference during incremental learning oradaptation. The adaptive algorithm is summarized below.

On-Line Adaptive Learning with Fixed Memory:

1) Collect examples on-line for testing.2) Initialize counters and constants:

.3) Increment collection counter: .4) Collect number of (additional) examples for training;

add to the training set; . If, go to 7).

5) Regenerate the RBF net with the SA algorithm usingtraining examples and testing examples.

6) Compute TSE and TRE. If , go to 3). IfTSE TSE and TRE TRE , go

to 7); else go to 3).7) Current adaptation is complete. Set TSE TSE .

Test system continuously for any significant change inthe error rate.

8) Collect new examples on-line for testing; test andcompute TSE .

9) If TSE TSE , go to 8). Otherwise, it istime to retrain, go to 2).


Table XIII shows how Mackey–Glass equation is adaptedon-line. RBF nets were generated with the SA algorithm atincrements of 200 examples; the table shows results from600 examples onwards. was set to 0.015 (1.5%). Withthis , adaptation is complete within 1000 examples. Ifis further reduced, adaptation would take longer (i.e., wouldneed more examples). Unlike previous Mackey–Glass results(Section IV-A), where Gaussians were generated from a largerset of points, the Gaussians here are generated from the actualtraining set available. A -reduction rate of 80% was used inthese cases.

VI. CONCLUSIONS

Through the learning principles, this paper has tried to definethe “nature” of algorithms that should be developed in theneural-network field. The theory defines very broadly whatthe neural-network algorithms should do. However, within thisbroad framework, many different algorithms can be developed.The algorithms that the authors have developed demonstratethat this framework is workable and can indeed producevery robust and reliable algorithms. Increased competitionwithin this framework can be very beneficial to the field andproduce very powerful methods. The algorithms presented bythe authors are only a beginning in that direction.

Most of the issues addressed by the learning theory havebeen long standing issues for the neural-network field. Theyinclude issues like polynomial time complexity [19], networkdesign, generalization, local minima, and catastrophic forget-ting. The algorithms presented by the authors demonstrate thatall these problems can be dealt with using very simple ideas.

REFERENCES

[1] Automatic Forecasting Systems,AUTOBOX, Software User Manual,Hatboro, PA, 1988.

[2] P. Baldi, “Computing with arrays of bell-shaped and sigmoid functions,”in Proc. IEEE Neural Inform. Processing Syst.,vol. 3, 1990, pp.728–734.

[3] S. M. Botros and C. G. Atkeson, “Generalization properties of radialbasis function,” inAdvances in Neural Information Processing System,vol. 3, R. P. Lippman, J. E. Moody, and D. S. Touretzky, Eds. SanMateo, CA: Morgan Kaufmann, 1991, pp. 707–713.

[4] G. E. P. Box and G. M. Jenkins,Time Series Analysis: Forecasting andControl. San Francisco, CA: Holden-Day, 1976.

[5] D. Broomhead and D. Lowe, “Multivariable function interpolation andadaptive networks,”Complex Syst.,vol. 2, pp. 321–355, 1988.

[6] A. L. Blum and R. L. Rivest, “Training a 3-node neural network isNP-complete,”Neural Network,vol. 5, no. 1, pp. 117–127, 1992.

[7] S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squareslearning algorithm for radial basis function networks,”IEEE Trans.Neural Networks,vol. 2, pp. 302–309, 1991.

[8] F. Girosi and T. Poggio, “Networks and the best approximation prop-erty,” Biol. Cybern.,vol. 63, pp. 169–176.

[9] F. Glover, “Improved linear programming models for discriminantanalysis,”Decision Sci.,vol. 21, no. 4, pp. 771–785, 1990.

[10] E. J. Hartman, J. D. Keeler, and J. M. Kowalski, “Layered neuralnetworks with Gaussian hidden unitsS universal approximations,”Neural Computa.,vol. 2, pp. 210–215, 1990.

[11] J. S. Judd,Neural Network Design and the Complexity of Learning.Cambridge, MA: MIT Press, 1990.

[12] V. Kardirkamanathan, M. Niranjan, and F. Fallside, “Sequential adap-tation of radial basis function neural networks and its application totime-series prediction,” inAdvances in Neural Information ProcessingSystem,R. P. Lippmann, J. E. Moddy, and D. S. Touretzky, Eds. SanMateo: Morgan Kaufmann, 1991, vol. 3, pp. 721–727.

[13] N. Karmarkar, “A new polynomial time algorithm for linear program-ming,” Combinatorica,vol. 4, pp. 373–395, 1984.

[14] L. G. Khachian, “A polynomial algorithm in linear programming,”Dorl.Akad. Nauk SSR,vol. 244, no. 5, pp. 1093–1096, 1979.

[15] A. Lapedes and R. Farber, “Nonlinear signal processing using neuralnetworks: Prediction and system modeling,” Los Alamos Nat. Lab Tech.Rep. LA-UR-87-2262, 1987.

[16] S. Makridakis et al., “The accuracy of extrapolation (time series)methods: Results of a forecasting competition,”J. Forecasting,vol. 1,pp. 111–153, 1982.

[17] O. L. Mangasarian, R. Setiono, and W. H. Wolberg, “Pattern recognitionvia linear programming: Theory and application to medical diagnosis,”in Proc. Wkshp. Large-Scale Numerical Optimization,T. F. Colemanand Y. Li, Eds. Philadelphia, PA: SIAM, 1990, Ithaca, NY: CornellUniv., Oct. 19–20, 1989, pp. 22–31.

[18] R. C. Montiero and I. Adler, “Interior path following primal-dualalgorithms. Part I: Linear programming,”Math. Programming,vol. 44,pp. 27–41, 1989.

[19] M. Minsky and S. Papert,Perceptrons. Cambridge, MA: MIT Press,1988.

[20] J. Moody and C. Darken, “Learning with localized receptive fields,” inProc. 1988 Connectionist Models Summer School,D. S. Touretzky, G. E.Hinton, and T. Sejnowski, Eds. San Mateo, CA: Morgan-Kaufmann,1988, pp. 133–143.

[21] , “Fast learning in networks of locally-tuned processing units,”Neural Computa.,vol. 1, no. 2, pp. 281–294, 1989.

[22] S. Mukhopadhyay, A. Roy, L. S. Kim, and S. Govil, “A polynomial timealgorithm for generating neural networks for pattern classification—Itsstability properties and some test results,”Neural Computa.,vol. 5, no.2, pp. 225–238, 1993.

[23] M. T. Musavi, W. Ahmed, K. H. Chan, K. B. Faris, and D. M. Hummels,“On the training of radial basis function classifiers,”Neural Networks,vol. 5, no. 4, pp. 595–603, 1992.

[24] K. S. Narendra and K. Parthasarathy, “Identification and control of dy-namical systems using neural networks,”IEEE Trans. Neural Networks,vol. 1, pp. 4–27, 1990.

[25] J. Park and I. W. Sandberg, “Universal approximation using radial-basis-function networks,”Neural Computa.,vol. 3, pp. 246–257, 1991.

[26] , “Universal approximation using radial-basis-function networks,”Neural Computa.,vol. 5, pp. 305–316, 1993.

[27] J. Platt, “A resource-allocating network for function interpolation,”Neural Computa.,vol. 3, no. 2, pp. 213–225, 1991.

[28] T. Poggio and F. Girosi, “Regularization algorithms for learning thatare equivalent to multilayer networks,”Science,vol. 247, pp. 978–982,1990.

[29] M. J. D. Powell, “Radial basis functions for multivariable interpolation:A review,” in Algorithms for Approximation,J. C. Mason, M. G. Cox,Eds. Oxford, U.K.: Clarendons, 1987, pp. 143–167.

[30] S. Renals and R. Rohwer,“Phoneme classification experiments usingradial basis function,” inProc. Int. Joint Conf. Neural Networks,vol. I,1989, pp. 461–467.

[31] A. Roy and S. Mukhopadhyay, “Pattern classification using linearprogramming,”ORSA J. Computing,Winter, vol. 3, no. 1, pp. 66–80,1991.

[32] A. Roy, L. S. Kim, and S. Mukhopadhyay, “A polynomial timealgorithm for the construction and training of a class of multilayerperceptrons,”Neural Networks,vol. 6, no. 4, pp. 535–545, 1993.

[33] A. Roy, S. Govil, and R. Miranda, “An algorithm to generate radial basisfunction (rbf)-like nets for classification problems,”Neural Networks,vol. 8, no. 2, pp. 179–201, 1995.

[34] R. Sharda and R. Patil, “Neural networks as forecasting experts: Anempirical test,” inProc. Int. Joint Conf. Neural Networks,vol. II, 1990,pp. 491–494.

[35] M. Todd and Y. Ye, “A centered projective algorithm for linearprogramming,” Math. Operations Res.,vol. 15, no. 3, pp. 508–529,1990.

[36] G. Vrckovnik, C. R. Carter, and S. Haykin, “Radial basis functionclassification of impulse radar waveforms,” inProc. Int. Joint Conf.Neural Networks,1990, vol. I, pp. 45–50.

[37] N. Weymaere and J. Martens, “A fast robust learning algorithm forfeedforward neural networks,”Neural Networks,vol. 4, pp. 361–369,1991.

[38] B. Widrow and M. Hoff, “Adaptive switching circuits,” in1960 IREWESCON Convention Rec.,New York, 1960, pp. 96–104.

[39] L. Xu, S. Klasa, and A. L. Yuille, “Recent advances on techniques ofstatic feedforward networks with supervised learning,”Int. J. NeuralSyst.,vol. 3, no. 3, pp. 253–290, 1992.

[40] L. Xu, A. Krzyzak, and E. Oja, “Rival penalized competitive learningfor clustering analysis, rbf net, and curve detection,”IEEE Trans. NeuralNetworks,vol. 4, pp. 636–649, 1993.


Asim Roy (S’78–M’82–SM’89) received the B.E.degree in mechanical engineering from CalcuttaUniversity, India, the M.S. degree in operationsresearch from Case Western Reserve University,Cleveland, and the Ph.D. degree in operations re-search from University of Texas at Austin. He alsostudied industrial engineering at Rutgers University,New Brunswick, NJ.

He has been a Visiting Scholar at Stanford Uni-versity and a Visiting Scientist at the Robotics andIntelligent Systems Group at Oak Ridge National

Laboratory. He is currently an Associate Professor of Computer InformationSystems. His research interests include neural networks, pattern recognition,prediction and forecasting, computational learning theory, and nonlinearmultiple objective optimization. He designed and developed the softwaresystem IFPS/OPTIMUM that pioneered the idea of incorporating optimizationtools in financial and planning languages for managerial use. Following inits footstep, such optimization systems are now widely available with thespreadsheet systems such as Excel and Lotus 1-2-3. He has recently publisheda new theory for brain-like learning/computing that challenges the classicalconnectionist learning ideas.

He has served on organizing committees of many scientific conferences.He was the Program Chair for the ORSA/TIMS (Operations Research Societyof America/ The Institute of Management Sciences) National meeting inLas Vegas and the General Chair of the ORSA/TIMS National meeting inPhoenix. He is a member of INNS (International Neural Network Society) andINFORMS (Institute for Operations Research and Management Sciences). Heis an Associate Editor of the IEEE TRANSACTIONS ONNEURAL NETWORKS andis on the editorial board of other journals.

Sandeep Govil received the B.Tech. degree in electrical engineering fromIndian Institute of Technology, Kanpur, the M.B.A. degree from NortheastLouisiana University, Monroe, and the Ph.D. degree in management sciencefrom Arizona State University, Tempe.

He is currently a Senior Consultant with American Airlines DecisionTechnologies. His research interests include the application of neural networksto business problems.

Raymond Miranda received the B.Tech. degree in electrical engineering fromBirla Institute of Technology, India, and the M.S. degree in decision andinformation systems from Arizona State University, Tempe.

He is currently a Decision Support Specialist with OptimalCare, Inc., acompany that deals with occupational health information systems. His researchinterests include the application of neural networks to medical diagnosis,pattern classification, and trend forecasting.

a neural-network learning theory and a polynomial time rbf algorithm

Documents