ga-based reinforcement learning for neural networks
Embed Size (px)
TRANSCRIPT

This article was downloaded by: [National Chiao Tung University 國立交通大學]On: 28 April 2014, At: 05:01Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK
International Journal of Systems SciencePublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/tsys20
GA-based reinforcement learning for neural networksCHIN-TENG LIN a , CHONG-PING JOU a & CHENG-JIANG LIN aa Department of Electrical and Control Engineering , National Chiao-Tung University ,Hsinchu, Taiwan, Republic of ChinaPublished online: 16 May 2007.
To cite this article: CHIN-TENG LIN , CHONG-PING JOU & CHENG-JIANG LIN (1998) GA-based reinforcement learning for neuralnetworks, International Journal of Systems Science, 29:3, 233-247, DOI: 10.1080/00207729808929517
To link to this article: http://dx.doi.org/10.1080/00207729808929517
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in thepublications on our platform. However, Taylor & Francis, our agents, and our licensors make no representationsor warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Anyopinions and views expressed in this publication are the opinions and views of the authors, and are not theviews of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should beindependently verified with primary sources of information. Taylor and Francis shall not be liable for any losses,actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoevercaused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

International Journal of Systems Science, 1998, volume 29, number 3, pages 233-247
GA-based reinforcement learning for neural networks
CHIN-TENG LINt, CHONG-PING Jout and CHENG-JIANG LINt
A genetic reinforcement neural network (GRNN) is proposed to solve various reinforcement learning problems. The proposed GRNN is constructed by integrating twofeedforward multilayer networks. One neural network acts as an action network fordetermining the outputs (actions) of the GRNN, and the other as a critic network tohelp the learning of the action network. Using the temporal difference predictionmethod, the critic network can predict the external reinforcement signal and providea more informative internal reinforcement signal to the action network. The actionnetwork uses the genetic algorithm (GA) to adapt itself according 10 the internalreinforcement signal. The key concept of the proposed GRNN learning scheme is toformulate the internal reinforcement signal as the fitness function for the GA. Thislearning scheme forms a novel hybrid GA, which consists of the temporal difference andgradient descent methods for the critic network learning, and the GA for the actionnetwork learning. By using the internal reinforcement signal as the fitness function, theGA can evaluate the candidate solutions (chromosomes) regularly, even during theperiod without external reinforcement feedback from the environment. Hence, theGA can proceed to nell' generations regularly without waiting for the arrival of theexternal reinforcement signal. This can usually accelerate the GA learning because areinforcement signal may only be available at a time long after a sequence of actionshas occurred in the reinforcement learning problems. Computer simulations have beenconducted to illustrate the performance and applicability of the proposed learningscheme.
I. Introduction
In general, the neural learning methods can be distinguished into three classes: supervised learning, reinforcement learning and unsupervised learning. In supervisedlearning a teacher provides the desired objective at eachtime step to the learning system. In reinforcementlearning the teacher's response is not as direct,immediate or informative as that in supervised learningand serves more to evaluate the state of system. Thepresence of a teacher or a supervisor to provide thecorrect response is not assumed in unsupervisedlearning, which is called learning by observation.Unsupervised learning does not require any feedback,but the disadvantage is that the learner cannot receiveany external guidance and this is inefficient, especiallyfor the applications in control and decision-making. If
Received 5 July 1997. Revised 19 August 1997. Accepted 29 August1997.
t Department of Electrical and Control Engineering, NationalChiao-Tung University, Hsinchu, Taiwan, Republic of China.
supervised learning is used in control (e.g. when theinput-output training data are available), it has beenshown that it is more efficient than the reinforcementlearning (Barto and Jordan 1987). However, many control problems require selecting control actions whoseconsequences emerge over uncertain periods for whichinput-output training data are not readily available. Insuch a case, the reinforcement learning system can beused to learn the unknown desired outputs by providingthe system with a suitable evaluation of its performance,so the reinforcement learning techniques are moreappropriate than the supervised learning. Hence, inthis paper we are interested in the reinforcementlearning for neural networks.
For the reinforcement learning problems, Barto et al.(1983) used neuron-like adaptive elements to solve difficult learning control problems with only reinforcementsignal feedback. The idea of their architecture, called theactor-critic architecture, and their adaptive heuristiccritic (AHC) algorithm were fully developed by Sutton(1984). The AHC algorithms that rely on both a learned
0020-7721/98 $12.00 © 1998 Taylor & Francis Ltd.
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

234 c.-T. Lin et a!'
critic function and a learned action function. Adaptivecritic algorithms arc designed for reinforcement learningwith delayed rewards, The AHC algorithm uses the tempond difference method to train a critic network thatIcarus to predict failurc. The prediction is then used toheuristically generate plausible target outputs at eachtime step, thereby allowing the use of backpropagationin a separate neural network that maps state variables tooutput actions. They also proposed the associativereward-penalty (AR_I') algorithm for adaptive elementsculled AR-I' elements (Barto and Anandan 1985).Several generalizations of A R_ P algorithm have beenproposed (Barto and Jordan 1987). Williams formulatedthe reinforcement learning problem as a gradientfollowing procedure (Williams 1987), and he identifieda class of algorithms, called REINFORCE algorithms,that possess thc gradient ascent property. Anderson(1987) developed a numerical connectionist learningsystem by replacing thc neuron-like adaptive elementsin thc actor-critic architecture with multilayer networks.With multilayer networks, the learning system canenhance its representation by Icarning new featuresthat arc required by or that facilitate the search for thetask's solution. In this paper we shall develop a reinforccmcnt learning system also based on the actorcritic architecture, However, because the use of gradientdescent (ascent) method in the above approaches usuallysutlers the local minima problem in the networklearning, we shall combine thc genetic algorithm (GA)with thc AHC algorithm into a novel hybrid GA forreinforcement learning to make use of the global optimization capability of GAs.
GAs arc general-purpose optimization algorithmswith a probabilistic component that provide a meansto search poorly understood, irregular spaces. Holland(1975) developed GAs to simulate some of the processesobserved in natural evolution. Evolution is a processrluu operates on chromosomes (organic devices thatencode thc structure of living beings). Natural selectionlinks chromosomes with thc performance of theirdecoded structure, The processes of natural selectioncause those chromosomes that encode successful structurcs to reproduce more often than those that do not.Recombination processes create different chromosomesin children by combining material from the chromosomes of thc two parents. Mutation may cause the chromosorncs of children to be different from those of theirparents. GAs appropriately incorporate these features ofnatural evolution in computer algorithms to solve diffcrcnt problems in the way that nature has done throughevolution. GAs require the problem of maximization (orminimization) to be stated in the form of a cost (fitness)function. In a GA a set of variables for a given problemis encoded into a string (or other coding structure),analogous to a chromosome in nature. Each string con-
tains a possible solution to the problem. To determinehow well a chromosome solves the problem, it is firstbroken down into the individual substrings which represent each variable, and these values are then used toevaluate the cost function, yielding a fitness value.GAs select parents from a pool of strings (population)according to the basic criteria of survival of the fittest. Itcreates new strings by recombining parts of the selectedparents in a random manner. In this manner, GAs areable to use historical information as a guide through thesearch space.
The GA is a parallel and global search technique.Because it simultaneously evaluates many points in thesearch space, it is more likely to converge toward theglobal solution. The GA applies operators inspired bythe mechanics of natural selection to a population ofbinary strings encoding the parameter space at eachgeneration, it explores different areas of the parameterspace, and then directs the search to regions where thereis a high probability of finding improved performance.By working with a population of solutions, the algorithm can effectively seek many local minima andthereby increases the likelihood of finding the globalminima. In this paper the GA is applied to find thenear-optimal weights of a neural network controller inthe reinforcement learning environment.
Recent applications of GAs to neural network weightoptimization produce results that are roughly competitive with standard backpropagation (Montana andDavis 1989, Whitley et al. 1990). The application ofGAs to the evolution of neural network topologies hasalso produced interesting results (Harp et al. 1990,Schaffer 1990). There are domains, however, whereGAs can make a unique contribution to neural networklearning. In particular, because GAs do not require oruse derivative information, the most appropriate applications are problems where gradient information is unavailable or costly to obtain. Reinforcement learning isone example of such a domain. Whitley et al. (1993)demonstrated how GAs can be used to train neural networks for reinforcement learning and neurocontrolapplications. They used the external reinforcemenlsignal from the environment directly as the fitness function for the GA, and thus their system required theaction network only. This structure is different fromthe actor-eritic architecture (Barto et al. 1983), whichconsists of an action network and a critic network. Inour system we shall use a critic network to provide theaction network with a more informative internal reinforcement signal such that the GA can perform a moreeffective search on the weight space of the action network.
In this paper we propose a genetic reinforcementneural network (G RNN) to solve various reinforcementlearning problems. The proposed learning system can
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

GA-based reinforcement learning for neural networks 235
construct a neural network controller according to areward-penalty (i.e. good-bad) signal, called the reinforcement signal. It allows a long delay between anaction and the resulting reinforcement feedback information. The proposed G RNN is constructed by integrating two neural networks, each of which is afeedforward multilayer network, and forms an actorcritic architecture. In the GRNN the network used asthe action network can determine a proper actionaccording to the current input vector (environmentstate). There is no 'teacher' to indicate the outputerrors of the action network for its learning in the reinforcement learning problems. The other feedforwardmultilayer network (the critic network) can predict theexternal reinforcement signal and provide the action network with a more informative internal reinforcementsignal, which is used as a fitness function for the GAbased learning of the action network.
Associated with the GRNN is a GA-based reinforcement learning algorithm. It is a hybrid GA, consisting ofthe temporal difference method, the gradient descentmethod and the normal GA. The temporal differencemethod is used to determine the output errors of thecritic network for the multistep prediction of theexternal reinforcement signal (Anderson 1987). Withthe knowledge of output errors, the gradient descentmethod (the back propagation algorithm) can be usedto train the critic network to find proper weights. Forthe action network, the GA is used to find the weightsfor desired control performance. To optimize the connection weights of the action network, we encode theseweights as a real-valued string (chromosome). In otherwords, each connection weight is viewed as a separateparameter and each real-valued string is simply the concatenated weight parameters of the action network.Initially, the GA generates a population of real-valuedstrings randomly, each of which represents one set ofconnection weights for the action network. After anew real-valued string has been created, an interpreteruses this string to set the connection weights of theaction network. The action network then runs in a feedforward fashion to produce control actions acting on theenvironment. At the same time, the critic network predicts the external reinforcement signal from the controlled environment, and produces the internalreinforcement signal to indicate the performance or fitness of the action network. With such evaluative information (fitness values) from the critic network, the GAcan look for a better set of weights (a better string) forthe action network.
The proposed GA-based reinforcement learningscheme is different from the scheme proposed byWhitley et al. (1993), especially in the definition of fitness function. In this system there is no critic network,and the external reinforcement signal is used as the fit-
ness function of the GA directly in training the actionnetwork. Hence, an action network (defined by a string)can be evaluated only after the appearance of anexternal reinforcement signal, which is usually availableonly after very many actions have acted on the environment in the reinforcement learning problems. The majorfeature of the proposed GA-based reinforcementlearning scheme is that we formulate the internal reinforcement signal as the fitness function of the GA basedon the actor-eritic architecture. In this way, the GA canevaluate the candidate solutions (the weights of theaction network) regularly even during the periodwithout external reinforcement feedback from the environment. Hence, the GA can proceed to new generationsregularly without waiting for the arrival of the externalreinforcement signal. In other words, we can keep thetime steps for evaluating each string (action network)fixed, because the critic network can give predictedreward/penalty information for a string without waitingfor the final success or failure. This can usually accelerate the GA learning, because an external reinforcement signal may be available only at a time long aftera sequence of actions has occurred in the reinforcementlearning problems. This is similar to the fact that weusually evaluate a person according to his/her potentialor performance during a period, not after he/she hasdone something really good or bad.
This paper is organized as follows. Section 2 describesthe basic of GAs and hybrid GAs. Section 3 describesthe structure of the proposed GRNN. The corresponding GA based reinforcement learning algorithmis presented in section 4. In section 5 the cart-pole balancing system is simulated to demonstrate the learningcapability of the proposed GRNN. Performance comparisons are also made in this section. Finally, conclusions are summarized in section 6.
2. Genetic algorithms
Genetic algorithms (GAs) were invented to mimic someof the processes observed in natural evolution. Theunderlying principles of GAs were first published byHolland (1962). The mathematical framework wasdeveloped in the 1960s and was presented by Holland(1975). GAs have been used primarily in two majorareas: optimization and machine learning. In optimization applications GAs have been used in many diversefields such as function optimization, image processing,the travelling salesman problem, system identificationand control. In machine learning GAs have been usedto learn syntactically simple string IF-TH EN rules in anarbitrary environment. Excellent references on GAs andtheir implementations and applications can be found in(Goldberg 1989, Davis 1991, Michalewicz andKrawezyk 1992).
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

236 c.-T. Lin et al.
The encoding mechanisms and the fitness functionform the links between the GA and the specific problemto be solved. The technique for encoding solutions mayvary from problem to problem and from GA to .GA.Generally, encoding is carried out usmg bit stnngs.The coding that has been shown to be the optimal oneis binary coding (Holland 1975). Intuitively, it is betterto have few possible options for many bits than to havemany options for few bits. A fitness function takes achromosome as input and return a number or a list ofnumbers that is a measure of the chromosome's performance on the problem to be solved. Fitness functionsplay the same role in GAs as the environment plays innatural evolution. The interaction of an individual withits environment provides a measure of fitness. Similarly,the interaction of a chromosome with a fitness functionprovides a measure of fitness that the GA uses whencarrying out reproduction.
The GA is a general-purpose stochastic optimizationmethod for search problems. GAs differ from normaloptimization and search procedures in .several ,:ays.First, the algorithm works with a population of stnng~,
searching many peaks in parallel. By employing geneticoperators, it exchanges information between the peaks,hence lowering the possibility of ending at a localminimum and missing the global minimum. Second, itworks with a coding of the parameters, not the parameters themselves. Third, the algorithm only needs toevaluate the objective function to guide its search, andthere is no requirement for derivatives or other auxiliaryknowledge. The only available feedback from the systemis the value of the performance measure (fitness) of thecurrent population. Finally the transition rules are probabilistic rather than deterministic. The randomizedsearch is guided by the fitness value of each string andhow it compares to others. Using the operators on thechromosomes which are taken from the population, thealgorithm efficiently explores parts of the search spacewhere the probability of finding improved performanceis high.
The basic element processed by a GA is the stringformed by concatenating substrings, each of which is a
In applying GAs for neural network learning, thestructures and parameters of neural networks areencoded as genes (or chromosomes), and GAs are thenused to search for better solutions (optimal structuresand parameters) for the neural networks. The fundamental concepts of genetic operators and hybrid geneticalgorithms will be discussed in the following subsections.
2.1. Basics of genetic algorithms
GAs are search algorithms based on the mechanics ofnatural selection, genetics and evolution. It is widelyaccepted that the evolution of living beings is a processthat operates on chromosomes, organic devices thatencode the structure of living beings. Natural selectionis the link between chromosomes and the performanceof their decoded structures. Processes of natural selection cause those chromosomes that encode successfulstructures to reproduce more often than these that donot. In addition to reproduction, mutations may causethe chromosomes of biological children to be differentfrom those of their biological parents, and recombination processes may create quite different chromosomesin the children by combining material from the chromosomes of two parents. These features of natural evolution influenced the development of GAs.
Roughly speaking, GAs manipulate strings of binarydigits, Is and as, called chromosomes which representmultiple points in the search space through properencoding mechanism. Each bit in a string is called anallele. GAs carry out simulated evolution on populations of such chromosomes. Like nature, GAs solvethe problem of finding good chromosomes by manipulating the material in the chromosomes blindly withoutany knowledge of the type of problem they are solving,the only information they are given is an evaluation ofeach chromosome that they produce. The evaluation IS
used to bias the selection of chromosomes so that thosewith the best evaluations tend to reproduce more oftenthan those with bad evaluations. GAs, using simplemanipulations of chromosome such as simple encodingsand reproduction mechanisms, can display complicatedbehaviour and solve some extremely difficult problemswithout knowledge of the decoded world.
A high-level description of a GA is as follows (Davis1991). Given a way or a method of encoding solutions tothe problem on the chromosomes and a fitness (evaluation) function that returns a performance measurementof any chromosome in the context of the problem, a GAconsists of the following steps.
Step I. Initialize a population of chromosomes.
Step 2. Evaluate each chromosome in the population.
Step 3.
Step 4.
Step 5.
Step 6.
Create new chromosomes by mating currentchromosomes; apply mutation and recombination on the parent chromosome's mate.
Delete members of the population to makeroom for the new chromosomes.
Evaluate the new chromosomes and insert theminto the population.
If the stopping criterion is satisfied, then stopand return the best chromosome; otherwise, goto Step 3.
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

GA-based reinforcement learning for neural networks 237
binary coding of a parameter of the search space. Thus,each string represents a point in the seach space andhence a possible solution to the problem. Each stringis decoded by an evaluator to obtain its objective function value. This function value, which should be maximized or minimized by the GA, is then converted to afitness value which determines the probability of theindividual undergoing genetic operators. The population then evolves from generation to generation throughthe application of the genetic operators. The totalnumber of strings included in a population is keptunchanged through generations. A GA in its simplestform uses three operators: reproduction, crossover andmutation (Goldberg 1989). Through reproduction,strings with high fitnesses receive multiple copies in thenext generation, whereas strings with low fitnessesreceive fewer copies or even none at all. The crossoveroperator produces two offspring (new candidate solutions) by recombining the information from two parentsin two steps. First, a given number of crossing sites areselected along the parent strings uniformly at random.second, two new strings are formed by exchanging alternate pairs of selection between the selected sites. Tn thesimplest form, crossover with single crossing site refersto taking a string, splitting it into two parts at a randomly generated crossover point and recombining itwith another string which has also been split at thesame crossover point. This procedure serves to promotechange in the best strings which could give them evenhigher fitnesses. Mutation is the random alteration of abit in the string which assists in keeping diversity in thepopulation.
2.2. Hybrid genetic algorithms
Traditional simple GAs, though robust, are generallynot the most successful optimization algorithm on anyparticular domain. Hybridizing a GA with algorithmscurrently in use can produce an algorithm better thanthe GA and the current algorithms. Hence, for an optimization problem, when there exist algorithms, optimization heuristics or domain knowledge that can aid inoptimization, it may be advantageous to consider ahybrid GA. GAs may be crossed with various problem-specific search techniques to form a hybrid thatexploits the global perspective of the GA (globalsearch) and the convergence of the problem-specifictechnique (local search).
There are numerous gradient techniques (e.g. the gradient descent method, the conjugate gradient method)and gradient-less techniques (e.g. the golden search, thesimplex method) available to find the local optimal in acalculus-friendly function (e.g. continuous function)(Luenberger 1976). Even without a calculus-friendlyfunction, there are well-developed heuristic search
schemes for many popular problems. For example, thegreedy algorithms in combinatorial optimization are aform of local search (Lawler 1976). An intuitive conceptof hybridizing GAs with these local search techniques isthat the GA finds the hills and the local searcher goesand climbs them. Thus, in this approach, we simplyallow the GA to run to substantial convergence andthen we permit the local optimization procedure totake over, perhaps searching from the top 5% or 10%of points in the last generation.
Tn some situations hybridization entails using therepresentation as well as optimization techniquesalready in use in the domain, while tailoring the GAoperators to the new representation. Moreover, hybridization can entail adding domain-based optimizationheuristics to the GA operator set. In these cases wecan no longer apply the familiar GA operators directlyand must create their analogue to account for new representations and/or added optimization schemes. Forexamples, Davis (1991) and Adler (1993) describe anapproach to hybridizing a simple GA with the simulatedannealing algorithm, and Tsinas and Dachwald (1994)and Petridis et al. (1992) describe an approach to hybridizing a simple GA with the back propagation algorithm. A hybrid of both categories of learningmethods results in a more powerful, more robust andfaster learning procedure. Tn this paper we develop anovel hybrid GA, called the GA-based reinforcementlearning algorithm, to train the proposed GRNN tosolve various reinforcement learning problems.
3. The structure of the GRNN
Unlike the supervised learning problem in which thecorrect target output values are given for each inputpattern to instruct the network learning, the reinforcement learning problem has only very simple evaluativeor critic information available for learning, rather thaninstructive information. In the extreme case there is onlya single bit of information to indicate whether theoutput is right or wrong. Fig. I shows how a networkand its training environment interact in a reinforcementlearning problem. The environment supplies a timevarying input vector to the network, receives itstime-varying output/action vector and then provides atime-varying scalar reinforcement signal. In this paperthe reinforcement signal r(t) is two-valued,r(t) E {-I, O}, such that r(t) = 0 means a success andr(t) = -1 means a failure. We also assume that r(t) isthe reinforcement signal available at time step t and iscaused by the inputs and actions chosen at earlier timesteps (i.e. at time steps t - I, t - 2, ... ). The goal oflearning is to maximize a function of this reinforcementsignal.
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

238 c.-T. Lin et al.
n h
v[l, I + I] = Lbj[I]Xj[1 + 11+L cdlJydl, I + I], (3)j=1 ;=1
y;[I, I + 1] = g(t aij [11.\j [I + 11) , (I)J=I
I and I + I are successive time steps, and aij is the weightfrom the jth input node to the ith hidden node. Theoutput node of the critic network receives inputs fromthe nodes in the hidden layer (i.e. y;) and directly fromthe nodes in the input layer (i.e..\):
(2)I
g(s) = I +e-s'
internal reinforcement signal. The internal reinforcement signal from the critic network enables both theaction network and the critic network to learn at eachtime step without waiting for the arrival of an externalreinforcement signal, greatly accelerating the learning ofboth networks. The structures and functions of the criticnetwork and the action network are described in thefollowing subsections.
where
3.1. The critic network
The critic network constantly predicts the reinforcement associated with different input states, and thusequivalently evaluates the goodness of the controlactions determined by the action network. The onlyinformation received by the critic network is the stateof the environment in terms of state variables andwhether or not a failure has occurred. The critic networkis a standard two-layer feed forward network with sigmoids everywhere except in the output layer. The inputto the critic network is the state of the plant, and theoutput is an evaluation of the state, denoted by v. Thisvalue is suitably discounted and combined with theexternal failure signal to produce the internal reinforcement signal r(I).
Figure 2 shows the structure of the critic network. Itincludes h hidden nodes and n input nodes, including abias node (i.e. XI, X2,"" xn ) . In this network eachhidden node receives n inputs and has n weights, andeach output node receives n + h inputs and has n + hweights. The output of the node in the hidden layer isgiven by
where v is the prediction of the external reinforcementvalue, b, is the weight from the jth input node to outputnode, and c, is the weight from the ith hidden node tooutput node. In (I) and (3) double time dependencies areused to avoid instabilities in updating weights(Anderson 1987, Berenji and Khedkar 1992).
ExternalReinforcement
. : Critic Network:V r Signal
t r~emporal Difference Method: Internal
ReinforcementStnte Signal
IGeneticAlgorithm'X
• Siring
INeuralNetworkBuilderI• Weights
:Action Network:f I I
IPlant
I
Figure 2. The structure of the critic network and the actionnetwork in the GRNN.
Figure I. The proposed genetic reinforcement neural network(GRNN).
To resolve reinforcement learning problems, a systemcalled thc GRNN is proposed. As Fig. I shows, theGRNN consists of two neural networks; one acts asthe action network and thc other as the critic network.Each network has exactly the same structure as thatshown in Fig. 2. The G RNN is basically in the formof thc actor-critic architecture (Sutton 1984). As wewant to solve the reinforcement learning problems inwhich the cxtcrnal reinforcement signal is availableonly after a long sequence of actions have been passedonto thc environment, we need a multistep critic network to predict the external reinforcement signal. Inthe GRNN, the critic network models the environmentsuch that it can perform a multistep prediction of thecxternal reinforcement signal that will eventually beobtained from the environment for the current actionchosen by the action network. With the multistep prediction, the critic network can provide a more informativc internal reinforcement signal to the action network.The act ion net work can then determine a better actionto impose onto the environment in the next time stepaccording to the current environment state and the
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

GA-based reinforcement learning for neural networks 239
The critic network evaluates the action recommendedby the action network and represents the evaluatedresult as the internal reinforcement signal. The internalreinforcement signal is a function of the external failuresignal and the change in state evaluation based on thestate of the system at time t + I:
{
0, start state,
r(t + I) = r[t + I] - vlt, t], failure state,
r[t + II + '"'fv[t, t + I] - vlt, t]' otherwise,
(4)
where 0:::: '"'f:::: I is the discount rate. In other words, thechange in the value of I' plus the value of the externalreinforcement signal constitutes the heuristic or internalreinforcement signal r(t) where the future values of v arediscounted more, the further they are from the currentstate of the system.
The learning algorithm for the critic network is composed of Sutton's AHC algorithm (Sutton 1984) for theoutput node and the backpropagation algorithm for thehidden nodes. The AHC algori thm is a temporal difference prediction technique proposed by Sutton (1988).We shall go into the details of this learning algorithmin the next section.
3.2. The action network
The action network is to determine a proper actionacting on the environment (plant) according to the current environment state. The structure of the action network is exactly the same as that of the critic networkshown in Fig. 2. The only information received by theaction network is the state of the environment in termsof state variables and the internal reinforcement signalfrom the critic network. As we want to use the GA totrain the action network for control applications, weencode the weights of the action network as a realvalued string. More clearly, each connection weight isviewed as a separate parameter and each real-valuedstring is simply the concatenated weight parameters ofan action network. Initially, the GA generates a population of real-valued strings randomly, each of whichrepresents one set of connection weights for an actionnetwork. After a new real-valued string is created, aninterpreter uses this string to set the connection weightsof the action network. The action network then runs in afeedforward fashion to produce control actions actingon the environment according to (I) and (3). At thesame time, the critic network constantly predicts thereinforcement associated with changing environmentstates under the control of the current action network.After a fixed time period, the internal reinforcementsignal from the critic network will indicate the fitnessof the current action network. This evaluation process
continues for each string (action network) in the population. When each string in the population has been evaluated and given a fitness value, the GA can look for abetter set of weights (better strings) and apply geneticoperators on them to form a new population as the nextgeneration. Better actions can thus be chosen by theaction network in the next generation. After a fixednumber of generations, or when the desired control performance is achieved, the whole evolution process stops,and the string with the largest fitness value in the lastgeneration is selected and decoded into the final actionnetwork. The detailed learning scheme for the actionnetwork will be discussed in the next section.
4. GA-based reinforcement learning algorithm
Associated with the GRNN is a GA-based reinforcement learning algorithm to determine the proper weightsof the GRNN. All the learning is performed on both theaction network and the critic network simultaneously,and only conducted by a reinforcement signal feedbackfrom the environment. The key concept of the GAbased reinforcement learning algorithm is to formulatethe internal reinforcement signal from the critic networkas the fitness function of the GA for the action network.This learning scheme is in fact a novel hybrid GA, whichconsists of the temporal difference and gradient descentmethods for the critic network learning, and the GA forthe action network learning. This algorithm possesses anadvantage of all hybrid GAs: hybridizing a GA withalgorithms currently in use can produce an algorithmbetter than the GA and the current algorithms. It willbe observed that the proposed reinforcement learningalgorithm is superior to the normal reinforcementlearning schemes without using GAs in the global optimization capability, and superior to the normal GAswithout using temporal difference prediction techniquein learning efficiency. The flowchart of the GA-basedreinforcement learning algorithm is shown in Fig. 3. Inthe following subsections we first consider the reinforcement learning scheme for the critic network of theGRNN, and then introduce the GA-based reinforcement learning scheme for the action network of theGRNN.
4.1. Learning algorithm for the critic network
When both the reinforcement signal and input patterns from the environment depend arbitrarily on thepast history of the action network outputs and theaction network only receives a reinforcement signalafter a long sequence of outputs, the credit assignmentproblem becomes severe. This temporal credit assignment problem occurs because we need to assign creditor blame to each step individually in long sequences
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

240 c.-T. Lin et al.
which increases (decreases) its contribution to the totalsum. The weights on the links connecting the nodes inthe input layer directly to the nodes in the output layerare updated according to the following rule:
Note that the sign of a hidden node's output weight isused, rather than its value. The variation is based onAnderson's empirical study (Anderson 1987) that thealgorithm is more robust if the sign of the weight isused, rather than its value.
where 7) > 0 is the learning rate and r[1 + II IS theinternal reinforcement signal at time I + I.
Similarly, for the weights on the links between thehidden layer and the output layer, we have the followingweight update rule:
cdl + I] = C;[I] + 7)1'[1 + 11.1';[1 + II. (6)
The weight update rule for the hidden layer is based on amodified version of the error backpropagation algorithm (Rumelhart et al. 1986). As no direct error measurement is possible (i.e. knowledge of correct action isnot available), rplays the role of an error measure in theupdate of the output node weights; if r is positive, theweights are altered to increase the output v for positiveinput, and vice versa. Therefore, the equation forupdating the hidden weights is
aij[1 + I] = aijlt] + 1/1'[1 + ILI';[I, 1](1 - Y;[I, l])sgn(c;[I].\j[I].
(7)
4.2. Learning algorithm for the action network
The GA is used to train the action network by usingthe internal reinforcement signal from the critic networkas the fitness function. Fig. 4 shows the flowchart of theGA-based learning scheme for the action network.Initially, the GA randomly generates a population ofreal-valued strings, each of which represents one set ofconnection weights for the action network. Hence, eachconnection weight is viewed as a separate parameter andeach real-valued string is simply the concatenated connection weight parameters for the action network. Thereal-value encoding scheme instead of the normal binaryencoding scheme in GAs is used here, so recombinationcan only occur between weights. As there are(n X h + n + 17) links in the action network (see Fig. 2),each string used by the genetic search includes(n x 17 + n + 17) real values concatenated together. Asmall population is used in our learning scheme. Theuse of a small population reduces the exploration ofthe multiple (representationally dissimilar) solutionsfor the same network.
After a new real-valued string has been created, aninterpreter takes this real-valued string and uses it to
(5)b;[1 + I] = h;[I] + 7)1'[1 + I]x;it]'
I input Training Data I
Forward Signal Propagation
(Critic Network) (Action Network)
Temporal Difference Prediction Determine the Best Action
GA·BasedReinforcement Learning Algorithm
(CriticNetwork) (ActionNetwork)
Gradient Descent Learning Algorithm Genetic Algorithms
Figure 3. Flowchart of the proposed GA-based reinforcementleurning algorithm for the G RNN.
leading up to cvcntuul successes or failures. Thus, tohandle this class of reinforcement learning problemswe need to solve the temporal credit assignment probIcm, along with solving the original structural creditassignment problem concerning attribution of networkerrors to different connections or weights. The solutionto thc tcmporal credit assignment problem in GRNN isto usc a multistep critic network that predicts the reinIorccmcnt signal at each time step in the period withoutany external reinforcement signal from the environment.This can ensure that both the critic network and theaction network can update their parameters during theperiod without any evaluative feedback from the environmcnt. To train the multistep critic network, we use atechnique based on the temporal difference method,which is often closely related to the dynamic programming techniques (Barto et al. 1983, Sutton 1988, Werbos1990). Unlike the single-step prediction and the supervised learning methods which assign credit according tothc difference between the predicted and actual outputs,thc temporal difference methods assign credit accordingto thc difference between temporally successive predictions. Note that thc term multistep prediction used heremeans that the critic network can predict a value thatwill be available several time steps later, although itmakes such a prediction at each time step to improveits prediction accuracy.
The goal of training the multistep critic network is tominimize thc prediction error, i.e. to minimize theinternal reinforcement signal 1'(1). It is similar to areward/punishment scheme for the weights updating inthc critic network. If positive (negative) interval reinforccmcnt is observed, thc values of the weights arerewarded (punished) by being changed in the direction
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

GA-based reinforcement learning for neural networks 241
(9)
Figure 4. Flowchart of the proposed hybrid GA for the actionnetwork.
set the connection weights in the action network. Theaction network then runs in a feedforward fashion tocontrol the environment (plant) for a fixed time period(determined by the constant TIME in Fig. 4) or until afailure occurs. At the same time, the critic networkpredicts the external reinforcement signal from thecontrolled environment and provides an internal reinforcement signal to indicate the fitness of the action network. In this way, according to a defined fitnessfunction, a fitness value is assigned to each string inthe population where high fitness values mean goodfit. The fitness function FIT can be any nonlinear,non-differentiable, discontinuous, positive function,because the GA only needs a fitness value assigned toeach string. In this paper we use the internal reinforcement signal from the critic network to define the fitnessfunction. Two fitness functions are defined and used inthe G RNN. The first fitness function is defined by
IFIT(I) = Ir(I)I' (8)
which reflects the fact that small internal reinforcementvalues (i.e. small prediction errors of the critic network)
mean higher fitness of the action network. The secondfitness function that we define is
I I
FlT(I) = Ir(I)1 x TIME'
where I is the current time step, I :::::: I :::::: TIME, and theconstant TIME is a fixed time period during whichthe performance of the action network is evaluated bythe critic network. The added IITIME in (9) reflects thecredibility or strength of the term 1/11'(1)1 in (8); if anaction network receives a failure signal from the environment before the time limit (i.e. I :::::: TIM E), then theaction network that can keep the desired control goallonger before failure occurs will obtain higher fitnessvalue. The above two fitness functions are differentfrom that defined by Whitley et al. (1993). Their relativemeasure of fitness takes the form of an accumulator thatdetermines how long the experiment is still success.Hence, a string (action network) cannot be assigned afitness value until an external reinforcement signalarrives to indicate the final success or failure of thecurrent action network.
When each string in the population has been evaluated and given a fitness value, the GA then looks for abetter set of weights (better strings) to form a new population as the next generation by using genetic operatorsintroduced in section 2 (i.e. the reproduction, crossoverand mutation operators). In basic GA operators, thecrossover operation can be generalized to multipointcrossover in which the number of crossover point(N c ) is defined. With Nc set to I, generalized crossover reduces to simple crossover. The multipoint crossover can solve one major problem of the simplecrossover; one-point crossover cannot combine certaincombinations of features encoded on chromosomes. Inthe proposed GA-based reinforcement learning algorithm, we choose Nc = 2. For the mutation operator,because we use the real-value encoding scheme, we usea higher mutation probability in our algorithm. This isdifferent from the traditional GAs that use the binaryencoding scheme. The latter are largely driven by recombination, not mutation. The above learning process continues to new generations until the number ofgenerations meets a predetermined generation size (Gin Fig. 4). After a fixed number of generations G, thewhole evolution process stops, and the string with thelargest fitness value in the last generation is selected anddecoded into the final action network.
The major feature of the proposed hybrid GAlearning scheme is that we formulate the internal reinforcement signal as the fitness function for the GA basedon the actor-eritic architecture (GRNN). In this way,the GA can evaluate the candidate solutions (theweights of the action network) regularly during theperiod without external reinforcement feedback from
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

242 c-T. Lin et al.
system fails and receives a penalty signal of -I whenthe pole falls past a certain angle (± 12° was used) or thecart runs into the bounds of its track (the distance is2.4m from the centre to both bounds of the track).The goal of this control problem is to train theG RNN to determine a sequence of forces applied tothe cart to balance the pole.
The model and the corresponding parameters of thecart-pole balancing system for our computer simulations are adopted from Anderson (1986, 1987) withthe consideration of friction effects. The equations ofmotion that we used are
I • • I, X ,
Figure 5. The cart-pole balancing system.
the environment. The GA can thus proceed to new generations in fixed time steps (specified by the constantTIM E in (9)) without waiting for the arrival of theexternal reinforcement signal. In other words, we can e(1 + I) = e(l) + ~e(I), (10)
. 2 I), (Ill + III )e(l)(Ill + IIlp )g sin e(l) - cos e(l)[f(l) + IIlple(l) sin e(l) -JLcsgn (.Y(I))] - p p
. . III /O(I+I)=e(I)+~ p (II)
(4/3) (Ill + IIlp ) / - lIl/cos2 e(l)
In the first set of simulations, we use the fitness function defined in (8), i.e. FlT(I) = I/lr(I)I, to train theGRNN, where r(l) is the internal reinforcement signalfrom the critic network. The used critic network and
X(I + I) = X(I) + ~.Y(I), (12)
f(l) + IIlpl[e(I)2 sin e(l).( 1)- .() A -O(I)eose(I)]-I),csgn(.Y(I))XI+ -XI +u ()
III + IIlp
(13)
where g = 9.8 m/s2 is the acceleration due to the gravity,III = I kg is the mass of the cart, IIlp = 0.1 kg is the massof the pole, / = 0.5 m is the half-pole length, JLc = 0.0005is the coefficient of friction of the cart on the track,JLp = 0.000002 is the coefficient of friction of the poleon the cart, and ~ = 0.02 is the sampling interval.
The constraints on the variables are _12° ::; e::; 12°,-2.4 m ::; x ::; 2.4 m. In designing the controller, theequations of motions of the cart-pole balancingsystem are assumed to be unknown to the controller.A more challenging part of this problem is that theonly available feedback is a failure signal that notifiesthe controller only when a failure occurs; that is, either101 > 12° or [x] > 2.4 m. As no exact teaching information is available, this is a typical reinforcement learningproblem and the feedback failure signal serves as thereinforcement signal. As a reinforcement signal mayonly be available after a long sequence of time steps inthis failure avoidance task, a multistep critic networkwith temporal difference prediction capability is usedin the GRNN. The (external) reinforcement signal inthis problem is defined as
keep the time steps TIME to evaluate each string(action network) and the generation size G fixed in ourlearning algorithm (see the flowchart in Fig. 4), becausethe critic network can give predicted reward/penaltyinformation to a string without waiting for the finalsuccess or failure. This can usually accelerate the GAlearning because an external reinforcement signal mayonly be available at a time long after a sequence ofactions has occurred in the reinforcement learning problems. This is similar to the fact that we usually evaluatea person according to his/her potential or performanceduring a period, not after he/she has done somethingreally good or bad.
S. Control of the cart-pole system
A general-purpose simulator for the GRNN model withthe multistep critic network has been written in the Clanguage and runs on an 80486-DX-based personalcomputer. Using this simulator, one typical example,the cart-pole balancing system (Anderson 1987), is prescnted in this section to show the performance of theproposed model.
The cart-pole balancing problem involves learninghow to balance an upright pole as shown in Fig. 5.The bottom of the pole is hinged to a cart that travelsalong a finite-length track to its right or its left. Both thecart and pole can move only in the vertical plane; that is,each has only one degree of freedom. There are fourinput state variables in this system: e, the angle of thepole from an upright position (deg); e, the angular velocity of the pole (dcgjs); x, the horizontal position of thecart's centre (m); and .Y, the velocity of the cart (m/s).Thc only control action is f, which is the amount offorce (N) applied to the cart to move it left or right.This control action f is a discrete value, ± 10N. The
{
- Ir(l) = '
0,
ifle(I)I> 12°orlx(I)1 > 2.4m,
otherwise.( 14)
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

GA-based reinforcement learning for neural networks 243
47802912
Mean
30262601
Median
1444611893
Worst
129312
Best
100200
Pop Size
Table I. Performance indices of Whitley's system on the cartpole balancing problem
work. Whitley et al. (1993) used a normal GA toupdate the action network according to their fitnessfunction. Like the use of GRNN in the cart-poleproblem, a bang-pole problem, a bang-bang controlscheme is used in their system; the control output alsohas only two possible values, ± ION. The constraints onthe state variables are also _12° ::; 0 ::; 12°, -2.4m ::;x ::; 2.4 m. The only available feedback is a failuresignal that notifies the controller only when a failureoccurs. The time steps from the beginning to the occurrence of a failure signal indicate the fitness of a string(action network); a string is assigned a higher fitnessvalue if it can balance the pole longer.
The simulation results of using Whitley's system onthe cart-pole system for two different population sizesare shown in Table I. The first row of Table I shows theperformance indexes when the population size is set asPOP = 100. We observe that Whitley's system learnedto balance the pole at the 4780th generation on average.Whitley's system balanced the pole at the 129th generation in the best case and at the 14446th generation in theworst case. The median number of generations requiredto produce the first good action network (the networkthat is able to balance the pole for 100000 time steps) is3026. The second row of Table I shows the performanceindexes when the population size is set as POP = 200.The results show that Whitley's system learned to balance the pole at the 2912th generation on average. In thebest case, it took 312 generations to learn to balance thepole, and in the worst case it took II 893 generations tolearn the task. The median number of generationsrequired to produce the first good action network is260 I. From Table I we observe that the mean generation number needed by Whitley's system to obtain agood action network is more than 2500 generations.As compared to our system, the GRNN needs onlyTIME = 100 and G = 50 to learn a good action network.
From the point of view of CPU time, we compare thelearning speed required by the GRNN to that byWhitley's system. A control strategy was deemed successful if it could balance a pole for 120000 time steps.When 120000 pole balance steps were completed, thenthe CPU times expended were measured and averagedover 50 simulations. The CPU times should be treated as
40353010 15 20 25Time (second)
Figure 6. Angular deviation of the pole resulted by a trainedGRNN with F1T(t) = I/lf(t)l.
5
4
3
2
$ "".ir° "0
-1
-2
-3
...-5
0 5
action network both have five input nodes, five hiddennodes and one output node. Hence, there are 35 weightsin each network. A bias node fixed at 0.5 is used as thefifth input to the network; a weight from the bias nodeto a hidden node (or to the output node) in effectchanges the threshold behaviour of that node. Thelearning parameters used to train the GRNN are thelearning rate 1] = 0.1, the population sizes POP = 200,the time limit TIME = 100, and the generation sizesG = 50. Initially, we set all the weights in the critic network and action network to random values between-2.5 and 2.5. Mutation of a string is carried out byadding a random value to each substring (weight) inthe range ± IO. The action network determines two possible actions ± ION to act on the cart according to thesystem state. When the GA learning stops, we choosethe best string in the population at the 50th generationand test it on the cart-pole system. Fig. 6 shows the poleposition (angular deviation of the pole) when the cartpole system was controlled by a well-trained GRNNstarting at the initial state: 0(0) = 0, 1i(0) = 0,x(O) = 0, .\'(0) = O.
The same cart-pole balancing control problem wasalso simulated by Whitley et al. (1993) using the pureGA approach. Whitley and his colleagues defined a fitness function different from ours. Their relative measureof fitness takes the form of an accumulator that determines how long the pole stays up and how long the cartavoids the end of the track. Hence, a string (action network) cannot be assigned a fitness value before a failurereally occurs (i.e. either 101 > 12° or [x] > 2.4 m). As thefitness function was defined directly by the external reinforcement signal, Whitley's system used the action network only. Their structure is different from our GRNN,which consists of an action network and a critic net-
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

c.- T. Lin et al.
Angular deviation of the pole resulted by a trainedWhitley system.
3
2
4
5;r--~-~-~----'--~-~-~--'
·2
.,
-3
-5;L-_~-~--~-~-~-~~-~--o 5 10 15 20 25 30 35 40
Time (second)
Figure 8. Angular deviation of the pole resulted by a trainedCRNN with 1'1'1'(1) = I!Ii(I) I after a disturbance is given,
403530
'" "'" ..,, "'r' u "P'
15 20 25Time(second)
10
Figure 7.
244
5
4
3
2
~o ."0
·1
·2
-3
...-5
0 5
Table 2. The learning speed in CI'U seconds of C RNN andGENITOR
Table 3. Moriarty's results
Learning speed (in CPU TIM E (s)
Learning speed (in CPU '1'1 ME) (s)
Method
GRNNGENITOR
mean
7.9431.18
best
1.1010.38
worst
25.11323.8
Method mean best worst
one-layer AHC 130.6 17 3017two-layer AHC 99.1 17 863Q-Iearning 19.8 5 99GENITOR 9,5 4 45SANE 5.9 4 8
rough estimates because they arc sensitive to the implementation details. However, the CPU time differencesfound arc large enough to indicate real differences inIraining time on both methods. The results are shownin Table 2. It is noted that the time needed to evaluate astring in Whitley's system (called GENITOR in thetable) is about three to four times as long as ours, asour time steps arc bounded by the time limit TIME andby the arrival time of a failure signal, but the time stepsfor Whitley's system are bounded only by the arrivaltime of the failure signal. Hence, on average, the proposed hybrid GA is superior to the pure GA in learningspeed in solving the reinforcement learning problems.'Fig. 7 shows the pole position (angular deviation ofthe pole) when the cart-pole system was controlled bya well-trained Whitley system starting at the initial state:0(0) = 0, 0(0) = 0, x(O) = 0, .\'(0) = O. These angulardeviations arc slightly bigger than those produced bythe trained GRNN.
For more complete performance comparisons, weconsider several other learning schemes. Moriarty andMiikkulainen (1996) proposed a new reinforcementlearning method called SANE (symbiotic adaptivencuro-cvolution), which evolves a population of neuronsthrough GAs to form a neural network capable of per-
forming a task. SANE achieves efficient learningthrough symbiotic evolution, where each individual inthe population represents only a partial solution to theproblem; complete solutions are formed by combiningseveral individuals. According to their results, SANE isfaster, more efficient and more consistent than the earlier AHC approaches (Barto et al. 1983, Anderson 1987)and Whitley's system (Whitley et al. 1993) (as shown inTable 3). From the table we found that the proposedGRNN is superior to the normal reinforcement learningschemes without using GAs for global optimization,such as the single-layer AHC (Barto et al. 1983) andtwo-layer AHC (Anderson 1987), and superior to thenormal GAs without using temporal difference technique (the GENITOR) in learning efficiency for reinforcement learning problems.
Similarly to the simulations of Berenji and Khedkar(1992), the adaptation capability of the proposedGRNN was tested. To demonstrate the disturbancerejection capability of the trained GRNN, we applieda disturbance f = 20 N to the cart at the IOth second.the curve of angular deviations in Fig. 8 indicates thatthe G RNN system brought the pole back to the centre
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

GA-hased reinforcement learning for neural networks 245
5,----~-~-~-~--~-~-~-, 5,----~-~-~-~-_~-~-~-,
4 4
3 3
2
1
~ ~ .,.,~i~ ~ _. ''''';)'' .""..t .. "lO" ·r.", """ ."""-1
-2
-3 -3
"" ""353015 20 25
Time(second)105
.5,0L--~-~--~---:'=----:;----::-------:-::------:40
353015 20 25Time (second)
105-5,
0L--~-~---'::-----:''------,':----::-------:-::------:40
Figure 9. Angular deviation of the pole resulted by a traiuedGRNN with FlT(I) = 1/1i"(t)1 when the half-length of thepole is reduced from 0.5 m to 0.25 m,
Figure II. Angular deviation of the pole resulted by a trainedGRNN with FlT(I) = I/lr(I)1 x IITIME.
5r--~-~--~-~--~-~-~------,
4
3
2
trained GRNN when the controlled system parametersare changed in the ways mentioned. The results show thegood control and adaptation capabilities of the trainedGRNN in the cart-pole balancing system.
In another set of simulations we use the second fitnessfunction defined in (9), i.e.
I IFlT(t) = Ir(I)1 x TIME'......".,~ o.}.-........--------'r"'"'...---w~
~
-1
-3
-2
""
to train the GRNN. The goal of GA learning in theaction network is to maximize this fitness function.The same learning parameters are used to train theGRNN in this case; the learning rate 1) = 0.1, the population sizes POP = 200, the time limit TIME = 100, andthe generation size G = 50. Fig. \1 shows the pole position when the cart-pole system was controlled by a welltrained GRNN system starting at the initial state:0(0) = 0, e(O) = 0, x(O) = 0, .x-(O) = O. We also test theadaptation capability of the trained GRNN with thesecond fitness function. The curve of angular deviationsin Fig. 12 shows that the G RNN brought the pole backto the centre position quickly after the disturbancef = 20 N was given at the 10th second. Figs 13 and 14show, respectively, the simulation results produced bythe trained GRNN when we reduced the pole lengthto I = 0.25 m, and doubled the mass of the cart to/11 = 2 kg. The results also show the good control andadaptation capabilities of the trained G RNN with thesecond fitness function in (9) in the cart-pole balancingsystem. Comparing Fig. 13 to Fig. 9, we find that theGRNN trained by using the second fitness function in(9) is more robust than that by using the first fitnessfunction in (8).
In other simulations we tried to reduce the time limitTI M E to 50 from 100 and kept the other learning para-
403530105-5L--~-~=__-c'::_-__=------:~--=--___:=_______:
o 15 20 25Time(second)
Figure 10. Angular deviation of the pole resulted by a trainedGRNN with FlT(I) = I/lr(I)1 when cart mass is doubledfrom I kg to 2 kg.
position quickly after the disturbance was given. In thistest, the G RNN required no further trials for relearning. We also changed the parameters of the cartpole system to test the robustness of the GRNN. Wefirst reduced the pole length I from 0.5 m to 0.25 m.Fig. 9 shows the simulation results produced by theGRNN. It is observed that the trained GRNN canstill balance the pole without any re-Iearning. In anothertest, we doubled the mass of the cart m to 2 kg from I kg.Fig. 10 shows the simulation results produced by theGRNN. Again, the results show that the GRNN cankeep the angular deviations within the range [_12°,+ 12°] without any re-learning. These robustness testsshow that no further trials are required to re-Iearn a
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

246 C» T. Lin et al.
5'r--~-~--~-~-~----'--~--'
< <
3 3
2
,~ 0 ~ .I.,.,~...... ...g ,'" .~
.,
2
1
~ o,k<_'VI.-.."..,....."."""...."...""".....".'VIA~"""-v\.-'......"AV"""""''''''VY\'''''''''''''III
-1
·2 -2
..·3
..
Figure 12. Angular deviation of the pole resulted by a trainedGRNN with FlT(t) = I/li(t)1 x tlTlME after a disturbance is given,
3530105
-5L-_~__~_~__~_~__~_~~_.Jo 15 20 25
Time (18COl'Id)
Figure 14. Angular deviation of the pole resulted by a trainedGRNN with F1T(t) = I/li(t)1 x tl TlMl: when cart mass isdoubled from I kg to 2 kg.
353015 20 25Time (second)
105
.,-2
2
ReferencesADLER, D., 1993, Genetic algorithms and simulated annealing: a mar
riage proposal. Proceedings of the IEEE International Conference OIl
Neural Networks, Vol. II. San Francisco, CA, pp. 1104-1109.ANDERSON, C. W., 1986, Learning and problem solving with multilayer
connectionist systems. PhD thesis, University of Massachusetts;1987, Strategy learning with multilayer connectionist representations. Proceedings of the Fourth International Workshop onMachine Learning, Irvine, CA, pp, 103-114.
BARTO, A. G., and ANANDAN, P., 1985, Pattern-recognizing stochasticlearning automata. IEEE Transactions on Systems. Man and Cybernetics. 15. 360-375.
the GA-based reinforcement learning algorithm wasderived for the GRNN. This learning algorithm possesses the advantage of other hybrid GAs; hybridizinga GA with algorithms currently in use can produce analgorithm better than the GA and the current algorithms. The proposed reinforcement learning algorithmis superior to the normal reinforcement learningschemes without using GAs in the global optimizationcapability, and superior to the normal GAs withoutusing temporal difference technique in learning efficiencyfor reinforcement learning problems. Using the proposed connectionist structure and learning algorithm,a neural network controller that controls a plant and aneural network predictor that models the plant can beset up according to a simple reinforcement signal. Theproposed GRNN makes the design of neural networkcontrollers more practical for real-world applications,because it greatly lessens the quality and quantityrequirements of the teaching signals, and reduces thelong training time of a pure GA approach. Computersimulations of the cart-pole balancing problemsatisfactorily verified the validity and performance ofthe proposed GRNN.
<0353015 20 25Time (second)
105-5L...-_-,--~-~--~-~-~--~_.J
o
Figure 13. Angular deviation of the pole resulted by a trainedGRNN with FIT(t) = I/li(t)1 x tlTlME when the halflength of the pole is reduced from 0.5 m to 0.25m.
5r---'---'--~~-'---'-----'--~--'
..
<
3
6. Conclusion
This paper describes a genetic reinforcement neural network (GRNN) to solve various reinforcement learningproblems, By combining the temporal difference techniquc, thc gradient descent method and genetic algorithms (GAs), a novel hybrid genetic algorithm called
meters unchanged. After computer simulations, theresults also show the good control and adaptation capabilities of the trained G RNN in the cart-pole balancingsystem. However, on average, the learned best actionnetwork is not so good in angular deviation as thosein the above simulations using TIME = 100.
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14

GA-hased reinforcement learning for neural networks 247
BARTO, A. G .. and JORDAN, M. I., 1987, Gradient following withoutbackpropagution in layered network. Proceedings of the lnternational Joint Conference on Neural Networks, San Diego, CA, Vol.
II, pp. 629-636.BARTO, A. G .. SUTTON, R. S., and ANDERSON, C. W., 1983, Neuron
like adaptive elements that can solve difficult learning control problem.IEEE Transactions on Svstcrns. Alan ond Cybemetics, 13,834847.
BERENJI, H. R., and KHEDKAR, p .. 1992, Learning and tuning fuzzylogic controllers through reinforcements. IEEE Transactions onNeural Networks, 3, 724-740.
DAVIS, L., 1991. Handbook: of Genetic Algorithms (New York: Van
Nostrand Reinhold).GOLDBERG. D. E.. 1989. Genetic Algorithms in Search, Optimization
ami Machine Leurning (Reading, MA: Addison-Wesley).HARP, S., SAMAD, T., and GUHA, A., 1990, Designing application
specific neural networks using the genetic algorithm. NeuralIn/ormation Processing Systems. Vol. 2 (San Mateo, CA: MorganKaufman),
HOLLAND,J. H., 1962, outline for a logical theory of adaptive systems.Journal of the Association for Computing Machinery, 3, 297-314;1975. Adaptation in Natural and Artificial System (Ann Arbor, MI:
University of Michigan).LAWLER, E. L., 1976. Combinatorial Optimization: Networks and
Matroids (New York: Holt, Rinehart and Winston).LUENBERGER, D. G., 1976, Linear lind Nonlinear Programming (Read
ing, MA: Addison-Wesley).MICHALEWICZ, Z" and KRAWEZYK, J. B., 1992, A modified genetic
algorithm for optimal control problems. Computers and Mathematical Applications. 23, 83-94.
MONTANA, D., and DAVIS, L., 1989, Training feedforward neural networks using genetic algorithms. Proceedings of the InternationalJoint Conference on Artificial Intelligence. pp. 762-767.
MORIARTY, D. E., and MIIKKULAINEN, R., 1996, Efficient reinforcement learning through symbiotic evolution. Machine Learning. 22,11-32.
PETRIDIS, V., KAZARLlS, S., PAPAIKONOMOU, A., and FILELlS, A.,1992, A hybrid genetic algorithm for training neural networks. Artificial Neural Networks 2. edited by I. Aleksander and J. Taylor,(North-Holland), pp. 953-956.
RUMELHART, D., HINTO, G., and WILLIAMS, R. J.. 1986, Learninginternal representation by error propagation. Parallel DistributedProcessing, edited by Rumelhart, D., and McClelland (Cambridge,MA: MIT Press), pp. 318-362.
SCHAFFER, J. D., CARUANA, R. A., and ESHELMAN, L. J" 1990, Usinggenetic search to exploit the emergent behavior of neural networks.Physico D, 42, 244-248.
SUTTON, R. S., 1984, Temporal credit assignment in reinforcementlearning. PhD thesis, University of Massachusetts, Amherst, MA,USA; 1988, Learning to predict by the methods of temporal difference. Machine Learning, 3, 9-44.
TSINAS, L. and DACHWALD, B., 1994, A combined neural and geneticlearning algorithm. Proceedings of the IEEE International Conference on Neural Networks, vol. I, pp. 770-774.
WERBOS, P. J., 1990. A menu of design for reinforcement learning overtime. Neural Networks for Control, edited by W. T. Miller, III, R. S.Sutton, and P. J. Werbos (Cambridge: MIT Press), Chapter 3.
WHITLEY, D., DOMtNIC, S., DAs, R., and ANDERSON, C. W" 1993,Genetic reinforcement learning for neurocontrol problems. MachineLearning, 13, 259-284.
WHITLEY, D" STARKWEATHER, T" and BOGART, c., 1990, Geneticalgorithm and neural networks: optimizing connections and connectivity. Parallel Computing, 14,347-361.
WILLIAMS, R, 1., 1987, A class of gradient-estimating algorithms forreinforcement learning in neural networks. Proceedings of theInternational Joint Conference on Neural Networks, San Diego.CA, Vol. II, pp. 601-608.
Dow
nloa
ded
by [
Nat
iona
l Chi
ao T
ung
Uni
vers
ity ]
at 0
5:01
28
Apr
il 20
14