ieee transactions on neural networks and …based on heuristic dynamic programming (hdp) to tackle...

16
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013 913 Adaptive Learning in Tracking Control Based on the Dual Critic Network Design Zhen Ni, Haibo He, Senior Member, IEEE, and Jinyu Wen, Member, IEEE Abstract—In this paper, we present a new adaptive dynamic programming approach by integrating a reference network that provides an internal goal representation to help the systems learning and optimization. Specifically, we build the reference network on top of the critic network to form a dual critic network design that contains the detailed internal goal representation to help approximate the value function. This internal goal signal, working as the reinforcement signal for the critic network in our design, is adaptively generated by the reference network and can also be adjusted automatically. In this way, we provide an alternative choice rather than crafting the reinforcement signal manually from prior knowledge. In this paper, we adopt the online action-dependent heuristic dynamic programming (ADHDP) design and provide the detailed design of the dual critic network structure. Detailed Lyapunov stability analysis for our proposed approach is presented to support the proposed structure from a theoretical point of view. Furthermore, we also develop a virtual reality platform to demonstrate the real- time simulation of our approach under different disturbance situations. The overall adaptive learning performance has been tested on two tracking control benchmarks with a tracking filter. For comparative studies, we also present the tracking performance with the typical ADHDP, and the simulation results justify the improved performance with our approach. Index Terms—Adaptive critic design (ACD), adaptive dynamic programming (ADP), internal goal, lyapunov stability analysis, online learning, reinforcement learning, tracking control, virtual reality. I. I NTRODUCTION D EVELOPING brain-like intelligence has become a cutting-edge research topic in the computational intel- ligence field. While the latest research on adaptive dynamic programming (ADP)/adaptive critic design (ACD) [1]–[6] has shown great success on machine intelligence and applications, there are still many grand challenges in reaching truly brain- like intelligence. One of the most important questions is how to design a general system that can learn and optimize adaptively Manuscript received November 29, 2011; revised November 28, 2012; accepted February 10, 2013. Date of publication March 7, 2013; date of current version April 5, 2013. This work was supported by the National Science Foundation under grant CAREER ECCS 1053717 and grant CNS 1117314, the Army Research Office under grant W911NF-12-1-0378, the Defense Advanced Research Projects Agency under grant FA8650-11-1-7148, and the National Natural Science Foundation of China under grant 51228701. Z. Ni and H. He are with Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail: [email protected]; [email protected]). J. Wen is with the College of Electrical, Electronic and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2247627 and efficiently like the human brain to achieve the final goal over time [7], [8]. In this paper, we propose a novel structure based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing machine intelligence, and ADP is one of the most critical approaches to (hopefully) bring such a level of intelligence closer to reality [8]–[10]. Generally speaking, ADP can be categorized into three typical structures [11], [12]. 1) HDP, which was proposed to provide a control action sequence with the action network and use the critic net- work to approximate the value function. For instance, in online HDP version such as [13], the critic network was designed to critique the generated action by propagating a temporal difference (TD) between two consecutive estimates of the value function. 2) Dual heuristic dynamic programming (DHDP), the key point of which is to employ the critic network to approximate the derivatives of the value function with respect to the state vectors. 3) Globalized dual heuristic dynamic programming (GDHP), which takes advantage of both HDP and DHP by employing the critic network to approximate both the value function and its derivatives with respect to the state vectors. Moreover, various versions have been developed based on these typical structures, such as the action-dependent (AD) version [13]–[15] by taking the control action as one of the inputs for the critic network, and the model-dependent version [16], [17] by taking the estimates of the model network as part of the inputs for the critic network. Currently, many implementations of these ADP designs have been tested on both mathematical benchmarks and engi- neering applications, such as the infinite-time optimal tracking control with greedy HDP algorithm in [18]–[20], the HDP controller for nonlinear discrete time tracking problem in [21]–[25], and the engine torque and exhaust air–fuel ratio tracking control based on AD HDP in [26]. In [27]–[29], the looper system control in the iron industry was improved with the Levenberg–Marquardt (LM) algorithm instead of back- propagation as the weights updating rules in neural networks, and in [16] and [30]–[33] the turbogenerator/power system control was improved with HDP and DHP controllers, and many others [34]–[39]. Although there are many successful applications with ADP approaches, many of them use either a binary reinforcement signal (“0” and “1”) or some specific discrete reinforcement 2162-237X/$31.00 © 2013 IEEE

Upload: others

Post on 21-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013 913

Adaptive Learning in Tracking Control Basedon the Dual Critic Network Design

Zhen Ni, Haibo He, Senior Member, IEEE, and Jinyu Wen, Member, IEEE

Abstract— In this paper, we present a new adaptive dynamicprogramming approach by integrating a reference network thatprovides an internal goal representation to help the systemslearning and optimization. Specifically, we build the referencenetwork on top of the critic network to form a dual critic networkdesign that contains the detailed internal goal representation tohelp approximate the value function. This internal goal signal,working as the reinforcement signal for the critic network inour design, is adaptively generated by the reference networkand can also be adjusted automatically. In this way, we providean alternative choice rather than crafting the reinforcementsignal manually from prior knowledge. In this paper, we adoptthe online action-dependent heuristic dynamic programming(ADHDP) design and provide the detailed design of the dualcritic network structure. Detailed Lyapunov stability analysis forour proposed approach is presented to support the proposedstructure from a theoretical point of view. Furthermore, wealso develop a virtual reality platform to demonstrate the real-time simulation of our approach under different disturbancesituations. The overall adaptive learning performance has beentested on two tracking control benchmarks with a trackingfilter. For comparative studies, we also present the trackingperformance with the typical ADHDP, and the simulation resultsjustify the improved performance with our approach.

Index Terms— Adaptive critic design (ACD), adaptive dynamicprogramming (ADP), internal goal, lyapunov stability analysis,online learning, reinforcement learning, tracking control, virtualreality.

I. INTRODUCTION

DEVELOPING brain-like intelligence has become acutting-edge research topic in the computational intel-

ligence field. While the latest research on adaptive dynamicprogramming (ADP)/adaptive critic design (ACD) [1]–[6] hasshown great success on machine intelligence and applications,there are still many grand challenges in reaching truly brain-like intelligence. One of the most important questions is how todesign a general system that can learn and optimize adaptively

Manuscript received November 29, 2011; revised November 28, 2012;accepted February 10, 2013. Date of publication March 7, 2013; date ofcurrent version April 5, 2013. This work was supported by the NationalScience Foundation under grant CAREER ECCS 1053717 and grant CNS1117314, the Army Research Office under grant W911NF-12-1-0378, theDefense Advanced Research Projects Agency under grant FA8650-11-1-7148,and the National Natural Science Foundation of China under grant 51228701.

Z. Ni and H. He are with Department of Electrical, Computer andBiomedical Engineering, University of Rhode Island, Kingston, RI 02881USA (e-mail: [email protected]; [email protected]).

J. Wen is with the College of Electrical, Electronic and Engineering,Huazhong University of Science and Technology, Wuhan 430074, China(e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2013.2247627

and efficiently like the human brain to achieve the final goalover time [7], [8]. In this paper, we propose a novel structurebased on heuristic dynamic programming (HDP) to tackle thisproblem.

In recent years, extensive efforts have been dedicated todeveloping machine intelligence, and ADP is one of themost critical approaches to (hopefully) bring such a levelof intelligence closer to reality [8]–[10]. Generally speaking,ADP can be categorized into three typical structures [11], [12].

1) HDP, which was proposed to provide a control actionsequence with the action network and use the critic net-work to approximate the value function. For instance, inonline HDP version such as [13], the critic network wasdesigned to critique the generated action by propagatinga temporal difference (TD) between two consecutiveestimates of the value function.

2) Dual heuristic dynamic programming (DHDP), the keypoint of which is to employ the critic network toapproximate the derivatives of the value function withrespect to the state vectors.

3) Globalized dual heuristic dynamic programming(GDHP), which takes advantage of both HDP and DHPby employing the critic network to approximate boththe value function and its derivatives with respect tothe state vectors.

Moreover, various versions have been developed based onthese typical structures, such as the action-dependent (AD)version [13]–[15] by taking the control action as one ofthe inputs for the critic network, and the model-dependentversion [16], [17] by taking the estimates of the model networkas part of the inputs for the critic network.

Currently, many implementations of these ADP designshave been tested on both mathematical benchmarks and engi-neering applications, such as the infinite-time optimal trackingcontrol with greedy HDP algorithm in [18]–[20], the HDPcontroller for nonlinear discrete time tracking problem in[21]–[25], and the engine torque and exhaust air–fuel ratiotracking control based on AD HDP in [26]. In [27]–[29], thelooper system control in the iron industry was improved withthe Levenberg–Marquardt (LM) algorithm instead of back-propagation as the weights updating rules in neural networks,and in [16] and [30]–[33] the turbogenerator/power systemcontrol was improved with HDP and DHP controllers, andmany others [34]–[39].

Although there are many successful applications with ADPapproaches, many of them use either a binary reinforcementsignal (“0” and “−1”) or some specific discrete reinforcement

2162-237X/$31.00 © 2013 IEEE

Page 2: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

914 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

signals (“0,” “−0.4,” and “−1”) [13], [21], [22]. Some otherscrafted the reinforcement signal manually with past experienceor prior knowledge. For instance, in [11], the square of thedifference between the reference signal and the actual outputof the system is set as the reinforcement signal. In [40],the reinforcement signal is defined with different weights forvarious components, with some of the error components evenhaving nonlinear coefficients. Therefore, an intuitive questionis: can the reinforcement signal be assigned without any(or minimum) prior knowledge or human intervention and beadjusted adaptively when the operation environment changes?

In this paper, we propose a novel ADP structure with dualcritic networks that provide the internal goal representationadaptively according to the system’s behaviors to tackle thisquestion. Specifically, we introduce one reference networkto be on top of the typical ADP design, through which wemaintain the advantage of a model-free AD structure [13].In this way, we build one reference network on top of thecritic network to work as an integrated dual critic networkstructure, where the reference network provides the criticnetwork with an internal reinforcement signal. Unlike thebinary reinforcement signal from the external environment,this internal signal is continuous and bounded. It not onlyworks as one of the input vectors of the critic network butalso contributes to the parameter tuning for the critic network.Moreover, it can be adjusted adaptively and automatically bythe reference network according the system states and thecontrol action. As this internal signal is not crafted manuallyor with any prior knowledge, we believe that our proposeddual critic network design can provide useful suggestionsabout how to adaptively and automatically assign a properreinforcement signal for the general system to achieve the finalgoal over time adaptively. We would like to note that, althoughthere are many successful applications on tracking control withadaptive/fuzzy controllers [41]–[43], many such approachesrequire an accurate system model to be able to achievethe control or tracking performance. However, in real-worldcomplex problems, it is not uncommon that such an accuratesystem model is very difficult, or even impossible sometimes,to obtain due to the complexity of the system. Therefore, weconsider our proposed dual critic model-free ADP approachto be particularly useful for such complex systems.

Motivated by our previous work [7], [44], we extend ouridea to the tracking control problems with several major newcontributions. First, we integrate a tracking filter in this paperto handle the tracking problem rather than the balancingproblem. Second, we develop the Lyapunov stability analysisto provide the theoretical support of our method, which isimportant to understand the convergence of this method. Third,we develop a virtual reality (VR) platform to demonstratethe performance of our method, with active interaction ofthe external environment. Finally, we conduct many moreexperiments, including two types of noises/disturbances, forthe system to demonstrate its performance. We would like tonote that the terminology of “reference network” we use inthis paper is similar to the “goal network” and “goal generatornetwork” we discussed in several of our recent papers [15],[45], and [25]. The underlying principle of both the reference

Fig. 1. Architecture design of the proposed HDP approach with a trackingfilter.

Fig. 2. Description of the dual critic network.

network and the goal network/goal generator network is thesame, i.e., to provide a detailed internal goal representation tofacilitate learning, optimization, and control.

The rest of this paper is organized as follows. Section IIprovides the architecture description of the dual criticnetwork and the tracking filter design. Section III presentsthe implementation of the dual critic network HDP designwith a tracking filter, together with its associated learningalgorithms. Detailed experiment setup and simulation resultsare given in Sections IV and V, respectively. For both cases,we provide comparative studies between our approach andthe typical HDP approach under the same settings. DetailedLyapunov stability analysis is presented in Section VI. Finally,Section VII concludes this work with some discussions onfuture research directions. The detailed pseudo-code for theimplementation of our method is given in the Appendix.

II. HDP WITH A DUAL CRITIC NETWORK FOR

NONLINEAR TRACKING CONTROL

The schematic diagram of our proposed idea is presentedin Fig. 1. The action network is kept the same as in [13],[21]. For the critic network, we integrate with one referencenetwork, and therefore there are two networks in the critic

Page 3: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

NI et al.: ADAPTIVE LEARNING IN TRACKING CONTROL 915

network block as presented in Fig. 2. The tracking filter isadded here to show the performance on the tracking controlproblem. In the rest of this section, we will introduce the dualcritic network block and the tracking filter.

A. Dual Critic Network in ADP Design

The motivation of this dual critic network design is twofold:one is to provide an internal goal representation for the criticnetwork; the other is to help approximate the cost-to-go indetail since the internal reinforcement signal works as one ofthe input vectors for the critic network.

From the system-level view in Fig. 1, we can see that theparameters in the dual critic network block can not only betuned by an external signal but also be adjusted by itself.Specifically, the reference network in the top of the block istuned by the error function with the external reward signal,while the critic network at the bottom of the block is tunedby the error function with the internal reinforcement signal.As presented in Fig. 2, the reference network observes a reg-ular reward signal r(k) (usually a binary value) from externalenvironment and provides the critic network with a detailedinternal reinforcement signal s(k) (usually a continuous value)by justifying the system state vectors X (k) and control actionu(k). In order to approximate the value function J (k) well,the critic network keeps the same inputs as the referencenetwork in addition to the internal signal s(k). Moreover, s(k)also contributes to the error function of the critic network,as the dash line shown inside the block. Since the s(k) canbe automatically adjusted according to the state vectors X (k)and the control action u(k), we regard it as an adaptivereinforcement signal. In summary, the key idea of our dualcritic design is to use the reference network to automaticallyand adaptively generate the internal goal signal, rather thanhand-crafted in the traditional ADP approaches, to guide thedecision-making process for the optimal action at any timeinstance to accomplish the final goal. This reference networkcan also actively interact with the critic network and actionnetwork, either directly (for critic network) or indirectly (foraction network), to support the action selection in a principledway. We will further present the detailed learning architectureand algorithm for such a dual structure design in Section III-B.

B. Tracking Filter

In order to demonstrate the improvement of our proposeddual-critic controller, we would like to test it on nonlineartracking control problem with a tracking filter, as presented inFig. 1. The inner structure of the tracking filter is presentedin Fig. 3, which was originally from [46] and later developedin [21] and [22]. To be clear, we introduce the design of thetracking filter as follows.

The nonlinear system function is defined in a generalform as

x(k + 1) = f (x(k)) + u(k) + d(k) (1)

where f (x(k)) is the nonlinear system function, x(k) =[ x1(k) x2(k) x3(k) · · · xn(k) ], and xi (k) ∈ � is the statevalue for the i th dimension at time instance k. u(k) is

Fig. 3. Description of the tracking filter.

the control action, and d(k) is the disturbance bounded in[−dm, dm], where dm is a constant value.

The nonlinear system function f (x(k)) is assumed to beunknown in the simulation and can be approximated by theaction network here. The approximation value is denoted asf (x(k)), which works as one of the inputs of the filter.Moreover, the inputs of the filter also include the currentstate vector X (k) and the desired trajectory value xd(k) andxd(k + 1), as presented in Fig. 3. The error e(k) is defined asthe difference between the current state value and the desiredvalue as follows:

e(k) = x(k) − xd(k) (2)

and the filtered tracking error e(k) is defined as

e(k) = e(k) + λ1en−1(k) + · · · + λn−1e1(k) (3)

where en−1(k), . . . , e1(k) are the past error values, whichmeans that en−i (k) = en(k−i), i = 0, 1, . . . , n−1, n ∈ �. Forbrevity, we define ∧ = [λn−1, λn−2, . . . , λ1], where λi ∈ �.Therefore, (3) can be rewritten as

e(k) = [∧ I ]e(k). (4)

Again, we can rewrite (3) for time instance k + 1 as

e(k + 1) = e(k + 1) + λ1en−1(k + 1) + · · ·+λn−1e1(k + 1). (5)

Substituting (1) into (5), we get

e(k + 1) = f (x(k)) − xd(k + 1) + λ1en−1(k + 1) + · · ·+λn−1e1(k + 1) + u(k) + d(k)

= f (x(k)) − xd(k + 1) + λ1en(k) + · · ·+λn−1e2(k) + u(k) + d(k). (6)

Similar to [21] and [22], we define that the control sequence

u(k) = xd(k + 1) − f (x(k)) + kv e(k) − λ1en(k) −· · · − λn−1e2(k). (7)

Substituting (7) into (6), we get

e(k + 1) = Kv e(k) − f (x(k)) + d(k) (8)

Page 4: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

916 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

where f (x(k)) is the nonlinear system function approximationerror given by

f (x(k)) = f (x(k)) − f (x(k)) (9)

and Kv is the gain value. Assuming that f (x(k)) is bounded,the system will be stable if 0 < Kvmax < 1, where Kvmax isthe maximum eigenvalue of Kv [22].

III. IMPLEMENTATION OF HDP WITH DUAL CRITIC

NETWORK FOR NONLINEAR TRACKING CONTROL

In this section, we provide detailed procedures on how toimplement our proposed dual critic network HDP approachwith a tracking filter. We also define the external reinforcementsignal and the internal reinforcement signal here. A briefpseudocode (Algorithm 1) is provided in this section toclearly demonstrate the implementation of the dual criticnetwork. Moreover, a detailed algorithm-level implementation(Algorithm 2) is presented in the Appendix to show theimplementation of our proposed approach. Multilayerperceptron (MLP) neural network has been one of the mostpopular techniques to approximate the nonlinear function inthe ADP community [11], [27], [12]. Therefore, we followthis trend and adopt MLP in our ADP design as well.

A. External and Internal Reinforcement Signal

As mentioned previously, there are two types of reinforce-ment signal in our proposed approach. One is the externalreinforcement signal, which comes from the environment, andthe other is internal reinforcement signal, which comes fromthe reference network and works as an internal goal that guidesthe system’s behavior specifically. We will discuss these twosignals in detail in the following.

External reinforcement signal r(k) is defined according tothe current filtered tracking error e(k) as

r(k) = [r1(k) r2(k) , . . . , rm(k)

] ∈ �m (10)

with

ri (k) ={

0,−1,

i f ‖ei (k)‖ ≤ ci f ‖ei (k)‖ ≥ c

, i = 1, 2, 3, . . . , m (11)

where ‖·‖ represents the Euclidean vector 2-norm and c is theconstant threshold for the filtered tracking error. The binaryvalue 0 represents a good tracking performance while −1means a poor one.

The internal reinforcement signal s(k) is the output of thereference network bounded in [−1, 1] and can be adaptivelyadjusted as the system states change. At the feed-forwardstage, the internal signal works as one of the inputs for thecritic network. At the feed-backward stage, the parametersin the critic network are tuned by the error function withs(k). Therefore, the internal reinforcement signal s(k) closelyconnects the reference network and the critic network as awhole block.

Fig. 4. Learning schematic in dual critic network.

B. Learning and Optimization of Dual Critic Network

Compared to the typical HDP design in [13], the learningand optimization of the critic network here is associatedwith the reference network as presented in Fig. 4. At theforward stage, the reference network obtains the inputs of thestate vector X (k) and the control action u(k) and providesan internal reinforcement signal s(k) for the critic network.Then the critic network updates the cost-to-go signal J (k). Atthe backward stage, the reference network will first be tunedby the error function (15) with r(k) and the updated cost-to-go signal J (k). After this is done, the reference network willprovide the updated internal reinforcement signal s(k) for thecritic network, which will then be adjusted with the error func-tion (12). The learning process will repeat until the terminalconditions are satisfied. The pseudocode in Algorithm 1 showsexactly this learning procedure.

The error function of the critic network is defined asfollows:

ec(k) = αJ (k) − [J (k − 1) − s(k)]; Ec(k) = 1

2e2

c(k)

(12)

whereJ (k) = ω(1)

c (k) · φ(ω(2)c (k) · xc(k)

)(13)

and ω(1)c (k) and ω

(2)c (k) refer to the weights of the input

to the hidden layer and the hidden to the output layer incritic network, respectively. xc(k) is the input vector of thecritic network and it contains the state vector X (k), thecontrol action u(k), and the internal signal s(k). φ stands forthe sigmoid function that refines the output into the rangeof [−1, 1]. s(k) is defined as

s(k) = φ

(ω(1)

r (k) · φ(ω(2)

r (k) · xr (k)))

(14)

where ω(1)r (k) and ω

(2)r (k) refer the weights of the input to the

hidden layer and the hidden to the output layer in referencenetwork, respectively. xr (k) is the input vector of the referencenetwork and it contains the state vector X (k) and the controlaction u(k).

The error function of the reference network is defined as

er (k) = αJ (k) − [J (k − 1) − r(k)]; Er (k) = 1

2e2

r (k).

(15)

Page 5: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

NI et al.: ADAPTIVE LEARNING IN TRACKING CONTROL 917

Algorithm 1 Outline of Implementation of Dual CriticNetwork HDP

(a) (b)

Fig. 5. (a) Neural network structure of the reference network. (b) Neuralnetwork structure of the critic network.

Given that the state vector X (k) has n elements and thecontrol action u(k) is a single control unit, the inputs forthe reference network and the critic network will be (n + 1)and (n + 2), as presented in Fig. 5(a) and (b), respectively.The chain backpropagation rule is employed for the neuralnetworks to learn and adapt their weights.

1) Re f erence Network: The reference network is intro-duced here to provide an internal goal representation for thecritic network. The internal goal s(k) is defined as

s(k) = 1 − exp−l(k)

1 + exp−l(k)(16)

l(k) =Nrh∑

i=1

w(2)ri

(t)yi (k) (17)

yi (k) = 1 − ex p−zi (k)

1 + ex p−zi (k), i = 1, . . . , Nrh (18)

zi (k) =n+1∑

j=1

w(1)ri, j

(t)xr j (k), i = 1, . . . , Nrh (19)

where zi is the input of the i th hidden node and yi is thecorresponding output of this hidden node after the sigmoidfunction, l is the input to the output node, Nrh is the numberof hidden neurons in the reference network, and xr j is theinput vector of the reference network, which has (n +1) inputnodes as presented in Fig. 5(a).

The procedure of backpropagation rule applied to the ref-erence network is illustrated below.1) �w

(2)r : Reference network weights adjustment from the

hidden layer to the output layer

�w(2)ri

(k) = ηr (k)

[

− ∂ Er (k)

∂w(2)ri (k)

]

(20)

where ηr (k) is the learning rate of the reference network attime instance k, and

∂ Er (k)

∂w(2)ri (k)

= ∂ Er (k)

∂ J (k)

∂ J (k)

∂s(k)

∂s(k)

∂l(k)

∂l(k)

∂w(2)ri (k)

= αer (k) · 1

2(1 − (s(k))2) · yi (k)

·Nrh∑

i=1

[w(2)ci

(k)1

2(1 − p2

i (k))w(1)ci, n+2

(k)]. (21)

2) �w(1)r : Reference network weight adjustment from the input

layer to the hidden layer

�w(1)ri, j

(k) = ηr (k)

[

− ∂ Er (k)

∂w(1)ri, j (k)

]

(22)

∂ Er (k)

∂w(1)ri, j (t)

= ∂ Er (k)

∂ J (k)

∂ J (k)

∂s(k)

∂s(k)

∂l(k)

∂l(k)

∂yi (k)

∂yi (k)

∂zi (k)

∂zi (k)

∂w(1)ri, j (k)

= αer (k) · 1

2(1 − y2

i (k)) · xr j (k) · 1

2(1 − (s(k))2)

·w(2)ri

(k) ·Nrh∑

i=1

[w(2)

ci(k)

1

2(1 − p2

i (k))w(1)ci, n+2

(k)

].

(23)

Once the internal goal s(k) is updated in reference network,we can adapt the weight tuning in the critic network.

2) Cri tic Network: In the literature, the critic networkis applied to approximate the cost function, and its inputsnormally contain the state vector X (k) and the control unitu(k). Here we add one more input with the internal goal s(k)and hope that s(k) can provide the critic network with detailedgoal representation that contributes to the system’s decisionmaking. The cost-to-go signal J (k) is defined as

J (k) =Nch∑

i=1

w(2)ci

(k)pi(k) (24)

pi(k) = 1 − ex p−qi(k)

1 + ex p−qi(k), i = 1, . . . , Nch (25)

Page 6: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

918 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

(a) (b)

Fig. 6. (a) Neural network structure of the action network. (b) Control actiongenerator.

qi (k) =n+2∑

j=1

w(1)ci, j

(k)xcj (k), i = 1, . . . , Nch (26)

where qi and pi are the input and output of the i th hidden nodein the critic network, respectively, and xcj is input vector ofthe critic network with n + 2 nodes, as presented in Fig. 5(b).

The procedure of the backpropagation rule applied to thecritic network is provided in the following.1) �w

(2)c : Critic network weight adjustment from the hidden

layer to the output layer

�w(2)ci

(k) = ηc(k)

[

− ∂ Ec(k)

∂w(2)ci (k)

]

(27)

where ηc(k) is the learning rate of the critic network at timeinstance k

∂ Ec(k)

∂w(2)ci (k)

= ∂ Ec(k)

∂ J (k)

∂ J (k)

∂w(2)ci (k)

= αec(k) · pi (k). (28)

2) �w(1)c : Critic network weight adjustment from the input

layer to the hidden layer

�w(1)ci, j

(k) = ηc(k)

[

− ∂ Ec(k)

∂w(1)ci, j (k)

]

(29)

∂ Ec(k)

∂w(1)ci, j (k)

= ∂ Ec(t)

∂ J (k)

∂ J (k)

∂pi (k)

∂pi(k)

∂qi(k)

∂qi (k)

∂w(1)ci, j (k)

= αec(k) · w(2)ci

(k) · 1

2(1 − p2

i (k))xcj (k). (30)

C. Interaction of the Action Network and the Tracking Filter

The control action in this paper is generated by the track-ing filter, as shown in Fig. 6(b). The action network hereis to approximate the nonlinear system function f (x(k)),and the approximation error f (x(k)) is added to the errorfunction of the action network. Note that f (x(k)) cannot beobtained directly from (9) since f (x(k)) is assumed unknownin this paper. The model network is commonly used inthe ADP/ACD designs for tracking control [18], [26]. Herewe apply the same technique to predict the state vectorx(k + 1) and get ˆe(k + 1). Then, f (x(k)) can be obtainedfrom (8).

The error function of the action network is defined asfollows:

ea = J (k) + f (k); Ea(k) = 1

2e2

a(k). (31)

The nonlinear system function f (x(k)) can be obtained asfollows:

f (x(k)) = 1 − exp(−v(k))

1 + exp(−v(k))(32)

v(k) =Nah∑

i=1

w(2)ai

(k)gi(k) (33)

gi(k) = 1 − exp(−hi (k))

1 + exp(−hi (k)), i = 1, . . . , Nah (34)

hi (k) =n∑

j=1

w(1)ai, j

(k)xaj(k), i = 1, . . . , Nah (35)

where hi and gi are the input and output of the i th hidden nodein the action network, and v is the input for the output node.f is the output of the action network, and Nah is the totalnumber of hidden nodes in the action network. And xaj is theinput vector of the action network presented in Fig. 6(a). Notethat the weights tuning of the action network should considerthe tracking filter as well.1) �w

(2)a : Action network weights adjustment from the hidden

layer to the output layer

�w(2)ai

(k) = ηa(k)

[

− ∂ Ea(k)

∂w(2)ai (k)

]

(36)

where ηa(k) is the learning rate of the action network at timeinstance k. And

∂ Ea (k)

∂w(2)ai (k)

= ∂ Ea (k)

∂ J (k)

∂ J (k)

∂u (k)

∂u (k)

∂v (k)

∂v (k)

∂w(2)ai (k)

(37)

= ea (k)

[−1

2

(1 − f 2 (k)

)]gi (k)

·⎧⎨

Nch∑

i=1

w(2)ci

(k)

[1

2

(1 − p2

i (k))]

w(1)ci,(n+1)

(k)

⎫⎬

⎭.

(38)

2) �w(1)a : Action network weights adjustment from the input

layer to the hidden layer

�w(1)ai, j

(k) = ηa(k)

[

− ∂ Ea(k)

∂w(1)ai, j (k)

]

(39)

∂ Ea (k)

∂w(1)ai j (k)

= ∂ Ea (k)

∂ J (k)

∂ J (k)

∂u (k)

∂u (k)

∂v (k)

∂v (k)

∂gi (k)

∂gi (k)

∂hi (k)

∂hi (k)

∂w(1)ai j (k)

(40)

=Nah∑

l=1

(w(2)

cl(k)w(1)

cl,(n+1)(k)

[1

2

(1 − p2

l (k))])

ea (k)

×{w(2)

ai(k) xaj (k)

[1

2

(1 − f 2 (k)

)][1

2

(1 − g2

i (k))]}

.

(41)

Page 7: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

NI et al.: ADAPTIVE LEARNING IN TRACKING CONTROL 919

TABLE I

SUMMARY OF THE PARAMETERS USED IN THE SIMULATION STUDY ONE

Para. ηc ηa ηr Kv λ c *

Value 5e − 3 8e − 3 2.5e − 3 0.1 0.2 1.5e − 3 *

Para. Nc Na Nr Tc Ta Tr α

Value 40 150 50 1e − 4 1e − 5 1e − 5 0.95

We note that, similar to [13] and [27], the normalizationof the weights will be employed during the learning andadaptation for all the networks used here. The weights areconfined to the proper range by

wr (k + 1) = wr (k) + �wr (k)

a,

{a = max(

∣∣ai j

∣∣)∣∣ ∀ai j ∈ wr (k) + �wr (k)

}(42)

wc (k + 1) = wc (k) + �wc (k)

b,

{b = max(

∣∣bi j∣∣)∣∣ ∀bi j ∈ wc (k) + �wc (k)

}(43)

wa (k + 1) = wa (k) + �wa (k)

c,

{c = max(

∣∣ci j

∣∣)∣∣ ∀ci j ∈ wa (k) + �wa (k)

}. (44)

IV. SIMULATION STUDY ONE

In this section, we present two numerical simulations basedon the same system function. The motivation is to compare thetracking performance with our proposed approach and with thetypical approach in [21], which is originally from [13]. Thesystem function is defined by the general nonlinear form as

x1(k + 1) = x2(k)

x2(k + 1) = f (x(k)) + u(k) + d(k) (45)

where

f (x(k)) = − 4

11·(

x1

1 + x22

)

+ 2

5x2(k) (46)

with f (x(k)) assumed to be unknown in the tracking process.Instead, the approximation value f (x(k)) can be obtained fromthe action network. And d(k) is the disturbance here.

A. Example One

The objective of Example 1 is to track the sinusoidalsignal with some harmonic signal with x2. The desired signalfunction is defined as x2d = sin(ωkT ) cos(2ωkT + τ ), whereω = 0.2 rad/s and τ = π/2. We set the sample intervalT = 50 ms and the total simulation time to be 150 s. Inthe literature, noise or disturbance is normally added in thesimulation to see how robustness of the proposed approachwill be, like in [22], [47], and [48]. In this paper, we adopta similar technique as in [22]. Disturbance d(k) = 1.5is introduced at k = 1200, corresponding to t = 60 s.Otherwise, it is set to be 0. The input vector X is definedas Xin(k) = [e(k − 1) e(k) x2d(k − 1) x2d(k)], wheree(k) = x2(k) − x2d(k). She structures of the action network,the reference network, and the critic network are 4 − 4 − 1

0 50 100 150−1

−0.5

0

0.5

1

1.5

2

time(s)

ampl

itude

tracking signaldesired signal

Fig. 7. Typical tracking performance with our proposed approach.

(i.e., the network has four input nodes, four hidden nodes andone output node), 5 − 4 − 1, and 6 − 4 − 1, respectively. Theparameters we used in the simulation are summarized as inTable I.where

α : Discount factor.c : Threshold on mean square error.ηc : Initial learning rate of the critic network.ηa : Initial learning rate of the action network.ηr : Initial learning rate of the reference network.Nc : Internal cycle of the critic network.Na : Internal cycle of the action network.Nr : Internal cycle of the reference network.Tc : Internal training error threshold for the critic network.Ta : Internal training error threshold for the action network.Tr : Internal training error threshold for the reference

network.We note that the learning rates will drop once the tracking

performance is “good” over time. Specifically, we will com-pare the mean square error (MSE) as expressed in (47) witha certain threshold c

MSE = 1

N

N∑

i=1

(x − xd)2 (47)

where x is the state vector, xd is the desired tracking signal,and N is a preset integer.

If the MSE is less than the threshold, then the learning ratewill be divided by a certain number and the new thresholdcan be calculated by dividing another number correspondingly.This kind of evaluation will be repeated during the wholeprocess of the tracking control. The detailed implementationis presented in Algorithm 2 (line 24 to 32).

For a comparative study, we have conducted the simulationwith both approaches under the same parameters and envi-ronment settings. The weights in the neural networks usedin both approaches are randomly selected from [−1, 1]. Thestarting point of the state vector is (0, 1.5), which is the samefor both approaches. The typical tracking performance withour proposed approach and the typical HDP approach are

Page 8: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

920 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

0 50 100 150−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

time (s)

ampl

itute

tracking signaldesired signal

Fig. 8. Typical tracking performance with HDP approach.

presented in Figs. 7 and 8, respectively. From Fig. 7, we cansee that the tracking signal (solid line) can exactly follow thedesired signal (dash line) within 1 s. In addition, the trackingsignal can also quickly go back to the right track after thedisturbance at 60 s. On the other hand, the tracking signal withthe typical HDP approach can only follow the desired signalafter 10 s. Moreover, the tracking signal needs more time (13 s)to go back to track the desired signal after the disturbance.From this example, we can see that our proposed approachshows not only faster learning process than the typical HDPapproach but also better robustness against disturbance.

B. Example Two

In order to show the adaptiveness of the proposed approach,we conducted another numerical example to track a signal thatwould change from a saw signal to a square signal and finallyto a sinusoid signal. We used the same system function andenvironment settings as in Section IV-A, except that d(k) wasset to be white Gaussian noise with a standard deviation of0.005 for all k. The object is to track the desired signal withthe state vector x2. The desired tracking signal is defined as

x2d =

⎧⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎩

A ·(

1 −∣∣∣t − T0

2

∣∣∣)

, 0 < t < T0

A ·(

1 −∣∣∣t − 3T0

2

∣∣∣)

, T0 < t < 2T0

−A, 2T0 < t < 3T0 or 4T0 < t < 5T0

A, 3T0 < t < 4T0 or 5T0 < t < 6T0

A · sin(ωt), 6T0 < t < 7.5T0

(48)

where t = kT , k is the step number, and T is the sampletime (T = 50 ms). A = 0.95 is the amplitude of the signal.T0 = 40 s is taken as the time internal for which each signallasts (i.e., the signal will change after T0).

The difficulty of this task is that the controller needs tolearn to track the desired signal, which will change over time,under the white Gaussian noise. Fig. 9 shows the typicaltracking performance with our proposed approach, and onecan clearly see the good transient tracking performance whenthe signal changes. But for the typical HDP approach, we

0 50 100 150 200 250 300−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

time(s)

ampl

itude

tracking signaldesired signal

Fig. 9. Typical tracking performance with our proposed approach.

0 50 100 150 200 250 300−1.5

−1

−0.5

0

0.5

1

1.5

time(s)

ampl

itude

tracking signaldesired signal

Fig. 10. Typical tracking performance with HDP approach.

can see in Fig. 10 that the controller takes about 50 s tolearn to follow the saw signal and also spends much time tolearn when the desired signal changes to a rectangular signal.In addition, the controller also requires about 20 s to catchup after the desired signal changes to sinusoid signal. Thisindicates that our proposed approach shows better adaptivenessto this tracking problem than the typical HDP approach undernoisy condition.

V. SIMULATION STUDY TWO

Instead of testing on two numerical cases, here we wouldlike to evaluate our proposed approach on a continuousbenchmark, i.e., the ball-and-beam tracking problem [49].There are many versions of this benchmark, and in this paperwe adopt the model presented in Fig. 11. The system containsa long beam that can be tilted by a servo or electric motor witha ball rolling back and forth on top of the beam. In this system,the driver is located at the center of the beam. The angle ofthe beam to the horizontal axis is measured by an incrementalencoder, and the position of the ball can be obtained withcameras mounted on the top of system. Our proposed approach

Page 9: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

NI et al.: ADAPTIVE LEARNING IN TRACKING CONTROL 921

Fig. 11. Schematic diagram of the ball-and-beam system.

will learn to track the desired signal with the position of theball.

A. Problem Formulation

From [49], we can obtain the equations of motion from theLagrange equation as follows:

(m + Ib

r2

)x ′ + (mr2 + Ib)

1

rα − mx ′α2 = mg(sin α) (49)

[m(x ′)2 + Ib + Iω]α + (2mx ′x ′ + bl2)α + Kl2α

+ (mr2 + Ib)1

rx ′ − mgx ′(cos α) = ul(cos α) (50)

wherem = 0.0162 kg, is the mass of the ball;r = 0.02 m, is the roll radius of the ball;Ib = 4.32 × 10−5kg · m2, is the inertia moment of the ball;b = 1 Ns/m, is the friction coefficient of the drive

mechanics;l = 0.48 m, is the radius of force application;lω = 0.5 m, is the radius of the beam;K = 0.001N/m, the stiffness of the drive mechanics;g : 9.8 N/kg, is the gravity;Iω = 0.14025 kg ·m2, is the moment of inertia of the beam;

andu is the force of the drive mechanics.In order to simplify the system model function, we define

that x1 = x ′ represents the position of the ball, x2 = x ′represents the velocity the ball, x3 = α is the angle of thebeam with respect to the horizontal axis, and x4 = α is theangular velocity of the beam. Therefore, the state vector canbe defined as X = [x1 x2 x3 x4]. In this way, the systemfunction (49) and (50) can be transformed into the followingforms:

(m + Ib

r2

)x2 + (mr2 + Ib)

1

rx4 = mx1x2

4 + mg(sin x3)

(51)

(mr2 + Ib)1

rx2 + [mx2

1 + Ib + Iω]x4 = (ul + mgx1)

× cos x3 − (2mx2x1 + bl2)x4 − Kl2x3. (52)

TABLE II

SUMMARY OF THE PARAMETERS USED IN THE SIMULATION STUDY TWO

Para. ηc ηa ηr Kv λ c *

Value 0.02 0.01 0.05 0.05 0.1 5e − 5 *

Para. Nc Na Nr Tc Ta Tr α

Value 100 200 200 1e − 4 1e − 5 1e − 5 0.95

0 50 100 150 200 250 300

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

time (s)x 1

tracking signaldesired signal

Fig. 12. Typical tracking performance with our proposed approach.

To make it clearer, we rewrite (51) and (52) with thespecific value of all the parameters mentioned above into theapproximate nonlinear state-space equations as follows:

x1 = x2 (53)

x2 = 1.717 sin(x3) (54)

x3 = x4 (55)

x4 = −0.241x4 + 0.157x1 cos(x3) + 0.5 cos(x3) · u. (56)

The objective is to track the sinusoid signal x1d = 0.1 sin(ωt)with the position of ball (x1), where ω = 0.1. This taskrequires the controller to not only keep the balance of theball on the beam but also to track the desired signal using theposition of the ball (x1). That is to say, if x1 is out of the bound([−0.48, 0.48] m), or x3 exceeds the angular velocity tolerance([0.24, 0.24] rad/s), we will reset the ball to the initial startingpoint ([0 0 0 0]). Since the learning process of the neuralnetwork is continuous, we will assume that the weights canbe carried on when the task is reset. Keen readers may also findthat this is a continuous-time benchmark rather the discrete-time case above. Here we apply the common technique that isextensively used in the literature [13], [27], [44] of calling thecontinuous system model function with the ode45 function inMATLAB with the step size 0.02 s. The parameters used in thiscase are summarized in Table II, and 5% uniform noise is alsoadded to the sensor of x1 to show the tracking performanceunder noisy conditions.

Figs. 12 and 13 present the tracking performance withour proposed approach and the typical HDP approach [21],respectively. Fig. 12 clearly shows that the ball is out ofbounds at the very beginning, but can quickly track the desiredsignal within one period. The oscillation of the tracking signal(solid line) in the first period shows the learning process of the

Page 10: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

922 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

0 50 100 150 200 250 300

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

time (s)

x 1tracking signaldesired signal

Fig. 13. Typical tracking performance with HDP approach.

TABLE III

SUMMARY OF TOTAL TRACKING ERROR FOR THE SIMULATION CASES

Number Example 1 Example 2 Ball and Beam

DualCritHDP 7.454 9.292 3.761

HDP 223.9 712.8 54.49

controller with our proposed approach. In Fig. 13, we can seethat the ball is out of bounds many times before it can trackthe desired signal. In other words, the controller spends thewhole first period to learn to control the ball and it learns afterabout 80 s. The simulation results show that the controller withour proposed approach has better noise tolerance than with thetypical HDP approach.

To provide a more accurate assessment of the trackingperformance, here we summarize the quantitative measure-ments in terms of tracking error for all examples in study 1(i.e., examples 1 and 2) and study 2 (i.e., the ball and beambenchmark). Here we adopt the evaluation function in [18]defined as PER = ∑N

0 eT (k)e(k), where e(k) is the trackingerror in Fig. 3 and N refers to the number of steps in thesimulation. The PER for Sections IV-A, IV-B, and V-A aresummarized in Table III.

From Table III, one can see that our proposed dual criticHDP can achieve much lower PER (total tracking error) thanwith the typical HDP approach [21]. The results also confirmthat our proposed structure with the informative and adaptivereinforcement signal can outperform the typical HDP structurein terms of tracking accuracy.

B. VR Demonstration of the Dual Critic Network Design

In this paper, we further apply our algorithm on VR envi-ronment to show real-time simulation of our approach duringinteraction with the external environment. VR can enablepowerful human–computer interactions, and it is interestingto observe how the proposed algorithm works in real-timesimulation without the requirement of setting up the realphysical system. Here we would like to demonstrate thetracking performance of our proposed approach on the ball-and-beam benchmark [50], [51]. And we also add disturbancesto see how robust our proposed approach can perform.

Fig. 14. Schematic of the VR platform.

Fig. 15. Typical tracking performance with our proposed approach inVR/simulation platform.

The VR platform is developed as in Fig. 14, where one cansee that the ball-and-beam system is at the center of the sceneand the state vectors (x1 and x3) are displayed in the upper-left table of the figure. In order to be more realistic, we adda disturbance option (upper center) in the simulation. That isto say, the user can apply any disturbance between −8 to 8 Non the ball whenever needed, and the corresponding force willbe applied to the system, as also displayed on the upper-righttable.

Real-time simulation result on the tracking problem withour proposed approach is presented in Fig. 15, where one canclearly see the online learning process. In other words, thefirst period of signal is like a distorted sine wave, while thesignal in second period almost tracks the desired sine signal.At about 64 s, we added a 2.1 N disturbance on the ball andwe can see that the controller spends about two periods tolearn to get back to the right track.

VI. LYAPUNOV STABILITY ANALYSIS

We now proceed to study the stability characteristicsof our proposed method. Similar to [52], we employ φa

for the output of the hidden neurons in the action net-work s.t. φa(k) = [φa,1(k) φa,2(k) · · · φa,Nah (k)]T and then

Page 11: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

NI et al.: ADAPTIVE LEARNING IN TRACKING CONTROL 923

‖φa(k)‖2 = φTa (k)φa(k). For the critic and reference networks,

we adopt similar notations s.t. φc and φr are the outputof the hidden neurons in the critic network and referencenetwork, respectively. The estimated weights in the networksare denoted as ωa , ωc, and ωr , while the expected weightsare denoted as ωa , ωc, and ωr . Therefore, the differences aredefined as ωa = ωa −ωa , ωc = ωc−ωc, and ωr = ωr −ωr . Wenote that, although ωa , ωc, and ωr are unknown parameters,we assume they have upper bounds in this analysis because,when they exceed the preset upper bounds, we will normalizethe weights according to (42)–(44).

Define the Lyapunov function candidate as

V = V1 + V2 + V3 + V4 + V5 (57)

where

V1 = 1

γ1eT (k)e(k) (58)

V2 = 1

ηctr(ωT

c (k)ωc(k)) (59)

V3 = 1

2‖ζc(k − 1)‖2 (60)

V4 = 1

γ2ηatr(ωT

a (k)ωa(k)) (61)

V5 = 1

γ3ηrtr(ωT

r (k)ωr (k)). (62)

In (60), ζc(k − 1) = (ωc(k − 1) − ωc)T φ1(k − 1) = ωT

c (k − 1)φ1(k − 1) and γi > 0 for i = 1, 2, 3.

The first difference of the Lyapunov function candidate canbe written as

�V = �V1 + �V2 + �V3 + �V4 + �V5. (63)

For �V1, we have

e(k + 1) = Kv e(k) − f (x(k)) + d(k) (64)where

f (x(k)) = f (x(k)) − f (x(k))

f (x(k)) = ωTa (k)φa(k) + εa(x(k))

f (x(k)) = (ωa(k) − ωa(k))T φa(k) + εa(x(k))

= ωTa (k)φa(k) + εa(x(k)). (65)

Therefore

�V1 = 1

γ1

(eT (k + 1)e(k + 1) − eT (k)e(k)

)

= 1

γ1

[(Kv e(k) + ζa(k) + εa(k) + d(k)

)T

·(Kv e(k) + ζa(k) + εa(k) + d(k)) − eT (k)e(k)]

≤ 1

γ1

[3(K 2

v ‖e(k)‖2 + ‖ζa(k)‖2 + ‖εa(k) + d(k)‖2 )

− ‖e(k)‖2]≤ 3

γ1

((K 2

vmax − 1

3

)‖e(k)‖2 + ‖ζa(k)‖2

+ ‖εa(k) + d(k)‖2)

(66)

where Kvmax is the maximum eigenvalue of Kv .

For �V2, we have

ωc(k + 1) = ωc(k) − ηcαφc(k)(αωTc (k)φc(k) + s(k)

−ωTc (k − 1)φc(k − 1))T (67)

and the corresponding ωc(k + 1) can be expressed as

ωc(k + 1) = ωc(k) − ηcαφc(k)(αωT

c (k)φc(k) + s(k)

−ωTc (k − 1)φc(k − 1)

)T

= (I − ηcα

2φc(k)φTc (k)

)ωc(k) − ηcαφc(k)

(αωT

c φc(k) + s(k) − ωTc (k − 1)φc(k − 1)

)T.

(68)

Then

�V2 = 1

ηctr(ωT

c (k + 1)ωc(k + 1) − ωTc (k)ωc(k))

= 1

ηctr(ωT

c (k)AT Aωc(k) − ωTc (k)ωc(k)

+Bα2η2cφ

Tc (k)φc(k)BT − ωT

c (k)AT ηcαφc(k)BT

−BηcαφTc (k)Aωc(k)

)(69)

where A = I − ηcα2φc(k)φT

c (k) and B = αωTc φc(k) + s(k) −

ωTc (k − 1)φc(k − 1).We note

ωTc (k)AT Aωc(k) − ωT

c (k)ωc(k)

= ωTc (k)[(I − ηcα

2φcφTc )T

×(I − ηcα2φcφ

Tc )]ωc(k) − ωT

c (k)ωc(k)

= −ηcα2 ‖ζc(k)‖2 − ηcα

2ωTc (k)φcφ

Tc

×(I − ηcα2φcφ

Tc )ωc(k) (70)

where ζc(k) = ωTc (k)φc(k).

Therefore

�V2(k) = −α2 ‖ζc(k)‖2 − α2(1 − ηcα2 ‖φc(k)‖2 ) ‖ζc(k)‖2

+ηcα2 ‖φc(k)‖2

·∥∥∥αωT

c φc(k) + s(k) − ωTc (k − 1)φc(k − 1)

∥∥∥

2

−2tr [α(I − ηcα2 ‖φc(k)‖2 )ζc(k)

·(αωTc φc(k) + s(k) − ωT

c (k − 1)φc(k − 1))T ].

(71)

We would like to seek the upper bound of (71) by applyingCauchy–Schwarz inequality for the fourth term. Therefore, wehave

�V2(k) ≤ −α2 ‖ζc(k)‖2 − α2(1 − ηcα2 ‖φc(k)‖2) · ‖ζc(k)‖2

+∥∥∥αωT

c φc(k) + s(k) − ωTc (k − 1)φc(k − 1)

∥∥∥

2

−α2(1 − ηcα2 ‖φc(k)‖2)

·∥∥∥ζc(k) + ωT

c φc(k) + α−1s(k) − α−1ωTc (k − 1)

×φc(k − 1)‖2 . (72)

Page 12: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

924 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

From (72), we can further get

�V2(k) ≤ −α2 ‖ζc(k)‖2 + 1

2‖ζc(k − 1)‖2

+2∥∥∥αωT

c φc(k) + s(k) − ωTc (k − 1)φc(k − 1)

∥∥∥

2

−α2(1 − ηcα2 ‖φc(k)‖2 )

·∥∥∥ζc(k) + ωT

c φc(k) + α−1s(k) − α−1ωTc (k − 1)

×φc(k − 1)‖2 . (73)

For �V3(k), we have

�V3(k) = 1

2(‖ζc(k)‖2 − ‖ζc(k − 1)‖2). (74)

For �V4(k), we have

ωa(k + 1) = ωa(k) − ηaφa(k)ωTc (k)Ca(k)

· (ωTc (k)φc(k) + Kv e(k) − e(k + 1))T (75)

where Ca(k) is an Nch × 1 vector and its elements can bedefined as Ca(k) = (1/2)(1 − φc(k))ωc,n+1(k). And then

ωa(k + 1) = ωa(k) − ηaφa(k)ωTc (k)Ca(k)

· (ωTc (k)φc(k) + Kv e(k) − e(k + 1)

)T. (76)

With tracking error dynamics, we can rewrite (76) as

ωa(k + 1) = ωa(k) − ηaφa(k)ωTc (k)Ca(k)

· (ωTc (k)φc(k) + ζa(k) − εa(k) − d(k)

)T. (77)

Therefore, we have

�V4(k) = 1

γ2ηatr[ωT

a (k + 1)ωa(k + 1) − ωTa (k)ωa(k)

]

= 1

γ2tr

[− 2ωT

c (k)Ca(k)ζa(k)

· (ωTc (k)φT

c (k) + ζa(k) − εa(k) − d(k))T

+ηaφ2a(k) ·

∥∥∥ωT

c (k)Ca(k)∥∥∥

2

·∥∥∥ωT

c (k)φc(k) + ζa(k) − εa(k) − d(k)∥∥∥

2]. (78)

In order to further simplify the formula, we can rewrite (78)as follows:

�V4(k) = 1

γ2tr [−(1 − ηaφ

2a(k)) ‖D‖2 ‖E‖2

+ ‖D · E − ζa(k)‖2 − ‖ζa(k)‖2] (79)

where

D = ωTc (k)Ca(k) (80)

E = ωTc (k)φc(k) + ζa(k) − εa(k) − d(k). (81)

Applying the Cauchy–Schwarz inequality for (79), we have

�V4(k) ≤ 1

γ2tr [F + 2 · ‖D · E‖2 + ‖ζa(k)‖2] (82)

whereF = −(1 − ηa ‖φa(k)‖2) ‖D‖2 ‖E‖2 . (83)

For �V5(k), we have

ωr (k + 1) = ωr (k) − ηrφr (k)ωTc (k)Cr (k)(αωT

c (k)φc(k)

+r(k) − ωTc (k − 1)φc(k − 1))T (84)

where Cr (k) is an Nch × 1 vector and its elements can bedefined as Cr (k) = (1/2)(1 − φc(k))ωc,n+2(k).

Therefore

�V5(k) = 1

γ3ηrtr[ωT

r (k + 1)ωr (k + 1) − ωTr (k)ωr (k)

]

= 1

γ3tr

[− 2ωT

r (k)Cr (k)ζr (k)(αωTc (k)φc(k)

+r(k) − ωTc (k − 1)φc(k − 1))T

+ηrφr (k)∥∥∥ωT

r (k)Cr (k)∥∥∥

2 ‖αωTc (k)φc(k)

+r(k) − ωTc (k − 1)φc(k − 1)‖2

]. (85)

We can also simplify (85) by rewriting it as

�V5(k) = 1

γ3tr [H + ‖G · I − ζr (k)‖2 − ‖ζr (k)‖2] (86)

where

G = (αωTc (k)φc(k) + r(k) − ωT

c (k − 1)φc(k − 1))T (87)

H = −(1 − ηr ‖φr (k)‖2) ‖G‖2 ‖I‖2 (88)

I = ωTr (k)Cr (k). (89)

Applying the Cauchy–Schwarz inequality for (86), wehave

�V5(k) ≤ 1

γ3tr

[H + 2 · ‖G · I‖2 + ‖ζr (k)‖2

]. (90)

Substituting (66), (73), (74), (82), and (90) into (57),we can get the first difference of the Lyapunov functioncandidate as

�V (k) ≤ 3

γ1

((K 2

vmax − 1

3

)‖e(k)‖2 + ‖ζa(k)‖2 + ‖εa(k)

+d(k)‖2)

− α2 ‖ζc(k)‖2 + 2‖αωTc φc(k) + s(k)

−ωTc (k − 1)φc(k − 1)‖2 − α2(1 − ηcα

2 ‖φc(k)‖2 )

· ‖ζc(k) + ωTc φc(k) + α−1s(k)

−α−1ωTc (k − 1)φc(k − 1)‖2

+1

2‖ζc(k − 1)‖2 + 1

2(‖ζc(k)‖2 − ‖ζc(k − 1)‖2)

+ 1

γ2tr

[F + (

2 · ‖D · E‖2 ) + ‖ζa(k)‖2]

+ 1

γ3tr

[H + (

2 · ‖G · I‖2 ) + ‖ζr (k)‖2]. (91)

Page 13: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

NI et al.: ADAPTIVE LEARNING IN TRACKING CONTROL 925

In order to express (91) in a clear way, we rewrite it as

�V (k) ≤ 3

γ1

((K 2

vmax − 1

3

)‖e(k)‖2

+ ‖ζa(k)‖2 + ‖εa(k) + d(k)‖2)

−(

α2 − 1

2

)‖ζc(k)‖2 − α2

(I − ηcα

2 ‖φc(k)‖2)

· ‖ζc(k) + ωTc φc(k) + α−1s(k)

−α−1ωTc (k − 1)φc(k − 1)‖2

− 1

γ2(1 − ηa ‖φa(k)‖2) ‖D‖2 ‖E‖2

− 1

γ3(1 − ηr ‖φr (k)‖2) ‖G‖2 ‖I‖2 + Z2 (92)

where Z2 is defined as

Z2 = 1

γ2‖ζa(k)‖2 + 1

γ3‖ζr (k)‖2

+2∥∥∥αωT

c φc(k) + s(k) − ωTc (k − 1)φc(k − 1)

∥∥∥

2

+ 1

γ2

(2 · ‖D · E‖2 ) + 1

γ3

(2 · ‖G · I‖2 ). (93)

And the upper bound for Z2 is

Z2 ≤ 2

γ2

∥∥∥ωT

amφam

∥∥∥

2 + 2

γ3

∥∥∥ωT

rmφrm

∥∥∥

2

+6(α2 + 1) ·∥∥∥ωT

cmφcm

∥∥∥

2 + 6s2m + 6

γ2

∥∥∥ωT

cmCam

∥∥∥

2

· (∥∥∥ωT

cmφcm

∥∥∥

2 + 2∥∥∥ωT

amφam

∥∥∥

2 + ‖εam + dm‖2 )

+ 6

γ3

∥∥∥ωT

rmCrm

∥∥∥

2 · ((α2 + 1) ·∥∥∥ωT

cmφcm

∥∥∥

2 + ‖rm‖2 )

= 6 ·(

α2 + 1 + 1

γ2

∥∥∥ωT

cmCam

∥∥∥

2

+ 1

γ3(α2 + 1) ·

∥∥∥ωT

rmCrm

∥∥∥

2)

·∥∥∥ωT

cmφcm

∥∥∥

2 + 2

γ2

(1 + 6

∥∥∥ωT

cmCam

∥∥∥

2)∥∥∥ωT

amφam

∥∥∥

2

+ 2

γ3

∥∥∥ωT

rmφrm

∥∥∥

2 + 6s2m + 6

γ2

∥∥∥ωT

cmCam

∥∥∥

2

· ‖εam + dm‖2 + 6

γ3

∥∥∥ωT

rm Crm

∥∥∥

2 · ‖rm‖2

= Z2m (94)

where ωcm , ωam , ωrm , φcm , φam, φrm , Cam , Crm , sm , and rm

are the upper bounds of ωc, ωa, ωr , φc, φa, φr , Ca, Cr , s, andr , respectively.

Equation (94) further implies that �V (k) ≤ 0 if thefollowing conditions hold:

0 < Kvmax <

√3

3(95)

√2

2< α < 1 (96)

ηcα2 ‖φc(k)‖2 < 1, ηa ‖φa(k)‖2 < 1, ηr ‖φr (k)‖2 < 1

(97)

and

‖e(k)‖ >

√γ1

1 − 3Kv max‖Zm‖ (98)

or

‖ζc(k)‖ >

√1

(α2 − 12 )

‖Zm‖ . (99)

According to a standard Lyapunov extension theorem [53],[54], this demonstrates that the auxiliary error and the error inthe weights estimates are uniformly ultimately bounded. Andthis further implies that the weights estimates are boundedcorrespondingly.

VII. CONCLUSION

This paper proposed a novel ADP structure that combinesone reference network with the original critic network into adual critic network design. The reference network provided thecritic network with an internal goal representation that helpedit to approximate the total cost-to-go signal in detail. Unlikethe discrete reward signal from the external environment, thisinternal goal signal can be adjusted adaptively with regardto the system state and the control action. Therefore, pastexperience or prior knowledge is not a necessity to assignreinforcement signal value here. The dual critic network hasa weight-turning path not only from the outside but also fromthe inside. Compared with the typical HDP design under thesame simulation environment settings, our proposed dual criticnetwork HDP can achieve better tracking performance ontime-costing and the accumulated tracking error. In additionto various simulation studies and a VR platform development,we also presented detailed theoretical analysis in terms ofLyapunov stability analysis for our method.

As a new ADP approach, there are many interesting futureresearch topics along this direction. For instance, in this paperwe only used one reference network in our design. We areextending this dual design to see how it works with multiplereference networks on top of the critic network, similar to thehierarchical neural network structure. Some promising resultshave been achieved and reported in [39] and [45]. Also, theweight-turning rules in our design are based on the classicchain backpropagation and we are interested to see how muchimprovement can be achieved if we implemented the LMalgorithm or Kalman filter into our neural networks.

APPENDIX

PSEUDO CODE

Algorithm 2 Algorithm-Level Implementation of Dual CriticNetwork of ADP for Tracking Control

/* f (x) ⇐ ActNet (x, wa), nonlinear function approximationwith the action network;ActNet: the action network;x: state vector;wa: weights of ActNet;f (x): nonlinear system function approximation, the output ofActNet;u ⇐ Filter

(xd , x, f (x)

), control action calculation;

Page 14: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

926 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

Filter: the tracking filter;xd : the desired reference signal;u: control action;s ⇐ RefNet (x, u, wr ), internal goal representation with thereference network;RefNet: the reference network;wr : weights of the RefNet;s: internal goal signal;J ⇐ CritNet (x, u, s, wc), total cost-to-go signal approximatedby the critic network;CritNet: the critic network;wc: weights of the CritNet;J:total cost-to-go signal, the output of the critic network; *//*Note: the parameters Nr , Tr , ηr , Nc , Tc, ηc, Na , Ta , and ηa

are all defined in Table I; */1) Initiate x(0)2) Uniformly randomize wa(0), wr (0), wc(0) in [−1, 1]3) f (x(0)) ⇐ ActNet (x(0), wa(0))

4) u(0) ⇐ Filter(

xd , x(0), f (x(0)))

5) s(0) ⇐ RefNet (x(0), u(0), wr (0))6) J (0) ⇐ CritNet (x(0), u(0), s(0), wc(0))7) J prev = J (0)8) for 1 to MaxStep do;9) //weights are carried on through the whole learning

process;10) CurrentState ⇐ (x(k − 1), u(k − 1)); //obtain

current state vectors from the external environment11) wa(k) = wa(k − 1);12) wc(k) = wc(k − 1);13) wr (k) = wr (k − 1);14) f (x(k)) ⇐ ActNet (x(k), wa(k));15) u(k) ⇐ Filter

(xd , x(k), f (x(k))

);

16) s(k) ⇐ RefNet (x(k), u(k), wr (k));17) J (k) ⇐ CritNet (x(k), u(k), s(k), wc(k));18) Obtain the tracking error e(k) via (3)19) if ||e(k)|| < c then20) r(k) = 0; // reward21) else22) r(k) = −1; // punishment23) end if //corresponding to step 1924) if Step ≥ 100 then25) calculate M SE via equation (47);26) if M SE < threshold then27) ηr (k),ηc(k),ηa(k) are divided by 6, respectively28) threshold = threshold/2; //update29) elseif M SE ≥ 4 ∗ threshold then30) ηr (k),ηc(k),ηa(k) and threshold are reset to

the initial values, respectively;31) end if ; //corresponding to step 2632) end if ; //corresponding to step 2433) Er (k) = 1

2 (αJ (k) − (J (k − 1) − r(k)))2;34) cyc = 0;35) while (Er (k) > Tr &cyc > Nr ) do

// update the weights recursively;36) wr (k) = wr (k) + �wr (k) via (20) and (23);

// update the s(k), J (k), Er (k), cyc correspondingly37) s(k) ⇐ RefNet (x(k), u(k), wr (k));

38) J (k) ⇐ CritNet (x(k), u(k), s(k), wc(k));39) Er (k) = 1

2 (αJ (k) − (J (k − 1) − r(k)))2;40) cyc = cyc + 1;41) end while // online learning of the reference

network42) Ec(k) = 1

2 (αJ (k) − (J (k − 1) − s(k)))2;43) cyc = 0;44) while (Ec(k) > Tc&cyc > Nc) do45) wc(k) = wc(k) + �wc(k) via (27) and (30);46) J (k) ⇐ CritNet (x(k), u(k), s(k), wc(k));47) Ec(k) = 1

2 (αJ (k) − (J (k − 1) − s(k)))2;48) cyc = cyc + 1;49) end while // online learning of the critic network50) Ea(k) = 1

2 (αJ (k) + f (x(k)) − Uc)2;

51) cyc = 0;52) while (Ea(k) > Ta&cyc > Na ) do53) wa(k) = wa(k) + �wa(k) via (36) and (41);// update the f (x(k)), u(k), s(k), J (k), Er (k), cyc

correspondingly54) f (x(k)) ⇐ ActNet (x(k), wa(k));55) u(k) ⇐ Filter

(xd , x(k), f (x(k))

);

56) s(k) ⇐ RefNet (x(k), u(k), wr (k));57) J (k) ⇐ CritNet (x(k), u(k), s(k), wc(k));58) Ea(k) = 1

2 (αJ (k) + f (x(k) − Uc))2;

59) cyc = cyc + 1;60) end while // online learning of the action network61) end for //corresponding to step 8.

REFERENCES

[1] F. Lewis and D. Liu, eds., Reinforcement Learning and ApproximateDynamic Programming for Feedback Control. Piscataway, NJ, USA:IEEE Press, 2013.

[2] J. Si, A. G. Barto, W. B. Powell, and D. C. Wunsch, eds., Handbookof Learning and Approximate Dynamic Programming. Piscataway, NJ,USA: IEEE Press, 2004.

[3] D. Bertsekas, Dynamic Programming and Optimal Control. Belmont,MA, USA: Athena Scientific, 1995.

[4] W. Powell, Approximate Dynamic Programming: Solving the Curses ofDimensionality. New York, USA: Wiley, 2007.

[5] J. J. Murray, C. J. Cox, G. G. Lendaris, and R. Saeks, “Adaptive dynamicprogramming,” IEEE Trans. Syst. Man, Cybern. C, Appl. Rev., vol. 32,no. 2, pp. 140–153, May 2002.

[6] J. Lee and J. Lee, “Approximate dynamic programming strategies andtheir applicability for process control: A review and future directions,”Int. J. Control Autom. Syst., vol. 2, no. 3, pp. 263–278, Sep. 2004.

[7] H. He, Self-Adaptive Systems for Machine Intelligence. New York, USA:Wiley, 2011.

[8] P. Werbos, “Reinforcement learning and approximate dynamic program-ming (RLADP)—foundations, common misconceptions and challengesahead,” in Reinforcement Learning and Approximate Dynamic Program-ming for Feedback Control. New York, USA: Wiley, 2013, pp. 3–30.

[9] P. J. Werbos, “Intelligence in the brain: A theory of how it works andhow to build it,” Neural Netw., vol. 22, no. 3, pp. 200–212, Apr. 2009.

[10] P. J. Werbos, “ADP: The key direction for future research in intelligentcontrol and understanding brain intelligence,” IEEE Trans. Syst. Man,Cybern. B, Cybern., vol. 38, no. 4, pp. 898–900, Aug. 2008.

[11] D. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Trans.Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997.

[12] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming:An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47,May 2009.

Page 15: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

NI et al.: ADAPTIVE LEARNING IN TRACKING CONTROL 927

[13] J. Si and Y.-T. Wang, “Online learning control by association andreinforcement,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 264–276,Mar. 2001.

[14] R. Enns and J. Si, “Helicopter trimming and tracking control using directneural dynamic programming,” IEEE Trans. Neural Netw., vol. 14, no. 4,pp. 929–939, Jul. 2003.

[15] Z. Ni, H. He, D. Zhao, and D. Prokhorov, “Reinforcement learningcontrol based on multi-goal representation using hierarchical heuris-tic dynamic programming,” in Proc. Int. Joint Conf. Neural Netw.,Jun. 2012, pp. 1–8.

[16] G. K. Venayagamoorthy, R. G. Harley, and D. C. Wunsch, “Comparisonof heuristic dynamic programming and dual heuristic programmingadaptive critics for neurocontrol of a turbogenerator,” IEEE Trans.Neural Netw., vol. 13, no. 3, pp. 764–773, May 2002.

[17] J. W. Park, R. G. Harley, and G. K. Venayagamoorthy, “Adaptive-critic-based optimal neurocontrol for synchronous generators in a powersystem using MLP/RBF neural networks,” IEEE Trans. Ind. Appl.,vol. 39, no. 5, pp. 1529–1540, Sep.–Oct. 2003.

[18] H. Zhang, Q. Wei, and Y. Luo, “A novel infinte-time optimal trackingcontrol scheme for a class of discrete-time nonlinear systems via thegreedy HDP iteration algorithm,” IEEE Trans. Syst. Man, Cybern. B,Cybern., vol. 38, no. 4, pp. 937–942, Aug. 2008.

[19] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approx-imate optimal tracking control for unknown general nonlinear systemsusing adaptive dynamic programming method,” IEEE Trans. NeuralNetw., vol. 22, no. 12, pp. 2226–2236, Dec. 2011.

[20] H. Zhang, R. Song, Q. Wei, and T. Zhang, “Optimal tracking controlfor a class of nonlinear discrete-time systems with time delays based onheuristic dynamic programming,” IEEE Trans. Neural Netw., vol. 22,no. 12, pp. 1851–1862, Dec. 2011.

[21] L. Yang, J. Si, K. S. Tsakalis, and A. A. Rodriguez, “Direct heuristicdynamic programming for nonlinear tracking conrol with filtered track-ing error,” IEEE Trans. Syst. Man, Cybern. B, Cybern., vol. 39, no. 6,pp. 1617–1622, Dec. 2009.

[22] P. He and S. Jagannathan, “Reinforcement learning neural-network-based controller for nonlinear discrete-time systems with input con-traints,” IEEE Trans. Syst. Man, Cybern. B, Cybern., vol. 37, no. 2,pp. 425–436, Apr. 2007.

[23] P. He and S. Jagannathan, “Reinforcement-based neuro-output feed-back control of nonlinear discrete-time systems with input constraints,”IEEE Trans. Systems Man, Cybern. B, Cybern., vol. 35, no. 1,pp. 150–154, Feb. 2005.

[24] S. Jagannathan and P. He, “Neural network-based state feedback controlof nonlinear discrete-time system in non-strict feedback form,” IEEETrans. Neural Netw., vol. 19, no. 12, pp. 2073–2087, Dec. 2008.

[25] Z. Ni, X. Fang, H. He, D. Zhao, and X. Xu, “Real-time tracking controlon adaptive critic design with uniformly ultimately bounded condition,”in Proc. IEEE Symp. Adapt. Dynamic Program. Reinforcement Learn.,IEEE Symp. Ser. Comput. Intell., Apr. 2013, to be published.

[26] D. Liu, H. Javaherian, O. Kovalenko, and T. Huang, “Adaptivecritic learning techniques for engine torque and air-fuel ratio con-trol,” IEEE Trans. Systems Man, Cybern. B, Cybern., vol. 38, no. 4,pp. 988–993, Aug. 2008.

[27] J. Fu, H. He, and X. Zhou, “Adaptive learning and control for MIMOsystem based on adaptive dynamic programming,” IEEE Trans. NeuralNetw., vol. 22, no. 7, pp. 1133–1148, Jul. 2011.

[28] J. Fu, H. He, Q. Liu, and Z. Ni, “An adaptive dynamic programmingapproach for closely-coupled mimo system control,” in Proc. 8th Int.Symp. Neural Netw., vol. 6677. Jun. 2011, pp. 1–10.

[29] Z. Ni, H. He, D. V. Prokhorov, and J. Fu, “An online actor-critic learningapproach with levenberg-marquardt algorithm,” in Proc. Int. Joint Conf.Neural Netw., Aug. 2011, pp. 2333–2340.

[30] G. K. Venayagamoorthy, R. G. Harley, and D. C. Wunsch, “Dualheuristic programming excitation neurocontrol for generators in a mul-timachine power system,” IEEE Trans. Ind. Appl., vol. 39, no. 2,pp. 382–394, Apr.–May 2003.

[31] W. Qiao, G. Venayagamoorthy, and R. Harley, “DHP-based wide-areacoordinating control of a power system with a large wind farm andmultiple FACTS devices,” in Proc. IEEE Int. Conf. Neural Netw., Aug.2007, pp. 2093–2098.

[32] S. Ray, G. K. Venayagamoorthy, B. Chaudhuri, and R. Majumder,“Comparison of adaptive critics and classical approaches based widearea controllers for a power system,” IEEE Trans. Syst. Man, Cybern.,B, Cybern., vol. 38, no. 4, pp. 1002–1007, Aug. 2008.

[33] Y. Tang, H. He, Z. Ni, J. Wen, and X. Sui, “Reactive power controlof grid-connected wind farm based on adaptive dynamic programming,”Neurocomputing, to be published.

[34] D. Liu, Y. Zhang, and H. G. Zhang, “A self-learning call admissioncontrol scheme for CDMA cellular networks,” IEEE Trans. NeuralNetw., vol. 16, no. 5, pp. 1219–1228, Sep. 2005.

[35] H. G. Zhang, Y. H. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systemswith control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9,pp. 1490–1503, Sep. 2009.

[36] F.-Y. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programmingfor finite-horizon optimal control of discrete-time nonlinear systems withε-error bound,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 24–36,Jan. 2011.

[37] P. H. Eaton, and D. V. Prokhorov, and D. C. Wunsch, “Neurocon-troller alternatives for ‘fuzzy’ ball-and-beam systems with nonuni-form nonlinear friction,” IEEE Trans. Neural Netw., vol. 11, no. 2,pp. 423–435, Mar. 2000.

[38] M. Fairbank, E. Alonso, and D. V. Prokhorov, “Simple and fast calcula-tion of the second-order gradients for globalized dual heuristic dynamicprogramming in neural networks,” IEEE Trans. Neural Netw. Learn.Syst., vol. 23, no. 10, pp. 1671–1676, Oct. 2012.

[39] H. He, Z. Ni, and D. Zhao, “Data-driven learning and control withmultiple critic networks,” in Proc. 10th World Congr. Int. ControlAutom., Jul. 2012, pp. 523–527.

[40] G. G. Lendaris and J. C. Neidhoefer, “Guidance in the use of adap-tive critics for control,” in Handbook of Learning and ApproximateDynamic Programming. Piscataway, NJ, USA: IEEE Press, 2004,pp. 97–118.

[41] H. Wong, M. de Queiroz, and V. Kapila, “Adaptive tracking control usingsynthesized velocity from attitude measurements,” Automatica, vol. 37,no. 6, pp. 947–953, Jun. 2001.

[42] G. Tao, S. Joshi, and X. Ma, “Adaptive state feedback and trackingcontrol of systems with actuator failures,” IEEE Trans. Autom. Control,vol. 46, no. 1, pp. 78–95, Jan. 2001.

[43] H. Zhang and L. Cai, “Nonlinear adaptive control using the fourierintegral and its application to cstr systems,” IEEE Trans. Systems, Man,Cybern. B, Cybern., vol. 32, no. 3, pp. 367–372, Jun. 2002.

[44] H. He, Z. Ni, and J. Fu, “A three-network architecture for on-linelearning and optimization based on adaptive dynamic programming,”Neurocomputing, vol. 78, no. 1, pp. 3–13, Feb. 2012.

[45] H. He, Z. Ni, and D. Zhao, “Learning and optimization in hierarchicaladaptive critic design,” in Reinforcement Learning and ApproximateDynamic Programming for Feedback Control. Piscataway, NJ, USA:IEEE Press, 2013, pp. 78–95.

[46] F. L. Lewis, S. Jagannathan, and A. Yesildirek, “Neural networkcontrol of robot manipulators and nonlinear systems,” Neural Net-work Robot Control. New York, USA: Taylor & Francis, 1999,pp. 173–219.

[47] F. Yuan, L. Feldkamp, G. Puskorius, and L. Davis, “A simple solutionto the bioreactor benchmark problem by application of Q-learning,” inProc. World Congr. Neural Netw., Washington, DC, USA, Jul. 1995,pp. 326–331.

[48] S. N. Balakrishnan and V. Biega, “Adaptive critic based neural networksfor control (low order system applications),” in Proc. Amer. ControlConf., Seattle, WA, USA, Jun. 1995, pp. 335–339.

[49] T. L. Chien, C. C. Chen, Y. C. Huang, and W.-J. Lin, “Stabilityand almost disturbance decoupling analysis of nonlinear system sub-ject to feedback linearization and feedforward neural network con-troller,” IEEE Trans. Neural Netw., vol. 19, no. 7, pp. 1220–1230,Jul. 2008.

[50] X. Fang, H. He, Z. Ni, and Y. Tang, “Learning and control in virtualreality for machine intelligence,” in Proc. 3rd Int. Conf. Intell. ControlInf. Process., Jul. 2012, pp. 63–67.

[51] [Online]. Available: http://www.youtube.com/watch?v=OeZEDBz6ki0&feature=youtu.be

[52] F. Liu, J. Sun, J. Si, W. Guo, and S. Mei, “A boundedness resultfor the direct heuristic dynamic programming,” Neural Netw., vol. 32,pp. 229–235, Aug. 2012.

[53] J. Slotine and W. Li, Applied Nonlinear Control. Englewood Cliffs, NJ,USA: Prentice-Hall, 1991.

[54] H. Khalil, Nonlinear Systems, 3rd ed. Englewood Cliffs, NJ, USA:Prentice-Hall, 2001.

Page 16: IEEE TRANSACTIONS ON NEURAL NETWORKS AND …based on heuristic dynamic programming (HDP) to tackle this problem. In recent years, extensive efforts have been dedicated to developing

928 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

Zhen Ni received the B.S. degrees from the Depart-ment of Control Science and Engineering, HuazhongUniversity of Science and Technology (HUST),Wuhan, China, in 2010, and the M.S. degree fromthe Department of Electrical, Computer, and Bio-medical Engineering, University of Rhode Island,Kingston, RI, USA, in 2012, where he is currentlypursuing the Ph.D. degree.

His current research interests include compu-tational intelligence and reinforcement learning,specifically in adaptive dynamic programming and

optimal/adaptive control.

Haibo He (SM’11) received the B.S. and M.S.degrees in electrical engineering from the HuazhongUniversity of Science and Technology (HUST),Wuhan, China, in 1999 and 2002, respectively, andthe Ph.D. degree in electrical engineering from OhioUniversity, Athens, OH, USA, in 2006.

He was an Assistant Professor with the Depart-ment of Electrical and Computer Engineering,Stevens Institute of Technology, Hoboken, NJ, USA,from 2006 to 2009. He is currently an Associate Pro-fessor with the Department of Electrical, Computer,

and Biomedical Engineering, University of Rhode Island, Kingston, RI, USA.His current research interests include adaptive dynamic programming (ADP),machine learning, computational intelligence, hardware design for machineintelligence, and various applications such as smart grid and renewable energysystems. He has authored one research book (Wiley), edited six conferenceproceedings (Springer), and authored or co-authored over 100 peer-reviewedjournal and conference papers. His research has been covered by nationaland international media such as IEEE Smart Grid Newsletter, The Wall StreetJournal, and Providence Business News.

Dr. He is currently an Associate Editor of the IEEE TRANSACTIONS ON

NEURAL NETWORKS AND LEARNING SYSTEMS, and the IEEE TRANS-ACTIONS ON SMART GRID. He was the recipient of the National ScienceFoundation (NSF) CAREER Award in 2011 and the Providence BusinessNews (PBN) Rising Star Innovator Award in 2011.

Jinyu Wen (M’10) is the Vice Dean of the Collegeof Electrical and Electronic Engineering, HuazhongUniversity of Science and Technology, Wuhan,China. He leads the Smart Grid Operation & ControlResearch Group, China State Key Laboratory ofAdvanced Electromagnetic Engineering and Tech-nology. His current research interests include smartgrid, power system operation and control, renewableenergy systems, and energy storage.

He is an active member of the IEEE Power &Energy Society and Industry Application Society.